Parallel NFS suffers delays due to Linux client work, 'monster' spec

Carol Sliwa

UCLA's Institute for Digital Research and Education once hoped storage systems supporting the long-promised parallel Network File System technology would be the answer to its bandwidth woes. But, in April 2011, the Institute gave up the wait and purchased a proprietary system from Panasas Inc.

    Requires Free Membership to View

to give its distributed scientific applications running on clustered servers the direct, parallel access to storage they needed.

More than a year later, Scott Friedman, chief technologist at IDRE, said he still hasn't found a parallel NFS (pNFS)-based storage product that he would consider using, although a limited number do exist. Panasas postponed delivery of its pNFS support, and is now targeting early next year. IDRE's other main storage vendor, BlueArc, announced prior to its acquisition by Hitachi Data Systems (HDS) that it would make pNFS available this year. Instead, an HDS product marketing manager said the specification still needed work. More recently, HDS said only that pNFS is on its roadmap.

When I wrote the first prototype of pNFS at the University of Michigan back in 2004, no one could have imagined the number of things that had to happen before it would become available in enterprise-level storage products.

Dean Hildebrand,
research staff member in storage systems, IBM's Almaden Research Center

Parallel NFS: What's behind the delays?

A work in progress for more than eight years, pNFS reached a milestone in January 2010 when the Internet Engineering Task Force (IETF) ratified the performance- and scalability-boosting technology as an optional part of NFS Version 4.1. Expectations grew as the publicity campaign ramped up, and vendors began to project product shipment dates. But the momentum stalled.

One of the main holdups was parallel NFS client support. Both the application server operating system, or client, and the storage system, or server, must support pNFS. So, even when EMC Corp, and NetApp Inc. released storage systems that support pNFS, the products weren't especially useful in the absence of a stable pNFS client. Both vendors, along with developers representing other vested interests, joined forces on a pNFS client. Oracle Corp./Sun Microsystems Inc. worked on Solaris, while others focused on open source Linux, since the initial target enterprise audience tends to use it, especially supported versions, such as Red Hat Enterprise Linux (RHEL) and SUSE Linux Enterprise Server (SLES).

"When I wrote the first prototype of pNFS at the University of Michigan back in 2004, no one could have imagined the number of things that had to happen before it would become available in enterprise-level storage products," Dean Hildebrand, a research staff member in storage systems at IBM's Almaden Research Center, wrote in an email. "The NFSv4.1 RFC 5661 is the largest specification to ever be published by the IETF. Implementing this specification in Linux is extremely complex, and code changes could not disrupt current NFSv2/3/4 users. In the end, eight years is not that long for such a useful technology."

"We ended up with a monster, and, of course, we had a long approval process because of that," said Sorin Faibish, chief scientist of the Fast Data Group at EMC and a prominent pNFS evangelist who recalled starting a "propaganda for pNFS" campaign around 2006 even though "there was no proof behind us." According to Faibish, "It took us two years to go through the [Internet Engineering Steering Group] IESG, which is the body that approves protocols by IETF. We thought it would take about two months. There were many, many complaints from the IESG that they [could not] cope with such a big document."

Redesigning the parallel NFS spec

Because pNFS depends on NFSv4.1, the community first had to implement a basic NFSv4.1 client. The task required considerable re-plumbing of the existing NFS client to add the new features, and then testing to ensure stability. Once they thought they had it right, the group sought a code review from Trond Myklebust, the Linux NFS client maintainer and a principal engineer at NetApp. Myklebust, however, had problems with it.

"[Myklebust] said, 'This code is not going anywhere,'" Faibish recalled. "There was a meeting two-and-a-half years ago with a lot of displeasure and discontent of almost all the members. We decided together that we had to do a redesign of the entire pNFS spec."

Contributors such as EMC, IBM, NetApp, Panasas, Sun and the University of Michigan's Center for Information Technology Integration (CITI) had worked on the project, but their main focus was the IETF grant of their request for comments (RFC), Faibish noted. To do that, they simply had to demonstrate the protocol worked, not that it performed well.

"I suppose that this is a weakness with the IETF process. There is no requirement that the spec and implementations be written at the same time. Usually, the spec is written first, and so this often means that problems start popping out of the woodwork when we actually go out there and try to implement it in practice," Myklebust wrote in an email. He stressed that he was commenting on his own behalf, in his capacity as Linux NFS client maintainer, not on behalf of NetApp.

One of the main issues Myklebust encountered was the developer group's insistence that the server issue callbacks to the client as a "normal mode" of operation. He said his experience with callbacks was that they didn't scale well and involved the server sending remote procedure calls (RPCs) to one client, while forcing another one to wait.

Varied pNFS flavors bring challenges

Another of the many challenges centered on the three pNFS flavors: file, block and object. Each relies on entirely different technologies for doing reads and writes, with file using NFS, block using SCSI, and object using object-based storage device (OSD) commands.

"All this has to be tested, and because the technologies are different, the features that need to be tested are very different," Myklebust explained via email. "This was complicated by the fact that that none of us had access to all three types of pNFS server, so there could be no centralized testing effort. There [was] no Linux pNFS server that we could use, so we had to create simulators in some cases and/or rely on help [from] proprietary server vendors (who hadn't at the time shipped a pNFS product either). In other words, we had to rely on each pNFS sub-team doing their own testing, and then [convince] the reviewers that they had covered all the bases."

New, tested features typically go into the official "upstream" Linux kernel that Linus Torvalds and his maintainers manage. According to Myklebust, the pNFS client implementation was done when the O_DIRECT read/write code merged into the 3.5 kernel, which was released on July 21, 2012. Excluding the O_DIRECT support (which pertains primarily to large database applications such as Oracle), the pNFS file layout code was complete with the 2.6.39 kernel on May 18, 2011. The pNFS object layout code followed in the 3.0 kernel on July 21, 2011, and the pNFS block layout code was ready with the 3.1 kernel on Oct. 24, 2011.

But getting something into the upstream Linux kernel doesn't necessarily mean it will appear in Linux distributions anytime soon. Myklebust said he's proud that Red Hat felt comfortable shipping RHEL 6.2 with a technology preview of pNFS for file layout just seven to eight months after it appeared in the upstream Linux kernel. RHEL 7.0, targeted for next year, is due to support file, block and object storage.

Beyond the pNFS client, IBM Research is working on an open source, user-space NFS server called Ganesha, which features an extendible interface for accessing file systems not in the Linux kernel, such as IBM's General Parallel File System. The Ganesha project is separate from the work that developers did to create a basic open source pNFS server in connection with the pNFS client. That basic pNFS server was functionally incomplete and suffered from design flaws that prevented its inclusion in the Linux kernel, Myklebust said.

"The problem is lack of manpower. The same proprietary vendors who are happy to fund pNFS client support appear to be a lot more reticent to fund work on the kernel server," Myklebust said, noting that such a server would "conclusively" solve the testing problem. "Relying on vendors is not a satisfying solution. I don't have the financing or even the space for one EMC, one Panasas and one NetApp server for my personal testing. The Linux distribution vendors have the same problem."

UCLA's Friedman, for one, fears parallel NFS may continue to stall. "My read is that there is no reason for such a relatively small industry to really support pNFS, as it ultimately constrains them from differentiating themselves from each other," he wrote via email. "They all talk about it, they all profess to support it, but when it comes down to it, pNFS is not a priority for any of them."

Industry analysts, however, say the long road to pNFS is no surprise. "It just takes time," said Randy Kerns, a senior strategist at Evaluator Group in Boulder, Colo. "Certain things that are fundamental like this -- an NFS change -- take a very long time."