
                   Re: A multi-threaded NFS server for Linux

   Olaf Kirch (okir@monad.swb.de)
   Tue, 26 Nov 1996 23:09:08 +0100

     * Messages  sorted  by:  [1][  date ][2][ thread ][3][ subject ][4][
       author ]
     * Next message: [5]Olaf Kirch: "Re: rpc.lockd/rpc.statd"
     * Previous message: [6]Paul Christenson: "smail SPAM filter?"
     * Next  in  thread:  [7]Linus  Torvalds:  "Re:  A multi-threaded NFS
       server for Linux"
     * Reply:  [8]Linus  Torvalds:  "Re:  A multi-threaded NFS server for
       Linux"
     _________________________________________________________________

   Hi all,

   here are some ramblings about implementing nfsd, the differences
   between kernel- and user-space, and life in general. It's become quite
   long, so if you're not interested in either of these topics,
   just skip it...

   On Sun, 24 Nov 1996 12:01:01 PST, "H.J. Lu" wrote:
   > With the upcoming the Linux C library 6.0, it is possible to
   > implement a multi-threaded NFS server in the user space using
   > the kernel-based pthread and MT-safe API included in libc 6.0.

   In  my  opinion,  servicing NFS from user space is an idea that should
   die.
   The current unfsd (and I'm pretty sure this will hold for any other
   implementation) has a host of problems:

   1. Speed.

   This is only partly related to nfsd being single-threaded. I have
   run some benchmarks a while ago comparing my kernel-based nfsd to
   the user-space nfsd.

   In  the  unfsd  case,  I  was  running 4 daemons in parallel (which is
   possible
   even now as long as you restrict yourself to read-only access), and
   found  the  upper  limit  for peak throughput was around 800 KBps; the
   rate
   for sustained reads was even lower. In comparison, the kernel-based
   nfsd achieved around than 1.1 MBps peak throughput which is almost
   the  theoretical  cheapernet  limit;  its  sustained rate was around 1
   MBps.
   Testers of my recent knfsd implementation reported a sustained rate
   of 3.8 MBps over 100 Mbps Ethernet.

   Even  though  some tweaking of the unfsd source (especially by getting
   rid
   of  the  Sun  RPC  code)  may  improve  performance some more, I don't
   believe
   the  user-space  can  be  pushed  much  further.  [Speaking of the RPC
   library,
   a rewrite would be required anyway to safely support NFS over TCP. You
   can easily hang a vanilla RPC server by sending an incomplete request
   over TCP and keeping the connection open]

   Now add to that the synchronization overhead required to keep the file
   handle cache in sync between the various threads...

   This leads me straight to the next topic:

   2. File Handle Layout

   Traditional  nfsds usually stuff a file's device and inode number into
   the
   file handle, along with some information on the exported inode. Since
   a user space program has no way of opening a file just given its inode
   number,  unfsd  takes  a  different  approach.  It basically creates a
   hashed
   version  of  the  file's  path. Each path component is stat'ed, and an
   8bit
   hash of the component's device and inode number is used.

   The first problem is that this kind of file handle is not invariant
   against renames from one directory to another. Agreed, this doesn't
   happen too often, but it does break Unix semantics. Try this on an
   nfs-mounted file system (with appropriate foo and bar):

   (mv bar foo/bar; cat) < bar

   The second problem is a lot worse. When unfsd is presented with a file
   handle it does not have in its cache, it must map it to a valid path
   name. This is basically done in the following way:

   path = "/";
   depth = 0;
   while (depth < length(fhandle)) {
   deeper:
   dirp = opendir(path);
   while ((entry = readdir(dirp)) != NULL) {
   if (hash(dev,ino) matches fhandle component) {
   remember dirp
   append entry to path
   depth++;
   goto deeper;
   }
   }
   closedir(dirp);
   backtrack;
   }

   Needless to say, this is not very fast. The file handle cache helps
   a lot here, but this kind of mapping operation occurs far more often
   than one might expect (consider a development tree where files get
   created   and   deleted   continuously).   In  addition,  the  current
   implementation
   discards conflicting handles when there's a hash collision.

   This file handle layout also leaves little room for any additional
   baggage. Unfsd currently uses 4 bytes for an inode hash of the file
   itself and 28 bytes for the hashed path, but as soon as you add other
   information like the inode generation number, you will sooner or
   later run out of room.

   Last not least, the file handle cache must be strictly synchronized
   between different nfsd processes/threads. Suppose you rename foo to
   bar,  which  is performed by thread1, then try to read the file, which
   is
   performed  by  thread2.  If the latter doesn't know the cached path is
   stale,
   it  will  fail.  You  could of course retry every operation that fails
   with
   ENOENT, but this will add even more clutter and overhead to the code.

   3. Adherence to the NFSv2 specification

   The  Linux  nfsd  currently  does  not  fulfill  the NFSv2 spec in its
   entirety.
   Especially  when  it  comes  to  safe  writes, it is really a fake. It
   neither
   makes  an  attempt  to  sync  file  data before replying to the client
   (which
   could be implemented, along with the `async' export option for turning
   off this kind of behavior), nor does it sync meta-data after inode
   operations (which is impossible from user space). To most people this
   is no big loss, but this behavior is definitely not acceptable if you
   want industry-strengh NFS.

   But  even  if  you  did  implement at least synchronous file writes in
   unfsd,
   be it as an option or as the default, there seems to be no way to
   implement some of the more advanced techniques like gathered writes.
   When implementing gathered writes, the server tries to detect whether
   other nfsd threads are writing to the file at the same time (which
   frequently happens when the client's biods flush out the data on file
   close),  and  if  they  do,  it  delays  syncing  file  data for a few
   milliseconds
   so  the  others can finish first, and then flushes all data in one go.
   You
   can do this in kernel-land by watching inode->i_writecount, but you're
   totally at a loss in user-space.

   4. Supporting NFSv3

   A   user-space   NFS  server  is  not  particularly  well  suited  for
   implementing
   NFSv3.  For  instance,  NFSv3  tries  to help cache consistency on the
   client
   by   providing  pre-operation  attributes  for  some  operations,  for
   instance
   the WRITE call. When a client finds that the pre-operation attributes
   returned by the server agree with those it has cached, it can safely
   assume that any data it has cached was still valid when the server
   replied  to  its  call,  so there's no need to discard the cached file
   data
   and meta-data.

   However, pre-op attributes can only be provided safely when the server
   retains  exclusive  access to the inode throughout the operation. This
   is
   impossible from user space.

   A similar example is the exclusive create operation where a verifier
   is stored in the inode's atime/mtime fields by the server to guarantee
   exactly-once  behavior  even  in  the face of request retransmissions.
   These
   values cannot be checked atomically by a user-space server.

   What this boils down to is that a user-space server cannot, without
   violating the protocol spec, implement many of the advanced features
   of NFSv3.

   5. File locking over NFS

   Supporting lockd in user-space is close to impossible. I've tried it,
   and have run into a large number of problems. Some of the highlights:

   * lockd can provide only a limited number of locks at the same
   time because it has only a limited number of file descriptors.

   * When lockd blocks a client's lock request because of a lock held
   by a local process on the server, it must continuously poll
   /proc/locks to see whether the request could be granted. What's
   more, if there's heavy contention for the file, it may take
   a long time before it succeeds because it cannot add itself
   to the inode's lock wait list in the kernel. That is, unless
   you want it to create a new thread just for blocking on this
   lock.

   * Lockd must synchronize its file handle cache with that of
   the NFS servers. Unfortunately, lockd is also needed when
   running as an NFS client only, so you run into problems with
   who owns the file handle cache, and how to share it between
   these to services.

   6. Conclusion

   Alright,  this  has  become  rather  long.  Some  of the problems I've
   described
   above  may  be  solvable with more or less effort, but I believe that,
   taken
   as a whole, they make a pretty strong argument against sticking with
   a user-space nfsd.

   In  kernel-space,  most of these issues are addressed most easily, and
   more
   efficiently. My current kernel nfsd is fairly small. Together with the
   RPC  core,  which  is  used  by  both  client  and server, it takes up
   something
   like 20 pages--don't quote me on the exact number. As mentioned above,
   it is also pretty fast, and I hope I'll be able to also provide fully
   functional file locking soon.

   If you want to take a look at the current snapshot, it's available at
   ftp.mathematik.th-darmstadt.de/pub/linux/okir/dontuse/linux-nfs-X.Y.ta
   r.gz.
   This version still has a bug in the nfsd readdir implementation, but
   I'll  release  an  updated  (and  fixed) version as soon as I have the
   necessary
   lockd rewrite sorted out.

   I  would  particularly  welcome  comments  from  Keepers of the Source
   whether
   my NFS rewrite has any chance of being incorporated into the kernel at
   some time... that would definitely motivate me to sick more time into
   it than I currently do.

   Happy hacking
   Olaf
--
Olaf Kirch         |  --- o --- Nous sommes du soleil we love when we play
okir@monad.swb.de  |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax
             For my PGP public key, finger okir@brewhq.swb.de.
     _________________________________________________________________

     * Next message: [9]Olaf Kirch: "Re: rpc.lockd/rpc.statd"
     * Previous message: [10]Paul Christenson: "smail SPAM filter?"
     * Next  in  thread:  [11]Linus  Torvalds:  "Re: A multi-threaded NFS
       server for Linux"
     * Reply:  [12]Linus  Torvalds:  "Re: A multi-threaded NFS server for
       Linux"

Referenser

   1. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/date.html#18
   2. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/index.html#18
   3. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/subject.html#18
   4. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/author.html#18
   5. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/0019.html
   6. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/0017.html
   7. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/0020.html
   8. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/0020.html
   9. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/0019.html
  10. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/0017.html
  11. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/0020.html
  12. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/0020.html
