1 .\" $NetBSD: 1.t,v 1.2 1998/01/09 06:55:36 perry Exp $
4 .\" The Regents of the University of California. All rights reserved.
6 .\" This document is derived from software contributed to Berkeley by
7 .\" Rick Macklem at The University of Guelph.
9 .\" Redistribution and use in source and binary forms, with or without
10 .\" modification, are permitted provided that the following conditions
12 .\" 1. Redistributions of source code must retain the above copyright
13 .\" notice, this list of conditions and the following disclaimer.
14 .\" 2. Redistributions in binary form must reproduce the above copyright
15 .\" notice, this list of conditions and the following disclaimer in the
16 .\" documentation and/or other materials provided with the distribution.
17 .\" 3. Neither the name of the University nor the names of its contributors
18 .\" may be used to endorse or promote products derived from this software
19 .\" without specific prior written permission.
21 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
22 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
23 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
24 .\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
25 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
26 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
27 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
28 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
29 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
30 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
33 .\" @(#)1.t 8.1 (Berkeley) 6/8/93
35 .sh 1 "NFS Implementation"
37 The 4.4BSD implementation of NFS and the alternate protocol nicknamed
38 Not Quite NFS (NQNFS) are kernel resident, but make use of a few system
40 The kernel implementation does not use an RPC library, handling the RPC
41 request and reply messages directly in \fImbuf\fR data areas. NFS
42 interfaces to the network using
43 sockets via. the kernel interface available in
44 \fIsys/kern/uipc_syscalls.c\fR as \fIsosend(), soreceive(),\fR...
45 There are connection management routines for support of sockets for connection
46 oriented protocols and timeout/retransmit support for datagram sockets on
48 For connection oriented transport protocols,
49 such as TCP/IP, there is one connection
50 for each client to server mount point that is maintained until an umount.
51 If the connection breaks, the client will attempt a reconnect with a new
53 The client side can operate without any daemons running, but performance
54 will be improved by running nfsiod daemons that perform read-aheads
56 For the server side to function, the daemons portmap, mountd and
58 The mountd daemon performs two important functions.
60 Upon startup and after a hangup signal, mountd reads the exports
61 file and pushes the export information for each local file system down
62 into the kernel via. the mount system call.
64 Mountd handles remote mount protocol (RFC1094, Appendix A) requests.
66 The nfsd master daemon forks off children that enter the kernel
67 via. the nfssvc system call. The children normally remain kernel
68 resident, providing a process context for the NFS RPC servers. The only
69 exception to this is when a Kerberos [Steiner88]
70 ticket is received and at that time
71 the nfsd exits the kernel temporarily to verify the ticket via. the
72 Kerberos libraries and then returns to the kernel with the results.
73 (This only happens for Kerberos mount points as described further under
75 Meanwhile, the master nfsd waits to accept new connections from clients
76 using connection oriented transport protocols and passes the new sockets down
78 The client side mount_nfs along with portmap and
79 mountd are the only parts of the NFS subsystem that make any
80 use of the Sun RPC library.
81 .sh 1 "Mount Problems"
83 There are several problems that can be encountered at the time of an NFS
84 mount, ranging from a unresponsive NFS server (crashed, network partitioned
85 from client, etc.) to various interoperability problems between different
89 if the 4.4BSD NFS server will be handling any PC clients, mountd will
90 require the \fB-n\fR option to enable non-root mount request servicing.
91 Running of a pcnfsd\** daemon will also be necessary.
93 \** Pcnfsd is available in source form from Sun Microsystems and many
96 The server side requires that the daemons
97 mountd and nfsd be running and that
98 they be registered with portmap properly.
99 If problems are encountered,
100 the safest fix is to kill all the daemons and then restart them in
101 the order portmap, mountd and nfsd.
102 Other server side problems are normally caused by problems with the format
103 of the exports file, which is covered under
104 Security and in the exports man page.
106 On the client side, there are several mount options useful for dealing
107 with server problems.
108 In cases where a file system is not critical for system operation, the
110 mount option may be specified so that mount_nfs will go into the
111 background for a mount attempt on an unresponsive server.
112 This is useful for mounts specified in
114 so that the system will not get hung while booting doing
116 because a file server is not responsive.
117 On the other hand, if the file system is critical to system operation, this
118 option should not be used so that the client will wait for the server to
119 come up before completing bootstrapping.
120 There are also three mount options to help deal with interoperability issues
121 with various non-BSD NFS servers. The
123 option specifies that the NFS
124 client use a reserved IP port number to satisfy some servers' security
127 \**Any security benefit of this is highly questionable and as
128 such the BSD server does not require a client to use a reserved port number.
132 option stops the NFS client from doing a \fIconnect\fR on the UDP
133 socket, so that the mount works with servers that send NFS replies from
134 port numbers other than the standard 2049.\**
136 \**The Encore Multimax is known
141 option sets the maximum size of the group list in the credentials passed
142 to an NFS server in every RPC request. Although RFC1057 specifies a maximum
143 size of 16 for the group list, some servers can't handle that many.
144 If a user, particularly root doing a mount,
145 keeps getting access denied from a file server, try temporarily
146 reducing the number of groups that user is in to less than 5
147 by editing /etc/group. If the user can then access the file system, slowly
148 increase the number of groups for that user until the limit is found and
149 then peg the limit there with the
152 This implies that the server will only see the first \fInum\fR
153 groups that the user is in, which can cause some accessibility problems.
155 For sites that have many NFS servers, amd [Pendry93]
156 is a useful administration tool.
157 It also reduces the number of actual NFS mount points, alleviating problems
158 with commands such as df(1) that hang when any of the NFS servers is
160 .sh 1 "Dealing with Hung Servers"
162 There are several mount options available to help a client deal with
163 being hung waiting for response from a crashed or unreachable\** server.
165 \**Due to a network partitioning or similar.
167 By default, a hard mount will continue to try to contact the server
168 ``forever'' to complete the system call. This type of mount is appropriate
169 when processes on the client that access files in the file system do not
170 tolerate file I/O systems calls that return -1 with \fIerrno == EINTR\fR
171 and/or access to the file system is critical for normal system operation.
173 There are two other alternatives:
175 A soft mount (\fB-s\fR option) retries an RPC \fIn\fR
176 times and then the corresponding
177 system call returns -1 with errno set to EINTR.
178 For TCP transport, the actual RPC request is not retransmitted, but the
179 timeout intervals waiting for a reply from the server are done
180 in the same manner as UDP for this purpose.
181 The problem with this type of mount is that most applications do not
182 expect an EINTR error return from file I/O system calls (since it never
183 occurs for a local file system) and get confused by the error return
184 from the I/O system call.
187 is used to set the RPC retry limit and if set too low, the error returns
188 will start occurring whenever the NFS server is slow due to heavy load.
189 Alternately, a large retry limit can result in a process hung for a long
190 time, due to a crashed server or network partitioning.
192 An interruptible mount (\fB-i\fR option) checks to see if a termination signal
193 is pending for the process when waiting for server response and if it is,
194 the I/O system call posts an EINTR. Normally this results in the process
195 being terminated by the signal when returning from the system call.
196 This feature allows you to ``^C'' out of processes that are hung
197 due to unresponsive servers.
198 The problem with this approach is that signals that are caught by
199 a process are not recognized as termination signals
200 and the process will remain hung.\**
202 \**Unfortunately, there are also some resource allocation situations in the
203 BSD kernel where the termination signal will be ignored and the process
206 .sh 1 "RPC Transport Issues"
208 The NFS Version 2 protocol runs over UDP/IP transport by
209 sending each Sun Remote Procedure Call (RFC1057)
210 request/reply message in a single UDP
211 datagram. Since UDP does not guarantee datagram delivery, the
212 Remote Procedure Call (RPC) layer
213 times out and retransmits an RPC request if
214 no RPC reply has been received. Since this round trip timeout (RTO) value
215 is for the entire RPC operation, including RPC message transmission to the
216 server, queuing at the server for an nfsd, performing the RPC and
217 sending the RPC reply message back to the client, it can be highly variable
218 for even a moderately loaded NFS server.
219 As a result, the RTO interval must be a conservation (large) estimate, in
220 order to avoid extraneous RPC request retransmits.\**
222 \**At best, an extraneous RPC request retransmit increases
223 the load on the server and at worst can result in damaged files
224 on the server when non-idempotent RPCs are redone [Juszczak89].
226 Also, with an 8Kbyte read/write data size
227 (the default), the read/write reply/request will be an 8+Kbyte UDP datagram
228 that must normally be fragmented at the IP layer for transmission.\**
230 \**6 IP fragments for an Ethernet,
231 which has an maximum transmission unit of 1500bytes.
233 For IP fragments to be successfully reassembled into
234 the IP datagram at the receive end, all
235 fragments must be received within a fairly short ``time to live''.
236 If one fragment is lost/damaged in transit,
237 the entire RPC must be retransmitted and redone.
238 This problem can be exaggerated by a network interface on the receiver that
239 cannot handle the reception of back to back network packets. [Kent87a]
241 There are several tuning mount
242 options on the client side that can prove useful when trying to
243 alleviate performance problems related to UDP RPC transport.
248 specify the maximum read or write data size respectively.
250 should be a power of 2 (4K, 2K, 1K) and adjusted downward from the
252 whenever IP fragmentation is causing problems. The best indicator of
253 IP fragmentation problems is a significant number of
254 \fIfragments dropped after timeout\fR
255 reported by the \fIip:\fR section of a \fBnetstat -s\fR
256 command on either the client or server.
257 Of course, if the fragments are being dropped at the server, it can be
258 fun figuring out which client(s) are involved.
259 The most likely candidates are clients that are not
260 on the same local area network as the
261 server or have network interfaces that do not receive several
262 back to back network packets properly.
264 By default, the 4.4BSD NFS client dynamically estimates the retransmit
265 timeout interval for the RPC and this appears to work reasonably well for
266 many environments. However, the
268 flag can be specified to turn off
269 the dynamic estimation of retransmit timeout, so that the client will
270 use a static initial timeout interval.\**
272 \**After the first retransmit timeout, the initial interval is backed off
277 option can be used with
279 to set the initial timeout interval to other than the default of 2 seconds.
280 The best indicator that dynamic estimation should be turned off would
281 be a significant number\** in the \fIX Replies\fR field and a
283 \**Even 0.1% of the total RPCs is probably significant.
285 large number in the \fIRetries\fR field
286 in the \fIRpc Info:\fR section as reported
287 by the \fBnfsstat\fR command.
288 On the server, there would be significant numbers of \fIInprog\fR recent
289 request cache hits in the \fIServer Cache Stats:\fR section as reported
290 by the \fBnfsstat\fR command, when run on the server.
292 The tradeoff is that a smaller timeout interval results in a better
293 average RPC response time, but increases the risk of extraneous retries
294 that in turn increase server load and the possibility of damaged files
295 on the server. It is probably best to err on the safe side and use a large
296 (>= 2sec) fixed timeout if the dynamic retransmit timeout estimation
297 seems to be causing problems.
299 An alternative to all this fiddling is to run NFS over TCP transport instead
301 Since the 4.4BSD TCP implementation provides reliable
302 delivery with congestion control, it avoids all of the above problems.
303 It also permits the use of read and write data sizes greater than the 8Kbyte
304 limit for UDP transport.\**
306 \**Read/write data sizes greater than 8Kbytes will not normally improve
307 performance unless the kernel constant MAXBSIZE is increased and the
308 file system on the server has a block size greater than 8Kbytes.
310 NFS over TCP usually delivers comparable to significantly better performance
312 unless the client or server processor runs at less than 5-10MIPS. For a
313 slow processor, the extra CPU overhead of using TCP transport will become
314 significant and TCP transport may only be useful when the client
315 to server interconnect traverses congested gateways.
316 The main problem with using TCP transport is that it is only supported
317 between BSD clients and servers.\**
319 \**There are rumors of commercial NFS over TCP implementations on the horizon
320 and these may well be worth exploring.
322 .sh 1 "Other Tuning Tricks"
324 Another mount option that may improve performance over
325 certain network interconnects is \fB-a=\fInum\fR
326 which sets the number of blocks that the system will
327 attempt to read-ahead during sequential reading of a file. The default value
328 of 1 seems to be appropriate for most situations, but a larger value might
329 achieve better performance for some environments, such as a mount to a server
330 across a ``high bandwidth * round trip delay'' interconnect.
332 For the adventurous, playing with the size of the buffer cache
333 can also improve performance for some environments that use NFS heavily.
334 Under some workloads, a buffer cache of 4-6Mbytes can result in significant
335 performance improvements over 1-2Mbytes, both in client side system call
336 response time and reduced server RPC load.
337 The buffer cache size defaults to 10% of physical memory,
338 but this can be overridden by specifying the BUFPAGES option
339 in the machine's config file.\**
341 BUFPAGES is the number of physical machine pages allocated to the buffer cache.
342 ie. BUFPAGES * NBPG = buffer cache size in bytes
344 When increasing the size of BUFPAGES, it is also advisable to increase the
345 number of buffers NBUF by a corresponding amount.
346 Note that there is a tradeoff of memory allocated to the buffer cache versus
347 available for paging, which implies that making the buffer cache larger
348 will increase paging rate, with possibly disastrous results.
349 .sh 1 "Security Issues"
351 When a machine is running an NFS server it opens up a great big security hole.
352 For ordinary NFS, the server receives client credentials
353 in the RPC request as a user id
354 and a list of group ids and trusts them to be authentic!
355 The only tool available to restrict remote access to
356 file systems with is the exports(5) file,
357 so file systems should be exported with great care.
358 The exports file is read by mountd upon startup and after a hangup signal
359 is posted for it and then as much of the access specifications as possible are
360 pushed down into the kernel for use by the nfsd(s).
361 The trick here is that the kernel information is stored on a per
362 local file system mount point and client host address basis and cannot refer to
363 individual directories within the local server file system.
364 It is best to think of the exports file as referring to the various local
365 file systems and not just directory paths as mount points.
366 A local file system may be exported to a specific host, all hosts that
367 match a subnet mask or all other hosts (the world). The latter is very
368 dangerous and should only be used for public information. It is also
369 strongly recommended that file systems exported to ``the world'' be exported
371 For each host or group of hosts, the file system can be exported read-only or
373 You can also define one of three client user id to server credential
374 mappings to help control access.
375 Root (user id == 0) can be mapped to some default credentials while all other
376 user ids are accepted as given.
377 If the default credentials for user id equal zero
378 are root, then there is essentially no remapping.
379 Most NFS file systems are exported this way, most commonly mapping
380 user id == 0 to the credentials for the user nobody.
381 Since the client user id and group id list is used unchanged on the server
382 (except for root), this also implies that
383 the user id and group id space must be common between the client and server.
384 (ie. user id N on the client must refer to the same user on the server)
385 All user ids can be mapped to a default set of credentials, typically that of
386 the user nobody. This essentially gives world access to all
387 users on the corresponding hosts.
389 There is also a non-standard BSD
390 \fB-kerb\fR export option that requires the client provide
391 a KerberosIV rcmd service ticket to authenticate the user on the server.
392 If successful, the Kerberos principal is looked up in the server's password
393 and group databases to get a set of credentials and a map of client userid to
394 these credentials is then cached.
395 The use of TCP transport is strongly recommended,
396 since the scheme depends on the TCP connection to avert replay attempts.
397 Unfortunately, this option is only usable
398 between BSD clients and servers since it is
399 not compatible with other known ``kerberized'' NFS systems.
400 To enable use of this Kerberos option, both mount_nfs on the client and
401 nfsd on the server must be rebuilt with the -DKERBEROS option and
402 linked to KerberosIV libraries.
403 The file system is then exported to the client(s) with the \fB-kerb\fR option
404 in the exports file on the server
405 and the client mount specifies the
412 mount option may be used to specify a Kerberos Realm for the ticket
413 (it must be the Kerberos Realm of the server) that is other than
414 the client's local Realm.
415 To access files in a \fB-kerb\fR mount point, the user must have a valid
416 TGT for the server's Realm, as provided by kinit or similar.
418 As well as the standard NFS Version 2 protocol (RFC1094) implementation, BSD
419 systems can use a variant of the protocol called Not Quite NFS (NQNFS) that
420 supports a variety of protocol extensions.
421 This protocol uses 64bit file offsets
422 and sizes, an \fIaccess rpc\fR, an \fIappend\fR option on the write rpc
423 and extended file attributes to support 4.4BSD file system functionality
425 It also makes use of a variant of short term
426 \fIleases\fR [Gray89] with delayed write client caching,
427 in an effort to provide full cache consistency and better performance.
428 This protocol is available between 4.4BSD systems only and is used when
429 the \fB-q\fR mount option is specified.
430 It can be used with any of the aforementioned options for NFS, such as TCP
431 transport (\fB-T\fR) and KerberosIV authentication (\fB-K\fR).
432 Although this protocol is experimental, it is recommended over NFS for
433 mounts between 4.4BSD systems.\**
435 \**I would appreciate email from anyone who can provide
436 NFS vs. NQNFS performance measurements,
437 particularly fast clients, many clients or over an internetwork
438 connection with a large ``bandwidth * RTT'' product.
440 .sh 1 "Monitoring NFS Activity"
442 The basic command for monitoring NFS activity on clients and servers is
443 nfsstat. It reports cumulative statistics of various NFS activities,
444 such as counts of the various different RPCs and cache hit rates on the client
445 and server. Of particular interest on the server are the fields in the
446 \fIServer Cache Stats:\fR section, which gives numbers for RPC retries received
447 in the first three fields and total RPCs in the fourth. The first three fields
448 should remain a very small percentage of the total. If not, it
449 would indicate one or more clients doing retries too aggressively and the fix
450 would be to isolate these clients,
451 disable the dynamic RTO estimation on them and
452 make their initial timeout interval a conservative (ie. large) value.
454 On the client side, the fields in the \fIRpc Info:\fR section are of particular
455 interest, as they give an overall picture of NFS activity.
456 The \fITimedOut\fR field is the number of I/O system calls that returned -1
457 for ``soft'' mounts and can be reduced
458 by increasing the retry limit or changing
459 the mount type to ``intr'' or ``hard''.
460 The \fIInvalid\fR field is a count of trashed RPC replies that are received
461 and should remain zero.\**
463 \**Some NFS implementations run with UDP checksums disabled, so garbage RPC
464 messages can be received.
466 The \fIX Replies\fR field counts the number of repeated RPC replies received
467 from the server and is a clear indication of a too aggressive RTO estimate.
468 Unfortunately, a good NFS server implementation will use a ``recent request
469 cache'' [Juszczak89] that will suppress the extraneous replies.
470 A large value for \fIRetries\fR indicates a problem, but
473 a too aggressive RTO estimate
475 an overloaded NFS server
477 IP fragments being dropped (gateway, client or server)
479 and requires further investigation.
480 The \fIRequests\fR field is the total count of RPCs done on all servers.
482 The \fBnetstat -s\fR comes in useful during investigation of RPC transport
484 The field \fIfragments dropped after timeout\fR in
485 the \fIip:\fR section indicates IP fragments are
486 being lost and a significant number of these occurring indicates that the
487 use of TCP transport or a smaller read/write data size is in order.
488 A significant number of \fIbad checksums\fR reported in the \fIudp:\fR
489 section would suggest network problems of a more generic sort.
490 (cabling, transceiver or network hardware interface problems or similar)
492 There is a RPC activity logging facility for both the client and
493 server side in the kernel.
494 When logging is enabled by setting the kernel variable nfsrtton to
495 one, the logs in the kernel structures nfsrtt (for the client side)
496 and nfsdrt (for the server side) are updated upon the completion
497 of each RPC in a circular manner.
498 The pos element of the structure is the index of the next element
499 of the log array to be updated.
500 In other words, elements of the log array from \fIlog\fR[pos] to
501 \fIlog\fR[pos - 1] are in chronological order.
502 The include file <sys/nfsrtt.h> should be consulted for details on the
503 fields in the two log structures.\**
505 \**Unfortunately, a monitoring tool that uses these logs is still in the
506 planning (dreaming) stage.
508 .sh 1 "Diskless Client Support"
510 The NFS client does include kernel support for diskless/dataless operation
511 where the root file system and optionally the swap area is remote NFS mounted.
512 A diskless/dataless client is configured using a version of the
513 ``swapvmunix.c'' file as provided in the directory \fIcontrib/diskless.nfs\fR.
514 If the swap device == NODEV, it specifies an NFS mounted swap area and should
515 be configured the same size as set up by diskless_setup when run on the server.
516 This file must be put in the \fIsys/compile/<machine_name>\fR kernel build
517 directory after the config command has been run, since config does
518 not know about specifying NFS root and swap areas.
519 The kernel variable mountroot must be set to nfs_mountroot instead of
520 ffs_mountroot and the kernel structure nfs_diskless must be filled in
522 There are some primitive system administration tools in the \fIcontrib/diskless.nfs\fR directory to assist in filling in
523 the nfs_diskless structure and in setting up an NFS server for
524 diskless/dataless clients.
525 The tools were designed to provide a bare bones capability, to allow maximum
526 flexibility when setting up different servers.
528 The tools are as follows:
530 diskless_offset.c - This little program reads a ``vmunix'' object file and
531 writes the file byte offset of the nfs_diskless structure in it to
532 standard out. It was kept separate because it sometimes has to
533 be compiled/linked in funny ways depending on the client architecture.
534 (See the comment at the beginning of it.)
536 diskless_setup.c - This program is run on the server and sets up files for a
537 given client. It mostly just fills in an nfs_diskless structure and
538 writes it out to either the "vmunix" file or a separate file called
539 /var/diskless/setup.<official-hostname>
541 diskless_boot.c - There are two functions in here that may be used
542 by a bootstrap server such as tftpd to permit sharing of the ``vmunix''
543 object file for similar clients. This saves disk space on the bootstrap
544 server and simplify organization, but are not critical for correct operation.
545 They read the ``vmunix''
546 file, but optionally fill in the nfs_diskless structure from a
547 separate "setup.<official-hostname>" file so that there is only
548 one copy of "vmunix" for all similar (same arch etc.) clients.
549 These functions use a text file called
550 /var/diskless/boot.<official-hostname> to control the netboot.
552 The basic setup steps are:
554 make a "vmunix" for the client(s) with mountroot() == nfs_mountroot()
555 and swdevt[0].sw_dev == NODEV if it is to do nfs swapping as well
556 (See the same swapvmunix.c file)
558 run diskless_offset on the vmunix file to find out the byte offset
559 of the nfs_diskless structure
561 Run diskless_setup on the server to set up the server and fill in the
562 nfs_diskless structure for that client.
563 The nfs_diskless structure can either be written into the
564 vmunix file (the -x option) or
565 saved in /var/diskless/setup.<official-hostname>.
567 Set up the bootstrap server. If the nfs_diskless structure was written into
568 the ``vmunix'' file, any vanilla bootstrap protocol such as bootp/tftp can
569 be used. If the bootstrap server has been modified to use the functions in
570 diskless_boot.c, then a
571 file called /var/diskless/boot.<official-hostname>
573 It is simply a two line text file, where the first line is the pathname
574 of the correct ``vmunix'' file and the second line has the pathname of
575 the nfs_diskless structure file and its byte offset in it.
578 /var/diskless/vmunix.pmax
580 /var/diskless/setup.rickers.cis.uoguelph.ca 642308
583 Create a /var subtree for each client in an appropriate place on the server,
584 such as /var/diskless/var/<client-hostname>/...
585 By using the <client-hostname> to differentiate /var for each host,
586 /etc/rc can be modified to mount the correct /var from the server.