servers/vfs/README

   1 ## Description of VFS                            Thomas Veerman 21-3-2013
   2 ## This file is organized such that it can be read both in a Wiki and on
   3 ## the MINIX terminal using e.g. vi or less. Please, keep the file in the
   4 ## source tree as the canonical version and copy changes into the Wiki.
   5 #pragma section-numbers 2
   6
   7 = VFS internals =
   8
   9 <<TableOfContents(2)>>
  10
  11 ## Table of contents
  12 ## 1 ..... General description of responsibilities
  13 ## 2 ..... General architecture
  14 ## 3 ..... Worker threads
  15 ## 4 ..... Locking
  16 ## 4.1 .... Locking requirements
  17 ## 4.2 .... Three-level Lock
  18 ## 4.3 .... Data structures subject to locking
  19 ## 4.4 .... Locking order
  20 ## 4.5 .... Vmnt (file system) locking
  21 ## 4.6 .... Vnode (open file) locking
  22 ## 4.7 .... Filp (file position) locking
  23 ## 4.8 .... Lock characteristics per request type
  24 ## 5 ..... Recovery from driver crashes
  25 ## 5.1 .... Recovery from block drivers crashes
  26 ## 5.2 .... Recovery from character driver crashes
  27 ## 5.3 .... Recovery from File Server crashes
  28
  29 == General description of responsibilities ==
  30 ## 1 General description of responsibilities
  31 VFS implements the file system in cooperation with one or more File Servers
  32 (FS). The File Servers take care of the actual file system on a partition. That
  33 is, they interpret the data structure on disk, write and read data to/from
  34 disk, etc. VFS sits on top of those File Servers and communicates with
  35 them. Looking inside VFS, we can identify several roles. First, a role of VFS
  36 is to handle most POSIX system calls that are supported by Minix. Additionally,
  37 it supports a few calls necessary for libc. The following system calls are
  38 handled by VFS:
  39
  40 access, chdir, chmod, chown, chroot, close, creat, fchdir, fcntl, fstat,
  41 fstatvfs, fsync, ftruncate, getdents, getvfsstat, ioctl, link, lseek,
  42 lstat, mkdir, mknod, mount, open, pipe2, read, readlink, rename, rmdir, select,
  43 stat, statvfs, symlink, sync, truncate, umask, umount, unlink, utimes, write.
  44
  45 Second, it maintains part of the state belonging to a process (process state is
  46 spread out over the kernel, VM, PM, and VFS). For example, it maintains state
  47 for select(2) calls, file descriptors and file positions. Also, it cooperates
  48 with the Process Manager to handle the fork, exec, and exit system calls.
  49 Third, VFS keeps track of endpoints that are supposed to be drivers for
  50 character or block special files. File Servers can be regarded as drivers for
  51 block special files, although they are handled entirely different compared
  52 to other drivers.
  53
  54 The following diagram depicts how a read() on a file in /home is being handled:
  55 {{{
  56       ----------------
  57       | user process |
  58       ----------------
  59              ^      ^
  60              |      |
  61            read(2)   \
  62              |        \
  63              V         \
  64       ----------------  |
  65       |      VFS     |  |
  66       ----------------  |
  67                     ^   |
  68                     |   |
  69                     V   |
  70   ------- -------- ---------
  71   | MFS | |  MFS | |  MFS  |
  72   |  /  | | /usr | | /home |
  73   ------- -------- ---------
  74 }}}
  75 Diagram 1: handling of read(2) system call
  76
  77 The user process executes the read system call which is delivered to VFS. VFS
  78 verifies the read is done on a valid (open) file and forwards the request
  79 to the FS responsible for the file system on which the file resides. The FS
  80 reads the data, copies it directly to the user process, and replies to VFS
  81 it has executed the request. Subsequently, VFS replies to the user process
  82 the operation is done and the user process continues to run.
  83
  84 == General architecture ==
  85 ## 2 General architecture
  86 VFS works roughly identical to every other server and driver in Minix; it
  87 fetches a message (internally referred to as a job in some cases), executes
  88 the request embedded in the message, returns a reply, and fetches the next
  89 job. There are several sources for new jobs: from user processes, from PM, from
  90 the kernel, and from suspended jobs inside VFS itself (suspended operations
  91 on pipes, locks, or character special files). File Servers are regarded as
  92 normal user processes in this case, but their abilities are limited. This
  93 is to prevent deadlocks. Once a job is received, a worker thread starts
  94 executing it. During the lifetime of a job, the worker thread might need
  95 to talk to several File Servers. The protocol VFS speaks with File Servers
  96 is fully documented on the Wiki at [0]. The protocol fields are defined in
  97 <minix/vfsif.h>. If the job is an operation on a character or block special
  98 file and the need to talk to a driver arises, VFS uses the Character and
  99 Block Device Protocol. See [1]. This is sadly not official documentation,
 100 but it is an accurate description of how it works. Luckily, driver writers
 101 can use the libchardriver and libblockdriver libraries and don't have to
 102 know the details of the protocol.
 103
 104 == Worker threads ==
 105 ## 3 Worker threads
 106 Upon start up, VFS spawns a configurable amount of worker threads. The
 107 main thread fetches requests and replies, and hands them off to idle or
 108 reply-pending workers, respectively. If no worker threads are available,
 109 the request is queued. All standard system calls are handled by such worker
 110 threads. One of the threads is reserved to handle new requests from system
 111 processes (i.e., File Servers and drivers) when there are no normal worker
 112 threads available; all normal threads might be blocked on a single worker
 113 thread that caused a system process to send a request on its own. To unblock
 114 all normal threads, we need to reserve one spare thread to handle that
 115 situation. VFS drives all File Servers and drivers asynchronously. While
 116 waiting for a reply, a worker thread is blocked and other workers can keep
 117 processing requests. Upon reply the worker thread is unblocked.
 118
 119 As mentioned above, the main thread is responsible for retrieving new jobs and
 120 replies to current jobs and start or unblock the proper worker thread.
 121 Driver replies are processed directly from the main thread. As a consequence,
 122 these processing routines may not block their calling thread. In some cases,
 123 these routines may resume a thread that is blocked waiting for the reply. This
 124 is always the case for block driver replies, and may or may not be the case for
 125 character driver replies. The character driver reply processing routines may
 126 also unblock suspended processes which in turn generate new jobs to be handled
 127 by the main loop (e.g., suspended reads and writes on pipes). So depending
 128 on the reply a new thread may have to be started.
 129
 130 Worker threads are strictly tied to a process, and each process can have at
 131 most one worker thread running for it. Generally speaking, there are two types
 132 of work supported by worker threads: normal work, and work from PM. The main
 133 subtype of normal work is the handling of a system call made by the process
 134 itself. The process is blocked while VFS is handling the system call, so no new
 135 system call can arrive from a process while VFS has not completed a previous
 136 system call from that process. For that reason, if there are no worker threads
 137 available to handle the work, the work is queued in the corresponding process
 138 entry of the fproc table.
 139
 140 The other main type of work consists of requests from PM. The protocol PM
 141 speaks with VFS is asynchronous. PM is allowed to send up to one request per
 142 process to VFS, in addition to a request to initiate a reboot. Most jobs from
 143 PM are taken care of immediately by the main thread, but some jobs require a
 144 worker thread context (to be able to sleep) and/or serialization with normal
 145 work. Therefore, each process may have a PM request queued for execution, also
 146 in the fproc table. Managing proper queuing, addition, and execution of both
 147 normal and PM work is the responsibility of the worker thread infrastructure.
 148
 149 There are several special tasks that require a worker thread, and these are
 150 implemented as normal work associated with a certain special process that does
 151 not make regular VFS calls anyway. For example, the initial ramdisk mount
 152 procedure uses a thread associated with the VFS process. Some of these special
 153 tasks require protection against being started multiple times at once, as this
 154 is not only undesirable but also disallowed. The full list of worker thread
 155 task types and subtypes is shown in Table 1.
 156
 157 {{{
 158 -------------------------------------------------------------------------
 159 | Worker thread task        | Type   | Association     | May use spare? |
 160 +---------------------------+--------+-----------------+----------------+
 161 | system call from process  | normal | calling process | if system proc |
 162 +---------------------------+--------+-----------------+----------------+
 163 | resumed pipe operation    | normal | calling process | no             |
 164 +---------------------------+--------+-----------------+----------------+
 165 | postponed PM request      | PM     | target process  | no             |
 166 +---------------------------+--------+-----------------+----------------+
 167 | DS event notification     | normal | DS              | yes            |
 168 +---------------------------+--------+-----------------+----------------+
 169 | initial ramdisk mounting  | normal | VFS             | no             |
 170 +---------------------------+--------+-----------------+----------------+
 171 | reboot sequence           | normal | PM              | no             |
 172 -------------------------------------------------------------------------
 173 }}}
 174 Table 1: worker thread work types and subtypes
 175
 176 Communication with block drivers is asynchronous, but at this time, access to
 177 these drivers is serialized on a per-driver basis. File Servers are treated
 178 differently. VFS was designed to be able to send requests concurrently to File
 179 Servers, although at the time of writing there are no File Servers that can
 180 actually make use of that functionality. To identify which reply from an FS
 181 belongs to which worker thread, all requests have an embedded transaction
 182 identification number (a magic number + thread id encoded in the mtype field of
 183 a message) which the FS has to echo upon reply. Because the range of valid
 184 transaction IDs is isolated from valid system call numbers, VFS can use that ID
 185 to differentiate between replies from File Servers and actual new system calls
 186 from FSes. Using this mechanism VFS is able to support FUSE and ProcFS.
 187
 188 == Locking ==
 189 ## 4 Locking
 190 To ensure correct execution of system calls, worker threads sometimes need
 191 certain objects within VFS to remain unchanged during thread suspension
 192 and resumption (i.e., when they need to communicate with a driver or File
 193 Server). Threads keep most state on the stack, but there are a few global
 194 variables that require protection: the fproc table, vmnt table, vnode table,
 195 and filp table. Other tables such as lock table, select table, and dmap table
 196 don't require protection by means of exclusive access. There it's required
 197 and enough to simply mark an entry in use.
 198
 199 === Locking requirements ===
 200 ## 4.1 Locking requirements
 201 VFS implements the locking model described in [2]. For completeness of this
 202 document we'll describe it here, too. The requirements are based on a threading
 203 package that is non-preemptive. VFS must guarantee correct functioning with
 204 several, semi-concurrently executing threads in any arbitrary order. The
 205 latter requirement follows from the fact that threads need service from
 206 other components like File Servers and drivers, and they may take any time
 207 to complete requests.
 208  1. Consistency of replicated values. Several system calls rely on VFS keeping a replicated representation of data in File Servers (e.g., file sizes, file modes, etc.).
 209  1. Isolation of system calls. Many system calls involve multiple requests to FSes. Concurrent requests from other processes must not lead to otherwise impossible results (e.g., a chmod operation on a file cannot fail halfway through because it's suddenly unlinked or moved).
 210  1. Integrity of objects. From the point of view of threads, obtaining mutual exclusion is a potentially blocking operation. The integrity of any objects used across blocking calls must be guaranteed (e.g., the file mode in a vnode must remain intact not only when talking to other components, but also when obtaining a lock on a filp).
 211  1. No deadlock. Not one call may cause another call to never complete. Deadlock situations are typically the result of two or more threads that each hold exclusive access to one resource and want exclusive access to the resource held by the other thread. These resources are a) data (global variables) and b) worker threads.
 212    a. Conflicts between locking of different types of objects can be avoided by keeping a locking order: objects of different type must always be locked in the same order. If multiple objects of the same type are to be locked, then first a "common denominator" higher up in the locking order must be locked.
 213    a. Some threads can only run to completion when another thread does work on their behalf. Examples of this are drivers and file servers that do system calls on their own (e.g., ProcFS, PFS/UNIX Domain Sockets, FUSE) or crashing components (e.g., a driver for a character special file that crashes during a request; a second thread is required to handle resource clean up or driver restart before the first thread can abort or retry the request).
 214  1. No starvation. VFS must guarantee that every system call completes in finite time (e.g., an infinite stream of reads must never completely block writes). Furthermore, we want to maximize parallelism to improve performance. This leads to:
 215  1. A request to one File Server must not block access to other FS processes. This means that most forms of locking cannot take place at a global level, and must at most take place on the file system level.
 216  1. No read-only operation on a regular file must block an independent read call to that file. In particular, (read-only) open and close operations may not block such reads, and multiple independent reads on the same file must be able to take place concurrently (i.e., reads that do not share a file position between their file descriptors).
 217
 218 === Three-level Lock ===
 219 ## 4.2 Three-level Lock
 220 From the requirements it follows that we need at least two locking types: read
 221 and write locks. Concurrent reads are allowed, but writes are exclusive both
 222 from reads and from each other. However, in a lot of cases it possible to use
 223 a third locking type that is in between read and write lock: the serialize
 224 lock. This is implemented in the three-level lock [2]. The three-level
 225 lock provides:
 226 TLL_READ: allows an unlimited number of threads to hold the lock with the
 227 same type (both the thread itself and other threads); N * concurrent.
 228 TLL_READSER: also allows an unlimited number of threads with type TLL_READ,
 229 but only one thread can obtain serial access to the lock; N * concurrent +
 230 1 * serial.
 231 TLL_WRITE: provides full mutual exclusion; 1 * exclusive + 0 * concurrent +
 232 0 * serial.
 233 In absence of TLL_READ locks, a TLL_READSER is identical to TLL_WRITE. However,
 234 TLL_READSER never blocks concurrent TLL_READ access. TLL_READSER can be
 235 upgraded to TLL_WRITE; the thread will block until the last TLL_READ lock
 236 leaves and new TLL_READ locks are blocked. Locks can be downgraded to a
 237 lower type. The three-level lock is implemented using two FIFO queues with
 238 write-bias. This guarantees no starvation.
 239
 240 === Data structures subject to locking ===
 241 ## 4.3 Data structures subject to locking
 242 VFS has a number of global data structures. See Table 2.
 243 {{{
 244 --------------------------------------------------------------------
 245 | Structure  | Object description                                  |
 246 +------------+-----------------------------------------------------|
 247 | fproc      | Process (includes process's file descriptors)       |
 248 +------------+-----------------------------------------------------|
 249 | vmnt       | Virtual mount; a mounted file system                |
 250 +------------+-----------------------------------------------------|
 251 | vnode      | Virtual node; an open file                          |
 252 +------------+-----------------------------------------------------|
 253 | filp       | File position into an open file                     |
 254 +------------+-----------------------------------------------------|
 255 | lock       | File region locking state for an open file          |
 256 +------------+-----------------------------------------------------|
 257 | select     | State for an in-progress select(2) call             |
 258 +------------+-----------------------------------------------------|
 259 | dmap       | Mapping from major device number to a device driver |
 260 --------------------------------------------------------------------
 261 }}}
 262 Table 2: VFS object types.
 263
 264 An fproc object is a process. An fproc object is created by fork(2)
 265 and destroyed by exit(2) (which may, or may not, be instantiated from the
 266 process itself). It is identified by its endpoint number ('fp_endpoint')
 267 and process id ('fp_pid'). Both are unique although in general the endpoint
 268 number is used throughout the system.
 269 A vmnt object is a mounted file system. It is created by mount(2) and destroyed
 270 by umount(2). It is identified by a device number ('m_dev') and FS endpoint
 271 number ('m_fs_e'); both are unique to each vmnt object. There is always a
 272 single process that handles a file system on a device and a device cannot
 273 be mounted twice.
 274 A vnode object is the VFS representation of an open inode on the file
 275 system. A vnode object is created when a first process opens or creates the
 276 corresponding file and is destroyed when the last process, which has that
 277 file open, closes it. It is identified by a combination of FS endpoint number
 278 ('v_fs_e') and inode number of that file system ('v_inode_nr'). A vnode
 279 might be mapped to another file system; the actual reading and writing is
 280 handled by a different endpoint. This has no effect on locking.
 281 A filp object contains a file position within a file. It is created when a file
 282 is opened or anonymous pipe created and destroyed when the last user (i.e.,
 283 process) closes it. A file descriptor always points to a single filp. A filp
 284 always point to a single vnode, although not all vnodes are pointed to by a
 285 filp. A filp has a reference count ('filp_count') which is identical to the
 286 number of file descriptors pointing to it. It can be increased by a dup(2)
 287 or fork(2). A filp can therefore be shared by multiple processes.
 288 A lock object keeps information about locking of file regions. This has
 289 nothing to do with the threading type of locking. The lock objects require
 290 no locking protection and won't be discussed further.
 291 A select object keeps information on a select(2) operation that cannot
 292 be fulfilled immediately (waiting for timeout or file descriptors not
 293 ready). They are identified by their owner ('requestor'); a pointer to the
 294 fproc table. A null pointer means not in use. A select object can be used by
 295 only one process and a process can do only one select(2) at a time. Select(2)
 296 operates on filps and is organized in such a way that it is sufficient to
 297 apply locking on individual filps and not on select objects themselves. They
 298 won't be discussed further.
 299 A dmap object is a mapping from a device number to a device driver. A device
 300 driver can have multiple device numbers associated (e.g., TTY). Access to
 301 a driver is exclusive when it uses the synchronous driver protocol.
 302
 303 === Locking order ===
 304 ## 4.4 Locking order
 305 Based on the description in the previous section, we need protection for
 306 fproc, vmnt, vnode, and filp objects. To prevent deadlocks as a result of
 307 object locking, we need to define a strict locking order. In VFS we use the
 308 following order:
 309
 310 {{{
 311 fproc > [exec] > vmnt > vnode > filp > [block special file] > [dmap]
 312 }}}
 313
 314 That is, no thread may lock an fproc object while holding a vmnt lock,
 315 and no thread may lock a vmnt object while holding an (associated) vnode, etc.
 316
 317 Fproc needs protection because processes themselves can initiate system
 318 calls, but also PM can cause system calls that have to be executed in their
 319 name. For example, a process might be busy reading from a character device
 320 and another process sends a termination signal. The exit(2) that follows is
 321 sent by PM and is to be executed by the to-be-killed process itself. At this
 322 point there is contention for the fproc object that belongs to the process,
 323 hence the need for protection. This problem is solved in a simple way. Recall
 324 that all worker threads are bound to a process. This also forms the basis of
 325 fproc locking: each worker thread acquires and holds the fproc lock for its
 326 associated process for as long as it is processing work for that process.
 327
 328 There are two cases where a worker thread may hold the lock to more than one
 329 process. First, as mentioned, the reboot procedure is executed from a worker
 330 thread set in the context of the PM process, thus with the PM process entry
 331 lock held. The procedure itself then acquires a temporary lock on every other
 332 process in turn, in order to clean it up without interference. Thus, the PM
 333 process entry is higher up in the locking order than all other process entries.
 334
 335 Second, the exec(2) call is protected by a lock, and this exec lock is
 336 currently implemented as a lock on the VM process entry. The exec lock is
 337 acquired by a worker thread for the process performing the exec(2) call, and
 338 thus, the VM process entry is below all other process entries in the locking
 339 order. The exec(2) call is protected by a lock for the following reason. VFS
 340 uses a number of variables on the heap to read ELF headers. They are on the
 341 heap due to their size; putting them on the stack would increase stack size
 342 demands for worker threads. The exec call does blocking read calls and thus
 343 needs exclusive access to these variables. However, only the exec(2) syscall
 344 needs this lock.
 345
 346 Access to block special files needs to be exclusive. File Servers are
 347 responsible for handling reads from and writes to block special files; if
 348 a block special file is on a device that is mounted, the FS responsible for
 349 that mount point takes care of it, otherwise the FS that handles the root of
 350 the file system is responsible. Due to mounting and unmounting file systems,
 351 the FS handling a block special file may change. Locking the vnode is not
 352 enough since the inode can be on an entirely different File Server. Therefore,
 353 access to block special files must be mutually exclusive from concurrent
 354 mount(2)/umount(2) operations. However, when we're not accessing a block
 355 special file, we don't need this lock.
 356
 357 === Vmnt (file system) locking ===
 358 ## 4.5 Vmnt (file system) locking
 359 Vmnt locking cannot be seen completely separately from vnode locking. For
 360 example, umount(2) fails if there are still in-use vnodes, which means that
 361 FS requests [0] only involving in-use inodes do not have to acquire a vmnt
 362 lock. On the other hand, all other request do need a vmnt lock. Extrapolating
 363 this to system calls this means that all system calls involving a file
 364 descriptor don't need a vmnt lock and all other system calls (that make FS
 365 requests) do need a vmnt lock.
 366 {{{
 367 -------------------------------------------------------------------------------
 368 | Category          | System calls                                            |
 369 +-------------------+---------------------------------------------------------+
 370 | System calls with | access, chdir, chmod, chown, chroot, creat, dumpcore+,  |
 371 | a path name       | exec, link, lstat, mkdir, mknod, mount, open, readlink, |
 372 | argument          | rename, rmdir, stat, statvfs, symlink, truncate, umount,|
 373 |                   | unlink, utime                                           |
 374 +-------------------+---------------------------------------------------------+
 375 | System calls with | close, fchdir, fcntl, fstat, fstatvfs, ftruncate,       |
 376 | a file descriptor | getdents, ioctl, lseek, pipe, read, select, write       |
 377 | argument          |                                                         |
 378 +-------------------+---------------------------------------------------------+
 379 | System calls with | fsync++, getvfsstat, sync, umask                        |
 380 | other or no       |                                                         |
 381 | arguments         |                                                         |
 382 -------------------------------------------------------------------------------
 383 }}}
 384 Table 3: System call categories.
 385 + path name argument is implicit, the path name is "core.<pid>"
 386 ++ although fsync actually provides a file descriptor argument, it's only
 387 used to find the vmnt and not to do any actual operations on
 388
 389 Before we describe what kind of vmnt locks VFS applies to system calls with a
 390 path name or other arguments, we need to make some notes on path lookup. Path
 391 lookups take arbitrary paths as input (relative and absolute). They can start
 392 at any vmnt (based on root directory and working directory of the process doing
 393 the lookup) and visit any file system in arbitrary order, possibly visiting
 394 the same file system more than once. As such, VFS can never tell in advance
 395 at which File Server a lookup will end. This has the following consequences:
 396  * In the lookup procedure, only one vmnt must be locked at a time. When
 397  moving from one vmnt to another, the first vmnt has to be unlocked before
 398  acquiring the next lock to prevent deadlocks.
 399  * The lookup procedure must lock each visited file system with TLL_READSER
 400  and downgrade or upgrade to the lock type desired by the caller for the
 401  destination file system (as VFS cannot know which file system is final). This
 402  is to prevent deadlocks when a thread acquires a TLL_READSER on a vmnt and
 403  another thread TLL_READ on the same vmnt. If the second thread is blocked
 404  on the first thread due to it acquiring a lock on a vnode, the first thread
 405  will be unable to upgrade a TLL_READSER lock to TLL_WRITE.
 406
 407 We use the following mapping for vmnt locks onto three-level lock types:
 408 {{{
 409 -------------------------------------------------------------------------------
 410 | Lock type  |  Mapped to  | Used for                                         |
 411 +------------+-------------+--------------------------------------------------+
 412 | VMNT_READ  | TLL_READ    | Read-only operations and fully independent write |
 413 |            |             | operations                                       |
 414 +------------+-------------+--------------------------------------------------+
 415 | VMNT_WRITE | TLL_READSER | Independent create and modify operations         |
 416 +------------+-------------+--------------------------------------------------+
 417 | VMNT_EXCL  | TLL_WRITE   | Delete and dependent write operations            |
 418 -------------------------------------------------------------------------------
 419 }}}
 420 Table 4: vmnt to tll lock mapping
 421
 422 The following table shows a sub-categorization of system calls without a
 423 file descriptor argument, together with their locking types and motivation
 424 as used by VFS.
 425 {{{
 426 -------------------------------------------------------------------------------
 427 | Group       | System calls | Lock type  | Motivation                        |
 428 +-------------+--------------+------------+-----------------------------------+
 429 | File open   | chdir,       | VMNT_READ  | These operations do not interfere |
 430 | ops.        | chroot, exec,|            | with each other, as vnodes can be |
 431 | (non-create)| open         |            | opened concurrently, and open     |
 432 |             |              |            | operations do not affect          |
 433 |             |              |            | replicated state.                 |
 434 +-------------+--------------+------------+-----------------------------------+
 435 | File create-| creat,       | VMNT_EXCL  | File create ops. require mutual   |
 436 | and-open    | open(O_CREAT)| for create | exclusion from concurrent file    |
 437 | ops         |              | VMNT_WRITE | open ops. If the file already     |
 438 |             |              | for open   | existed, the VMNT_WRITE lock that |
 439 |             |              |            | is necessary for the lookup is    |
 440 |             |              |            | not upgraded                      |
 441 +-------------+--------------+------------+-----------------------------------+
 442 | File create-| pipe         | VMNT_READ  | These create nameless inodes      |
 443 | unique-and- |              |            | which cannot be opened by means   |
 444 | open ops.   |              |            | of a path. Their creation         |
 445 |             |              |            | therefore does not interfere with |
 446 |             |              |            | anything else                     |
 447 +-------------+--------------+------------+-----------------------------------+
 448 | File create-| mkdir, mknod,| VMNT_WRITE | These operations do not affect    |
 449 | only ops.   | slink        |            | any VFS state, and can therefore  |
 450 |             |              |            | take place concurrently with open |
 451 |             |              |            | operations                        |
 452 +-------------+--------------+------------+-----------------------------------+
 453 | File info   | access, lstat| VMNT_READ  | These operations do not interfere |
 454 | retrieval or| readlink,stat|            | with each other and do not modify |
 455 | modification| utime        |            | replicated state                  |
 456 +-------------+--------------+------------+-----------------------------------+
 457 | File        | chmod, chown,| VMNT_READ  | These operations do not interfere |
 458 | modification| truncate     |            | with each other. They do need     |
 459 |             |              |            | exclusive access on the vnode     |
 460 |             |              |            | level                             |
 461 +-------------+--------------+------------+-----------------------------------+
 462 | File link   | link         | VMNT_WRITE | Identical to file create-only     |
 463 | ops.        |              |            | operations                        |
 464 +-------------+--------------+------------+-----------------------------------+
 465 | File unlink | rmdir, unlink| VMNT_EXCL  | These must not interfere with     |
 466 | ops.        |              |            | file create operations, to avoid  |
 467 |             |              |            | the scenario where inodes are     |
 468 |             |              |            | reused immediately. However, due  |
 469 |             |              |            | to necessary path checks, the     |
 470 |             |              |            | vmnt is first locked VMNT_WRITE   |
 471 |             |              |            | and then upgraded                 |
 472 +-------------+--------------+------------+-----------------------------------+
 473 | File rename | rename       | VMNT_EXCL  | Identical to file unlink          |
 474 | ops.        |              |            | operations                        |
 475 +-------------+--------------+------------+-----------------------------------+
 476 | Non-file    | sync, umask, | VMNT_READ  | umask does not involve the file   |
 477 | ops.        | getvfsstat   | or none    | system, so it does not need       |
 478 |             |              |            | locks. sync does not alter state  |
 479 |             |              |            | in VFS and  is atomic at the FS   |
 480 |             |              |            | level. getvfsstat caches stats    |
 481 |             |              |            | only and requires no exclusion.   |
 482 -------------------------------------------------------------------------------
 483 }}}
 484 Table 5: System call without file descriptor argument sub-categorization
 485
 486 === Vnode (open file) locking ===
 487 ## 4.6 Vnode (open file) locking
 488 Compared to vmnt locking, vnode locking is relatively straightforward. All
 489 read-only accesses to vnodes that merely read the vnode object's fields are
 490 allowed to be concurrent. Consequently, all accesses that change fields
 491 of a vnode object must be exclusive. This leaves us with creation and
 492 destruction of vnode objects (and related to that, their reference counts);
 493 it's sufficient to serialize these accesses. This follows from the fact
 494 that a vnode is only created when the first user opens it, and destroyed
 495 when the last user closes it. A open file in process A cannot be be closed
 496 by process B. Note that this also relies on the fact that a process can do
 497 only one system call at a time. Kernel threads would violate this assumption.
 498
 499 We use the following mapping for vnode locks onto three-level lock types:
 500 {{{
 501 -------------------------------------------------------------------------------
 502 | Lock type  |  Mapped to  | Used for                                         |
 503 +------------+-------------+--------------------------------------------------+
 504 | VNODE_READ | TLL_READ    | Read access to previously opened vnodes          |
 505 +------------+-------------+--------------------------------------------------+
 506 | VNODE_OPCL | TLL_READSER | Creation, opening, closing, and destruction of   |
 507 |            |             | vnodes                                           |
 508 +------------+-------------+--------------------------------------------------+
 509 | VNODE_WRITE| TLL_WRITE   | Write access to previously opened vnodes         |
 510 -------------------------------------------------------------------------------
 511 }}}
 512 Table 6: vnode to tll lock mapping
 513
 514 When vnodes are destroyed, they are initially locked with VNODE_OPCL. After
 515 all, we're going to alter the reference count, so this must be serialized. If
 516 the reference count then reaches zero we obtain exclusive access. This should
 517 always be immediately possible unless there is a consistency problem. See
 518 section 4.8 for an exhaustive listing of locking methods for all operations on
 519 vnodes.
 520
 521 === Filp (file position) locking ===
 522 ## 4.7 Filp (file position) locking
 523 The main fields of a filp object that are shared between various processes
 524 (and by extension threads), and that can change after object creation,
 525 are filp_count and filp_pos. Writes to and reads from filp object must be
 526 mutually exclusive, as all system calls have to use the latest version. For
 527 example, a read(2) call changes the file position (i.e., filp_pos), so two
 528 concurrent reads must obtain exclusive access. Consequently, as even read
 529 operations require exclusive access, filp object don't use three-level locks,
 530 but only mutexes.
 531
 532 System calls that involve a file descriptor often access both the filp and
 533 the corresponding vnode. The locking order requires us to first lock the
 534 vnode and then the filp. This is taken care of at the filp level. Whenever
 535 a filp is locked, a lock on the vnode is acquired first. Conversely, when
 536 a filp is unlocked, the corresponding vnode is also unlocked. A convenient
 537 consequence is that whenever a vnode is locked exclusively (VNODE_WRITE),
 538 all corresponding filps are implicitly locked. This is of particular use
 539 when multiple filps must be locked at the same time:
 540  * When opening a named pipe, VFS must make sure that there is at most one   filp for the reader end and one filp for the writer end.
 541  * Pipe readers and writers must be suspended in the absence of (respectively)  writers and readers.
 542 Because both filps are linked to the same vnode object (they are for the same
 543 pipe), it suffices to exclusively lock that vnode instead of both filp objects.
 544
 545 In some cases it can happen that a function that operates on a locked filp,
 546 calls another function that triggers another lock on a different filp for
 547 the same vnode. For example, close_filp. At some point, close_filp() calls
 548 release() which in turn will loop through the filp table looking for pipes
 549 being select(2)ed on. If there are, the select code will lock the filp and do
 550 operations on it. This works fine when doing a select(2) call, but conflicts
 551 with close(2) or exit(2). Lock_filp() makes an exception for this situation;
 552 if you've already locked a vnode with VNODE_OPCL or VNODE_WRITE when locking
 553 a filp, you obtain a "soft lock" on the vnode for this filp. This means
 554 that lock_filp won't actually try to lock the vnode (which wouldn't work),
 555 but flags the vnode as "skip unlock_vnode upon unlock_filp." Upon unlocking
 556 the filp, the vnode remains locked, the soft lock is removed, and the filp
 557 mutex is released. Note that this scheme does not violate the locking order;
 558 the vnode is (already) locked before the filp.
 559
 560 A similar problem arises with create_pipe. In this case we obtain a new vnode
 561 object, lock it, and obtain two new, locked, filp objects. If everything works
 562 out and the filp objects are linked to the same vnode, we run into trouble
 563 when unlocking both filps. The first filp being unlocked would work; the
 564 second filp doesn't have an associated vnode that's locked anymore. Therefore
 565 we introduced a plural unlock_filps(filp1, filp2) that can unlock two filps
 566 that both point to the same vnode.
 567
 568 === Lock characteristics per request type ===
 569 ## 4.8 Lock characteristics per request type
 570 For File Servers that support concurrent requests, it's useful to know which
 571 locking guarantees VFS provides for vmnts and vnodes, so it can take that
 572 into account when protecting internal data structures. READ = TLL_READ,
 573 READSER = TLL_READSER, WRITE = TLL_WRITE. The vnode locks applies to the
 574 REQ_INODE_NR field in requests, unless the notes say otherwise.
 575 {{{
 576 ------------------------------------------------------------------------------
 577 | request      | vmnt    | vnode   | notes                                   |
 578 +--------------+---------+---------+-----------------------------------------+
 579 | REQ_BREAD    |         | READ    | VFS serializes reads from and writes to |
 580 |              |         |         | block special files                     |
 581 +--------------+---------+---------+-----------------------------------------+
 582 | REQ_BWRITE   |         | WRITE   | VFS serializes reads from and writes to |
 583 |              |         |         | block special files                     |
 584 +--------------+---------+---------+-----------------------------------------+
 585 | REQ_CHMOD    | READ    | WRITE   | vmnt is only locked if file is not      |
 586 |              |         |         | already opened                          |
 587 +--------------+---------+---------+-----------------------------------------+
 588 | REQ_CHOWN    | READ    | WRITE   | vmnt is only locked if file is not      |
 589 |              |         |         | already opened                          |
 590 +--------------+---------+---------+-----------------------------------------+
 591 | REQ_CREATE   | WRITE   | WRITE   | The directory in which the file is      |
 592 |              |         |         | created is write locked                 |
 593 +--------------+---------+---------+-----------------------------------------+
 594 | REQ_FLUSH    |         |         | Mutually exclusive to REQ_BREAD and     |
 595 |              |         |         | REQ_BWRITE                              |
 596 +--------------+---------+---------+-----------------------------------------+
 597 | REQ_FTRUNC   | READ    | WRITE   | vmnt is only locked if file is not      |
 598 |              |         |         | already opened                          |
 599 +--------------+---------+---------+-----------------------------------------+
 600 | REQ_GETDENTS | READ    | READ    | vmnt is only locked if file is not      |
 601 |              |         |         | already opened                          |
 602 +--------------+---------+---------+-----------------------------------------+
 603 | REQ_INHIBREAD|         | READ    |                                         |
 604 +--------------+---------+---------+-----------------------------------------+
 605 | REQ_LINK     | READSER | WRITE   | REQ_INODE_NR is locked READ             |
 606 |              |         |         | REQ_DIR_INO is locked WRITE             |
 607 +--------------+---------+---------+-----------------------------------------+
 608 | REQ_LOOKUP   | READSER |         |                                         |
 609 +--------------+---------+---------+-----------------------------------------+
 610 | REQ_MKDIR    | READSER | WRITE   |                                         |
 611 +--------------+---------+---------+-----------------------------------------+
 612 | REQ_MKNOD    | READSER | WRITE   |                                         |
 613 +--------------+---------+---------+-----------------------------------------+
 614 |REQ_MOUNTPOINT| WRITE   | WRITE   |                                         |
 615 +--------------+---------+---------+-----------------------------------------+
 616 |REQ_NEW_DRIVER|         |         |                                         |
 617 +--------------+---------+---------+-----------------------------------------+
 618 | REQ_NEWNODE  |         |         | Only sent to PFS                        |
 619 +--------------+---------+---------+-----------------------------------------+
 620 | REQ_PUTNODE  |         | READSER | READSER when dropping all but one       |
 621 |              |         | or WRITE| references. WRITE when final reference  |
 622 |              |         |         | is dropped (i.e., no longer in use)     |
 623 +--------------+---------+---------+-----------------------------------------+
 624 | REQ_RDLINK   | READ    | READ    | In some circumstances stricter locking  |
 625 |              |         |         | might be applied, but not guaranteed    |
 626 +--------------+---------+---------+-----------------------------------------+
 627 | REQ_READ     |         | READ    |                                         |
 628 +--------------+---------+---------+-----------------------------------------+
 629 |REQ_READSUPER | WRITE   |         |                                         |
 630 +--------------+---------+---------+-----------------------------------------+
 631 | REQ_RENAME   | WRITE   | WRITE   |                                         |
 632 +--------------+---------+---------+-----------------------------------------+
 633 | REQ_RMDIR    | WRITE   | WRITE   |                                         |
 634 +--------------+---------+---------+-----------------------------------------+
 635 | REQ_SLINK    | READSER | READ    |                                         |
 636 +--------------+---------+---------+-----------------------------------------+
 637 | REQ_STAT     | READ    | READ    | vmnt is only locked if file is not      |
 638 |              |         |         | already opened                          |
 639 +--------------+---------+---------+-----------------------------------------+
 640 | REQ_STATVFS  | READ    | READ    | vmnt is only locked if file is not      |
 641 |              |         |         | already opened                          |
 642 +--------------+---------+---------+-----------------------------------------+
 643 | REQ_SYNC     | READ    |         |                                         |
 644 +--------------+---------+---------+-----------------------------------------+
 645 | REQ_UNLINK   | WRITE   | WRITE   |                                         |
 646 +--------------+---------+---------+-----------------------------------------+
 647 | REQ_UNMOUNT  | WRITE   |         |                                         |
 648 +--------------+---------+---------+-----------------------------------------+
 649 | REQ_UTIME    | READ    | READ    |                                         |
 650 +--------------+---------+---------+-----------------------------------------+
 651 | REQ_WRITE    |         | WRITE   |                                         |
 652 -----------------------------------------------------------------------------+
 653 }}}
 654 Table 7: VFS-FS requests locking guarantees
 655
 656 == Recovery from driver crashes ==
 657 ## 5 Recovery from driver crashes
 658 VFS can recover from block special file and character special file driver
 659 crashes. It can recover to some degree from a crashed File Server (which we
 660 can regard as a driver).
 661
 662 === Recovery from block drivers crashes ===
 663 ## 5.1 Recovery from block drivers crashes
 664 When reading or writing, VFS doesn't communicate with block drivers directly,
 665 but always through a File Server (the root File Server being default). If the
 666 block driver crashes, the File Server does most of the work of the recovery
 667 procedure. VFS loops through all open files for block special files that
 668 were handled by this driver and reopens them. After that it sends the new
 669 endpoint to the File Server so it can finish the recover procedure. Finally,
 670 the File Server will retry pending requests if possible. However, reopening
 671 files can cause the block driver to crash again. When that happens, VFS will
 672 stop the recovery. A driver can return ERESTART to VFS to tell it to retry
 673 a request. VFS does this with an arbitrary maximum of 5 attempts.
 674
 675 === Recovery from character driver crashes ===
 676 ## 5.2 Recovery from character driver crashes
 677 While VFS used to support minimal recovery from character driver crashes, the
 678 added complexity has so far proven to outweigh the benefits, especially since
 679 such crash recovery can never be fully transparent: it depends entirely on the
 680 character device as to whether repeating an I/O request makes sense at all.
 681 Currently, all operations except close(2) on a file descriptor that identifies
 682 a device on a crashed character driver, will result in an EIO error. It is up
 683 to the application to reopen the character device and retry whatever it was
 684 doing in the appropriate manner. In the future, automatic reopen and I/O
 685 restart may be reintroduced for a limited subset of character drivers.
 686
 687 === Recovery from File Server crashes ===
 688 ## 5.3 Recovery from File Server crashes
 689 At the time of writing we cannot recover from crashed File Servers. When
 690 VFS detects it has to clean up the remnants of a File Server process (i.e.,
 691 through an exit(2)), it marks all associated file descriptors as invalid
 692 and cancels ongoing and pending requests to that File Server. Resources that
 693 were in use by the File Server are cleaned up.
 694
 695 [0] http://wiki.minix3.org/en/DevelopersGuide/VfsFsProtocol
 696
 697 [1] http://www.cs.vu.nl/~dcvmoole/minix/blockchar.txt
 698
 699 [2] http://www.minix3.org/theses/moolenbroek-multimedia-support.pdf