share/doc/papers/memfs/1.t

   1 .\"     $NetBSD: 1.t,v 1.2 1998/01/09 06:41:29 perry Exp $
   2 .\"
   3 .\" Copyright (c) 1990 The Regents of the University of California.
   4 .\" All rights reserved.
   5 .\"
   6 .\" Redistribution and use in source and binary forms, with or without
   7 .\" modification, are permitted provided that the following conditions
   8 .\" are met:
   9 .\" 1. Redistributions of source code must retain the above copyright
  10 .\"    notice, this list of conditions and the following disclaimer.
  11 .\" 2. Redistributions in binary form must reproduce the above copyright
  12 .\"    notice, this list of conditions and the following disclaimer in the
  13 .\"    documentation and/or other materials provided with the distribution.
  14 .\" 3. Neither the name of the University nor the names of its contributors
  15 .\"    may be used to endorse or promote products derived from this software
  16 .\"    without specific prior written permission.
  17 .\"
  18 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  19 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  20 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  21 .\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  22 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  23 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  24 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  25 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  26 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  27 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  28 .\" SUCH DAMAGE.
  29 .\"
  30 .\"     @(#)1.t 5.1 (Berkeley) 4/16/91
  31 .\"
  32 .nr PS 11
  33 .nr VS 13
  34 .SH
  35 Introduction
  36 .PP
  37 This paper describes the motivation for and implementation of
  38 a memory-based filesystem.
  39 Memory-based filesystems have existed for a long time;
  40 they have generally been marketed as RAM disks or sometimes
  41 as software packages that use the machine's general purpose memory.
  42 .[
  43 white
  44 .]
  45 .PP
  46 A RAM disk is designed to appear like any other disk peripheral
  47 connected to a machine.
  48 It is normally interfaced to the processor through the I/O bus
  49 and is accessed through a device driver similar or sometimes identical
  50 to the device driver used for a normal magnetic disk.
  51 The device driver sends requests for blocks of data to the device
  52 and the requested data is then DMA'ed to or from the requested block.
  53 Instead of storing its data on a rotating magnetic disk,
  54 the RAM disk stores its data in a large array of random access memory
  55 or bubble memory.
  56 Thus, the latency of accessing the RAM disk is nearly zero
  57 compared to the 15-50 milliseconds of latency incurred when
  58 access rotating magnetic media.
  59 RAM disks also have the benefit of being able to transfer data at
  60 the maximum DMA rate of the system,
  61 while disks are typically limited by the rate that the data passes
  62 under the disk head.
  63 .PP
  64 Software packages simulating RAM disks operate by allocating
  65 a fixed partition of the system memory.
  66 The software then provides a device driver interface similar
  67 to the one described for hardware RAM disks,
  68 except that it uses memory-to-memory copy instead of DMA to move
  69 the data between the RAM disk and the system buffers,
  70 or it maps the contents of the RAM disk into the system buffers.
  71 Because the memory used by the RAM disk is not available for
  72 other purposes, software RAM-disk solutions are used primarily
  73 for machines with limited addressing capabilities such as PC's
  74 that do not have an effective way of using the extra memory anyway.
  75 .PP
  76 Most software RAM disks lose their contents when the system is powered
  77 down or rebooted.
  78 The contents can be saved by using battery backed-up memory,
  79 by storing critical filesystem data structures in the filesystem,
  80 and by running a consistency check program after each reboot.
  81 These conditions increase the hardware cost
  82 and potentially slow down the speed of the disk.
  83 Thus, RAM-disk filesystems are not typically
  84 designed to survive power failures;
  85 because of their volatility, their usefulness is limited to transient
  86 or easily recreated information such as might be found in
  87 .PN /tmp .
  88 Their primary benefit is that they have higher throughput
  89 than disk based filesystems.
  90 .[
  91 smith
  92 .]
  93 This improved throughput is particularly useful for utilities that
  94 make heavy use of temporary files, such as compilers.
  95 On fast processors, nearly half of the elapsed time for a compilation
  96 is spent waiting for synchronous operations required for file
  97 creation and deletion.
  98 The use of the memory-based filesystem nearly eliminates this waiting time.
  99 .PP
 100 Using dedicated memory to exclusively support a RAM disk
 101 is a poor use of resources.
 102 The overall throughput of the system can be improved
 103 by using the memory where it is getting the highest access rate.
 104 These needs may shift between supporting process virtual address spaces
 105 and caching frequently used disk blocks.
 106 If the memory is dedicated to the filesystem,
 107 it is better used in a buffer cache.
 108 The buffer cache permits faster access to the data
 109 because it requires only a single memory-to-memory copy
 110 from the kernel to the user process.
 111 The use of memory is used in a RAM-disk configuration may require two
 112 memory-to-memory copies, one from the RAM disk
 113 to the buffer cache,
 114 then another copy from the buffer cache to the user process.
 115 .PP
 116 The new work being presented in this paper is building a prototype
 117 RAM-disk filesystem in pageable memory instead of dedicated memory.
 118 The goal is to provide the speed benefits of a RAM disk
 119 without paying the performance penalty inherent in dedicating
 120 part of the physical memory on the machine to the RAM disk.
 121 By building the filesystem in pageable memory,
 122 it competes with other processes for the available memory.
 123 When memory runs short, the paging system pushes its
 124 least-recently-used pages to backing store.
 125 Being pageable also allows the filesystem to be much larger than
 126 would be practical if it were limited by the amount of physical
 127 memory that could be dedicated to that purpose.
 128 We typically operate our
 129 .PN /tmp
 130 with 30 to 60 megabytes of space
 131 which is larger than the amount of memory on the machine.
 132 This configuration allows small files to be accessed quickly,
 133 while still allowing
 134 .PN /tmp
 135 to be used for big files,
 136 although at a speed more typical of normal, disk-based filesystems.
 137 .PP
 138 An alternative to building a memory-based filesystem would be to have
 139 a filesystem that never did operations synchronously and never
 140 flushed its dirty buffers to disk.
 141 However, we believe that such a filesystem would either
 142 use a disproportionately large percentage of the buffer
 143 cache space, to the detriment of other filesystems,
 144 or would require the paging system to flush its dirty pages.
 145 Waiting for other filesystems to push dirty pages
 146 subjects them to delays while waiting for the pages to be written.
 147 We await the results of others trying this approach.
 148 .[
 149 Ohta
 150 .]
 151 .SH
 152 Implementation
 153 .PP
 154 The current implementation took less time to write than did this paper.
 155 It consists of 560 lines of kernel code (1.7K text + data)
 156 and some minor modifications to the program that builds
 157 disk based filesystems, \fInewfs\fP.
 158 A condensed version of the kernel code for the
 159 memory-based filesystem are reproduced in Appendix 1.
 160 .PP
 161 A filesystem is created by invoking the modified \fInewfs\fP, with
 162 an option telling it to create a memory-based filesystem.
 163 It allocates a section of virtual address space of the requested
 164 size and builds a filesystem in the memory
 165 instead of on a disk partition.
 166 When built, it does a \fImount\fP system call specifying a filesystem type of
 167 .SM MFS
 168 (Memory File System).
 169 The auxiliary data parameter to the mount call specifies a pointer
 170 to the base of the memory in which it has built the filesystem.
 171 (The auxiliary data parameter used by the local filesystem, \fIufs\fP,
 172 specifies the block device containing the filesystem.)
 173 .PP
 174 The mount system call allocates and initializes a mount table
 175 entry and then calls the filesystem-specific mount routine.
 176 The filesystem-specific routine is responsible for doing
 177 the mount and initializing the filesystem-specific
 178 portion of the mount table entry.
 179 The memory-based filesystem-specific mount routine,
 180 .RN mfs_mount ,
 181 is shown in Appendix 1.
 182 It allocates a block-device vnode to represent the memory disk device.
 183 In the private area of this vnode it stores the base address of
 184 the filesystem and the process identifier of the \fInewfs\fP process
 185 for later reference when doing I/O.
 186 It also initializes an I/O list that it
 187 uses to record outstanding I/O requests.
 188 It can then call the \fIufs\fP filesystem mount routine,
 189 passing the special block-device vnode that it has created
 190 instead of the usual disk block-device vnode.
 191 The mount proceeds just as any other local mount, except that
 192 requests to read from the block device are vectored through
 193 .RN mfs_strategy
 194 (described below) instead of the usual
 195 .RN spec_strategy
 196 block device I/O function.
 197 When the mount is completed,
 198 .RN mfs_mount
 199 does not return as most other filesystem mount functions do;
 200 instead it sleeps in the kernel awaiting I/O requests.
 201 Each time an I/O request is posted for the filesystem,
 202 a wakeup is issued for the corresponding \fInewfs\fP process.
 203 When awakened, the process checks for requests on its buffer list.
 204 A read request is serviced by copying data from the section of the
 205 \fInewfs\fP address space corresponding to the requested disk block
 206 to the kernel buffer.
 207 Similarly a write request is serviced by copying data to the section of the
 208 \fInewfs\fP address space corresponding to the requested disk block
 209 from the kernel buffer.
 210 When all the requests have been serviced, the \fInewfs\fP
 211 process returns to sleep to await more requests.
 212 .PP
 213 Once mounted,
 214 all operations on files in the memory-based filesystem are handled
 215 by the \fIufs\fP filesystem code until they get to the point where the
 216 filesystem needs to do I/O on the device.
 217 Here, the filesystem encounters the second piece of the
 218 memory-based filesystem.
 219 Instead of calling the special-device strategy routine,
 220 it calls the memory-based strategy routine,
 221 .RN mfs_strategy .
 222 Usually,
 223 the request is serviced by linking the buffer onto the
 224 I/O list for the memory-based filesystem
 225 vnode and sending a wakeup to the \fInewfs\fP process.
 226 This wakeup results in a context-switch to the \fInewfs\fP
 227 process, which does a copyin or copyout as described above.
 228 The strategy routine must be careful to check whether
 229 the I/O request is coming from the \fInewfs\fP process itself, however.
 230 Such requests happen during mount and unmount operations,
 231 when the kernel is reading and writing the superblock.
 232 Here,
 233 .RN mfs_strategy
 234 must do the I/O itself to avoid deadlock.
 235 .PP
 236 The final piece of kernel code to support the
 237 memory-based filesystem is the close routine.
 238 After the filesystem has been successfully unmounted,
 239 the device close routine is called.
 240 For a memory-based filesystem, the device close routine is
 241 .RN mfs_close .
 242 This routine flushes any pending I/O requests,
 243 then sets the I/O list head to a special value
 244 that is recognized by the I/O servicing loop in
 245 .RN mfs_mount
 246 as an indication that the filesystem is unmounted.
 247 The
 248 .RN mfs_mount
 249 routine exits, in turn causing the \fInewfs\fP process
 250 to exit, resulting in the filesystem vanishing in a cloud of dirty pages.
 251 .PP
 252 The paging of the filesystem does not require any additional
 253 code beyond that already in the kernel to support virtual memory.
 254 The \fInewfs\fP process competes with other processes on an equal basis
 255 for the machine's available memory.
 256 Data pages of the filesystem that have not yet been used
 257 are zero-fill-on-demand pages that do not occupy memory,
 258 although they currently allocate space in backing store.
 259 As long as memory is plentiful, the entire contents of the filesystem
 260 remain memory resident.
 261 When memory runs short, the oldest pages of \fInewfs\fP will be
 262 pushed to backing store as part of the normal paging activity.
 263 The pages that are pushed usually hold the contents of
 264 files that have been created in the memory-based filesystem
 265 but have not been recently accessed (or have been deleted).
 266 .[
 267 leffler
 268 .]
 269 .SH
 270 Performance
 271 .PP
 272 The performance of the current memory-based filesystem is determined by
 273 the memory-to-memory copy speed of the processor.
 274 Empirically we find that the throughput is about 45% of this
 275 memory-to-memory copy speed.
 276 The basic set of steps for each block written is:
 277 .IP 1)
 278 memory-to-memory copy from the user process doing the write to a kernel buffer
 279 .IP 2)
 280 context-switch to the \fInewfs\fP process
 281 .IP 3)
 282 memory-to-memory copy from the kernel buffer to the \fInewfs\fP address space
 283 .IP 4)
 284 context switch back to the writing process
 285 .LP
 286 Thus each write requires at least two memory-to-memory copies
 287 accounting for about 90% of the
 288 .SM CPU
 289 time.
 290 The remaining 10% is consumed in the context switches and
 291 the filesystem allocation and block location code.
 292 The actual context switch count is really only about half
 293 of the worst case outlined above because
 294 read-ahead and write-behind allow multiple blocks
 295 to be handled with each context switch.
 296 .PP
 297 On the six-\c
 298 .SM "MIPS CCI"
 299 Power 6/32 machine,
 300 the raw reading and writing speed is only about twice that of
 301 a regular disk-based filesystem.
 302 However, for processes that create and delete many files,
 303 the speedup is considerably greater.
 304 The reason for the speedup is that the filesystem
 305 must do two synchronous operations to create a file,
 306 first writing the allocated inode to disk, then creating the
 307 directory entry.
 308 Deleting a file similarly requires at least two synchronous
 309 operations.
 310 Here, the low latency of the memory-based filesystem is
 311 noticeable compared to the disk-based filesystem,
 312 as a synchronous operation can be done with
 313 just two context switches instead of incurring the disk latency.
 314 .SH
 315 Future Work
 316 .PP
 317 The most obvious shortcoming of the current implementation
 318 is that filesystem blocks are copied twice, once between the \fInewfs\fP
 319 process' address space and the kernel buffer cache,
 320 and once between the kernel buffer and the requesting process.
 321 These copies are done in different process contexts, necessitating
 322 two context switches per group of I/O requests.
 323 These problems arise because of the current inability of the kernel
 324 to do page-in operations
 325 for an address space other than that of the currently-running process,
 326 and the current inconvenience of mapping process-owned pages into the kernel
 327 buffer cache.
 328 Both of these problems are expected to be solved in the next version
 329 of the virtual memory system,
 330 and thus we chose not to address them in the current implementation.
 331 With the new version of the virtual memory system, we expect to use
 332 any part of physical memory as part of the buffer cache,
 333 even though it will not be entirely addressable at once within the kernel.
 334 In that system, the implementation of a memory-based filesystem
 335 that avoids the double copy and context switches will be much easier.
 336 .PP
 337 Ideally part of the kernel's address space would reside in pageable memory.
 338 Once such a facility is available it would be most efficient to
 339 build a memory-based filesystem within the kernel.
 340 One potential problem with such a scheme is that many kernels
 341 are limited to a small address space (usually a few megabytes).
 342 This restriction limits the size of memory-based
 343 filesystem that such a machine can support.
 344 On such a machine, the kernel can describe a memory-based filesystem
 345 that is larger than its address space and use a ``window''
 346 to map the larger filesystem address space into its limited address space.
 347 The window would maintain a cache of recently accessed pages.
 348 The problem with this scheme is that if the working set of
 349 active pages is greater than the size of the window, then
 350 much time is spent remapping pages and invalidating
 351 translation buffers.
 352 Alternatively, a separate address space could be constructed for each
 353 memory-based filesystem as in the current implementation,
 354 and the memory-resident pages of that address space could be mapped
 355 exactly as other cached pages are accessed.
 356 .PP
 357 The current system uses the existing local filesystem structures
 358 and code to implement the memory-based filesystem.
 359 The major advantages of this approach are the sharing of code
 360 and the simplicity of the approach.
 361 There are several disadvantages, however.
 362 One is that the size of the filesystem is fixed at mount time.
 363 This means that a fixed number of inodes (files) and data blocks
 364 can be supported.
 365 Currently, this approach requires enough swap space for the entire
 366 filesystem, and prevents expansion and contraction of the filesystem on demand.
 367 The current design also prevents the filesystem from taking advantage
 368 of the memory-resident character of the filesystem.
 369 It would be interesting to explore other filesystem implementations
 370 that would be less expensive to execute and that would make better
 371 use of the space.
 372 For example, the current filesystem structure is optimized for magnetic
 373 disks.
 374 It includes replicated control structures, ``cylinder groups''
 375 with separate allocation maps and control structures,
 376 and data structures that optimize rotational layout of files.
 377 None of this is useful in a memory-based filesystem (at least when the
 378 backing store for the filesystem is dynamically allocated and not
 379 contiguous on a single disk type).
 380 On the other hand,
 381 directories could be implemented using dynamically-allocated
 382 memory organized as linked lists or trees rather than as files stored
 383 in ``disk'' blocks.
 384 Allocation and location of pages for file data might use virtual memory
 385 primitives and data structures rather than direct and indirect blocks.
 386 A reimplementation along these lines will be considered when the virtual
 387 memory system in the current system has been replaced.
 388 .[
 389 $LIST$
 390 .]