share/doc/smm/05.fastfs/4.t

   1 .\"     $NetBSD: 4.t,v 1.2 1998/01/09 06:55:32 perry Exp $
   2 .\"
   3 .\" Copyright (c) 1986, 1993
   4 .\"     The Regents of the University of California.  All rights reserved.
   5 .\"
   6 .\" Redistribution and use in source and binary forms, with or without
   7 .\" modification, are permitted provided that the following conditions
   8 .\" are met:
   9 .\" 1. Redistributions of source code must retain the above copyright
  10 .\"    notice, this list of conditions and the following disclaimer.
  11 .\" 2. Redistributions in binary form must reproduce the above copyright
  12 .\"    notice, this list of conditions and the following disclaimer in the
  13 .\"    documentation and/or other materials provided with the distribution.
  14 .\" 3. Neither the name of the University nor the names of its contributors
  15 .\"    may be used to endorse or promote products derived from this software
  16 .\"    without specific prior written permission.
  17 .\"
  18 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  19 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  20 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  21 .\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  22 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  23 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  24 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  25 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  26 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  27 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  28 .\" SUCH DAMAGE.
  29 .\"
  30 .\"     @(#)4.t 8.1 (Berkeley) 6/8/93
  31 .\"
  32 .ds RH Performance
  33 .NH
  34 Performance
  35 .PP
  36 Ultimately, the proof of the effectiveness of the
  37 algorithms described in the previous section
  38 is the long term performance of the new file system.
  39 .PP
  40 Our empirical studies have shown that the inode layout policy has
  41 been effective.
  42 When running the ``list directory'' command on a large directory
  43 that itself contains many directories (to force the system
  44 to access inodes in multiple cylinder groups),
  45 the number of disk accesses for inodes is cut by a factor of two.
  46 The improvements are even more dramatic for large directories
  47 containing only files,
  48 disk accesses for inodes being cut by a factor of eight.
  49 This is most encouraging for programs such as spooling daemons that
  50 access many small files,
  51 since these programs tend to flood the
  52 disk request queue on the old file system.
  53 .PP
  54 Table 2 summarizes the measured throughput of the new file system.
  55 Several comments need to be made about the conditions under which these
  56 tests were run.
  57 The test programs measure the rate at which user programs can transfer
  58 data to or from a file without performing any processing on it.
  59 These programs must read and write enough data to
  60 insure that buffering in the
  61 operating system does not affect the results.
  62 They are also run at least three times in succession;
  63 the first to get the system into a known state
  64 and the second two to insure that the
  65 experiment has stabilized and is repeatable.
  66 The tests used and their results are
  67 discussed in detail in [Kridle83]\(dg.
  68 .FS
  69 \(dg A UNIX command that is similar to the reading test that we used is
  70 ``cp file /dev/null'', where ``file'' is eight megabytes long.
  71 .FE
  72 The systems were running multi-user but were otherwise quiescent.
  73 There was no contention for either the CPU or the disk arm.
  74 The only difference between the UNIBUS and MASSBUS tests
  75 was the controller.
  76 All tests used an AMPEX Capricorn 330 megabyte Winchester disk.
  77 As Table 2 shows, all file system test runs were on a VAX 11/750.
  78 All file systems had been in production use for at least
  79 a month before being measured.
  80 The same number of system calls were performed in all tests;
  81 the basic system call overhead was a negligible portion of
  82 the total running time of the tests.
  83 .KF
  84 .DS B
  85 .TS
  86 box;
  87 c c|c s s
  88 c c|c c c.
  89 Type of Processor and   Read
  90 File System     Bus Measured    Speed   Bandwidth       % CPU
  91 _
  92 old 1024        750/UNIBUS      29 Kbytes/sec   29/983 3%       11%
  93 new 4096/1024   750/UNIBUS      221 Kbytes/sec  221/983 22%     43%
  94 new 8192/1024   750/UNIBUS      233 Kbytes/sec  233/983 24%     29%
  95 new 4096/1024   750/MASSBUS     466 Kbytes/sec  466/983 47%     73%
  96 new 8192/1024   750/MASSBUS     466 Kbytes/sec  466/983 47%     54%
  97 .TE
  98 .ce 1
  99 Table 2a \- Reading rates of the old and new UNIX file systems.
 100 .TS
 101 box;
 102 c c|c s s
 103 c c|c c c.
 104 Type of Processor and   Write
 105 File System     Bus Measured    Speed   Bandwidth       % CPU
 106 _
 107 old 1024        750/UNIBUS      48 Kbytes/sec   48/983 5%       29%
 108 new 4096/1024   750/UNIBUS      142 Kbytes/sec  142/983 14%     43%
 109 new 8192/1024   750/UNIBUS      215 Kbytes/sec  215/983 22%     46%
 110 new 4096/1024   750/MASSBUS     323 Kbytes/sec  323/983 33%     94%
 111 new 8192/1024   750/MASSBUS     466 Kbytes/sec  466/983 47%     95%
 112 .TE
 113 .ce 1
 114 Table 2b \- Writing rates of the old and new UNIX file systems.
 115 .DE
 116 .KE
 117 .PP
 118 Unlike the old file system,
 119 the transfer rates for the new file system do not
 120 appear to change over time.
 121 The throughput rate is tied much more strongly to the
 122 amount of free space that is maintained.
 123 The measurements in Table 2 were based on a file system
 124 with a 10% free space reserve.
 125 Synthetic work loads suggest that throughput deteriorates
 126 to about half the rates given in Table 2 when the file
 127 systems are full.
 128 .PP
 129 The percentage of bandwidth given in Table 2 is a measure
 130 of the effective utilization of the disk by the file system.
 131 An upper bound on the transfer rate from the disk is calculated
 132 by multiplying the number of bytes on a track by the number
 133 of revolutions of the disk per second.
 134 The bandwidth is calculated by comparing the data rates
 135 the file system is able to achieve as a percentage of this rate.
 136 Using this metric, the old file system is only
 137 able to use about 3\-5% of the disk bandwidth,
 138 while the new file system uses up to 47%
 139 of the bandwidth.
 140 .PP
 141 Both reads and writes are faster in the new system than in the old system.
 142 The biggest factor in this speedup is because of the larger
 143 block size used by the new file system.
 144 The overhead of allocating blocks in the new system is greater
 145 than the overhead of allocating blocks in the old system,
 146 however fewer blocks need to be allocated in the new system
 147 because they are bigger.
 148 The net effect is that the cost per byte allocated is about
 149 the same for both systems.
 150 .PP
 151 In the new file system, the reading rate is always at least
 152 as fast as the writing rate.
 153 This is to be expected since the kernel must do more work when
 154 allocating blocks than when simply reading them.
 155 Note that the write rates are about the same
 156 as the read rates in the 8192 byte block file system;
 157 the write rates are slower than the read rates in the 4096 byte block
 158 file system.
 159 The slower write rates occur because
 160 the kernel has to do twice as many disk allocations per second,
 161 making the processor unable to keep up with the disk transfer rate.
 162 .PP
 163 In contrast the old file system is about 50%
 164 faster at writing files than reading them.
 165 This is because the write system call is asynchronous and
 166 the kernel can generate disk transfer
 167 requests much faster than they can be serviced,
 168 hence disk transfers queue up in the disk buffer cache.
 169 Because the disk buffer cache is sorted by minimum seek distance,
 170 the average seek between the scheduled disk writes is much
 171 less than it would be if the data blocks were written out
 172 in the random disk order in which they are generated.
 173 However when the file is read,
 174 the read system call is processed synchronously so
 175 the disk blocks must be retrieved from the disk in the
 176 non-optimal seek order in which they are requested.
 177 This forces the disk scheduler to do long
 178 seeks resulting in a lower throughput rate.
 179 .PP
 180 In the new system the blocks of a file are more optimally
 181 ordered on the disk.
 182 Even though reads are still synchronous,
 183 the requests are presented to the disk in a much better order.
 184 Even though the writes are still asynchronous,
 185 they are already presented to the disk in minimum seek
 186 order so there is no gain to be had by reordering them.
 187 Hence the disk seek latencies that limited the old file system
 188 have little effect in the new file system.
 189 The cost of allocation is the factor in the new system that
 190 causes writes to be slower than reads.
 191 .PP
 192 The performance of the new file system is currently
 193 limited by memory to memory copy operations
 194 required to move data from disk buffers in the
 195 system's address space to data buffers in the user's
 196 address space.  These copy operations account for
 197 about 40% of the time spent performing an input/output operation.
 198 If the buffers in both address spaces were properly aligned,
 199 this transfer could be performed without copying by
 200 using the VAX virtual memory management hardware.
 201 This would be especially desirable when transferring
 202 large amounts of data.
 203 We did not implement this because it would change the
 204 user interface to the file system in two major ways:
 205 user programs would be required to allocate buffers on page boundaries,
 206 and data would disappear from buffers after being written.
 207 .PP
 208 Greater disk throughput could be achieved by rewriting the disk drivers
 209 to chain together kernel buffers.
 210 This would allow contiguous disk blocks to be read
 211 in a single disk transaction.
 212 Many disks used with UNIX systems contain either
 213 32 or 48 512 byte sectors per track.
 214 Each track holds exactly two or three 8192 byte file system blocks,
 215 or four or six 4096 byte file system blocks.
 216 The inability to use contiguous disk blocks
 217 effectively limits the performance
 218 on these disks to less than 50% of the available bandwidth.
 219 If the next block for a file cannot be laid out contiguously,
 220 then the minimum spacing to the next allocatable
 221 block on any platter is between a sixth and a half a revolution.
 222 The implication of this is that the best possible layout without
 223 contiguous blocks uses only half of the bandwidth of any given track.
 224 If each track contains an odd number of sectors,
 225 then it is possible to resolve the rotational delay to any number of sectors
 226 by finding a block that begins at the desired
 227 rotational position on another track.
 228 The reason that block chaining has not been implemented is because it
 229 would require rewriting all the disk drivers in the system,
 230 and the current throughput rates are already limited by the
 231 speed of the available processors.
 232 .PP
 233 Currently only one block is allocated to a file at a time.
 234 A technique used by the DEMOS file system
 235 when it finds that a file is growing rapidly,
 236 is to preallocate several blocks at once,
 237 releasing them when the file is closed if they remain unused.
 238 By batching up allocations, the system can reduce the
 239 overhead of allocating at each write,
 240 and it can cut down on the number of disk writes needed to
 241 keep the block pointers on the disk
 242 synchronized with the block allocation [Powell79].
 243 This technique was not included because block allocation
 244 currently accounts for less than 10% of the time spent in
 245 a write system call and, once again, the
 246 current throughput rates are already limited by the speed
 247 of the available processors.
 248 .ds RH Functional enhancements
 249 .sp 2
 250 .ne 1i