share/doc/papers/kerntune/3.t

   1 .\"     $NetBSD: 3.t,v 1.2 1998/01/09 06:41:20 perry Exp $
   2 .\"
   3 .\" Copyright (c) 1984 The Regents of the University of California.
   4 .\" All rights reserved.
   5 .\"
   6 .\" Redistribution and use in source and binary forms, with or without
   7 .\" modification, are permitted provided that the following conditions
   8 .\" are met:
   9 .\" 1. Redistributions of source code must retain the above copyright
  10 .\"    notice, this list of conditions and the following disclaimer.
  11 .\" 2. Redistributions in binary form must reproduce the above copyright
  12 .\"    notice, this list of conditions and the following disclaimer in the
  13 .\"    documentation and/or other materials provided with the distribution.
  14 .\" 3. Neither the name of the University nor the names of its contributors
  15 .\"    may be used to endorse or promote products derived from this software
  16 .\"    without specific prior written permission.
  17 .\"
  18 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  19 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  20 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  21 .\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  22 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  23 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  24 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  25 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  26 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  27 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  28 .\" SUCH DAMAGE.
  29 .\"
  30 .\" Copyright (c) 1984 M. K. McKusick
  31 .\"
  32 .\" Redistribution and use in source and binary forms, with or without
  33 .\" modification, are permitted provided that the following conditions
  34 .\" are met:
  35 .\" 1. Redistributions of source code must retain the above copyright
  36 .\"    notice, this list of conditions and the following disclaimer.
  37 .\" 2. Redistributions in binary form must reproduce the above copyright
  38 .\"    notice, this list of conditions and the following disclaimer in the
  39 .\"    documentation and/or other materials provided with the distribution.
  40 .\" 3. All advertising materials mentioning features or use of this software
  41 .\"    must display the following acknowledgement:
  42 .\"     This product includes software developed by the University of
  43 .\"     California, Berkeley and its contributors.
  44 .\" 4. Neither the name of the University nor the names of its contributors
  45 .\"    may be used to endorse or promote products derived from this software
  46 .\"    without specific prior written permission.
  47 .\"
  48 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  49 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  50 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  51 .\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  52 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  53 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  54 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  55 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  56 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  57 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  58 .\" SUCH DAMAGE.
  59 .\"
  60 .\"     @(#)3.t 1.2 (Berkeley) 11/8/90
  61 .\"
  62 .ds RH Techniques for Improving Performance
  63 .NH 1
  64 Techniques for Improving Performance
  65 .PP
  66 This section gives several hints on general optimization techniques.
  67 It then proceeds with an example of how they can be
  68 applied to the 4.2BSD kernel to improve its performance.
  69 .NH 2
  70 Using the Profiler
  71 .PP
  72 The profiler is a useful tool for improving
  73 a set of routines that implement an abstraction.
  74 It can be helpful in identifying poorly coded routines,
  75 and in evaluating the new algorithms and code that replace them.
  76 Taking full advantage of the profiler
  77 requires a careful examination of the call graph profile,
  78 and a thorough knowledge of the abstractions underlying
  79 the kernel.
  80 .PP
  81 The easiest optimization that can be performed
  82 is a small change
  83 to a control construct or data structure.
  84 An obvious starting point
  85 is to expand a small frequently called routine inline.
  86 The drawback to inline expansion is that the data abstractions
  87 in the kernel may become less parameterized,
  88 hence less clearly defined.
  89 The profiling will also become less useful since the loss of
  90 routines will make its output more granular.
  91 .PP
  92 Further potential for optimization lies in routines that
  93 implement data abstractions whose total execution
  94 time is long.
  95 If the data abstraction function cannot easily be speeded up,
  96 it may be advantageous to cache its results,
  97 and eliminate the need to rerun
  98 it for identical inputs.
  99 These and other ideas for program improvement are discussed in
 100 [Bentley81].
 101 .PP
 102 This tool is best used in an iterative approach:
 103 profiling the kernel,
 104 eliminating one bottleneck,
 105 then finding some other part of the kernel
 106 that begins to dominate execution time.
 107 .PP
 108 A completely different use of the profiler is to analyze the control
 109 flow of an unfamiliar section of the kernel.
 110 By running an example that exercises the unfamiliar section of the kernel,
 111 and then using \fIgprof\fR, you can get a view of the
 112 control structure of the unfamiliar section.
 113 .NH 2
 114 An Example of Tuning
 115 .PP
 116 The first step is to come up with a method for generating
 117 profile data.
 118 We prefer to run a profiling system for about a one day
 119 period on one of our general timesharing machines.
 120 While this is not as reproducible as a synthetic workload,
 121 it certainly represents a realistic test.
 122 We have run one day profiles on several
 123 occasions over a three month period.
 124 Despite the long period of time that elapsed
 125 between the test runs the shape of the profiles,
 126 as measured by the number of times each system call
 127 entry point was called, were remarkably similar.
 128 .PP
 129 A second alternative is to write a small benchmark
 130 program to repeated exercise a suspected bottleneck.
 131 While these benchmarks are not useful as a long term profile
 132 they can give quick feedback on whether a hypothesized
 133 improvement is really having an effect.
 134 It is important to realize that the only real assurance
 135 that a change has a beneficial effect is through
 136 long term measurements of general timesharing.
 137 We have numerous examples where a benchmark program
 138 suggests vast improvements while the change
 139 in the long term system performance is negligible,
 140 and conversely examples in which the benchmark program run more slowly,
 141 but the long term system performance improves significantly.
 142 .PP
 143 An investigation of our long term profiling showed that
 144 the single most expensive function performed by the kernel
 145 is path name translation.
 146 We find that our general time sharing systems do about
 147 500,000 name translations per day.
 148 The cost of doing name translation in the original 4.2BSD
 149 is 24.2 milliseconds,
 150 representing 40% of the time processing system calls,
 151 which is 19% of the total cycles in the kernel,
 152 or 11% of all cycles executed on the machine.
 153 The times are shown in Figure 3.
 154 .KF
 155 .DS L
 156 .TS
 157 center box;
 158 l r r.
 159 part    time    % of kernel
 160 _
 161 self    14.3 ms/call    11.3%
 162 child   9.9 ms/call     7.9%
 163 _
 164 total   24.2 ms/call    19.2%
 165 .TE
 166 .ce
 167 Figure 3. Call times for \fInamei\fP.
 168 .DE
 169 .KE
 170 .PP
 171 The system measurements collected showed the
 172 pathname translation routine, \fInamei\fP,
 173 was clearly worth optimizing.
 174 An inspection of \fInamei\fP shows that
 175 it consists of two nested loops.
 176 The outer loop is traversed once per pathname component.
 177 The inner loop performs a linear search through a directory looking
 178 for a particular pathname component.
 179 .PP
 180 Our first idea was to observe that many programs
 181 step through a directory performing an operation on
 182 each entry in turn.
 183 This caused us to modify \fInamei\fP to cache
 184 the directory offset of the last pathname
 185 component looked up by a process.
 186 The cached offset is then used
 187 as the point at which a search in the same directory
 188 begins.  Changing directories invalidates the cache, as
 189 does modifying the directory.
 190 For programs that step sequentially through a directory with
 191 $N$ files, search time decreases from $O ( N sup 2 )$
 192 to $O(N)$.
 193 .PP
 194 The cost of the cache is about 20 lines of code
 195 (about 0.2 kilobytes)
 196 and 16 bytes per process, with the cached data
 197 stored in a process's \fIuser\fP vector.
 198 .PP
 199 As a quick benchmark to verify the effectiveness of the
 200 cache we ran ``ls \-l''
 201 on a directory containing 600 files.
 202 Before the per-process cache this command
 203 used 22.3 seconds of system time.
 204 After adding the cache the program used the same amount
 205 of user time, but the system time dropped to 3.3 seconds.
 206 .PP
 207 This change prompted our rerunning a profiled system
 208 on a machine containing the new \fInamei\fP.
 209 The results showed that the time in \fInamei\fP
 210 dropped by only 2.6 ms/call and
 211 still accounted for 36% of the system call time,
 212 18% of the kernel, or about 10% of all the machine cycles.
 213 This amounted to a drop in system time from 57% to about 55%.
 214 The results are shown in Figure 4.
 215 .KF
 216 .DS L
 217 .TS
 218 center box;
 219 l r r.
 220 part    time    % of kernel
 221 _
 222 self    11.0 ms/call    9.2%
 223 child   10.6 ms/call    8.9%
 224 _
 225 total   21.6 ms/call    18.1%
 226 .TE
 227 .ce
 228 Figure 4. Call times for \fInamei\fP with per-process cache.
 229 .DE
 230 .KE
 231 .PP
 232 The small performance improvement
 233 was caused by a low cache hit ratio.
 234 Although the cache was 90% effective when hit,
 235 it was only usable on about 25% of the names being translated.
 236 An additional reason for the small improvement was that
 237 although the amount of time spent in \fInamei\fP itself
 238 decreased substantially,
 239 more time was spent in the routines that it called
 240 since each directory had to be accessed twice;
 241 once to search from the middle to the end,
 242 and once to search from the beginning to the middle.
 243 .PP
 244 Most missed names were caused by path name components
 245 other than the last.
 246 Thus Robert Elz introduced a system wide cache of most recent
 247 name translations.
 248 The cache is keyed on a name and the
 249 inode and device number of the directory that contains it.
 250 Associated with each entry is a pointer to the corresponding
 251 entry in the inode table.
 252 This has the effect of short circuiting the outer loop of \fInamei\fP.
 253 For each path name component,
 254 \fInamei\fP first looks in its cache of recent translations
 255 for the needed name.
 256 If it exists, the directory search can be completely eliminated.
 257 If the name is not recognized,
 258 then the per-process cache may still be useful in
 259 reducing the directory search time.
 260 The two cacheing schemes complement each other well.
 261 .PP
 262 The cost of the name cache is about 200 lines of code
 263 (about 1.2 kilobytes)
 264 and 44 bytes per cache entry.
 265 Depending on the size of the system,
 266 about 200 to 1000 entries will normally be configured,
 267 using 10-44 kilobytes of physical memory.
 268 The name cache is resident in memory at all times.
 269 .PP
 270 After adding the system wide name cache we reran ``ls \-l''
 271 on the same directory.
 272 The user time remained the same,
 273 however the system time rose slightly to 3.7 seconds.
 274 This was not surprising as \fInamei\fP
 275 now had to maintain the cache,
 276 but was never able to make any use of it.
 277 .PP
 278 Another profiled system was created and measurements
 279 were collected over a one day period.  These measurements
 280 showed a 6 ms/call decrease in \fInamei\fP, with
 281 \fInamei\fP accounting for only 31% of the system call time,
 282 16% of the time in the kernel,
 283 or about 7% of all the machine cycles.
 284 System time dropped from 55% to about 49%.
 285 The results are shown in Figure 5.
 286 .KF
 287 .DS L
 288 .TS
 289 center box;
 290 l r r.
 291 part    time    % of kernel
 292 _
 293 self    9.5 ms/call     9.6%
 294 child   6.1 ms/call     6.1%
 295 _
 296 total   15.6 ms/call    15.7%
 297 .TE
 298 .ce
 299 Figure 5.  Call times for \fInamei\fP with both caches.
 300 .DE
 301 .KE
 302 .PP
 303 Statistics on the performance of both caches show
 304 the large performance improvement is
 305 caused by the high hit ratio.
 306 On the profiled system a 60% hit rate was observed in
 307 the system wide cache.  This, coupled with the 25%
 308 hit rate in the per-process offset cache yielded an
 309 effective cache hit rate of 85%.
 310 While the system wide cache reduces both the amount of time in
 311 the routines that \fInamei\fP calls as well as \fInamei\fP itself
 312 (since fewer directories need to be accessed or searched),
 313 it is interesting to note that the actual percentage of system
 314 time spent in \fInamei\fP itself increases even though the
 315 actual time per call decreases.
 316 This is because less total time is being spent in the kernel,
 317 hence a smaller absolute time becomes a larger total percentage.