1 .\" $NetBSD: kernmalloc.t,v 1.4 2003/02/04 23:47:52 perry Exp $
3 .\" Copyright (c) 1988 The Regents of the University of California.
4 .\" All rights reserved.
6 .\" Redistribution and use in source and binary forms, with or without
7 .\" modification, are permitted provided that the following conditions
9 .\" 1. Redistributions of source code must retain the above copyright
10 .\" notice, this list of conditions and the following disclaimer.
11 .\" 2. Redistributions in binary form must reproduce the above copyright
12 .\" notice, this list of conditions and the following disclaimer in the
13 .\" documentation and/or other materials provided with the distribution.
14 .\" 3. Neither the name of the University nor the names of its contributors
15 .\" may be used to endorse or promote products derived from this software
16 .\" without specific prior written permission.
18 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
19 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
20 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
21 .\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
22 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
23 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
24 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
25 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
26 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
27 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
30 .\" @(#)kernmalloc.t 5.1 (Berkeley) 4/16/91
32 .\" reference a system routine name
34 \fI\\$1\fP\^(\h'1m/24u')\\$2
36 .\" reference a header name
55 .\" cheat: original indent is stored in \n(OI by .DS B; restore it
56 .\" then center legend after .DE rereads and centers the block.
76 \fIProceedings of the San Francisco USENIX Conference\fP,
77 pp. 295-303, June 1988.
83 Design of a General Purpose Memory Allocator for the 4.3BSD UNIX\(dg Kernel
84 .ds LF Summer USENIX '88
86 .ds RF San Francisco, June 20-24
87 .EH 'Design of a General Purpose Memory ...''McKusick, Karels'
88 .OH 'McKusick, Karels''Design of a General Purpose Memory ...'
90 \(dgUNIX is a registered trademark of AT&T in the US and other countries.
93 Marshall Kirk McKusick
97 Computer Systems Research Group
98 Computer Science Division
99 Department of Electrical Engineering and Computer Science
100 University of California, Berkeley
101 Berkeley, California 94720
103 The 4.3BSD UNIX kernel uses many memory allocation mechanisms,
104 each designed for the particular needs of the particular subsystem.
105 This paper describes a general purpose dynamic memory allocator
106 that can be used by all of the kernel subsystems.
107 The design of this allocator takes advantage of known memory usage
108 patterns in the UNIX kernel and a hybrid strategy that is time-efficient
109 for small allocations and space-efficient for large allocations.
110 This allocator replaces the multiple memory allocation interfaces
111 with a single easy-to-program interface,
112 results in more efficient use of global memory by eliminating
113 partitioned and specialized memory pools,
114 and is quick enough that no performance loss is observed
115 relative to the current implementations.
116 The paper concludes with a discussion of our experience in using
117 the new memory allocator,
118 and directions for future work.
121 .H 1 "Kernel Memory Allocation in 4.3BSD
123 The 4.3BSD kernel has at least ten different memory allocators.
124 Some of them handle large blocks,
125 some of them handle small chained data structures,
126 and others include information to describe I/O operations.
127 Often the allocations are for small pieces of memory that are only
128 needed for the duration of a single system call.
129 In a user process such short-term
130 memory would be allocated on the run-time stack.
131 Because the kernel has a limited run-time stack,
132 it is not feasible to allocate even moderate blocks of memory on it.
133 Consequently, such memory must be allocated through a more dynamic mechanism.
135 when the system must translate a pathname,
136 it must allocate a one kilobye buffer to hold the name.
137 Other blocks of memory must be more persistent than a single system call
138 and really have to be allocated from dynamic memory.
139 Examples include protocol control blocks that remain throughout
140 the duration of the network connection.
142 Demands for dynamic memory allocation in the kernel have increased
143 as more services have been added.
144 Each time a new type of memory allocation has been required,
145 a specialized memory allocation scheme has been written to handle it.
146 Often the new memory allocation scheme has been built on top
147 of an older allocator.
148 For example, the block device subsystem provides a crude form of
149 memory allocation through the allocation of empty buffers [Thompson78].
150 The allocation is slow because of the implied semantics of
151 finding the oldest buffer, pushing its contents to disk if they are dirty,
152 and moving physical memory into or out of the buffer to create
154 To reduce the overhead, a ``new'' memory allocator was built in 4.3BSD
155 for name translation that allocates a pool of empty buffers.
156 It keeps them on a free list so they can
157 be quickly allocated and freed [McKusick85].
159 This memory allocation method has several drawbacks.
160 First, the new allocator can only handle a limited range of sizes.
161 Second, it depletes the buffer pool, as it steals memory intended
162 to buffer disk blocks to other purposes.
163 Finally, it creates yet another interface of
164 which the programmer must be aware.
166 A generalized memory allocator is needed to reduce the complexity
167 of writing code inside the kernel.
168 Rather than providing many semi-specialized ways of allocating memory,
169 the kernel should provide a single general purpose allocator.
170 With only a single interface,
171 programmers do not need to figure
172 out the most appropriate way to allocate memory.
173 If a good general purpose allocator is available,
174 it helps avoid the syndrome of creating yet another special
177 To ease the task of understanding how to use it,
178 the memory allocator should have an interface similar to the interface
179 of the well-known memory allocator provided for
180 applications programmers through the C library routines
184 Like the C library interface,
185 the allocation routine should take a parameter specifying the
186 size of memory that is needed.
187 The range of sizes for memory requests should not be constrained.
188 The free routine should take a pointer to the storage being freed,
189 and should not require additional information such as the size
190 of the piece of memory being freed.
191 .H 1 "Criteria for a Kernel Memory Allocator
193 The design specification for a kernel memory allocator is similar to,
194 but not identical to,
195 the design criteria for a user level memory allocator.
196 The first criterion for a memory allocator is that it make good use
197 of the physical memory.
198 Good use of memory is measured by the amount of memory needed to hold
199 a set of allocations at any point in time.
200 Percentage utilization is expressed as:
202 utilization~=~requested over required
204 Here, ``requested'' is the sum of the memory that has been requested
206 ``Required'' is the amount of memory that has been
207 allocated for the pool from which the requests are filled.
208 An allocator requires more memory than requested because of fragmentation
209 and a need to have a ready supply of free memory for future requests.
210 A perfect memory allocator would have a utilization of 100%.
212 having a 50% utilization is considered good [Korn85].
214 Good memory utilization in the kernel is more important than
216 Because user processes run in virtual memory,
217 unused parts of their address space can be paged out.
218 Thus pages in the process address space
219 that are part of the ``required'' pool that are not
220 being ``requested'' need not tie up physical memory.
221 Because the kernel is not paged,
222 all pages in the ``required'' pool are held by the kernel and
223 cannot be used for other purposes.
224 To keep the kernel utilization percentage as high as possible,
225 it is desirable to release unused memory in the ``required'' pool
226 rather than to hold it as is typically done with user processes.
227 Because the kernel can directly manipulate its own page maps,
228 releasing unused memory is fast;
229 a user process must do a system call to release memory.
231 The most important criterion for a memory allocator is that it be fast.
232 Because memory allocation is done frequently,
233 a slow memory allocator will degrade the system performance.
234 Speed of allocation is more critical when executing in the
235 kernel than in user code,
236 because the kernel must allocate many data structure that user
237 processes can allocate cheaply on their run-time stack.
238 In addition, the kernel represents the platform on which all user
240 and if it is slow, it will degrade the performance of every process
243 Another problem with a slow memory allocator is that programmers
244 of frequently-used kernel interfaces will feel that they
245 cannot afford to use it as their primary memory allocator.
246 Instead they will build their own memory allocator on top of the
247 original by maintaining their own pool of memory blocks.
248 Multiple allocators reduce the efficiency with which memory is used.
249 The kernel ends up with many different free lists of memory
250 instead of a single free list from which all allocation can be drawn.
252 consider the case of two subsystems that need memory.
253 If they have their own free lists,
254 the amount of memory tied up in the two lists will be the
255 sum of the greatest amount of memory that each of
256 the two subsystems has ever used.
257 If they share a free list,
258 the amount of memory tied up in the free list may be as low as the
259 greatest amount of memory that either subsystem used.
260 As the number of subsystems grows,
261 the savings from having a single free list grow.
262 .H 1 "Existing User-level Implementations
264 There are many different algorithms and
265 implementations of user-level memory allocators.
266 A survey of those available on UNIX systems appeared in [Korn85].
267 Nearly all of the memory allocators tested made good use of memory,
268 though most of them were too slow for use in the kernel.
269 The fastest memory allocator in the survey by nearly a factor of two
270 was the memory allocator provided on 4.2BSD originally
271 written by Chris Kingsley at California Institute of Technology.
273 the 4.2BSD memory allocator also wasted twice as much memory
274 as its nearest competitor in the survey.
276 The 4.2BSD user-level memory allocator works by maintaining a set of lists
277 that are ordered by increasing powers of two.
278 Each list contains a set of memory blocks of its corresponding size.
279 To fulfill a memory request,
280 the size of the request is rounded up to the next power of two.
281 A piece of memory is then removed from the list corresponding
282 to the specified power of two and returned to the requester.
283 Thus, a request for a block of memory of size 53 returns
284 a block from the 64-sized list.
285 A typical memory allocation requires a roundup calculation
286 followed by a linked list removal.
287 Only if the list is empty is a real memory allocation done.
288 The free operation is also fast;
289 the block of memory is put back onto the list from which it came.
290 The correct list is identified by a size indicator stored
291 immediately preceding the memory block.
292 .H 1 "Considerations Unique to a Kernel Allocator
294 There are several special conditions that arise when writing a
295 memory allocator for the kernel that do not apply to a user process
297 First, the maximum memory allocation can be determined at
298 the time that the machine is booted.
299 This number is never more than the amount of physical memory on the machine,
300 and is typically much less since a machine with all its
301 memory dedicated to the operating system is uninteresting to use.
302 Thus, the kernel can statically allocate a set of data structures
303 to manage its dynamically allocated memory.
304 These data structures never need to be
305 expanded to accommodate memory requests;
306 yet, if properly designed, they need not be large.
307 For a user process, the maximum amount of memory that may be allocated
308 is a function of the maximum size of its virtual memory.
309 Although it could allocate static data structures to manage
310 its entire virtual memory,
311 even if they were efficiently encoded they would potentially be huge.
312 The other alternative is to allocate data structures as they are needed.
313 However, that adds extra complications such as new
314 failure modes if it cannot allocate space for additional
315 structures and additional mechanisms to link them all together.
317 Another special condition of the kernel memory allocator is that it
318 can control its own address space.
319 Unlike user processes that can only grow and shrink their heap at one end,
320 the kernel can keep an arena of kernel addresses and allocate
321 pieces from that arena which it then populates with physical memory.
322 The effect is much the same as a user process that has parts of
323 its address space paged out when they are not in use,
324 except that the kernel can explicitly control the set of pages
325 allocated to its address space.
326 The result is that the ``working set'' of pages in use by the
327 kernel exactly corresponds to the set of pages that it is really using.
328 .FI "One day memory usage on a Berkeley time-sharing machine"
332 A final special condition that applies to the kernel is that
333 all of the different uses of dynamic memory are known in advance.
334 Each one of these uses of dynamic memory can be assigned a type.
335 For each type of dynamic memory that is allocated,
336 the kernel can provide allocation limits.
337 One reason given for having separate allocators is that
338 no single allocator could starve the rest of the kernel of all
339 its available memory and thus a single runaway
340 client could not paralyze the system.
341 By putting limits on each type of memory,
342 the single general purpose memory allocator can provide the same
343 protection against memory starvation.\(dg
345 \(dgOne might seriously ask the question what good it is if ``only''
346 one subsystem within the kernel hangs if it is something like the
347 network on a diskless workstation.
350 \*(Lb shows the memory usage of the kernel over a one day period
351 on a general timesharing machine at Berkeley.
352 The ``In Use'', ``Free'', and ``Mem Use'' fields are instantaneous values;
353 the ``Requests'' field is the number of allocations since system startup;
354 the ``High Use'' field is the maximum value of
355 the ``Mem Use'' field since system startup.
356 The figure demonstrates that most
357 allocations are for small objects.
358 Large allocations occur infrequently,
359 and are typically for long-lived objects
360 such as buffers to hold the superblock for
361 a mounted file system.
362 Thus, a memory allocator only needs to be
363 fast for small pieces of memory.
364 .H 1 "Implementation of the Kernel Memory Allocator
366 In reviewing the available memory allocators,
367 none of their strategies could be used without some modification.
368 The kernel memory allocator that we ended up with is a hybrid
369 of the fast memory allocator found in the 4.2BSD C library
370 and a slower but more-memory-efficient first-fit allocator.
372 Small allocations are done using the 4.2BSD power-of-two list strategy;
373 the typical allocation requires only a computation of
374 the list to use and the removal of an element if it is available,
376 Macros are provided to avoid the cost of a subroutine call.
377 Only if the request cannot be fulfilled from a list is a call
378 made to the allocator itself.
379 To ensure that the allocator is always called for large requests,
380 the lists corresponding to large allocations are always empty.
381 Appendix A shows the data structures and implementation of the macros.
383 Similarly, freeing a block of memory can be done with a macro.
384 The macro computes the list on which to place the request
386 The free routine is called only if the block of memory is
387 considered to be a large allocation.
388 Including the cost of blocking out interrupts,
389 the allocation and freeing macros generate respectively
390 only nine and sixteen (simple) VAX instructions.
392 Because of the inefficiency of power-of-two allocation strategies
393 for large allocations,
394 a different strategy is used for allocations larger than two kilobytes.
395 The selection of two kilobytes is derived from our statistics on
396 the utilization of memory within the kernel,
397 that showed that 95 to 98% of allocations are of size one kilobyte or less.
398 A frequent caller of the memory allocator
399 (the name translation function)
400 always requests a one kilobyte block.
401 Additionally the allocation method for large blocks is based on allocating
402 pieces of memory in multiples of pages.
403 Consequently the actual allocation size for requests of size
404 $2~times~pagesize$ or less are identical.\(dg
406 \(dgTo understand why this number is $size 8 {2~times~pagesize}$ one
407 observes that the power-of-two algorithm yields sizes of 1, 2, 4, 8, \&...
408 pages while the large block algorithm that allocates in multiples
409 of pages yields sizes of 1, 2, 3, 4, \&... pages.
410 Thus for allocations of sizes between one and two pages
411 both algorithms use two pages;
412 it is not until allocations of sizes between two and three pages
413 that a difference emerges where the power-of-two algorithm will use
414 four pages while the large block algorithm will use three pages.
416 In 4.3BSD on the VAX, the (software) page size is one kilobyte,
417 so two kilobytes is the smallest logical cutoff.
419 Large allocations are first rounded up to be a multiple of the page size.
420 The allocator then uses a first-fit algorithm to find space in the
421 kernel address arena set aside for dynamic allocations.
422 Thus a request for a five kilobyte piece of memory will use exactly
423 five pages of memory rather than eight kilobytes as with
424 the power-of-two allocation strategy.
425 When a large piece of memory is freed,
426 the memory pages are returned to the free memory pool,
427 and the address space is returned to the kernel address arena
428 where it is coalesced with adjacent free pieces.
430 Another technique to improve both the efficiency of memory utilization
431 and the speed of allocation
432 is to cluster same-sized small allocations on a page.
433 When a list for a power-of-two allocation is empty,
434 a new page is allocated and divided into pieces of the needed size.
435 This strategy speeds future allocations as several pieces of memory
436 become available as a result of the call into the allocator.
438 .FI "Calculation of allocation size"
441 Because the size is not specified when a block of memory is freed,
442 the allocator must keep track of the sizes of the pieces it has handed out.
443 The 4.2BSD user-level allocator stores the size of each block
444 in a header just before the allocation.
445 However, this strategy doubles the memory requirement for allocations that
446 require a power-of-two-sized block.
448 instead of storing the size of each piece of memory with the piece itself,
449 the size information is associated with the memory page.
450 \*(Lb shows how the kernel determines
451 the size of a piece of memory that is being freed,
452 by calculating the page in which it resides,
453 and looking up the size associated with that page.
454 Eliminating the cost of the overhead per piece improved utilization
455 far more than expected.
456 The reason is that many allocations in the kernel are for blocks of
457 memory whose size is exactly a power of two.
458 These requests would be nearly doubled if the user-level strategy were used.
459 Now they can be accommodated with no wasted memory.
461 The allocator can be called both from the top half of the kernel,
462 which is willing to wait for memory to become available,
463 and from the interrupt routines in the bottom half of the kernel
464 that cannot wait for memory to become available.
465 Clients indicate their willingness (and ability) to wait with a flag
466 to the allocation routine.
467 For clients that are willing to wait,
468 the allocator guarrentees that their request will succeed.
469 Thus, these clients can need not check the return value from the allocator.
470 If memory is unavailable and the client cannot wait,
471 the allocator returns a null pointer.
472 These clients must be prepared to cope with this
473 (hopefully infrequent) condition
474 (usually by giving up and hoping to do better later).
475 .H 1 "Results of the Implementation
477 The new memory allocator was written about a year ago.
478 Conversion from the old memory allocators to the new allocator
479 has been going on ever since.
480 Many of the special purpose allocators have been eliminated.
486 Many of the special purpose memory allocators built on
487 top of other allocators have also been eliminated.
488 For example, the allocator that was built on top of the buffer pool allocator
490 to allocate pathname buffers in
493 Because the typical allocation is so fast,
494 we have found that none of the special purpose pools are needed.
495 Indeed, the allocation is about the same as the previous cost of
496 allocating buffers from the network pool (\fImbuf\fP\^s).
497 Consequently applications that used to allocate network
498 buffers for their own uses have been switched over to using
499 the general purpose allocator without increasing their running time.
501 Quantifying the performance of the allocator is difficult because
502 it is hard to measure the amount of time spent allocating
503 and freeing memory in the kernel.
504 The usual approach is to compile a kernel for profiling
505 and then compare the running time of the routines that
506 implemented the old abstraction versus those that implement the new one.
507 The old routines are difficult to quantify because
508 individual routines were used for more than one purpose.
511 routine was used both to allocate one kilobyte memory blocks
512 and for its intended purpose of providing buffers to the filesystem.
513 Differentiating these uses is often difficult.
514 To get a measure of the cost of memory allocation before
515 putting in our new allocator,
516 we summed up the running time of all the routines whose
517 exclusive task was memory allocation.
518 To this total we added the fraction
519 of the running time of the multi-purpose routines that could
520 clearly be identified as memory allocation usage.
521 This number showed that approximately three percent of
522 the time spent in the kernel could be accounted to memory allocation.
524 The new allocator is difficult to measure
525 because the usual case of the memory allocator is implemented as a macro.
526 Thus, its running time is a small fraction of the running time of the
527 numerous routines in the kernel that use it.
528 To get a bound on the cost,
529 we changed the macro always to call the memory allocation routine.
530 Running in this mode, the memory allocator accounted for six percent
531 of the time spent in the kernel.
532 Factoring out the cost of the statistics collection and the
533 subroutine call overhead for the cases that could
534 normally be handled by the macro,
535 we estimate that the allocator would account for
536 at most four percent of time in the kernel.
537 These measurements show that the new allocator does not introduce
538 significant new run-time costs.
540 The other major success has been in keeping the size information
542 This technique allows the most frequently requested sizes to be
543 allocated without waste.
544 It also reduces the amount of bookkeeping information associated
545 with the allocator to four kilobytes of information
546 per megabyte of memory under management (with a one kilobyte page size).
549 Our next project is to convert many of the static
550 kernel tables to be dynamically allocated.
551 Static tables include the process table, the file table,
553 Making these tables dynamic will have two benefits.
554 First, it will reduce the amount of memory
555 that must be statically allocated at boot time.
556 Second, it will eliminate the arbitrary upper limit imposed
557 by the current static sizing
558 (although a limit will be retained to constrain runaway clients).
559 Other researchers have already shown the memory savings
560 achieved by this conversion [Rodriguez88].
562 Under the current implementation,
563 memory is never moved from one size list to another.
564 With the 4.2BSD memory allocator this causes problems,
565 particularly for large allocations where a process may use
566 a quarter megabyte piece of memory once,
567 which is then never available for any other size request.
568 In our hybrid scheme,
569 memory can be shuffled between large requests so that large blocks
570 of memory are never stranded as they are with the 4.2BSD allocator.
571 However, pages allocated to small requests are allocated once
572 to a particular size and never changed thereafter.
573 If a burst of requests came in for a particular size,
574 that size would acquire a large amount of memory
575 that would then not be available for other future requests.
577 In practice, we do not find that the free lists become too large.
578 However, we have been investigating ways to handle such problems
579 if they occur in the future.
580 Our current investigations involve a routine
581 that can run as part of the idle loop that would sort the elements
582 on each of the free lists into order of increasing address.
583 Since any given page has only one size of elements allocated from it,
584 the effect of the sorting would be to sort the list into distinct pages.
585 When all the pieces of a page became free,
586 the page itself could be released back to the free pool so that
587 it could be allocated to another purpose.
588 Although there is no guarantee that all the pieces of a page would ever
590 most allocations are short-lived, lasting only for the duration of
591 an open file descriptor, an open network connection, or a system call.
592 As new allocations would be made from the page sorted to
593 the front of the list,
594 return of elements from pages at the back would eventually
595 allow pages later in the list to be freed.
597 Two of the traditional UNIX
598 memory allocators remain in the current system.
599 The terminal subsystem uses \fIclist\fP\^s (character lists).
600 That part of the system is expected to undergo major revision within
601 the next year or so, and it will probably be changed to use
602 \fImbuf\fP\^s as it is merged into the network system.
603 The other major allocator that remains is
605 the routine that manages the filesystem buffer pool memory
606 and associated control information.
607 Only the filesystem uses
609 in the current system;
610 it manages the constant-sized buffer pool.
611 We plan to merge the filesystem buffer cache into the virtual memory system's
612 page cache in the future.
613 This change will allow the size of the buffer pool to be changed
614 according to memory load,
615 but will require a policy for balancing memory needs
616 with filesystem cache performance.
617 .H 1 "Acknowledgments
619 In the spirit of community support,
620 we have made various versions of our allocator available to our test sites.
621 They have been busily burning it in and giving
622 us feedback on their experiences.
623 We acknowledge their invaluable input.
624 The feedback from the Usenix program committee on the initial draft of
625 our paper suggested numerous important improvements.
628 .IP Korn85 \w'Rodriguez88\0\0'u
629 David Korn, Kiem-Phong Vo,
630 ``In Search of a Better Malloc''
631 \fIProceedings of the Portland Usenix Conference\fP,
632 pp 489-506, June 1985.
634 M. McKusick, M. Karels, S. Leffler,
635 ``Performance Improvements and Functional Enhancements in 4.3BSD''
636 \fIProceedings of the Portland Usenix Conference\fP,
637 pp 519-531, June 1985.
639 Robert Rodriguez, Matt Koehler, Larry Palmer, Ricky Palmer,
640 ``A Dynamic UNIX Operating System''
641 \fIProceedings of the San Francisco Usenix Conference\fP,
645 ``UNIX Implementation''
646 \fIBell System Technical Journal\fP, volume 57, number 6,