src/backend/utils/mmgr/README

   1 $PostgreSQL$
   2
   3 Notes About Memory Allocation Redesign
   4 ======================================
   5
   6 Up through version 7.0, Postgres had serious problems with memory leakage
   7 during large queries that process a lot of pass-by-reference data.  There
   8 was no provision for recycling memory until end of query.  This needed to be
   9 fixed, even more so with the advent of TOAST which will allowed very large
  10 chunks of data to be passed around in the system.  This document describes
  11 the new memory management system implemented in 7.1.
  12
  13
  14 Background
  15 ----------
  16
  17 We already do most of our memory allocation in "memory contexts", which
  18 are usually AllocSets as implemented by backend/utils/mmgr/aset.c.  What
  19 we need to do is create more contexts and define proper rules about when
  20 they can be freed.
  21
  22 The basic operations on a memory context are:
  23
  24 * create a context
  25
  26 * allocate a chunk of memory within a context (equivalent of standard
  27   C library's malloc())
  28
  29 * delete a context (including freeing all the memory allocated therein)
  30
  31 * reset a context (free all memory allocated in the context, but not the
  32   context object itself)
  33
  34 Given a chunk of memory previously allocated from a context, one can
  35 free it or reallocate it larger or smaller (corresponding to standard
  36 library's free() and realloc() routines).  These operations return memory
  37 to or get more memory from the same context the chunk was originally
  38 allocated in.
  39
  40 At all times there is a "current" context denoted by the
  41 CurrentMemoryContext global variable.  The backend macro palloc()
  42 implicitly allocates space in that context.  The MemoryContextSwitchTo()
  43 operation selects a new current context (and returns the previous context,
  44 so that the caller can restore the previous context before exiting).
  45
  46 The main advantage of memory contexts over plain use of malloc/free is
  47 that the entire contents of a memory context can be freed easily, without
  48 having to request freeing of each individual chunk within it.  This is
  49 both faster and more reliable than per-chunk bookkeeping.  We already use
  50 this fact to clean up at transaction end: by resetting all the active
  51 contexts, we reclaim all memory.  What we need are additional contexts
  52 that can be reset or deleted at strategic times within a query, such as
  53 after each tuple.
  54
  55
  56 Some Notes About the palloc API Versus Standard C Library
  57 ---------------------------------------------------------
  58
  59 The behavior of palloc and friends is similar to the standard C library's
  60 malloc and friends, but there are some deliberate differences too.  Here
  61 are some notes to clarify the behavior.
  62
  63 * If out of memory, palloc and repalloc exit via elog(ERROR).  They never
  64 return NULL, and it is not necessary or useful to test for such a result.
  65
  66 * palloc(0) is explicitly a valid operation.  It does not return a NULL
  67 pointer, but a valid chunk of which no bytes may be used.  (However, the
  68 chunk might later be repalloc'd larger; it can also be pfree'd without
  69 error.)  (Note: this behavior is new in Postgres 8.0; earlier versions
  70 disallowed palloc(0).  It seems more consistent to allow it, however.)
  71 Similarly, repalloc allows realloc'ing to zero size.
  72
  73 * pfree and repalloc do not accept a NULL pointer.  This is intentional.
  74
  75
  76 pfree/repalloc No Longer Depend On CurrentMemoryContext
  77 -------------------------------------------------------
  78
  79 In this proposal, pfree() and repalloc() can be applied to any chunk
  80 whether it belongs to CurrentMemoryContext or not --- the chunk's owning
  81 context will be invoked to handle the operation, regardless.  This is a
  82 change from the old requirement that CurrentMemoryContext must be set
  83 to the same context the memory was allocated from before one can use
  84 pfree() or repalloc().  The old coding requirement is obviously fairly
  85 error-prone, and will become more so the more context-switching we do;
  86 so I think it's essential to use CurrentMemoryContext only for palloc.
  87 We can avoid needing it for pfree/repalloc by putting restrictions on
  88 context managers as discussed below.
  89
  90 We could even consider getting rid of CurrentMemoryContext entirely,
  91 instead requiring the target memory context for allocation to be specified
  92 explicitly.  But I think that would be too much notational overhead ---
  93 we'd have to pass an appropriate memory context to called routines in
  94 many places.  For example, the copyObject routines would need to be passed
  95 a context, as would function execution routines that return a
  96 pass-by-reference datatype.  And what of routines that temporarily
  97 allocate space internally, but don't return it to their caller?  We
  98 certainly don't want to clutter every call in the system with "here is
  99 a context to use for any temporary memory allocation you might want to
 100 do".  So there'd still need to be a global variable specifying a suitable
 101 temporary-allocation context.  That might as well be CurrentMemoryContext.
 102
 103
 104 Additions to the Memory-Context Mechanism
 105 -----------------------------------------
 106
 107 If we are going to have more contexts, we need more mechanism for keeping
 108 track of them; else we risk leaking whole contexts under error conditions.
 109
 110 We can do this by creating trees of "parent" and "child" contexts.  When
 111 creating a memory context, the new context can be specified to be a child
 112 of some existing context.  A context can have many children, but only one
 113 parent.  In this way the contexts form a forest (not necessarily a single
 114 tree, since there could be more than one top-level context).
 115
 116 We then say that resetting or deleting any particular context resets or
 117 deletes all its direct and indirect children as well.  This feature allows
 118 us to manage a lot of contexts without fear that some will be leaked; we
 119 only need to keep track of one top-level context that we are going to
 120 delete at transaction end, and make sure that any shorter-lived contexts
 121 we create are descendants of that context.  Since the tree can have
 122 multiple levels, we can deal easily with nested lifetimes of storage,
 123 such as per-transaction, per-statement, per-scan, per-tuple.  Storage
 124 lifetimes that only partially overlap can be handled by allocating
 125 from different trees of the context forest (there are some examples
 126 in the next section).
 127
 128 For convenience we will also want operations like "reset/delete all
 129 children of a given context, but don't reset or delete that context
 130 itself".
 131
 132
 133 Globally Known Contexts
 134 -----------------------
 135
 136 There will be several widely-known contexts that will typically be
 137 referenced through global variables.  At any instant the system may
 138 contain many additional contexts, but all other contexts should be direct
 139 or indirect children of one of these contexts to ensure they are not
 140 leaked in event of an error.
 141
 142 TopMemoryContext --- this is the actual top level of the context tree;
 143 every other context is a direct or indirect child of this one.  Allocating
 144 here is essentially the same as "malloc", because this context will never
 145 be reset or deleted.  This is for stuff that should live forever, or for
 146 stuff that the controlling module will take care of deleting at the
 147 appropriate time.  An example is fd.c's tables of open files, as well as
 148 the context management nodes for memory contexts themselves.  Avoid
 149 allocating stuff here unless really necessary, and especially avoid
 150 running with CurrentMemoryContext pointing here.
 151
 152 PostmasterContext --- this is the postmaster's normal working context.
 153 After a backend is spawned, it can delete PostmasterContext to free its
 154 copy of memory the postmaster was using that it doesn't need.  (Anything
 155 that has to be passed from postmaster to backends will be passed in
 156 TopMemoryContext.  The postmaster will have only TopMemoryContext,
 157 PostmasterContext, and ErrorContext --- the remaining top-level contexts
 158 will be set up in each backend during startup.)
 159
 160 CacheMemoryContext --- permanent storage for relcache, catcache, and
 161 related modules.  This will never be reset or deleted, either, so it's
 162 not truly necessary to distinguish it from TopMemoryContext.  But it
 163 seems worthwhile to maintain the distinction for debugging purposes.
 164 (Note: CacheMemoryContext will have child-contexts with shorter lifespans.
 165 For example, a child context is the best place to keep the subsidiary
 166 storage associated with a relcache entry; that way we can free rule
 167 parsetrees and so forth easily, without having to depend on constructing
 168 a reliable version of freeObject().)
 169
 170 MessageContext --- this context holds the current command message from the
 171 frontend, as well as any derived storage that need only live as long as
 172 the current message (for example, in simple-Query mode the parse and plan
 173 trees can live here).  This context will be reset, and any children
 174 deleted, at the top of each cycle of the outer loop of PostgresMain.  This
 175 is kept separate from per-transaction and per-portal contexts because a
 176 query string might need to live either a longer or shorter time than any
 177 single transaction or portal.
 178
 179 TopTransactionContext --- this holds everything that lives until end of the
 180 top-level transaction.  This context will be reset, and all its children
 181 deleted, at conclusion of each top-level transaction cycle.  In most cases
 182 you don't want to allocate stuff directly here, but in CurTransactionContext;
 183 what does belong here is control information that exists explicitly to manage
 184 status across multiple subtransactions.  Note: this context is NOT cleared
 185 immediately upon error; its contents will survive until the transaction block
 186 is exited by COMMIT/ROLLBACK.
 187
 188 CurTransactionContext --- this holds data that has to survive until the end
 189 of the current transaction, and in particular will be needed at top-level
 190 transaction commit.  When we are in a top-level transaction this is the same
 191 as TopTransactionContext, but in subtransactions it points to a child context.
 192 It is important to understand that if a subtransaction aborts, its
 193 CurTransactionContext is thrown away after finishing the abort processing;
 194 but a committed subtransaction's CurTransactionContext is kept until top-level
 195 commit (unless of course one of the intermediate levels of subtransaction
 196 aborts).  This ensures that we do not keep data from a failed subtransaction
 197 longer than necessary.  Because of this behavior, you must be careful to clean
 198 up properly during subtransaction abort --- the subtransaction's state must be
 199 delinked from any pointers or lists kept in upper transactions, or you will
 200 have dangling pointers leading to a crash at top-level commit.  An example of
 201 data kept here is pending NOTIFY messages, which are sent at top-level commit,
 202 but only if the generating subtransaction did not abort.
 203
 204 PortalContext --- this is not actually a separate context, but a
 205 global variable pointing to the per-portal context of the currently active
 206 execution portal.  This can be used if it's necessary to allocate storage
 207 that will live just as long as the execution of the current portal requires.
 208
 209 ErrorContext --- this permanent context will be switched into for error
 210 recovery processing, and then reset on completion of recovery.  We'll
 211 arrange to have, say, 8K of memory available in it at all times.  In this
 212 way, we can ensure that some memory is available for error recovery even
 213 if the backend has run out of memory otherwise.  This allows out-of-memory
 214 to be treated as a normal ERROR condition, not a FATAL error.
 215
 216
 217 Contexts For Prepared Statements And Portals
 218 --------------------------------------------
 219
 220 A prepared-statement object has an associated private context, in which
 221 the parse and plan trees for its query are stored.  Because these trees
 222 are read-only to the executor, the prepared statement can be re-used many
 223 times without further copying of these trees.
 224
 225 An execution-portal object has a private context that is referenced by
 226 PortalContext when the portal is active.  In the case of a portal created
 227 by DECLARE CURSOR, this private context contains the query parse and plan
 228 trees (there being no other object that can hold them).  Portals created
 229 from prepared statements simply reference the prepared statements' trees,
 230 and won't actually need any storage allocated in their private contexts.
 231
 232
 233 Transient Contexts During Execution
 234 -----------------------------------
 235
 236 When creating a prepared statement, the parse and plan trees will be built
 237 in a temporary context that's a child of MessageContext (so that it will
 238 go away automatically upon error).  On success, the finished plan is
 239 copied to the prepared statement's private context, and the temp context
 240 is released; this allows planner temporary space to be recovered before
 241 execution begins.  (In simple-Query mode we'll not bother with the extra
 242 copy step, so the planner temp space stays around till end of query.)
 243
 244 The top-level executor routines, as well as most of the "plan node"
 245 execution code, will normally run in a context that is created by
 246 ExecutorStart and destroyed by ExecutorEnd; this context also holds the
 247 "plan state" tree built during ExecutorStart.  Most of the memory
 248 allocated in these routines is intended to live until end of query,
 249 so this is appropriate for those purposes.  The executor's top context
 250 is a child of PortalContext, that is, the per-portal context of the
 251 portal that represents the query's execution.
 252
 253 The main improvement needed in the executor is that expression evaluation
 254 --- both for qual testing and for computation of targetlist entries ---
 255 needs to not leak memory.  To do this, each ExprContext (expression-eval
 256 context) created in the executor will now have a private memory context
 257 associated with it, and we'll arrange to switch into that context when
 258 evaluating expressions in that ExprContext.  The plan node that owns the
 259 ExprContext is responsible for resetting the private context to empty
 260 when it no longer needs the results of expression evaluations.  Typically
 261 the reset is done at the start of each tuple-fetch cycle in the plan node.
 262
 263 Note that this design gives each plan node its own expression-eval memory
 264 context.  This appears necessary to handle nested joins properly, since
 265 an outer plan node might need to retain expression results it has computed
 266 while obtaining the next tuple from an inner node --- but the inner node
 267 might execute many tuple cycles and many expressions before returning a
 268 tuple.  The inner node must be able to reset its own expression context
 269 more often than once per outer tuple cycle.  Fortunately, memory contexts
 270 are cheap enough that giving one to each plan node doesn't seem like a
 271 problem.
 272
 273 A problem with running index accesses and sorts in a query-lifespan context
 274 is that these operations invoke datatype-specific comparison functions,
 275 and if the comparators leak any memory then that memory won't be recovered
 276 till end of query.  The comparator functions all return bool or int32,
 277 so there's no problem with their result data, but there can be a problem
 278 with leakage of internal temporary data.  In particular, comparator
 279 functions that operate on TOAST-able data types will need to be careful
 280 not to leak detoasted versions of their inputs.  This is annoying, but
 281 it appears a lot easier to make the comparators conform than to fix the
 282 index and sort routines, so that's what I propose to do for 7.1.  Further
 283 cleanup can be left for another day.
 284
 285 There will be some special cases, such as aggregate functions.  nodeAgg.c
 286 needs to remember the results of evaluation of aggregate transition
 287 functions from one tuple cycle to the next, so it can't just discard
 288 all per-tuple state in each cycle.  The easiest way to handle this seems
 289 to be to have two per-tuple contexts in an aggregate node, and to
 290 ping-pong between them, so that at each tuple one is the active allocation
 291 context and the other holds any results allocated by the prior cycle's
 292 transition function.
 293
 294 Executor routines that switch the active CurrentMemoryContext may need
 295 to copy data into their caller's current memory context before returning.
 296 I think there will be relatively little need for that, because of the
 297 convention of resetting the per-tuple context at the *start* of an
 298 execution cycle rather than at its end.  With that rule, an execution
 299 node can return a tuple that is palloc'd in its per-tuple context, and
 300 the tuple will remain good until the node is called for another tuple
 301 or told to end execution.  This is pretty much the same state of affairs
 302 that exists now, since a scan node can return a direct pointer to a tuple
 303 in a disk buffer that is only guaranteed to remain good that long.
 304
 305 A more common reason for copying data will be to transfer a result from
 306 per-tuple context to per-run context; for example, a Unique node will
 307 save the last distinct tuple value in its per-run context, requiring a
 308 copy step.
 309
 310 Another interesting special case is VACUUM, which needs to allocate
 311 working space that will survive its forced transaction commits, yet
 312 be released on error.  Currently it does that through a "portal",
 313 which is essentially a child context of TopMemoryContext.  While that
 314 way still works, it's ugly since xact abort needs special processing
 315 to delete the portal.  Better would be to use a context that's a child
 316 of PortalContext and hence is certain to go away as part of normal
 317 processing.  (Eventually we might have an even better solution from
 318 nested transactions, but this'll do fine for now.)
 319
 320
 321 Mechanisms to Allow Multiple Types of Contexts
 322 ----------------------------------------------
 323
 324 We may want several different types of memory contexts with different
 325 allocation policies but similar external behavior.  To handle this,
 326 memory allocation functions will be accessed via function pointers,
 327 and we will require all context types to obey the conventions given here.
 328 (This is not very far different from the existing code.)
 329
 330 A memory context will be represented by an object like
 331
 332 typedef struct MemoryContextData
 333 {
 334     NodeTag        type;           /* identifies exact kind of context */
 335     MemoryContextMethods methods;
 336     MemoryContextData *parent;     /* NULL if no parent (toplevel context) */
 337     MemoryContextData *firstchild; /* head of linked list of children */
 338     MemoryContextData *nextchild;  /* next child of same parent */
 339     char          *name;           /* context name (just for debugging) */
 340 } MemoryContextData, *MemoryContext;
 341
 342 This is essentially an abstract superclass, and the "methods" pointer is
 343 its virtual function table.  Specific memory context types will use
 344 derived structs having these fields as their first fields.  All the
 345 contexts of a specific type will have methods pointers that point to the
 346 same static table of function pointers, which will look like
 347
 348 typedef struct MemoryContextMethodsData
 349 {
 350     Pointer     (*alloc) (MemoryContext c, Size size);
 351     void        (*free_p) (Pointer chunk);
 352     Pointer     (*realloc) (Pointer chunk, Size newsize);
 353     void        (*reset) (MemoryContext c);
 354     void        (*delete) (MemoryContext c);
 355 } MemoryContextMethodsData, *MemoryContextMethods;
 356
 357 Alloc, reset, and delete requests will take a MemoryContext pointer
 358 as parameter, so they'll have no trouble finding the method pointer
 359 to call.  Free and realloc are trickier.  To make those work, we will
 360 require all memory context types to produce allocated chunks that
 361 are immediately preceded by a standard chunk header, which has the
 362 layout
 363
 364 typedef struct StandardChunkHeader
 365 {
 366     MemoryContext mycontext;         /* Link to owning context object */
 367     Size          size;              /* Allocated size of chunk */
 368 };
 369
 370 It turns out that the existing aset.c memory context type does this
 371 already, and probably any other kind of context would need to have the
 372 same data available to support realloc, so this is not really creating
 373 any additional overhead.  (Note that if a context type needs more per-
 374 allocated-chunk information than this, it can make an additional
 375 nonstandard header that precedes the standard header.  So we're not
 376 constraining context-type designers very much.)
 377
 378 Given this, the pfree routine will look something like
 379
 380     StandardChunkHeader * header =
 381         (StandardChunkHeader *) ((char *) p - sizeof(StandardChunkHeader));
 382
 383     (*header->mycontext->methods->free_p) (p);
 384
 385 We could do it as a macro, but the macro would have to evaluate its
 386 argument twice, which seems like a bad idea (the current pfree macro
 387 does not do that).  This is already saving two levels of function call
 388 compared to the existing code, so I think we're doing fine without
 389 squeezing out that last little bit ...
 390
 391
 392 More Control Over aset.c Behavior
 393 ---------------------------------
 394
 395 Currently, aset.c allocates an 8K block upon the first allocation in
 396 a context, and doubles that size for each successive block request.
 397 That's good behavior for a context that might hold *lots* of data, and
 398 the overhead wasn't bad when we had only a few contexts in existence.
 399 With dozens if not hundreds of smaller contexts in the system, we will
 400 want to be able to fine-tune things a little better.
 401
 402 The creator of a context will be able to specify an initial block size
 403 and a maximum block size.  Selecting smaller values will prevent wastage
 404 of space in contexts that aren't expected to hold very much (an example is
 405 the relcache's per-relation contexts).
 406
 407 Also, it will be possible to specify a minimum context size.  If this
 408 value is greater than zero then a block of that size will be grabbed
 409 immediately upon context creation, and cleared but not released during
 410 context resets.  This feature is needed for ErrorContext (see above),
 411 but will most likely not be used for other contexts.
 412
 413 We expect that per-tuple contexts will be reset frequently and typically
 414 will not allocate very much space per tuple cycle.  To make this usage
 415 pattern cheap, the first block allocated in a context is not given
 416 back to malloc() during reset, but just cleared.  This avoids malloc
 417 thrashing.
 418
 419
 420 Other Notes
 421 -----------
 422
 423 The original version of this proposal suggested that functions returning
 424 pass-by-reference datatypes should be required to return a value freshly
 425 palloc'd in their caller's memory context, never a pointer to an input
 426 value.  I've abandoned that notion since it clearly is prone to error.
 427 In the current proposal, it is possible to discover which context a
 428 chunk of memory is allocated in (by checking the required standard chunk
 429 header), so nodeAgg can determine whether or not it's safe to reset
 430 its working context; it doesn't have to rely on the transition function
 431 to do what it's expecting.