doc/src/sgml/storage.sgml

   1 <!-- $PostgreSQL$ -->
   2
   3 <chapter id="storage">
   4
   5 <title>Database Physical Storage</title>
   6
   7 <para>
   8 This chapter provides an overview of the physical storage format used by
   9 <productname>PostgreSQL</productname> databases.
  10 </para>
  11
  12 <sect1 id="storage-file-layout">
  13
  14 <title>Database File Layout</title>
  15
  16 <para>
  17 This section describes the storage format at the level of files and
  18 directories.
  19 </para>
  20
  21 <para>
  22 All the data needed for a database cluster is stored within the cluster's data
  23 directory, commonly referred to as <varname>PGDATA</> (after the name of the
  24 environment variable that can be used to define it).  A common location for
  25 <varname>PGDATA</> is <filename>/var/lib/pgsql/data</>.  Multiple clusters,
  26 managed by different server instances, can exist on the same machine.
  27 </para>
  28
  29 <para>
  30 The <varname>PGDATA</> directory contains several subdirectories and control
  31 files, as shown in <xref linkend="pgdata-contents-table">.  In addition to
  32 these required items, the cluster configuration files
  33 <filename>postgresql.conf</filename>, <filename>pg_hba.conf</filename>, and
  34 <filename>pg_ident.conf</filename> are traditionally stored in
  35 <varname>PGDATA</> (although in <productname>PostgreSQL</productname> 8.0 and
  36 later, it is possible to keep them elsewhere).
  37 </para>
  38
  39 <table tocentry="1" id="pgdata-contents-table">
  40 <title>Contents of <varname>PGDATA</></title>
  41 <tgroup cols="2">
  42 <thead>
  43 <row>
  44 <entry>
  45 Item
  46 </entry>
  47 <entry>Description</entry>
  48 </row>
  49 </thead>
  50
  51 <tbody>
  52
  53 <row>
  54  <entry><filename>PG_VERSION</></entry>
  55  <entry>A file containing the major version number of <productname>PostgreSQL</productname></entry>
  56 </row>
  57
  58 <row>
  59  <entry><filename>base</></entry>
  60  <entry>Subdirectory containing per-database subdirectories</entry>
  61 </row>
  62
  63 <row>
  64  <entry><filename>global</></entry>
  65  <entry>Subdirectory containing cluster-wide tables, such as
  66  <structname>pg_database</></entry>
  67 </row>
  68
  69 <row>
  70  <entry><filename>pg_clog</></entry>
  71  <entry>Subdirectory containing transaction commit status data</entry>
  72 </row>
  73
  74 <row>
  75  <entry><filename>pg_multixact</></entry>
  76  <entry>Subdirectory containing multitransaction status data
  77   (used for shared row locks)</entry>
  78 </row>
  79
  80 <row>
  81  <entry><filename>pg_stat_tmp</></entry>
  82  <entry>Subdirectory containing temporary files for the statistics
  83   subsystem</entry>
  84 </row>
  85
  86 <row>
  87  <entry><filename>pg_subtrans</></entry>
  88  <entry>Subdirectory containing subtransaction status data</entry>
  89 </row>
  90
  91 <row>
  92  <entry><filename>pg_tblspc</></entry>
  93  <entry>Subdirectory containing symbolic links to tablespaces</entry>
  94 </row>
  95
  96 <row>
  97  <entry><filename>pg_twophase</></entry>
  98  <entry>Subdirectory containing state files for prepared transactions</entry>
  99 </row>
 100
 101 <row>
 102  <entry><filename>pg_xlog</></entry>
 103  <entry>Subdirectory containing WAL (Write Ahead Log) files</entry>
 104 </row>
 105
 106 <row>
 107  <entry><filename>postmaster.opts</></entry>
 108  <entry>A file recording the command-line options the server was
 109 last started with</entry>
 110 </row>
 111
 112 <row>
 113  <entry><filename>postmaster.pid</></entry>
 114  <entry>A lock file recording the current server PID and shared memory
 115 segment ID (not present after server shutdown)</entry>
 116 </row>
 117
 118 </tbody>
 119 </tgroup>
 120 </table>
 121
 122 <para>
 123 For each database in the cluster there is a subdirectory within
 124 <varname>PGDATA</><filename>/base</>, named after the database's OID in
 125 <structname>pg_database</>.  This subdirectory is the default location
 126 for the database's files; in particular, its system catalogs are stored
 127 there.
 128 </para>
 129
 130 <para>
 131 Each table and index is stored in a separate file, named after the table
 132 or index's <firstterm>filenode</> number, which can be found in
 133 <structname>pg_class</>.<structfield>relfilenode</>. In addition to the
 134 main file (a/k/a main fork), each table and index has a <firstterm>free space
 135 map</> (see <xref linkend="storage-fsm">), which stores information about free
 136 space available in the relation.  The free space map is stored in a file named
 137 with the filenode number plus the suffix <literal>_fsm</>.  Tables also have a
 138 <firstterm>visibility map</>, stored in a fork with the suffix
 139 <literal>_vm</>, to track which pages are known to have no dead tuples.
 140 The visibility map is described further in <xref linkend="storage-vm">.
 141 </para>
 142
 143 <caution>
 144 <para>
 145 Note that while a table's filenode often matches its OID, this is
 146 <emphasis>not</> necessarily the case; some operations, like
 147 <command>TRUNCATE</>, <command>REINDEX</>, <command>CLUSTER</> and some forms
 148 of <command>ALTER TABLE</>, can change the filenode while preserving the OID.
 149 Avoid assuming that filenode and table OID are the same.
 150 </para>
 151 </caution>
 152
 153 <para>
 154 When a table or index exceeds 1 GB, it is divided into gigabyte-sized
 155 <firstterm>segments</>.  The first segment's file name is the same as the
 156 filenode; subsequent segments are named filenode.1, filenode.2, etc.
 157 This arrangement avoids problems on platforms that have file size limitations.
 158 (Actually, 1 GB is just the default segment size.  The segment size can be
 159 adjusted using the configuration option <option>--with-segsize</option>
 160 when building <productname>PostgreSQL</>.)
 161 In principle, free space map and visibility map forks could require multiple
 162 segments as well, though this is unlikely to happen in practice.
 163 The contents of tables and indexes are discussed further in
 164 <xref linkend="storage-page-layout">.
 165 </para>
 166
 167 <para>
 168 A table that has columns with potentially large entries will have an
 169 associated <firstterm>TOAST</> table, which is used for out-of-line storage of
 170 field values that are too large to keep in the table rows proper.
 171 <structname>pg_class</>.<structfield>reltoastrelid</> links from a table to
 172 its <acronym>TOAST</> table, if any.
 173 See <xref linkend="storage-toast"> for more information.
 174 </para>
 175
 176 <para>
 177 Tablespaces make the scenario more complicated.  Each user-defined tablespace
 178 has a symbolic link inside the <varname>PGDATA</><filename>/pg_tblspc</>
 179 directory, which points to the physical tablespace directory (as specified in
 180 its <command>CREATE TABLESPACE</> command).  The symbolic link is named after
 181 the tablespace's OID.  Inside the physical tablespace directory there is
 182 a subdirectory for each database that has elements in the tablespace, named
 183 after the database's OID.  Tables within that directory follow the filenode
 184 naming scheme.  The <literal>pg_default</> tablespace is not accessed through
 185 <filename>pg_tblspc</>, but corresponds to
 186 <varname>PGDATA</><filename>/base</>.  Similarly, the <literal>pg_global</>
 187 tablespace is not accessed through <filename>pg_tblspc</>, but corresponds to
 188 <varname>PGDATA</><filename>/global</>.
 189 </para>
 190
 191 <para>
 192 Temporary files (for operations such as sorting more data than can fit in
 193 memory) are created within <varname>PGDATA</><filename>/base/pgsql_tmp</>,
 194 or within a <filename>pgsql_tmp</> subdirectory of a tablespace directory
 195 if a tablespace other than <literal>pg_default</> is specified for them.
 196 The name of a temporary file has the form
 197 <filename>pgsql_tmp<replaceable>PPP</>.<replaceable>NNN</></filename>,
 198 where <replaceable>PPP</> is the PID of the owning backend and
 199 <replaceable>NNN</> distinguishes different temporary files of that backend.
 200 </para>
 201
 202 </sect1>
 203
 204 <sect1 id="storage-toast">
 205
 206 <title>TOAST</title>
 207
 208     <indexterm>
 209      <primary>TOAST</primary>
 210     </indexterm>
 211     <indexterm><primary>sliced bread</><see>TOAST</></indexterm>
 212
 213 <para>
 214 This section provides an overview of <acronym>TOAST</> (The
 215 Oversized-Attribute Storage Technique).
 216 </para>
 217
 218 <para>
 219 <productname>PostgreSQL</productname> uses a fixed page size (commonly
 220 8 kB), and does not allow tuples to span multiple pages.  Therefore,  it is
 221 not possible to store very large field values directly.  To overcome
 222 this limitation, large  field values are compressed and/or broken up into
 223 multiple physical rows. This happens transparently to the user, with only
 224 small impact on most of the backend code.  The technique is affectionately
 225 known as <acronym>TOAST</>  (or <quote>the best thing since sliced bread</>).
 226 </para>
 227
 228 <para>
 229 Only certain data types support <acronym>TOAST</> &mdash; there is no need to
 230 impose the overhead on data types that cannot produce large field values.
 231 To support <acronym>TOAST</>, a data type must have a variable-length
 232 (<firstterm>varlena</>) representation, in which the first 32-bit word of any
 233 stored value contains the total length of the value in bytes (including
 234 itself).  <acronym>TOAST</> does not constrain the rest of the representation.
 235 All the C-level functions supporting a <acronym>TOAST</>-able data type must
 236 be careful to handle <acronym>TOAST</>ed input values.  (This is normally done
 237 by invoking <function>PG_DETOAST_DATUM</> before doing anything with an input
 238 value, but in some cases more efficient approaches are possible.)
 239 </para>
 240
 241 <para>
 242 <acronym>TOAST</> usurps two bits of the varlena length word (the high-order
 243 bits on big-endian machines, the low-order bits on little-endian machines),
 244 thereby limiting the logical size of any value of a <acronym>TOAST</>-able
 245 data type to 1 GB (2<superscript>30</> - 1 bytes).  When both bits are zero,
 246 the value is an ordinary un-<acronym>TOAST</>ed value of the data type, and
 247 the remaining bits of the length word give the total datum size (including
 248 length word) in bytes.  When the highest-order or lowest-order bit is set,
 249 the value has only a single-byte header instead of the normal four-byte
 250 header, and the remaining bits give the total datum size (including length
 251 byte) in bytes.  As a special case, if the remaining bits are all zero
 252 (which would be impossible for a self-inclusive length), the value is a
 253 pointer to out-of-line data stored in a separate TOAST table.  (The size of
 254 a TOAST pointer is given in the second byte of the datum.)
 255 Values with single-byte headers aren't aligned on any particular
 256 boundary, either.  Lastly, when the highest-order or lowest-order bit is
 257 clear but the adjacent bit is set, the content of the datum has been
 258 compressed and must be decompressed before use.  In this case the remaining
 259 bits of the length word give the total size of the compressed datum, not the
 260 original data.  Note that compression is also possible for out-of-line data
 261 but the varlena header does not tell whether it has occurred &mdash;
 262 the content of the TOAST pointer tells that, instead.
 263 </para>
 264
 265 <para>
 266 If any of the columns of a table are <acronym>TOAST</>-able, the table will
 267 have an associated <acronym>TOAST</> table, whose OID is stored in the table's
 268 <structname>pg_class</>.<structfield>reltoastrelid</> entry.  Out-of-line
 269 <acronym>TOAST</>ed values are kept in the <acronym>TOAST</> table, as
 270 described in more detail below.
 271 </para>
 272
 273 <para>
 274 The compression technique used is a fairly simple and very fast member
 275 of the LZ family of compression techniques.  See
 276 <filename>src/backend/utils/adt/pg_lzcompress.c</> for the details.
 277 </para>
 278
 279 <para>
 280 Out-of-line values are divided (after compression if used) into chunks of at
 281 most <symbol>TOAST_MAX_CHUNK_SIZE</> bytes (by default this value is chosen
 282 so that four chunk rows will fit on a page, making it about 2000 bytes).
 283 Each chunk is stored
 284 as a separate row in the <acronym>TOAST</> table for the owning table.  Every
 285 <acronym>TOAST</> table has the columns <structfield>chunk_id</> (an OID
 286 identifying the particular <acronym>TOAST</>ed value),
 287 <structfield>chunk_seq</> (a sequence number for the chunk within its value),
 288 and <structfield>chunk_data</> (the actual data of the chunk).  A unique index
 289 on <structfield>chunk_id</> and <structfield>chunk_seq</> provides fast
 290 retrieval of the values.  A pointer datum representing an out-of-line
 291 <acronym>TOAST</>ed value therefore needs to store the OID of the
 292 <acronym>TOAST</> table in which to look and the OID of the specific value
 293 (its <structfield>chunk_id</>).  For convenience, pointer datums also store the
 294 logical datum size (original uncompressed data length) and actual stored size
 295 (different if compression was applied).  Allowing for the varlena header bytes,
 296 the total size of a <acronym>TOAST</> pointer datum is therefore 18 bytes
 297 regardless of the actual size of the represented value.
 298 </para>
 299
 300 <para>
 301 The <acronym>TOAST</> code is triggered only
 302 when a row value to be stored in a table is wider than
 303 <symbol>TOAST_TUPLE_THRESHOLD</> bytes (normally 2 kB).
 304 The <acronym>TOAST</> code will compress and/or move
 305 field values out-of-line until the row value is shorter than
 306 <symbol>TOAST_TUPLE_TARGET</> bytes (also normally 2 kB)
 307 or no more gains can be had.  During an UPDATE
 308 operation, values of unchanged fields are normally preserved as-is; so an
 309 UPDATE of a row with out-of-line values incurs no <acronym>TOAST</> costs if
 310 none of the out-of-line values change.
 311 </para>
 312
 313 <para>
 314 The <acronym>TOAST</> code recognizes four different strategies for storing
 315 <acronym>TOAST</>-able columns:
 316
 317    <itemizedlist>
 318     <listitem>
 319      <para>
 320       <literal>PLAIN</literal> prevents either compression or
 321       out-of-line storage; furthermore it disables use of single-byte headers
 322       for varlena types.
 323       This is the only possible strategy for
 324       columns of non-<acronym>TOAST</>-able data types.
 325      </para>
 326     </listitem>
 327     <listitem>
 328      <para>
 329       <literal>EXTENDED</literal> allows both compression and out-of-line
 330       storage.  This is the default for most <acronym>TOAST</>-able data types.
 331       Compression will be attempted first, then out-of-line storage if
 332       the row is still too big.
 333      </para>
 334     </listitem>
 335     <listitem>
 336      <para>
 337       <literal>EXTERNAL</literal> allows out-of-line storage but not
 338       compression.  Use of <literal>EXTERNAL</literal> will
 339       make substring operations on wide <type>text</type> and
 340       <type>bytea</type> columns faster (at the penalty of increased storage
 341       space) because these operations are optimized to fetch only the
 342       required parts of the out-of-line value when it is not compressed.
 343      </para>
 344     </listitem>
 345     <listitem>
 346      <para>
 347       <literal>MAIN</literal> allows compression but not out-of-line
 348       storage.  (Actually, out-of-line storage will still be performed
 349       for such columns, but only as a last resort when there is no other
 350       way to make the row small enough.)
 351      </para>
 352     </listitem>
 353    </itemizedlist>
 354
 355 Each <acronym>TOAST</>-able data type specifies a default strategy for columns
 356 of that data type, but the strategy for a given table column can be altered
 357 with <command>ALTER TABLE SET STORAGE</>.
 358 </para>
 359
 360 <para>
 361 This scheme has a number of advantages compared to a more straightforward
 362 approach such as allowing row values to span pages.  Assuming that queries are
 363 usually qualified by comparisons against relatively small key values, most of
 364 the work of the executor will be done using the main row entry. The big values
 365 of <acronym>TOAST</>ed attributes will only be pulled out (if selected at all)
 366 at the time the result set is sent to the client. Thus, the main table is much
 367 smaller and more of its rows fit in the shared buffer cache than would be the
 368 case without any out-of-line storage. Sort sets shrink also, and sorts will
 369 more often be done entirely in memory. A little test showed that a table
 370 containing typical HTML pages and their URLs was stored in about half of the
 371 raw data size including the <acronym>TOAST</> table, and that the main table
 372 contained only about 10% of the entire data (the URLs and some small HTML
 373 pages). There was no run time difference compared to an un-<acronym>TOAST</>ed
 374 comparison table, in which all the HTML pages were cut down to 7 kB to fit.
 375 </para>
 376
 377 </sect1>
 378
 379 <sect1 id="storage-fsm">
 380
 381 <title>Free Space Map</title>
 382
 383 <indexterm>
 384  <primary>Free Space Map</primary>
 385 </indexterm>
 386 <indexterm><primary>FSM</><see>Free Space Map</></indexterm>
 387
 388 <para>
 389 Each heap and index relation, except for hash indexes, has a Free Space Map
 390 (FSM) to keep track of available space in the relation. It's stored
 391 alongside the main relation data in a separate relation fork, named after the
 392 filenode number of the relation, plus a <literal>_fsm</> suffix. For example,
 393 if the filenode of a relation is 12345, the FSM is stored in a file called
 394 <filename>12345_fsm</>, in the same directory as the main relation file.
 395 </para>
 396
 397 <para>
 398 The Free Space Map is organized as a tree of <acronym>FSM</> pages. The
 399 bottom level <acronym>FSM</> pages store the free space available on each
 400 heap (or index) page, using one byte to represent each such page. The upper
 401 levels aggregate information from the lower levels.
 402 </para>
 403
 404 <para>
 405 Within each <acronym>FSM</> page is a binary tree, stored in an array with
 406 one byte per node. Each leaf node represents a heap page, or a lower level
 407 <acronym>FSM</> page. In each non-leaf node, the higher of its children's
 408 values is stored. The maximum value in the leaf nodes is therefore stored
 409 at the root.
 410 </para>
 411
 412 <para>
 413 See <filename>src/backend/storage/freespace/README</> for more details on
 414 how the <acronym>FSM</> is structured, and how it's updated and searched.
 415 The <filename>contrib/pg_freespacemap</> module can be used to examine the
 416 information stored in free space maps (see <xref linkend="pgfreespacemap">).
 417 </para>
 418
 419 </sect1>
 420
 421 <sect1 id="storage-vm">
 422
 423 <title>Visibility Map</title>
 424
 425 <indexterm>
 426  <primary>Visibility Map</primary>
 427 </indexterm>
 428 <indexterm><primary>VM</><see>Visibility Map</></indexterm>
 429
 430 <para>
 431 Each heap relation has a Visibility Map
 432 (VM) to keep track of which pages contain only tuples that are known to be
 433 visible to all active transactions. It's stored
 434 alongside the main relation data in a separate relation fork, named after the
 435 filenode number of the relation, plus a <literal>_vm</> suffix. For example,
 436 if the filenode of a relation is 12345, the VM is stored in a file called
 437 <filename>12345_vm</>, in the same directory as the main relation file.
 438 Note that indexes do not have VMs.
 439 </para>
 440
 441 <para>
 442 The visibility map simply stores one bit per heap page. A set bit means
 443 that all tuples on the page are known to be visible to all transactions.
 444 This means that the page does not contain any tuples that need to be vacuumed;
 445 in future it might also be used to avoid visiting the page for visibility
 446 checks. The map is conservative in the sense that we
 447 make sure that whenever a bit is set, we know the condition is true, but if
 448 a bit is not set, it might or might not be true.
 449 </para>
 450
 451 </sect1>
 452
 453 <sect1 id="storage-page-layout">
 454
 455 <title>Database Page Layout</title>
 456
 457 <para>
 458 This section provides an overview of the page format used within
 459 <productname>PostgreSQL</productname> tables and indexes.<footnote>
 460   <para>
 461     Actually, index access methods need not use this page format.
 462     All the existing index methods do use this basic format,
 463     but the data kept on index metapages usually doesn't follow
 464     the item layout rules.
 465   </para>
 466 </footnote>
 467 Sequences and <acronym>TOAST</> tables are formatted just like a regular table.
 468 </para>
 469
 470 <para>
 471 In the following explanation, a
 472 <firstterm>byte</firstterm>
 473 is assumed to contain 8 bits.  In addition, the term
 474 <firstterm>item</firstterm>
 475 refers to an individual data value that is stored on a page.  In a table,
 476 an item is a row; in an index, an item is an index entry.
 477 </para>
 478
 479 <para>
 480 Every table and index is stored as an array of <firstterm>pages</> of a
 481 fixed size (usually 8 kB, although a different page size can be selected
 482 when compiling the server).  In a table, all the pages are logically
 483 equivalent, so a particular item (row) can be stored in any page.  In
 484 indexes, the first page is generally reserved as a <firstterm>metapage</>
 485 holding control information, and there can be different types of pages
 486 within the index, depending on the index access method.
 487 </para>
 488
 489 <para>
 490 <xref linkend="page-table"> shows the overall layout of a page.
 491 There are five parts to each page.
 492 </para>
 493
 494 <table tocentry="1" id="page-table">
 495 <title>Overall Page Layout</title>
 496 <titleabbrev>Page Layout</titleabbrev>
 497 <tgroup cols="2">
 498 <thead>
 499 <row>
 500 <entry>
 501 Item
 502 </entry>
 503 <entry>Description</entry>
 504 </row>
 505 </thead>
 506
 507 <tbody>
 508
 509 <row>
 510  <entry>PageHeaderData</entry>
 511  <entry>24 bytes long. Contains general information about the page, including
 512 free space pointers.</entry>
 513 </row>
 514
 515 <row>
 516 <entry>ItemIdData</entry>
 517 <entry>Array of (offset,length) pairs pointing to the actual items.
 518 4 bytes per item.</entry>
 519 </row>
 520
 521 <row>
 522 <entry>Free space</entry>
 523 <entry>The unallocated space. New item pointers are allocated from the start
 524 of this area, new items from the end.</entry>
 525 </row>
 526
 527 <row>
 528 <entry>Items</entry>
 529 <entry>The actual items themselves.</entry>
 530 </row>
 531
 532 <row>
 533 <entry>Special space</entry>
 534 <entry>Index access method specific data. Different methods store different
 535 data. Empty in ordinary tables.</entry>
 536 </row>
 537
 538 </tbody>
 539 </tgroup>
 540 </table>
 541
 542  <para>
 543
 544   The first 24 bytes of each page consists of a page header
 545   (PageHeaderData). Its format is detailed in <xref
 546   linkend="pageheaderdata-table">. The first two fields track the most
 547   recent WAL entry related to this page. Next is a 2-byte field
 548   containing flag bits. This is followed by three 2-byte integer fields
 549   (<structfield>pd_lower</structfield>, <structfield>pd_upper</structfield>,
 550   and <structfield>pd_special</structfield>). These contain byte offsets
 551   from the page start to the start
 552   of unallocated space, to the end of unallocated space, and to the start of
 553   the special space.
 554   The next 2 bytes of the page header,
 555   <structfield>pd_pagesize_version</structfield>, store both the page size
 556   and a version indicator.  Beginning with
 557   <productname>PostgreSQL</productname> 8.3 the version number is 4;
 558   <productname>PostgreSQL</productname> 8.1 and 8.2 used version number 3;
 559   <productname>PostgreSQL</productname> 8.0 used version number 2;
 560   <productname>PostgreSQL</productname> 7.3 and 7.4 used version number 1;
 561   prior releases used version number 0.
 562   (The basic page layout and header format has not changed in most of these
 563   versions, but the layout of heap row headers has.)  The page size
 564   is basically only present as a cross-check; there is no support for having
 565   more than one page size in an installation.
 566   The last field is a hint that shows whether pruning the page is likely
 567   to be profitable: it tracks the oldest un-pruned XMAX on the page.
 568
 569  </para>
 570
 571  <table tocentry="1" id="pageheaderdata-table">
 572  <title>PageHeaderData Layout</title>
 573  <titleabbrev>PageHeaderData Layout</titleabbrev>
 574  <tgroup cols="4">
 575  <thead>
 576   <row>
 577    <entry>Field</entry>
 578    <entry>Type</entry>
 579    <entry>Length</entry>
 580    <entry>Description</entry>
 581   </row>
 582  </thead>
 583  <tbody>
 584   <row>
 585    <entry>pd_lsn</entry>
 586    <entry>XLogRecPtr</entry>
 587    <entry>8 bytes</entry>
 588    <entry>LSN: next byte after last byte of xlog record for last change
 589    to this page</entry>
 590   </row>
 591   <row>
 592    <entry>pd_tli</entry>
 593    <entry>uint16</entry>
 594    <entry>2 bytes</entry>
 595    <entry>TimeLineID of last change (only its lowest 16 bits)</entry>
 596   </row>
 597   <row>
 598    <entry>pd_flags</entry>
 599    <entry>uint16</entry>
 600    <entry>2 bytes</entry>
 601    <entry>Flag bits</entry>
 602   </row>
 603   <row>
 604    <entry>pd_lower</entry>
 605    <entry>LocationIndex</entry>
 606    <entry>2 bytes</entry>
 607    <entry>Offset to start of free space</entry>
 608   </row>
 609   <row>
 610    <entry>pd_upper</entry>
 611    <entry>LocationIndex</entry>
 612    <entry>2 bytes</entry>
 613    <entry>Offset to end of free space</entry>
 614   </row>
 615   <row>
 616    <entry>pd_special</entry>
 617    <entry>LocationIndex</entry>
 618    <entry>2 bytes</entry>
 619    <entry>Offset to start of special space</entry>
 620   </row>
 621   <row>
 622    <entry>pd_pagesize_version</entry>
 623    <entry>uint16</entry>
 624    <entry>2 bytes</entry>
 625    <entry>Page size and layout version number information</entry>
 626   </row>
 627   <row>
 628    <entry>pd_prune_xid</entry>
 629    <entry>TransactionId</entry>
 630    <entry>4 bytes</entry>
 631    <entry>Oldest unpruned XMAX on page, or zero if none</entry>
 632   </row>
 633  </tbody>
 634  </tgroup>
 635  </table>
 636
 637  <para>
 638   All the details can be found in
 639   <filename>src/include/storage/bufpage.h</filename>.
 640  </para>
 641
 642  <para>
 643
 644   Following the page header are item identifiers
 645   (<type>ItemIdData</type>), each requiring four bytes.
 646   An item identifier contains a byte-offset to
 647   the start of an item, its length in bytes, and a few attribute bits
 648   which affect its interpretation.
 649   New item identifiers are allocated
 650   as needed from the beginning of the unallocated space.
 651   The number of item identifiers present can be determined by looking at
 652   <structfield>pd_lower</>, which is increased to allocate a new identifier.
 653   Because an item
 654   identifier is never moved until it is freed, its index can be used on a
 655   long-term basis to reference an item, even when the item itself is moved
 656   around on the page to compact free space.  In fact, every pointer to an
 657   item (<type>ItemPointer</type>, also known as
 658   <type>CTID</type>) created by
 659   <productname>PostgreSQL</productname> consists of a page number and the
 660   index of an item identifier.
 661
 662  </para>
 663
 664  <para>
 665
 666   The items themselves are stored in space allocated backwards from the end
 667   of unallocated space.  The exact structure varies depending on what the
 668   table is to contain. Tables and sequences both use a structure named
 669   <type>HeapTupleHeaderData</type>, described below.
 670
 671  </para>
 672
 673  <para>
 674
 675   The final section is the <quote>special section</quote> which can
 676  contain anything the access method wishes to store.  For example,
 677   b-tree indexes store links to the page's left and right siblings,
 678   as well as some other data relevant to the index structure.
 679   Ordinary tables do not use a special section at all (indicated by setting
 680   <structfield>pd_special</> to equal the page size).
 681
 682  </para>
 683
 684  <para>
 685
 686   All table rows are structured in the same way. There is a fixed-size
 687   header (occupying 23 bytes on most machines), followed by an optional null
 688   bitmap, an optional object ID field, and the user data. The header is
 689   detailed
 690   in <xref linkend="heaptupleheaderdata-table">.  The actual user data
 691   (columns of the row) begins at the offset indicated by
 692   <structfield>t_hoff</>, which must always be a multiple of the MAXALIGN
 693   distance for the platform.
 694   The null bitmap is
 695   only present if the <firstterm>HEAP_HASNULL</firstterm> bit is set in
 696   <structfield>t_infomask</structfield>. If it is present it begins just after
 697   the fixed header and occupies enough bytes to have one bit per data column
 698   (that is, <structfield>t_natts</> bits altogether). In this list of bits, a
 699   1 bit indicates not-null, a 0 bit is a null.  When the bitmap is not
 700   present, all columns are assumed not-null.
 701   The object ID is only present if the <firstterm>HEAP_HASOID</firstterm> bit
 702   is set in <structfield>t_infomask</structfield>.  If present, it appears just
 703   before the <structfield>t_hoff</> boundary.  Any padding needed to make
 704   <structfield>t_hoff</> a MAXALIGN multiple will appear between the null
 705   bitmap and the object ID.  (This in turn ensures that the object ID is
 706   suitably aligned.)
 707
 708  </para>
 709
 710  <table tocentry="1" id="heaptupleheaderdata-table">
 711  <title>HeapTupleHeaderData Layout</title>
 712  <titleabbrev>HeapTupleHeaderData Layout</titleabbrev>
 713  <tgroup cols="4">
 714  <thead>
 715   <row>
 716    <entry>Field</entry>
 717    <entry>Type</entry>
 718    <entry>Length</entry>
 719    <entry>Description</entry>
 720   </row>
 721  </thead>
 722  <tbody>
 723   <row>
 724    <entry>t_xmin</entry>
 725    <entry>TransactionId</entry>
 726    <entry>4 bytes</entry>
 727    <entry>insert XID stamp</entry>
 728   </row>
 729   <row>
 730    <entry>t_xmax</entry>
 731    <entry>TransactionId</entry>
 732    <entry>4 bytes</entry>
 733    <entry>delete XID stamp</entry>
 734   </row>
 735   <row>
 736    <entry>t_cid</entry>
 737    <entry>CommandId</entry>
 738    <entry>4 bytes</entry>
 739    <entry>insert and/or delete CID stamp (overlays with t_xvac)</entry>
 740   </row>
 741   <row>
 742    <entry>t_xvac</entry>
 743    <entry>TransactionId</entry>
 744    <entry>4 bytes</entry>
 745    <entry>XID for VACUUM operation moving a row version</entry>
 746   </row>
 747   <row>
 748    <entry>t_ctid</entry>
 749    <entry>ItemPointerData</entry>
 750    <entry>6 bytes</entry>
 751    <entry>current TID of this or newer row version</entry>
 752   </row>
 753   <row>
 754    <entry>t_infomask2</entry>
 755    <entry>int16</entry>
 756    <entry>2 bytes</entry>
 757    <entry>number of attributes, plus various flag bits</entry>
 758   </row>
 759   <row>
 760    <entry>t_infomask</entry>
 761    <entry>uint16</entry>
 762    <entry>2 bytes</entry>
 763    <entry>various flag bits</entry>
 764   </row>
 765   <row>
 766    <entry>t_hoff</entry>
 767    <entry>uint8</entry>
 768    <entry>1 byte</entry>
 769    <entry>offset to user data</entry>
 770   </row>
 771  </tbody>
 772  </tgroup>
 773  </table>
 774
 775  <para>
 776    All the details can be found in
 777    <filename>src/include/access/htup.h</filename>.
 778  </para>
 779
 780  <para>
 781
 782   Interpreting the actual data can only be done with information obtained
 783   from other tables, mostly <structname>pg_attribute</structname>. The
 784   key values needed to identify field locations are
 785   <structfield>attlen</structfield> and <structfield>attalign</structfield>.
 786   There is no way to directly get a
 787   particular attribute, except when there are only fixed width fields and no
 788   null values. All this trickery is wrapped up in the functions
 789   <firstterm>heap_getattr</firstterm>, <firstterm>fastgetattr</firstterm>
 790   and <firstterm>heap_getsysattr</firstterm>.
 791
 792  </para>
 793  <para>
 794
 795   To read the data you need to examine each attribute in turn. First check
 796   whether the field is NULL according to the null bitmap. If it is, go to
 797   the next. Then make sure you have the right alignment.  If the field is a
 798   fixed width field, then all the bytes are simply placed. If it's a
 799   variable length field (attlen = -1) then it's a bit more complicated.
 800   All variable-length datatypes share the common header structure
 801   <type>struct varlena</type>, which includes the total length of the stored
 802   value and some flag bits.  Depending on the flags, the data can be either
 803   inline or in a <acronym>TOAST</> table;
 804   it might be compressed, too (see <xref linkend="storage-toast">).
 805
 806  </para>
 807 </sect1>
 808
 809 </chapter>