The code to unlink dropped relations in FinishPreparedTransaction() was
[PostgreSQL.git] / doc / src / sgml / storage.sgml
blob4d83f72469a409c65658063fe36b99a83f286196
1 <!-- $PostgreSQL$ -->
3 <chapter id="storage">
5 <title>Database Physical Storage</title>
7 <para>
8 This chapter provides an overview of the physical storage format used by
9 <productname>PostgreSQL</productname> databases.
10 </para>
12 <sect1 id="storage-file-layout">
14 <title>Database File Layout</title>
16 <para>
17 This section describes the storage format at the level of files and
18 directories.
19 </para>
21 <para>
22 All the data needed for a database cluster is stored within the cluster's data
23 directory, commonly referred to as <varname>PGDATA</> (after the name of the
24 environment variable that can be used to define it). A common location for
25 <varname>PGDATA</> is <filename>/var/lib/pgsql/data</>. Multiple clusters,
26 managed by different server instances, can exist on the same machine.
27 </para>
29 <para>
30 The <varname>PGDATA</> directory contains several subdirectories and control
31 files, as shown in <xref linkend="pgdata-contents-table">. In addition to
32 these required items, the cluster configuration files
33 <filename>postgresql.conf</filename>, <filename>pg_hba.conf</filename>, and
34 <filename>pg_ident.conf</filename> are traditionally stored in
35 <varname>PGDATA</> (although in <productname>PostgreSQL</productname> 8.0 and
36 later, it is possible to keep them elsewhere).
37 </para>
39 <table tocentry="1" id="pgdata-contents-table">
40 <title>Contents of <varname>PGDATA</></title>
41 <tgroup cols="2">
42 <thead>
43 <row>
44 <entry>
45 Item
46 </entry>
47 <entry>Description</entry>
48 </row>
49 </thead>
51 <tbody>
53 <row>
54 <entry><filename>PG_VERSION</></entry>
55 <entry>A file containing the major version number of <productname>PostgreSQL</productname></entry>
56 </row>
58 <row>
59 <entry><filename>base</></entry>
60 <entry>Subdirectory containing per-database subdirectories</entry>
61 </row>
63 <row>
64 <entry><filename>global</></entry>
65 <entry>Subdirectory containing cluster-wide tables, such as
66 <structname>pg_database</></entry>
67 </row>
69 <row>
70 <entry><filename>pg_clog</></entry>
71 <entry>Subdirectory containing transaction commit status data</entry>
72 </row>
74 <row>
75 <entry><filename>pg_multixact</></entry>
76 <entry>Subdirectory containing multitransaction status data
77 (used for shared row locks)</entry>
78 </row>
80 <row>
81 <entry><filename>pg_stat_tmp</></entry>
82 <entry>Subdirectory containing temporary files for the statistics
83 subsystem</entry>
84 </row>
86 <row>
87 <entry><filename>pg_subtrans</></entry>
88 <entry>Subdirectory containing subtransaction status data</entry>
89 </row>
91 <row>
92 <entry><filename>pg_tblspc</></entry>
93 <entry>Subdirectory containing symbolic links to tablespaces</entry>
94 </row>
96 <row>
97 <entry><filename>pg_twophase</></entry>
98 <entry>Subdirectory containing state files for prepared transactions</entry>
99 </row>
101 <row>
102 <entry><filename>pg_xlog</></entry>
103 <entry>Subdirectory containing WAL (Write Ahead Log) files</entry>
104 </row>
106 <row>
107 <entry><filename>postmaster.opts</></entry>
108 <entry>A file recording the command-line options the server was
109 last started with</entry>
110 </row>
112 <row>
113 <entry><filename>postmaster.pid</></entry>
114 <entry>A lock file recording the current server PID and shared memory
115 segment ID (not present after server shutdown)</entry>
116 </row>
118 </tbody>
119 </tgroup>
120 </table>
122 <para>
123 For each database in the cluster there is a subdirectory within
124 <varname>PGDATA</><filename>/base</>, named after the database's OID in
125 <structname>pg_database</>. This subdirectory is the default location
126 for the database's files; in particular, its system catalogs are stored
127 there.
128 </para>
130 <para>
131 Each table and index is stored in a separate file, named after the table
132 or index's <firstterm>filenode</> number, which can be found in
133 <structname>pg_class</>.<structfield>relfilenode</>. In addition to the
134 main file (a/k/a main fork), each table and index has a <firstterm>free space
135 map</> (see <xref linkend="storage-fsm">), which stores information about free
136 space available in the relation. The free space map is stored in a file named
137 with the filenode number plus the suffix <literal>_fsm</>. Tables also have a
138 <firstterm>visibility map</>, stored in a fork with the suffix
139 <literal>_vm</>, to track which pages are known to have no dead tuples.
140 The visibility map is described further in <xref linkend="storage-vm">.
141 </para>
143 <caution>
144 <para>
145 Note that while a table's filenode often matches its OID, this is
146 <emphasis>not</> necessarily the case; some operations, like
147 <command>TRUNCATE</>, <command>REINDEX</>, <command>CLUSTER</> and some forms
148 of <command>ALTER TABLE</>, can change the filenode while preserving the OID.
149 Avoid assuming that filenode and table OID are the same.
150 </para>
151 </caution>
153 <para>
154 When a table or index exceeds 1 GB, it is divided into gigabyte-sized
155 <firstterm>segments</>. The first segment's file name is the same as the
156 filenode; subsequent segments are named filenode.1, filenode.2, etc.
157 This arrangement avoids problems on platforms that have file size limitations.
158 (Actually, 1 GB is just the default segment size. The segment size can be
159 adjusted using the configuration option <option>--with-segsize</option>
160 when building <productname>PostgreSQL</>.)
161 In principle, free space map and visibility map forks could require multiple
162 segments as well, though this is unlikely to happen in practice.
163 The contents of tables and indexes are discussed further in
164 <xref linkend="storage-page-layout">.
165 </para>
167 <para>
168 A table that has columns with potentially large entries will have an
169 associated <firstterm>TOAST</> table, which is used for out-of-line storage of
170 field values that are too large to keep in the table rows proper.
171 <structname>pg_class</>.<structfield>reltoastrelid</> links from a table to
172 its <acronym>TOAST</> table, if any.
173 See <xref linkend="storage-toast"> for more information.
174 </para>
176 <para>
177 Tablespaces make the scenario more complicated. Each user-defined tablespace
178 has a symbolic link inside the <varname>PGDATA</><filename>/pg_tblspc</>
179 directory, which points to the physical tablespace directory (as specified in
180 its <command>CREATE TABLESPACE</> command). The symbolic link is named after
181 the tablespace's OID. Inside the physical tablespace directory there is
182 a subdirectory for each database that has elements in the tablespace, named
183 after the database's OID. Tables within that directory follow the filenode
184 naming scheme. The <literal>pg_default</> tablespace is not accessed through
185 <filename>pg_tblspc</>, but corresponds to
186 <varname>PGDATA</><filename>/base</>. Similarly, the <literal>pg_global</>
187 tablespace is not accessed through <filename>pg_tblspc</>, but corresponds to
188 <varname>PGDATA</><filename>/global</>.
189 </para>
191 <para>
192 Temporary files (for operations such as sorting more data than can fit in
193 memory) are created within <varname>PGDATA</><filename>/base/pgsql_tmp</>,
194 or within a <filename>pgsql_tmp</> subdirectory of a tablespace directory
195 if a tablespace other than <literal>pg_default</> is specified for them.
196 The name of a temporary file has the form
197 <filename>pgsql_tmp<replaceable>PPP</>.<replaceable>NNN</></filename>,
198 where <replaceable>PPP</> is the PID of the owning backend and
199 <replaceable>NNN</> distinguishes different temporary files of that backend.
200 </para>
202 </sect1>
204 <sect1 id="storage-toast">
206 <title>TOAST</title>
208 <indexterm>
209 <primary>TOAST</primary>
210 </indexterm>
211 <indexterm><primary>sliced bread</><see>TOAST</></indexterm>
213 <para>
214 This section provides an overview of <acronym>TOAST</> (The
215 Oversized-Attribute Storage Technique).
216 </para>
218 <para>
219 <productname>PostgreSQL</productname> uses a fixed page size (commonly
220 8 kB), and does not allow tuples to span multiple pages. Therefore, it is
221 not possible to store very large field values directly. To overcome
222 this limitation, large field values are compressed and/or broken up into
223 multiple physical rows. This happens transparently to the user, with only
224 small impact on most of the backend code. The technique is affectionately
225 known as <acronym>TOAST</> (or <quote>the best thing since sliced bread</>).
226 </para>
228 <para>
229 Only certain data types support <acronym>TOAST</> &mdash; there is no need to
230 impose the overhead on data types that cannot produce large field values.
231 To support <acronym>TOAST</>, a data type must have a variable-length
232 (<firstterm>varlena</>) representation, in which the first 32-bit word of any
233 stored value contains the total length of the value in bytes (including
234 itself). <acronym>TOAST</> does not constrain the rest of the representation.
235 All the C-level functions supporting a <acronym>TOAST</>-able data type must
236 be careful to handle <acronym>TOAST</>ed input values. (This is normally done
237 by invoking <function>PG_DETOAST_DATUM</> before doing anything with an input
238 value, but in some cases more efficient approaches are possible.)
239 </para>
241 <para>
242 <acronym>TOAST</> usurps two bits of the varlena length word (the high-order
243 bits on big-endian machines, the low-order bits on little-endian machines),
244 thereby limiting the logical size of any value of a <acronym>TOAST</>-able
245 data type to 1 GB (2<superscript>30</> - 1 bytes). When both bits are zero,
246 the value is an ordinary un-<acronym>TOAST</>ed value of the data type, and
247 the remaining bits of the length word give the total datum size (including
248 length word) in bytes. When the highest-order or lowest-order bit is set,
249 the value has only a single-byte header instead of the normal four-byte
250 header, and the remaining bits give the total datum size (including length
251 byte) in bytes. As a special case, if the remaining bits are all zero
252 (which would be impossible for a self-inclusive length), the value is a
253 pointer to out-of-line data stored in a separate TOAST table. (The size of
254 a TOAST pointer is given in the second byte of the datum.)
255 Values with single-byte headers aren't aligned on any particular
256 boundary, either. Lastly, when the highest-order or lowest-order bit is
257 clear but the adjacent bit is set, the content of the datum has been
258 compressed and must be decompressed before use. In this case the remaining
259 bits of the length word give the total size of the compressed datum, not the
260 original data. Note that compression is also possible for out-of-line data
261 but the varlena header does not tell whether it has occurred &mdash;
262 the content of the TOAST pointer tells that, instead.
263 </para>
265 <para>
266 If any of the columns of a table are <acronym>TOAST</>-able, the table will
267 have an associated <acronym>TOAST</> table, whose OID is stored in the table's
268 <structname>pg_class</>.<structfield>reltoastrelid</> entry. Out-of-line
269 <acronym>TOAST</>ed values are kept in the <acronym>TOAST</> table, as
270 described in more detail below.
271 </para>
273 <para>
274 The compression technique used is a fairly simple and very fast member
275 of the LZ family of compression techniques. See
276 <filename>src/backend/utils/adt/pg_lzcompress.c</> for the details.
277 </para>
279 <para>
280 Out-of-line values are divided (after compression if used) into chunks of at
281 most <symbol>TOAST_MAX_CHUNK_SIZE</> bytes (by default this value is chosen
282 so that four chunk rows will fit on a page, making it about 2000 bytes).
283 Each chunk is stored
284 as a separate row in the <acronym>TOAST</> table for the owning table. Every
285 <acronym>TOAST</> table has the columns <structfield>chunk_id</> (an OID
286 identifying the particular <acronym>TOAST</>ed value),
287 <structfield>chunk_seq</> (a sequence number for the chunk within its value),
288 and <structfield>chunk_data</> (the actual data of the chunk). A unique index
289 on <structfield>chunk_id</> and <structfield>chunk_seq</> provides fast
290 retrieval of the values. A pointer datum representing an out-of-line
291 <acronym>TOAST</>ed value therefore needs to store the OID of the
292 <acronym>TOAST</> table in which to look and the OID of the specific value
293 (its <structfield>chunk_id</>). For convenience, pointer datums also store the
294 logical datum size (original uncompressed data length) and actual stored size
295 (different if compression was applied). Allowing for the varlena header bytes,
296 the total size of a <acronym>TOAST</> pointer datum is therefore 18 bytes
297 regardless of the actual size of the represented value.
298 </para>
300 <para>
301 The <acronym>TOAST</> code is triggered only
302 when a row value to be stored in a table is wider than
303 <symbol>TOAST_TUPLE_THRESHOLD</> bytes (normally 2 kB).
304 The <acronym>TOAST</> code will compress and/or move
305 field values out-of-line until the row value is shorter than
306 <symbol>TOAST_TUPLE_TARGET</> bytes (also normally 2 kB)
307 or no more gains can be had. During an UPDATE
308 operation, values of unchanged fields are normally preserved as-is; so an
309 UPDATE of a row with out-of-line values incurs no <acronym>TOAST</> costs if
310 none of the out-of-line values change.
311 </para>
313 <para>
314 The <acronym>TOAST</> code recognizes four different strategies for storing
315 <acronym>TOAST</>-able columns:
317 <itemizedlist>
318 <listitem>
319 <para>
320 <literal>PLAIN</literal> prevents either compression or
321 out-of-line storage; furthermore it disables use of single-byte headers
322 for varlena types.
323 This is the only possible strategy for
324 columns of non-<acronym>TOAST</>-able data types.
325 </para>
326 </listitem>
327 <listitem>
328 <para>
329 <literal>EXTENDED</literal> allows both compression and out-of-line
330 storage. This is the default for most <acronym>TOAST</>-able data types.
331 Compression will be attempted first, then out-of-line storage if
332 the row is still too big.
333 </para>
334 </listitem>
335 <listitem>
336 <para>
337 <literal>EXTERNAL</literal> allows out-of-line storage but not
338 compression. Use of <literal>EXTERNAL</literal> will
339 make substring operations on wide <type>text</type> and
340 <type>bytea</type> columns faster (at the penalty of increased storage
341 space) because these operations are optimized to fetch only the
342 required parts of the out-of-line value when it is not compressed.
343 </para>
344 </listitem>
345 <listitem>
346 <para>
347 <literal>MAIN</literal> allows compression but not out-of-line
348 storage. (Actually, out-of-line storage will still be performed
349 for such columns, but only as a last resort when there is no other
350 way to make the row small enough.)
351 </para>
352 </listitem>
353 </itemizedlist>
355 Each <acronym>TOAST</>-able data type specifies a default strategy for columns
356 of that data type, but the strategy for a given table column can be altered
357 with <command>ALTER TABLE SET STORAGE</>.
358 </para>
360 <para>
361 This scheme has a number of advantages compared to a more straightforward
362 approach such as allowing row values to span pages. Assuming that queries are
363 usually qualified by comparisons against relatively small key values, most of
364 the work of the executor will be done using the main row entry. The big values
365 of <acronym>TOAST</>ed attributes will only be pulled out (if selected at all)
366 at the time the result set is sent to the client. Thus, the main table is much
367 smaller and more of its rows fit in the shared buffer cache than would be the
368 case without any out-of-line storage. Sort sets shrink also, and sorts will
369 more often be done entirely in memory. A little test showed that a table
370 containing typical HTML pages and their URLs was stored in about half of the
371 raw data size including the <acronym>TOAST</> table, and that the main table
372 contained only about 10% of the entire data (the URLs and some small HTML
373 pages). There was no run time difference compared to an un-<acronym>TOAST</>ed
374 comparison table, in which all the HTML pages were cut down to 7 kB to fit.
375 </para>
377 </sect1>
379 <sect1 id="storage-fsm">
381 <title>Free Space Map</title>
383 <indexterm>
384 <primary>Free Space Map</primary>
385 </indexterm>
386 <indexterm><primary>FSM</><see>Free Space Map</></indexterm>
388 <para>
389 Each heap and index relation, except for hash indexes, has a Free Space Map
390 (FSM) to keep track of available space in the relation. It's stored
391 alongside the main relation data in a separate relation fork, named after the
392 filenode number of the relation, plus a <literal>_fsm</> suffix. For example,
393 if the filenode of a relation is 12345, the FSM is stored in a file called
394 <filename>12345_fsm</>, in the same directory as the main relation file.
395 </para>
397 <para>
398 The Free Space Map is organized as a tree of <acronym>FSM</> pages. The
399 bottom level <acronym>FSM</> pages store the free space available on each
400 heap (or index) page, using one byte to represent each such page. The upper
401 levels aggregate information from the lower levels.
402 </para>
404 <para>
405 Within each <acronym>FSM</> page is a binary tree, stored in an array with
406 one byte per node. Each leaf node represents a heap page, or a lower level
407 <acronym>FSM</> page. In each non-leaf node, the higher of its children's
408 values is stored. The maximum value in the leaf nodes is therefore stored
409 at the root.
410 </para>
412 <para>
413 See <filename>src/backend/storage/freespace/README</> for more details on
414 how the <acronym>FSM</> is structured, and how it's updated and searched.
415 The <filename>contrib/pg_freespacemap</> module can be used to examine the
416 information stored in free space maps (see <xref linkend="pgfreespacemap">).
417 </para>
419 </sect1>
421 <sect1 id="storage-vm">
423 <title>Visibility Map</title>
425 <indexterm>
426 <primary>Visibility Map</primary>
427 </indexterm>
428 <indexterm><primary>VM</><see>Visibility Map</></indexterm>
430 <para>
431 Each heap relation has a Visibility Map
432 (VM) to keep track of which pages contain only tuples that are known to be
433 visible to all active transactions. It's stored
434 alongside the main relation data in a separate relation fork, named after the
435 filenode number of the relation, plus a <literal>_vm</> suffix. For example,
436 if the filenode of a relation is 12345, the VM is stored in a file called
437 <filename>12345_vm</>, in the same directory as the main relation file.
438 Note that indexes do not have VMs.
439 </para>
441 <para>
442 The visibility map simply stores one bit per heap page. A set bit means
443 that all tuples on the page are known to be visible to all transactions.
444 This means that the page does not contain any tuples that need to be vacuumed;
445 in future it might also be used to avoid visiting the page for visibility
446 checks. The map is conservative in the sense that we
447 make sure that whenever a bit is set, we know the condition is true, but if
448 a bit is not set, it might or might not be true.
449 </para>
451 </sect1>
453 <sect1 id="storage-page-layout">
455 <title>Database Page Layout</title>
457 <para>
458 This section provides an overview of the page format used within
459 <productname>PostgreSQL</productname> tables and indexes.<footnote>
460 <para>
461 Actually, index access methods need not use this page format.
462 All the existing index methods do use this basic format,
463 but the data kept on index metapages usually doesn't follow
464 the item layout rules.
465 </para>
466 </footnote>
467 Sequences and <acronym>TOAST</> tables are formatted just like a regular table.
468 </para>
470 <para>
471 In the following explanation, a
472 <firstterm>byte</firstterm>
473 is assumed to contain 8 bits. In addition, the term
474 <firstterm>item</firstterm>
475 refers to an individual data value that is stored on a page. In a table,
476 an item is a row; in an index, an item is an index entry.
477 </para>
479 <para>
480 Every table and index is stored as an array of <firstterm>pages</> of a
481 fixed size (usually 8 kB, although a different page size can be selected
482 when compiling the server). In a table, all the pages are logically
483 equivalent, so a particular item (row) can be stored in any page. In
484 indexes, the first page is generally reserved as a <firstterm>metapage</>
485 holding control information, and there can be different types of pages
486 within the index, depending on the index access method.
487 </para>
489 <para>
490 <xref linkend="page-table"> shows the overall layout of a page.
491 There are five parts to each page.
492 </para>
494 <table tocentry="1" id="page-table">
495 <title>Overall Page Layout</title>
496 <titleabbrev>Page Layout</titleabbrev>
497 <tgroup cols="2">
498 <thead>
499 <row>
500 <entry>
501 Item
502 </entry>
503 <entry>Description</entry>
504 </row>
505 </thead>
507 <tbody>
509 <row>
510 <entry>PageHeaderData</entry>
511 <entry>24 bytes long. Contains general information about the page, including
512 free space pointers.</entry>
513 </row>
515 <row>
516 <entry>ItemIdData</entry>
517 <entry>Array of (offset,length) pairs pointing to the actual items.
518 4 bytes per item.</entry>
519 </row>
521 <row>
522 <entry>Free space</entry>
523 <entry>The unallocated space. New item pointers are allocated from the start
524 of this area, new items from the end.</entry>
525 </row>
527 <row>
528 <entry>Items</entry>
529 <entry>The actual items themselves.</entry>
530 </row>
532 <row>
533 <entry>Special space</entry>
534 <entry>Index access method specific data. Different methods store different
535 data. Empty in ordinary tables.</entry>
536 </row>
538 </tbody>
539 </tgroup>
540 </table>
542 <para>
544 The first 24 bytes of each page consists of a page header
545 (PageHeaderData). Its format is detailed in <xref
546 linkend="pageheaderdata-table">. The first two fields track the most
547 recent WAL entry related to this page. Next is a 2-byte field
548 containing flag bits. This is followed by three 2-byte integer fields
549 (<structfield>pd_lower</structfield>, <structfield>pd_upper</structfield>,
550 and <structfield>pd_special</structfield>). These contain byte offsets
551 from the page start to the start
552 of unallocated space, to the end of unallocated space, and to the start of
553 the special space.
554 The next 2 bytes of the page header,
555 <structfield>pd_pagesize_version</structfield>, store both the page size
556 and a version indicator. Beginning with
557 <productname>PostgreSQL</productname> 8.3 the version number is 4;
558 <productname>PostgreSQL</productname> 8.1 and 8.2 used version number 3;
559 <productname>PostgreSQL</productname> 8.0 used version number 2;
560 <productname>PostgreSQL</productname> 7.3 and 7.4 used version number 1;
561 prior releases used version number 0.
562 (The basic page layout and header format has not changed in most of these
563 versions, but the layout of heap row headers has.) The page size
564 is basically only present as a cross-check; there is no support for having
565 more than one page size in an installation.
566 The last field is a hint that shows whether pruning the page is likely
567 to be profitable: it tracks the oldest un-pruned XMAX on the page.
569 </para>
571 <table tocentry="1" id="pageheaderdata-table">
572 <title>PageHeaderData Layout</title>
573 <titleabbrev>PageHeaderData Layout</titleabbrev>
574 <tgroup cols="4">
575 <thead>
576 <row>
577 <entry>Field</entry>
578 <entry>Type</entry>
579 <entry>Length</entry>
580 <entry>Description</entry>
581 </row>
582 </thead>
583 <tbody>
584 <row>
585 <entry>pd_lsn</entry>
586 <entry>XLogRecPtr</entry>
587 <entry>8 bytes</entry>
588 <entry>LSN: next byte after last byte of xlog record for last change
589 to this page</entry>
590 </row>
591 <row>
592 <entry>pd_tli</entry>
593 <entry>uint16</entry>
594 <entry>2 bytes</entry>
595 <entry>TimeLineID of last change (only its lowest 16 bits)</entry>
596 </row>
597 <row>
598 <entry>pd_flags</entry>
599 <entry>uint16</entry>
600 <entry>2 bytes</entry>
601 <entry>Flag bits</entry>
602 </row>
603 <row>
604 <entry>pd_lower</entry>
605 <entry>LocationIndex</entry>
606 <entry>2 bytes</entry>
607 <entry>Offset to start of free space</entry>
608 </row>
609 <row>
610 <entry>pd_upper</entry>
611 <entry>LocationIndex</entry>
612 <entry>2 bytes</entry>
613 <entry>Offset to end of free space</entry>
614 </row>
615 <row>
616 <entry>pd_special</entry>
617 <entry>LocationIndex</entry>
618 <entry>2 bytes</entry>
619 <entry>Offset to start of special space</entry>
620 </row>
621 <row>
622 <entry>pd_pagesize_version</entry>
623 <entry>uint16</entry>
624 <entry>2 bytes</entry>
625 <entry>Page size and layout version number information</entry>
626 </row>
627 <row>
628 <entry>pd_prune_xid</entry>
629 <entry>TransactionId</entry>
630 <entry>4 bytes</entry>
631 <entry>Oldest unpruned XMAX on page, or zero if none</entry>
632 </row>
633 </tbody>
634 </tgroup>
635 </table>
637 <para>
638 All the details can be found in
639 <filename>src/include/storage/bufpage.h</filename>.
640 </para>
642 <para>
644 Following the page header are item identifiers
645 (<type>ItemIdData</type>), each requiring four bytes.
646 An item identifier contains a byte-offset to
647 the start of an item, its length in bytes, and a few attribute bits
648 which affect its interpretation.
649 New item identifiers are allocated
650 as needed from the beginning of the unallocated space.
651 The number of item identifiers present can be determined by looking at
652 <structfield>pd_lower</>, which is increased to allocate a new identifier.
653 Because an item
654 identifier is never moved until it is freed, its index can be used on a
655 long-term basis to reference an item, even when the item itself is moved
656 around on the page to compact free space. In fact, every pointer to an
657 item (<type>ItemPointer</type>, also known as
658 <type>CTID</type>) created by
659 <productname>PostgreSQL</productname> consists of a page number and the
660 index of an item identifier.
662 </para>
664 <para>
666 The items themselves are stored in space allocated backwards from the end
667 of unallocated space. The exact structure varies depending on what the
668 table is to contain. Tables and sequences both use a structure named
669 <type>HeapTupleHeaderData</type>, described below.
671 </para>
673 <para>
675 The final section is the <quote>special section</quote> which can
676 contain anything the access method wishes to store. For example,
677 b-tree indexes store links to the page's left and right siblings,
678 as well as some other data relevant to the index structure.
679 Ordinary tables do not use a special section at all (indicated by setting
680 <structfield>pd_special</> to equal the page size).
682 </para>
684 <para>
686 All table rows are structured in the same way. There is a fixed-size
687 header (occupying 23 bytes on most machines), followed by an optional null
688 bitmap, an optional object ID field, and the user data. The header is
689 detailed
690 in <xref linkend="heaptupleheaderdata-table">. The actual user data
691 (columns of the row) begins at the offset indicated by
692 <structfield>t_hoff</>, which must always be a multiple of the MAXALIGN
693 distance for the platform.
694 The null bitmap is
695 only present if the <firstterm>HEAP_HASNULL</firstterm> bit is set in
696 <structfield>t_infomask</structfield>. If it is present it begins just after
697 the fixed header and occupies enough bytes to have one bit per data column
698 (that is, <structfield>t_natts</> bits altogether). In this list of bits, a
699 1 bit indicates not-null, a 0 bit is a null. When the bitmap is not
700 present, all columns are assumed not-null.
701 The object ID is only present if the <firstterm>HEAP_HASOID</firstterm> bit
702 is set in <structfield>t_infomask</structfield>. If present, it appears just
703 before the <structfield>t_hoff</> boundary. Any padding needed to make
704 <structfield>t_hoff</> a MAXALIGN multiple will appear between the null
705 bitmap and the object ID. (This in turn ensures that the object ID is
706 suitably aligned.)
708 </para>
710 <table tocentry="1" id="heaptupleheaderdata-table">
711 <title>HeapTupleHeaderData Layout</title>
712 <titleabbrev>HeapTupleHeaderData Layout</titleabbrev>
713 <tgroup cols="4">
714 <thead>
715 <row>
716 <entry>Field</entry>
717 <entry>Type</entry>
718 <entry>Length</entry>
719 <entry>Description</entry>
720 </row>
721 </thead>
722 <tbody>
723 <row>
724 <entry>t_xmin</entry>
725 <entry>TransactionId</entry>
726 <entry>4 bytes</entry>
727 <entry>insert XID stamp</entry>
728 </row>
729 <row>
730 <entry>t_xmax</entry>
731 <entry>TransactionId</entry>
732 <entry>4 bytes</entry>
733 <entry>delete XID stamp</entry>
734 </row>
735 <row>
736 <entry>t_cid</entry>
737 <entry>CommandId</entry>
738 <entry>4 bytes</entry>
739 <entry>insert and/or delete CID stamp (overlays with t_xvac)</entry>
740 </row>
741 <row>
742 <entry>t_xvac</entry>
743 <entry>TransactionId</entry>
744 <entry>4 bytes</entry>
745 <entry>XID for VACUUM operation moving a row version</entry>
746 </row>
747 <row>
748 <entry>t_ctid</entry>
749 <entry>ItemPointerData</entry>
750 <entry>6 bytes</entry>
751 <entry>current TID of this or newer row version</entry>
752 </row>
753 <row>
754 <entry>t_infomask2</entry>
755 <entry>int16</entry>
756 <entry>2 bytes</entry>
757 <entry>number of attributes, plus various flag bits</entry>
758 </row>
759 <row>
760 <entry>t_infomask</entry>
761 <entry>uint16</entry>
762 <entry>2 bytes</entry>
763 <entry>various flag bits</entry>
764 </row>
765 <row>
766 <entry>t_hoff</entry>
767 <entry>uint8</entry>
768 <entry>1 byte</entry>
769 <entry>offset to user data</entry>
770 </row>
771 </tbody>
772 </tgroup>
773 </table>
775 <para>
776 All the details can be found in
777 <filename>src/include/access/htup.h</filename>.
778 </para>
780 <para>
782 Interpreting the actual data can only be done with information obtained
783 from other tables, mostly <structname>pg_attribute</structname>. The
784 key values needed to identify field locations are
785 <structfield>attlen</structfield> and <structfield>attalign</structfield>.
786 There is no way to directly get a
787 particular attribute, except when there are only fixed width fields and no
788 null values. All this trickery is wrapped up in the functions
789 <firstterm>heap_getattr</firstterm>, <firstterm>fastgetattr</firstterm>
790 and <firstterm>heap_getsysattr</firstterm>.
792 </para>
793 <para>
795 To read the data you need to examine each attribute in turn. First check
796 whether the field is NULL according to the null bitmap. If it is, go to
797 the next. Then make sure you have the right alignment. If the field is a
798 fixed width field, then all the bytes are simply placed. If it's a
799 variable length field (attlen = -1) then it's a bit more complicated.
800 All variable-length datatypes share the common header structure
801 <type>struct varlena</type>, which includes the total length of the stored
802 value and some flag bits. Depending on the flags, the data can be either
803 inline or in a <acronym>TOAST</> table;
804 it might be compressed, too (see <xref linkend="storage-toast">).
806 </para>
807 </sect1>
809 </chapter>