doc/src/sgml/wal.sgml

   1 <!-- doc/src/sgml/wal.sgml -->
   2
   3 <chapter id="wal">
   4  <title>Reliability and the Write-Ahead Log</title>
   5
   6  <para>
   7   This chapter explains how to control the reliability of
   8   <productname>PostgreSQL</productname>, including details about the
   9   Write-Ahead Log.
  10  </para>
  11
  12  <sect1 id="wal-reliability">
  13   <title>Reliability</title>
  14
  15   <para>
  16    Reliability is an important property of any serious database
  17    system, and <productname>PostgreSQL</productname> does everything possible to
  18    guarantee reliable operation. One aspect of reliable operation is
  19    that all data recorded by a committed transaction should be stored
  20    in a nonvolatile area that is safe from power loss, operating
  21    system failure, and hardware failure (except failure of the
  22    nonvolatile area itself, of course).  Successfully writing the data
  23    to the computer's permanent storage (disk drive or equivalent)
  24    ordinarily meets this requirement.  In fact, even if a computer is
  25    fatally damaged, if the disk drives survive they can be moved to
  26    another computer with similar hardware and all committed
  27    transactions will remain intact.
  28   </para>
  29
  30   <para>
  31    While forcing data to the disk platters periodically might seem like
  32    a simple operation, it is not. Because disk drives are dramatically
  33    slower than main memory and CPUs, several layers of caching exist
  34    between the computer's main memory and the disk platters.
  35    First, there is the operating system's buffer cache, which caches
  36    frequently requested disk blocks and combines disk writes. Fortunately,
  37    all operating systems give applications a way to force writes from
  38    the buffer cache to disk, and <productname>PostgreSQL</productname> uses those
  39    features.  (See the <xref linkend="guc-wal-sync-method"/> parameter
  40    to adjust how this is done.)
  41   </para>
  42
  43   <para>
  44    Next, there might be a cache in the disk drive controller; this is
  45    particularly common on <acronym>RAID</acronym> controller cards. Some of
  46    these caches are <firstterm>write-through</firstterm>, meaning writes are sent
  47    to the drive as soon as they arrive. Others are
  48    <firstterm>write-back</firstterm>, meaning data is sent to the drive at
  49    some later time. Such caches can be a reliability hazard because the
  50    memory in the disk controller cache is volatile, and will lose its
  51    contents in a power failure.  Better controller cards have
  52    <firstterm>battery-backup units</firstterm> (<acronym>BBU</acronym>s), meaning
  53    the card has a battery that
  54    maintains power to the cache in case of system power loss.  After power
  55    is restored the data will be written to the disk drives.
  56   </para>
  57
  58   <para>
  59    And finally, most disk drives have caches. Some are write-through
  60    while some are write-back, and the same concerns about data loss
  61    exist for write-back drive caches as for disk controller
  62    caches.  Consumer-grade IDE and SATA drives are particularly likely
  63    to have write-back caches that will not survive a power failure.  Many
  64    solid-state drives (SSD) also have volatile write-back caches.
  65   </para>
  66
  67   <para>
  68    These caches can typically be disabled; however, the method for doing
  69    this varies by operating system and drive type:
  70   </para>
  71
  72   <itemizedlist>
  73     <listitem>
  74       <para>
  75         On <productname>Linux</productname>, IDE and SATA drives can be queried using
  76         <command>hdparm -I</command>; write caching is enabled if there is
  77         a <literal>*</literal> next to <literal>Write cache</literal>.  <command>hdparm -W 0</command>
  78         can be used to turn off write caching.  SCSI drives can be queried
  79         using <ulink url="http://sg.danny.cz/sg/sdparm.html"><application>sdparm</application></ulink>.
  80         Use <command>sdparm --get=WCE</command> to check
  81         whether the write cache is enabled and <command>sdparm --clear=WCE</command>
  82         to disable it.
  83       </para>
  84     </listitem>
  85
  86     <listitem>
  87       <para>
  88         On <productname>FreeBSD</productname>, IDE drives can be queried using
  89         <command>camcontrol identify</command> and write caching turned off using
  90         <literal>hw.ata.wc=0</literal> in <filename>/boot/loader.conf</filename>;
  91         SCSI drives can be queried using <command>camcontrol identify</command>,
  92         and the write cache both queried and changed using
  93         <command>sdparm</command> when available.
  94       </para>
  95     </listitem>
  96
  97     <listitem>
  98       <para>
  99         On <productname>Solaris</productname>, the disk write cache is controlled by
 100         <command>format -e</command>.
 101         (The Solaris <acronym>ZFS</acronym> file system is safe with disk write-cache
 102         enabled because it issues its own disk cache flush commands.)
 103       </para>
 104     </listitem>
 105
 106     <listitem>
 107       <para>
 108         On <productname>Windows</productname>, if <varname>wal_sync_method</varname> is
 109         <literal>open_datasync</literal> (the default), write caching can be disabled
 110         by unchecking <literal>My Computer\Open\<replaceable>disk drive</replaceable>\Properties\Hardware\Properties\Policies\Enable write caching on the disk</literal>.
 111         Alternatively, set <varname>wal_sync_method</varname> to
 112         <literal>fdatasync</literal> (NTFS only) or <literal>fsync</literal>,
 113         which prevent write caching.
 114       </para>
 115     </listitem>
 116
 117     <listitem>
 118       <para>
 119         On <productname>macOS</productname>, write caching can be prevented by
 120         setting <varname>wal_sync_method</varname> to <literal>fsync_writethrough</literal>.
 121       </para>
 122     </listitem>
 123   </itemizedlist>
 124
 125   <para>
 126    Recent SATA drives (those following <acronym>ATAPI-6</acronym> or later)
 127    offer a drive cache flush command (<command>FLUSH CACHE EXT</command>),
 128    while SCSI drives have long supported a similar command
 129    <command>SYNCHRONIZE CACHE</command>.  These commands are not directly
 130    accessible to <productname>PostgreSQL</productname>, but some file systems
 131    (e.g., <acronym>ZFS</acronym>, <acronym>ext4</acronym>) can use them to flush
 132    data to the platters on write-back-enabled drives.  Unfortunately, such
 133    file systems behave suboptimally when combined with battery-backup unit
 134    (<acronym>BBU</acronym>) disk controllers.  In such setups, the synchronize
 135    command forces all data from the controller cache to the disks,
 136    eliminating much of the benefit of the BBU.  You can run the
 137    <xref linkend="pgtestfsync"/> program to see
 138    if you are affected.  If you are affected, the performance benefits
 139    of the BBU can be regained by turning off write barriers in
 140    the file system or reconfiguring the disk controller, if that is
 141    an option.  If write barriers are turned off, make sure the battery
 142    remains functional; a faulty battery can potentially lead to data loss.
 143    Hopefully file system and disk controller designers will eventually
 144    address this suboptimal behavior.
 145   </para>
 146
 147   <para>
 148    When the operating system sends a write request to the storage hardware,
 149    there is little it can do to make sure the data has arrived at a truly
 150    non-volatile storage area. Rather, it is the
 151    administrator's responsibility to make certain that all storage components
 152    ensure integrity for both data and file-system metadata.
 153    Avoid disk controllers that have non-battery-backed write caches.
 154    At the drive level, disable write-back caching if the
 155    drive cannot guarantee the data will be written before shutdown.
 156    If you use SSDs, be aware that many of these do not honor cache flush
 157    commands by default.
 158    You can test for reliable I/O subsystem behavior using <ulink
 159    url="https://brad.livejournal.com/2116715.html"><filename>diskchecker.pl</filename></ulink>.
 160   </para>
 161
 162   <para>
 163    Another risk of data loss is posed by the disk platter write
 164    operations themselves. Disk platters are divided into sectors,
 165    commonly 512 bytes each.  Every physical read or write operation
 166    processes a whole sector.
 167    When a write request arrives at the drive, it might be for some multiple
 168    of 512 bytes (<productname>PostgreSQL</productname> typically writes 8192 bytes, or
 169    16 sectors, at a time), and the process of writing could fail due
 170    to power loss at any time, meaning some of the 512-byte sectors were
 171    written while others were not.  To guard against such failures,
 172    <productname>PostgreSQL</productname> periodically writes full page images to
 173    permanent WAL storage <emphasis>before</emphasis> modifying the actual page on
 174    disk. By doing this, during crash recovery <productname>PostgreSQL</productname> can
 175    restore partially-written pages from WAL.  If you have file-system software
 176    that prevents partial page writes (e.g., ZFS),  you can turn off
 177    this page imaging by turning off the <xref
 178    linkend="guc-full-page-writes"/> parameter. Battery-Backed Unit
 179    (BBU) disk controllers do not prevent partial page writes unless
 180    they guarantee that data is written to the BBU as full (8kB) pages.
 181   </para>
 182   <para>
 183    <productname>PostgreSQL</productname> also protects against some kinds of data corruption
 184    on storage devices that may occur because of hardware errors or media failure over time,
 185    such as reading/writing garbage data.
 186    <itemizedlist>
 187     <listitem>
 188      <para>
 189       Each individual record in a WAL file is protected by a CRC-32C (32-bit) check
 190       that allows us to tell if record contents are correct. The CRC value
 191       is set when we write each WAL record and checked during crash recovery,
 192       archive recovery and replication.
 193      </para>
 194     </listitem>
 195     <listitem>
 196      <para>
 197       Data pages are not currently checksummed by default, though full page images
 198       recorded in WAL records will be protected; see <link
 199       linkend="app-initdb-data-checksums"><application>initdb</application></link>
 200       for details about enabling data checksums.
 201      </para>
 202     </listitem>
 203     <listitem>
 204      <para>
 205       Internal data structures such as <filename>pg_xact</filename>, <filename>pg_subtrans</filename>, <filename>pg_multixact</filename>,
 206       <filename>pg_serial</filename>, <filename>pg_notify</filename>, <filename>pg_stat</filename>, <filename>pg_snapshots</filename> are not directly
 207       checksummed, nor are pages protected by full page writes. However, where
 208       such data structures are persistent, WAL records are written that allow
 209       recent changes to be accurately rebuilt at crash recovery and those
 210       WAL records are protected as discussed above.
 211      </para>
 212     </listitem>
 213     <listitem>
 214      <para>
 215       Individual state files in <filename>pg_twophase</filename> are protected by CRC-32C.
 216      </para>
 217     </listitem>
 218     <listitem>
 219      <para>
 220       Temporary data files used in larger SQL queries for sorts,
 221       materializations and intermediate results are not currently checksummed,
 222       nor will WAL records be written for changes to those files.
 223      </para>
 224     </listitem>
 225    </itemizedlist>
 226   </para>
 227   <para>
 228    <productname>PostgreSQL</productname> does not protect against correctable memory errors
 229    and it is assumed you will operate using RAM that uses industry standard
 230    Error Correcting Codes (ECC) or better protection.
 231   </para>
 232  </sect1>
 233
 234  <sect1 id="checksums">
 235   <title>Data Checksums</title>
 236   <indexterm>
 237    <primary>checksums</primary>
 238   </indexterm>
 239
 240   <para>
 241    By default, data pages are not protected by checksums, but this can
 242    optionally be enabled for a cluster. When enabled, each data page includes
 243    a checksum that is updated when the page is written and verified each time
 244    the page is read. Only data pages are protected by checksums; internal data
 245    structures and temporary files are not.
 246   </para>
 247
 248   <para>
 249    Checksums are normally enabled when the cluster is initialized using <link
 250    linkend="app-initdb-data-checksums"><application>initdb</application></link>.
 251    They can also be enabled or disabled at a later time as an offline
 252    operation. Data checksums are enabled or disabled at the full cluster
 253    level, and cannot be specified individually for databases or tables.
 254   </para>
 255
 256   <para>
 257    The current state of checksums in the cluster can be verified by viewing the
 258    value of the read-only configuration variable <xref
 259    linkend="guc-data-checksums" /> by issuing the command <command>SHOW
 260    data_checksums</command>.
 261   </para>
 262
 263   <para>
 264    When attempting to recover from page corruptions, it may be necessary to
 265    bypass the checksum protection. To do this, temporarily set the
 266    configuration parameter <xref linkend="guc-ignore-checksum-failure" />.
 267   </para>
 268
 269   <sect2 id="checksums-offline-enable-disable">
 270    <title>Off-line Enabling of Checksums</title>
 271
 272    <para>
 273     The <link linkend="app-pgchecksums"><application>pg_checksums</application></link>
 274     application can be used to enable or disable data checksums, as well as
 275     verify checksums, on an offline cluster.
 276    </para>
 277
 278   </sect2>
 279  </sect1>
 280
 281   <sect1 id="wal-intro">
 282    <title>Write-Ahead Logging (<acronym>WAL</acronym>)</title>
 283
 284    <indexterm zone="wal">
 285     <primary>WAL</primary>
 286    </indexterm>
 287
 288    <indexterm>
 289     <primary>transaction log</primary>
 290     <see>WAL</see>
 291    </indexterm>
 292
 293    <para>
 294     <firstterm>Write-Ahead Logging</firstterm> (<acronym>WAL</acronym>)
 295     is a standard method for ensuring data integrity.  A detailed
 296     description can be found in most (if not all) books about
 297     transaction processing. Briefly, <acronym>WAL</acronym>'s central
 298     concept is that changes to data files (where tables and indexes
 299     reside) must be written only after those changes have been logged,
 300     that is, after WAL records describing the changes have been flushed
 301     to permanent storage. If we follow this procedure, we do not need
 302     to flush data pages to disk on every transaction commit, because we
 303     know that in the event of a crash we will be able to recover the
 304     database using the log: any changes that have not been applied to
 305     the data pages can be redone from the WAL records.  (This is
 306     roll-forward recovery, also known as REDO.)
 307    </para>
 308
 309    <tip>
 310     <para>
 311      Because <acronym>WAL</acronym> restores database file
 312      contents after a crash, journaled file systems are not necessary for
 313      reliable storage of the data files or WAL files.  In fact, journaling
 314      overhead can reduce performance, especially if journaling
 315      causes file system <emphasis>data</emphasis> to be flushed
 316      to disk.  Fortunately, data flushing during journaling can
 317      often be disabled with a file system mount option, e.g.,
 318      <literal>data=writeback</literal> on a Linux ext3 file system.
 319      Journaled file systems do improve boot speed after a crash.
 320     </para>
 321    </tip>
 322
 323
 324    <para>
 325     Using <acronym>WAL</acronym> results in a
 326     significantly reduced number of disk writes, because only the WAL
 327     file needs to be flushed to disk to guarantee that a transaction is
 328     committed, rather than every data file changed by the transaction.
 329     The WAL file is written sequentially,
 330     and so the cost of syncing the WAL is much less than the cost of
 331     flushing the data pages.  This is especially true for servers
 332     handling many small transactions touching different parts of the data
 333     store.  Furthermore, when the server is processing many small concurrent
 334     transactions, one <function>fsync</function> of the WAL file may
 335     suffice to commit many transactions.
 336    </para>
 337
 338    <para>
 339     <acronym>WAL</acronym> also makes it possible to support on-line
 340     backup and point-in-time recovery, as described in <xref
 341     linkend="continuous-archiving"/>.  By archiving the WAL data we can support
 342     reverting to any time instant covered by the available WAL data:
 343     we simply install a prior physical backup of the database, and
 344     replay the WAL just as far as the desired time.  What's more,
 345     the physical backup doesn't have to be an instantaneous snapshot
 346     of the database state &mdash; if it is made over some period of time,
 347     then replaying the WAL for that period will fix any internal
 348     inconsistencies.
 349    </para>
 350   </sect1>
 351
 352  <sect1 id="wal-async-commit">
 353   <title>Asynchronous Commit</title>
 354
 355    <indexterm>
 356     <primary>synchronous commit</primary>
 357    </indexterm>
 358
 359    <indexterm>
 360     <primary>asynchronous commit</primary>
 361    </indexterm>
 362
 363   <para>
 364    <firstterm>Asynchronous commit</firstterm> is an option that allows transactions
 365    to complete more quickly, at the cost that the most recent transactions may
 366    be lost if the database should crash.  In many applications this is an
 367    acceptable trade-off.
 368   </para>
 369
 370   <para>
 371    As described in the previous section, transaction commit is normally
 372    <firstterm>synchronous</firstterm>: the server waits for the transaction's
 373    <acronym>WAL</acronym> records to be flushed to permanent storage
 374    before returning a success indication to the client.  The client is
 375    therefore guaranteed that a transaction reported to be committed will
 376    be preserved, even in the event of a server crash immediately after.
 377    However, for short transactions this delay is a major component of the
 378    total transaction time.  Selecting asynchronous commit mode means that
 379    the server returns success as soon as the transaction is logically
 380    completed, before the <acronym>WAL</acronym> records it generated have
 381    actually made their way to disk.  This can provide a significant boost
 382    in throughput for small transactions.
 383   </para>
 384
 385   <para>
 386    Asynchronous commit introduces the risk of data loss. There is a short
 387    time window between the report of transaction completion to the client
 388    and the time that the transaction is truly committed (that is, it is
 389    guaranteed not to be lost if the server crashes).  Thus asynchronous
 390    commit should not be used if the client will take external actions
 391    relying on the assumption that the transaction will be remembered.
 392    As an example, a bank would certainly not use asynchronous commit for
 393    a transaction recording an ATM's dispensing of cash.  But in many
 394    scenarios, such as event logging, there is no need for a strong
 395    guarantee of this kind.
 396   </para>
 397
 398   <para>
 399    The risk that is taken by using asynchronous commit is of data loss,
 400    not data corruption.  If the database should crash, it will recover
 401    by replaying <acronym>WAL</acronym> up to the last record that was
 402    flushed.  The database will therefore be restored to a self-consistent
 403    state, but any transactions that were not yet flushed to disk will
 404    not be reflected in that state.  The net effect is therefore loss of
 405    the last few transactions.  Because the transactions are replayed in
 406    commit order, no inconsistency can be introduced &mdash; for example,
 407    if transaction B made changes relying on the effects of a previous
 408    transaction A, it is not possible for A's effects to be lost while B's
 409    effects are preserved.
 410   </para>
 411
 412   <para>
 413    The user can select the commit mode of each transaction, so that
 414    it is possible to have both synchronous and asynchronous commit
 415    transactions running concurrently.  This allows flexible trade-offs
 416    between performance and certainty of transaction durability.
 417    The commit mode is controlled by the user-settable parameter
 418    <xref linkend="guc-synchronous-commit"/>, which can be changed in any of
 419    the ways that a configuration parameter can be set.  The mode used for
 420    any one transaction depends on the value of
 421    <varname>synchronous_commit</varname> when transaction commit begins.
 422   </para>
 423
 424   <para>
 425    Certain utility commands, for instance <command>DROP TABLE</command>, are
 426    forced to commit synchronously regardless of the setting of
 427    <varname>synchronous_commit</varname>.  This is to ensure consistency
 428    between the server's file system and the logical state of the database.
 429    The commands supporting two-phase commit, such as <command>PREPARE
 430    TRANSACTION</command>, are also always synchronous.
 431   </para>
 432
 433   <para>
 434    If the database crashes during the risk window between an
 435    asynchronous commit and the writing of the transaction's
 436    <acronym>WAL</acronym> records,
 437    then changes made during that transaction <emphasis>will</emphasis> be lost.
 438    The duration of the
 439    risk window is limited because a background process (the <quote>WAL
 440    writer</quote>) flushes unwritten <acronym>WAL</acronym> records to disk
 441    every <xref linkend="guc-wal-writer-delay"/> milliseconds.
 442    The actual maximum duration of the risk window is three times
 443    <varname>wal_writer_delay</varname> because the WAL writer is
 444    designed to favor writing whole pages at a time during busy periods.
 445   </para>
 446
 447   <caution>
 448    <para>
 449     An immediate-mode shutdown is equivalent to a server crash, and will
 450     therefore cause loss of any unflushed asynchronous commits.
 451    </para>
 452   </caution>
 453
 454   <para>
 455    Asynchronous commit provides behavior different from setting
 456    <xref linkend="guc-fsync"/> = off.
 457    <varname>fsync</varname> is a server-wide
 458    setting that will alter the behavior of all transactions.  It disables
 459    all logic within <productname>PostgreSQL</productname> that attempts to synchronize
 460    writes to different portions of the database, and therefore a system
 461    crash (that is, a hardware or operating system crash, not a failure of
 462    <productname>PostgreSQL</productname> itself) could result in arbitrarily bad
 463    corruption of the database state.  In many scenarios, asynchronous
 464    commit provides most of the performance improvement that could be
 465    obtained by turning off <varname>fsync</varname>, but without the risk
 466    of data corruption.
 467   </para>
 468
 469   <para>
 470    <xref linkend="guc-commit-delay"/> also sounds very similar to
 471    asynchronous commit, but it is actually a synchronous commit method
 472    (in fact, <varname>commit_delay</varname> is ignored during an
 473    asynchronous commit).  <varname>commit_delay</varname> causes a delay
 474    just before a transaction flushes <acronym>WAL</acronym> to disk, in
 475    the hope that a single flush executed by one such transaction can also
 476    serve other transactions committing at about the same time.  The
 477    setting can be thought of as a way of increasing the time window in
 478    which transactions can join a group about to participate in a single
 479    flush, to amortize the cost of the flush among multiple transactions.
 480   </para>
 481
 482  </sect1>
 483
 484  <sect1 id="wal-configuration">
 485   <title><acronym>WAL</acronym> Configuration</title>
 486
 487   <para>
 488    There are several <acronym>WAL</acronym>-related configuration parameters that
 489    affect database performance. This section explains their use.
 490    Consult <xref linkend="runtime-config"/> for general information about
 491    setting server configuration parameters.
 492   </para>
 493
 494   <para>
 495    <firstterm>Checkpoints</firstterm><indexterm><primary>checkpoint</primary></indexterm>
 496    are points in the sequence of transactions at which it is guaranteed
 497    that the heap and index data files have been updated with all
 498    information written before that checkpoint.  At checkpoint time, all
 499    dirty data pages are flushed to disk and a special checkpoint record is
 500    written to the WAL file.  (The change records were previously flushed
 501    to the <acronym>WAL</acronym> files.)
 502    In the event of a crash, the crash recovery procedure looks at the latest
 503    checkpoint record to determine the point in the WAL (known as the redo
 504    record) from which it should start the REDO operation.  Any changes made to
 505    data files before that point are guaranteed to be already on disk.
 506    Hence, after a checkpoint, WAL segments preceding the one containing
 507    the redo record are no longer needed and can be recycled or removed. (When
 508    <acronym>WAL</acronym> archiving is being done, the WAL segments must be
 509    archived before being recycled or removed.)
 510   </para>
 511
 512   <para>
 513    The checkpoint requirement of flushing all dirty data pages to disk
 514    can cause a significant I/O load.  For this reason, checkpoint
 515    activity is throttled so that I/O begins at checkpoint start and completes
 516    before the next checkpoint is due to start; this minimizes performance
 517    degradation during checkpoints.
 518   </para>
 519
 520   <para>
 521    The server's checkpointer process automatically performs
 522    a checkpoint every so often.  A checkpoint is begun every <xref
 523    linkend="guc-checkpoint-timeout"/> seconds, or if
 524    <xref linkend="guc-max-wal-size"/> is about to be exceeded,
 525    whichever comes first.
 526    The default settings are 5 minutes and 1 GB, respectively.
 527    If no WAL has been written since the previous checkpoint, new checkpoints
 528    will be skipped even if <varname>checkpoint_timeout</varname> has passed.
 529    (If WAL archiving is being used and you want to put a lower limit on how
 530    often files are archived in order to bound potential data loss, you should
 531    adjust the <xref linkend="guc-archive-timeout"/> parameter rather than the
 532    checkpoint parameters.)
 533    It is also possible to force a checkpoint by using the SQL
 534    command <command>CHECKPOINT</command>.
 535   </para>
 536
 537   <para>
 538    Reducing <varname>checkpoint_timeout</varname> and/or
 539    <varname>max_wal_size</varname> causes checkpoints to occur
 540    more often. This allows faster after-crash recovery, since less work
 541    will need to be redone. However, one must balance this against the
 542    increased cost of flushing dirty data pages more often. If
 543    <xref linkend="guc-full-page-writes"/> is set (as is the default), there is
 544    another factor to consider. To ensure data page consistency,
 545    the first modification of a data page after each checkpoint results in
 546    logging the entire page content. In that case,
 547    a smaller checkpoint interval increases the volume of output to the WAL,
 548    partially negating the goal of using a smaller interval,
 549    and in any case causing more disk I/O.
 550   </para>
 551
 552   <para>
 553    Checkpoints are fairly expensive, first because they require writing
 554    out all currently dirty buffers, and second because they result in
 555    extra subsequent WAL traffic as discussed above.  It is therefore
 556    wise to set the checkpointing parameters high enough so that checkpoints
 557    don't happen too often.  As a simple sanity check on your checkpointing
 558    parameters, you can set the <xref linkend="guc-checkpoint-warning"/>
 559    parameter.  If checkpoints happen closer together than
 560    <varname>checkpoint_warning</varname> seconds,
 561    a message will be output to the server log recommending increasing
 562    <varname>max_wal_size</varname>.  Occasional appearance of such
 563    a message is not cause for alarm, but if it appears often then the
 564    checkpoint control parameters should be increased. Bulk operations such
 565    as large <command>COPY</command> transfers might cause a number of such warnings
 566    to appear if you have not set <varname>max_wal_size</varname> high
 567    enough.
 568   </para>
 569
 570   <para>
 571    To avoid flooding the I/O system with a burst of page writes,
 572    writing dirty buffers during a checkpoint is spread over a period of time.
 573    That period is controlled by
 574    <xref linkend="guc-checkpoint-completion-target"/>, which is
 575    given as a fraction of the checkpoint interval (configured by using
 576    <varname>checkpoint_timeout</varname>).
 577    The I/O rate is adjusted so that the checkpoint finishes when the
 578    given fraction of
 579    <varname>checkpoint_timeout</varname> seconds have elapsed, or before
 580    <varname>max_wal_size</varname> is exceeded, whichever is sooner.
 581    With the default value of 0.9,
 582    <productname>PostgreSQL</productname> can be expected to complete each checkpoint
 583    a bit before the next scheduled checkpoint (at around 90% of the last checkpoint's
 584    duration).  This spreads out the I/O as much as possible so that the checkpoint
 585    I/O load is consistent throughout the checkpoint interval.  The disadvantage of
 586    this is that prolonging checkpoints affects recovery time, because more WAL
 587    segments will need to be kept around for possible use in recovery.  A user
 588    concerned about the amount of time required to recover might wish to reduce
 589    <varname>checkpoint_timeout</varname> so that checkpoints occur more frequently
 590    but still spread the I/O across the checkpoint interval.  Alternatively,
 591    <varname>checkpoint_completion_target</varname> could be reduced, but this would
 592    result in times of more intense I/O (during the checkpoint) and times of less I/O
 593    (after the checkpoint completed but before the next scheduled checkpoint) and
 594    therefore is not recommended.
 595    Although <varname>checkpoint_completion_target</varname> could be set as high as
 596    1.0, it is typically recommended to set it to no higher than 0.9 (the default)
 597    since checkpoints include some other activities besides writing dirty buffers.
 598    A setting of 1.0 is quite likely to result in checkpoints not being
 599    completed on time, which would result in performance loss due to
 600    unexpected variation in the number of WAL segments needed.
 601   </para>
 602
 603   <para>
 604    On Linux and POSIX platforms <xref linkend="guc-checkpoint-flush-after"/>
 605    allows you to force OS pages written by the checkpoint to be
 606    flushed to disk after a configurable number of bytes.  Otherwise, these
 607    pages may be kept in the OS's page cache, inducing a stall when
 608    <literal>fsync</literal> is issued at the end of a checkpoint.  This setting will
 609    often help to reduce transaction latency, but it also can have an adverse
 610    effect on performance; particularly for workloads that are bigger than
 611    <xref linkend="guc-shared-buffers"/>, but smaller than the OS's page cache.
 612   </para>
 613
 614   <para>
 615    The number of WAL segment files in <filename>pg_wal</filename> directory depends on
 616    <varname>min_wal_size</varname>, <varname>max_wal_size</varname> and
 617    the amount of WAL generated in previous checkpoint cycles. When old WAL
 618    segment files are no longer needed, they are removed or recycled (that is,
 619    renamed to become future segments in the numbered sequence). If, due to a
 620    short-term peak of WAL output rate, <varname>max_wal_size</varname> is
 621    exceeded, the unneeded segment files will be removed until the system
 622    gets back under this limit. Below that limit, the system recycles enough
 623    WAL files to cover the estimated need until the next checkpoint, and
 624    removes the rest. The estimate is based on a moving average of the number
 625    of WAL files used in previous checkpoint cycles. The moving average
 626    is increased immediately if the actual usage exceeds the estimate, so it
 627    accommodates peak usage rather than average usage to some extent.
 628    <varname>min_wal_size</varname> puts a minimum on the amount of WAL files
 629    recycled for future usage; that much WAL is always recycled for future use,
 630    even if the system is idle and the WAL usage estimate suggests that little
 631    WAL is needed.
 632   </para>
 633
 634   <para>
 635    Independently of <varname>max_wal_size</varname>,
 636    the most recent <xref linkend="guc-wal-keep-size"/> megabytes of
 637    WAL files plus one additional WAL file are
 638    kept at all times. Also, if WAL archiving is used, old segments cannot be
 639    removed or recycled until they are archived. If WAL archiving cannot keep up
 640    with the pace that WAL is generated, or if <varname>archive_command</varname>
 641    or <varname>archive_library</varname>
 642    fails repeatedly, old WAL files will accumulate in <filename>pg_wal</filename>
 643    until the situation is resolved. A slow or failed standby server that
 644    uses a replication slot will have the same effect (see
 645    <xref linkend="streaming-replication-slots"/>).
 646    Similarly, if <link linkend="runtime-config-wal-summarization">
 647    WAL summarization</link> is enabled, old segments are kept
 648    until they are summarized.
 649   </para>
 650
 651   <para>
 652    In archive recovery or standby mode, the server periodically performs
 653    <firstterm>restartpoints</firstterm>,<indexterm><primary>restartpoint</primary></indexterm>
 654    which are similar to checkpoints in normal operation: the server forces
 655    all its state to disk, updates the <filename>pg_control</filename> file to
 656    indicate that the already-processed WAL data need not be scanned again,
 657    and then recycles any old WAL segment files in the <filename>pg_wal</filename>
 658    directory.
 659    Restartpoints can't be performed more frequently than checkpoints on the
 660    primary because restartpoints can only be performed at checkpoint records.
 661    A restartpoint can be demanded by a schedule or by an external request.
 662    The <structfield>restartpoints_timed</structfield> counter in the
 663    <link linkend="monitoring-pg-stat-checkpointer-view"><structname>pg_stat_checkpointer</structname></link>
 664    view counts the first ones while the <structfield>restartpoints_req</structfield>
 665    the second.
 666    A restartpoint is triggered by schedule when a checkpoint record is reached
 667    if at least <xref linkend="guc-checkpoint-timeout"/> seconds have passed since
 668    the last performed restartpoint or when the previous attempt to perform
 669    the restartpoint has failed. In the last case, the next restartpoint
 670    will be scheduled in 15 seconds.
 671    A restartpoint is triggered by request due to similar reasons like checkpoint
 672    but mostly if WAL size is about to exceed <xref linkend="guc-max-wal-size"/>
 673    However, because of limitations on when a restartpoint can be performed,
 674    <varname>max_wal_size</varname> is often exceeded during recovery,
 675    by up to one checkpoint cycle's worth of WAL.
 676    (<varname>max_wal_size</varname> is never a hard limit anyway, so you should
 677    always leave plenty of headroom to avoid running out of disk space.)
 678    The <structfield>restartpoints_done</structfield> counter in the
 679    <link linkend="monitoring-pg-stat-checkpointer-view"><structname>pg_stat_checkpointer</structname></link>
 680    view counts the restartpoints that have really been performed.
 681   </para>
 682
 683   <para>
 684    In some cases, when the WAL size on the primary increases quickly,
 685    for instance during massive <command>INSERT</command>,
 686    the <structfield>restartpoints_req</structfield> counter on the standby
 687    may demonstrate a peak growth.
 688    This occurs because requests to create a new restartpoint due to increased
 689    WAL consumption cannot be performed because the safe checkpoint record
 690    since the last restartpoint has not yet been replayed on the standby.
 691    This behavior is normal and does not lead to an increase in system resource
 692    consumption.
 693    Only the <structfield>restartpoints_done</structfield>
 694    counter among the restartpoint-related ones indicates that noticeable system
 695    resources have been spent.
 696   </para>
 697
 698   <para>
 699    There are two commonly used internal <acronym>WAL</acronym> functions:
 700    <function>XLogInsertRecord</function> and <function>XLogFlush</function>.
 701    <function>XLogInsertRecord</function> is used to place a new record into
 702    the <acronym>WAL</acronym> buffers in shared memory. If there is no
 703    space for the new record, <function>XLogInsertRecord</function> will have
 704    to write (move to kernel cache) a few filled <acronym>WAL</acronym>
 705    buffers. This is undesirable because <function>XLogInsertRecord</function>
 706    is used on every database low level modification (for example, row
 707    insertion) at a time when an exclusive lock is held on affected
 708    data pages, so the operation needs to be as fast as possible.  What
 709    is worse, writing <acronym>WAL</acronym> buffers might also force the
 710    creation of a new WAL segment, which takes even more
 711    time. Normally, <acronym>WAL</acronym> buffers should be written
 712    and flushed by an <function>XLogFlush</function> request, which is
 713    made, for the most part, at transaction commit time to ensure that
 714    transaction records are flushed to permanent storage. On systems
 715    with high WAL output, <function>XLogFlush</function> requests might
 716    not occur often enough to prevent <function>XLogInsertRecord</function>
 717    from having to do writes.  On such systems
 718    one should increase the number of <acronym>WAL</acronym> buffers by
 719    modifying the <xref linkend="guc-wal-buffers"/> parameter.  When
 720    <xref linkend="guc-full-page-writes"/> is set and the system is very busy,
 721    setting <varname>wal_buffers</varname> higher will help smooth response times
 722    during the period immediately following each checkpoint.
 723   </para>
 724
 725   <para>
 726    The <xref linkend="guc-commit-delay"/> parameter defines for how many
 727    microseconds a group commit leader process will sleep after acquiring a
 728    lock within <function>XLogFlush</function>, while group commit
 729    followers queue up behind the leader.  This delay allows other server
 730    processes to add their commit records to the WAL buffers so that all of
 731    them will be flushed by the leader's eventual sync operation.  No sleep
 732    will occur if <xref linkend="guc-fsync"/> is not enabled, or if fewer
 733    than <xref linkend="guc-commit-siblings"/> other sessions are currently
 734    in active transactions; this avoids sleeping when it's unlikely that
 735    any other session will commit soon.  Note that on some platforms, the
 736    resolution of a sleep request is ten milliseconds, so that any nonzero
 737    <varname>commit_delay</varname> setting between 1 and 10000
 738    microseconds would have the same effect.  Note also that on some
 739    platforms, sleep operations may take slightly longer than requested by
 740    the parameter.
 741   </para>
 742
 743   <para>
 744    Since the purpose of <varname>commit_delay</varname> is to allow the
 745    cost of each flush operation to be amortized across concurrently
 746    committing transactions (potentially at the expense of transaction
 747    latency), it is necessary to quantify that cost before the setting can
 748    be chosen intelligently.  The higher that cost is, the more effective
 749    <varname>commit_delay</varname> is expected to be in increasing
 750    transaction throughput, up to a point.  The <xref
 751    linkend="pgtestfsync"/> program can be used to measure the average time
 752    in microseconds that a single WAL flush operation takes.  A value of
 753    half of the average time the program reports it takes to flush after a
 754    single 8kB write operation is often the most effective setting for
 755    <varname>commit_delay</varname>, so this value is recommended as the
 756    starting point to use when optimizing for a particular workload.  While
 757    tuning <varname>commit_delay</varname> is particularly useful when the
 758    WAL is stored on high-latency rotating disks, benefits can be
 759    significant even on storage media with very fast sync times, such as
 760    solid-state drives or RAID arrays with a battery-backed write cache;
 761    but this should definitely be tested against a representative workload.
 762    Higher values of <varname>commit_siblings</varname> should be used in
 763    such cases, whereas smaller <varname>commit_siblings</varname> values
 764    are often helpful on higher latency media.  Note that it is quite
 765    possible that a setting of <varname>commit_delay</varname> that is too
 766    high can increase transaction latency by so much that total transaction
 767    throughput suffers.
 768   </para>
 769
 770   <para>
 771    When <varname>commit_delay</varname> is set to zero (the default), it
 772    is still possible for a form of group commit to occur, but each group
 773    will consist only of sessions that reach the point where they need to
 774    flush their commit records during the window in which the previous
 775    flush operation (if any) is occurring.  At higher client counts a
 776    <quote>gangway effect</quote> tends to occur, so that the effects of group
 777    commit become significant even when <varname>commit_delay</varname> is
 778    zero, and thus explicitly setting <varname>commit_delay</varname> tends
 779    to help less.  Setting <varname>commit_delay</varname> can only help
 780    when (1) there are some concurrently committing transactions, and (2)
 781    throughput is limited to some degree by commit rate; but with high
 782    rotational latency this setting can be effective in increasing
 783    transaction throughput with as few as two clients (that is, a single
 784    committing client with one sibling transaction).
 785   </para>
 786
 787   <para>
 788    The <xref linkend="guc-wal-sync-method"/> parameter determines how
 789    <productname>PostgreSQL</productname> will ask the kernel to force
 790    <acronym>WAL</acronym> updates out to disk.
 791    All the options should be the same in terms of reliability, with
 792    the exception of <literal>fsync_writethrough</literal>, which can sometimes
 793    force a flush of the disk cache even when other options do not do so.
 794    However, it's quite platform-specific which one will be the fastest.
 795    You can test the speeds of different options using the <xref
 796    linkend="pgtestfsync"/> program.
 797    Note that this parameter is irrelevant if <varname>fsync</varname>
 798    has been turned off.
 799   </para>
 800
 801   <para>
 802    Enabling the <xref linkend="guc-wal-debug"/> configuration parameter
 803    (provided that <productname>PostgreSQL</productname> has been
 804    compiled with support for it) will result in each
 805    <function>XLogInsertRecord</function> and <function>XLogFlush</function>
 806    <acronym>WAL</acronym> call being logged to the server log. This
 807    option might be replaced by a more general mechanism in the future.
 808   </para>
 809
 810   <para>
 811    There are two internal functions to write WAL data to disk:
 812    <function>XLogWrite</function> and <function>issue_xlog_fsync</function>.
 813    When <xref linkend="guc-track-wal-io-timing"/> is enabled, the total
 814    amounts of time <function>XLogWrite</function> writes and
 815    <function>issue_xlog_fsync</function> syncs WAL data to disk are counted as
 816    <literal>wal_write_time</literal> and <literal>wal_sync_time</literal> in
 817    <xref linkend="pg-stat-wal-view"/>, respectively.
 818    <function>XLogWrite</function> is normally called by
 819    <function>XLogInsertRecord</function> (when there is no space for the new
 820    record in WAL buffers), <function>XLogFlush</function> and the WAL writer,
 821    to write WAL buffers to disk and call <function>issue_xlog_fsync</function>.
 822    <function>issue_xlog_fsync</function> is normally called by
 823    <function>XLogWrite</function> to sync WAL files to disk.
 824    If <varname>wal_sync_method</varname> is either
 825    <literal>open_datasync</literal> or <literal>open_sync</literal>,
 826    a write operation in <function>XLogWrite</function> guarantees to sync written
 827    WAL data to disk and <function>issue_xlog_fsync</function> does nothing.
 828    If <varname>wal_sync_method</varname> is either <literal>fdatasync</literal>,
 829    <literal>fsync</literal>, or <literal>fsync_writethrough</literal>,
 830    the write operation moves WAL buffers to kernel cache and
 831    <function>issue_xlog_fsync</function> syncs them to disk. Regardless
 832    of the setting of <varname>track_wal_io_timing</varname>, the number
 833    of times <function>XLogWrite</function> writes and
 834    <function>issue_xlog_fsync</function> syncs WAL data to disk are also
 835    counted as <literal>wal_write</literal> and <literal>wal_sync</literal>
 836    in <structname>pg_stat_wal</structname>, respectively.
 837   </para>
 838
 839   <para>
 840    The <xref linkend="guc-recovery-prefetch"/> parameter can be used to reduce
 841    I/O wait times during recovery by instructing the kernel to initiate reads
 842    of disk blocks that will soon be needed but are not currently in
 843    <productname>PostgreSQL</productname>'s buffer pool.
 844    The <xref linkend="guc-maintenance-io-concurrency"/> and
 845    <xref linkend="guc-wal-decode-buffer-size"/> settings limit prefetching
 846    concurrency and distance, respectively.  By default, it is set to
 847    <literal>try</literal>, which enables the feature on systems that support
 848    issuing read-ahead advice.
 849   </para>
 850  </sect1>
 851
 852  <sect1 id="wal-internals">
 853   <title>WAL Internals</title>
 854
 855   <indexterm zone="wal-internals">
 856    <primary>LSN</primary>
 857   </indexterm>
 858
 859   <para>
 860    <acronym>WAL</acronym> is automatically enabled; no action is
 861    required from the administrator except ensuring that the
 862    disk-space requirements for the <acronym>WAL</acronym> files are met,
 863    and that any necessary tuning is done (see <xref
 864    linkend="wal-configuration"/>).
 865   </para>
 866
 867   <para>
 868    <acronym>WAL</acronym> records are appended to the <acronym>WAL</acronym>
 869    files as each new record is written. The insert position is described by
 870    a Log Sequence Number (<acronym>LSN</acronym>) that is a byte offset into
 871    the WAL, increasing monotonically with each new record.
 872    <acronym>LSN</acronym> values are returned as the datatype
 873    <link linkend="datatype-pg-lsn"><type>pg_lsn</type></link>. Values can be
 874    compared to calculate the volume of <acronym>WAL</acronym> data that
 875    separates them, so they are used to measure the progress of replication
 876    and recovery.
 877   </para>
 878
 879   <para>
 880    <acronym>WAL</acronym> files are stored in the directory
 881    <filename>pg_wal</filename> under the data directory, as a set of
 882    segment files, normally each 16 MB in size (but the size can be changed
 883    by altering the <option>--wal-segsize</option> <application>initdb</application> option).  Each segment is
 884    divided into pages, normally 8 kB each (this size can be changed via the
 885    <option>--with-wal-blocksize</option> configure option).  The WAL record headers
 886    are described in <filename>access/xlogrecord.h</filename>; the record
 887    content is dependent on the type of event that is being logged.  Segment
 888    files are given ever-increasing numbers as names, starting at
 889    <filename>000000010000000000000001</filename>.  The numbers do not wrap,
 890    but it will take a very, very long time to exhaust the
 891    available stock of numbers.
 892   </para>
 893
 894   <para>
 895    It is advantageous if the WAL is located on a different disk from the
 896    main database files.  This can be achieved by moving the
 897    <filename>pg_wal</filename> directory to another location (while the server
 898    is shut down, of course) and creating a symbolic link from the
 899    original location in the main data directory to the new location.
 900   </para>
 901
 902   <para>
 903    The aim of <acronym>WAL</acronym> is to ensure that the log is
 904    written before database records are altered, but this can be subverted by
 905    disk drives<indexterm><primary>disk drive</primary></indexterm> that falsely report a
 906    successful write to the kernel,
 907    when in fact they have only cached the data and not yet stored it
 908    on the disk.  A power failure in such a situation might lead to
 909    irrecoverable data corruption.  Administrators should try to ensure
 910    that disks holding <productname>PostgreSQL</productname>'s
 911    <acronym>WAL</acronym> files do not make such false reports.
 912    (See <xref linkend="wal-reliability"/>.)
 913   </para>
 914
 915   <para>
 916    After a checkpoint has been made and the WAL flushed, the
 917    checkpoint's position is saved in the file
 918    <filename>pg_control</filename>. Therefore, at the start of recovery,
 919    the server first reads <filename>pg_control</filename> and
 920    then the checkpoint record; then it performs the REDO operation by
 921    scanning forward from the WAL location indicated in the checkpoint
 922    record.  Because the entire content of data pages is saved in the
 923    WAL on the first page modification after a checkpoint (assuming
 924    <xref linkend="guc-full-page-writes"/> is not disabled), all pages
 925    changed since the checkpoint will be restored to a consistent
 926    state.
 927   </para>
 928
 929   <para>
 930    To deal with the case where <filename>pg_control</filename> is
 931    corrupt, we should support the possibility of scanning existing WAL
 932    segments in reverse order &mdash; newest to oldest &mdash; in order to find the
 933    latest checkpoint.  This has not been implemented yet.
 934    <filename>pg_control</filename> is small enough (less than one disk page)
 935    that it is not subject to partial-write problems, and as of this writing
 936    there have been no reports of database failures due solely to the inability
 937    to read <filename>pg_control</filename> itself.  So while it is
 938    theoretically a weak spot, <filename>pg_control</filename> does not
 939    seem to be a problem in practice.
 940   </para>
 941  </sect1>
 942
 943 </chapter>