doc/src/sgml/wal.sgml

   1 <!-- $PostgreSQL$ -->
   2
   3 <chapter id="wal">
   4  <title>Reliability and the Write-Ahead Log</title>
   5
   6  <para>
   7   This chapter explains how the Write-Ahead Log is used to obtain
   8   efficient, reliable operation.
   9  </para>
  10
  11  <sect1 id="wal-reliability">
  12   <title>Reliability</title>
  13
  14   <para>
  15    Reliability is an important property of any serious database
  16    system, and <productname>PostgreSQL</> does everything possible to
  17    guarantee reliable operation. One aspect of reliable operation is
  18    that all data recorded by a committed transaction should be stored
  19    in a nonvolatile area that is safe from power loss, operating
  20    system failure, and hardware failure (except failure of the
  21    nonvolatile area itself, of course).  Successfully writing the data
  22    to the computer's permanent storage (disk drive or equivalent)
  23    ordinarily meets this requirement.  In fact, even if a computer is
  24    fatally damaged, if the disk drives survive they can be moved to
  25    another computer with similar hardware and all committed
  26    transactions will remain intact.
  27   </para>
  28
  29   <para>
  30    While forcing data periodically to the disk platters might seem like
  31    a simple operation, it is not. Because disk drives are dramatically
  32    slower than main memory and CPUs, several layers of caching exist
  33    between the computer's main memory and the disk platters.
  34    First, there is the operating system's buffer cache, which caches
  35    frequently requested disk blocks and combines disk writes. Fortunately,
  36    all operating systems give applications a way to force writes from
  37    the buffer cache to disk, and <productname>PostgreSQL</> uses those
  38    features.  (See the <xref linkend="guc-wal-sync-method"> parameter
  39    to adjust how this is done.)
  40   </para>
  41
  42   <para>
  43    Next, there might be a cache in the disk drive controller; this is
  44    particularly common on <acronym>RAID</> controller cards. Some of
  45    these caches are <firstterm>write-through</>, meaning writes are passed
  46    along to the drive as soon as they arrive. Others are
  47    <firstterm>write-back</>, meaning data is passed on to the drive at
  48    some later time. Such caches can be a reliability hazard because the
  49    memory in the disk controller cache is volatile, and will lose its
  50    contents in a power failure.  Better controller cards have
  51    <firstterm>battery-backed</> caches, meaning the card has a battery that
  52    maintains power to the cache in case of system power loss.  After power
  53    is restored the data will be written to the disk drives.
  54   </para>
  55
  56   <para>
  57    And finally, most disk drives have caches. Some are write-through
  58    while some are write-back, and the
  59    same concerns about data loss exist for write-back drive caches as
  60    exist for disk controller caches.  Consumer-grade IDE and SATA drives are
  61    particularly likely to have write-back caches that will not survive a
  62    power failure.  To check write caching on <productname>Linux</> use
  63    <command>hdparm -I</>;  it is enabled if there is a <literal>*</> next
  64    to <literal>Write cache</>.  <command>hdparm -W</> to turn off
  65    write caching.  On <productname>FreeBSD</> use
  66    <application>atacontrol</>.  (For SCSI disks use <ulink
  67    url="http://sg.torque.net/sg/sdparm.html"><application>sdparm</></ulink>
  68    to turn off <literal>WCE</>.)  On <productname>Solaris</> the disk
  69    write cache is controlled by <ulink
  70    url="http://www.sun.com/bigadmin/content/submitted/format_utility.jsp"><literal>format
  71    -e</></ulink>. (The Solaris <acronym>ZFS</> file system is safe with
  72    disk write-cache enabled because it issues its own disk cache flush
  73    commands.)  On <productname>Windows</> if <varname>wal_sync_method</>
  74    is <literal>open_datasync</> (the default), write caching is disabled
  75    by unchecking <literal>My Computer\Open\{select disk
  76    drive}\Properties\Hardware\Properties\Policies\Enable write caching on
  77    the disk</>.  Also on Windows, <literal>fsync</> and
  78    <literal>fsync_writethrough</> never do write caching.
  79   </para>
  80
  81   <para>
  82    When the operating system sends a write request to the disk hardware,
  83    there is little it can do to make sure the data has arrived at a truly
  84    non-volatile storage area. Rather, it is the
  85    administrator's responsibility to be sure that all storage components
  86    ensure data integrity.  Avoid disk controllers that have non-battery-backed
  87    write caches.  At the drive level, disable write-back caching if the
  88    drive cannot guarantee the data will be written before shutdown.
  89   </para>
  90
  91   <para>
  92    Another risk of data loss is posed by the disk platter write
  93    operations themselves. Disk platters are divided into sectors,
  94    commonly 512 bytes each.  Every physical read or write operation
  95    processes a whole sector.
  96    When a write request arrives at the drive, it might be for 512 bytes,
  97    1024 bytes, or 8192 bytes, and the process of writing could fail due
  98    to power loss at any time, meaning some of the 512-byte sectors were
  99    written, and others were not.  To guard against such failures,
 100    <productname>PostgreSQL</> periodically writes full page images to
 101    permanent storage <emphasis>before</> modifying the actual page on
 102    disk. By doing this, during crash recovery <productname>PostgreSQL</> can
 103    restore partially-written pages.  If you have a battery-backed disk
 104    controller or file-system software that prevents partial page writes
 105    (e.g., ReiserFS 4),  you can turn off this page imaging by using the
 106    <xref linkend="guc-full-page-writes"> parameter.
 107   </para>
 108  </sect1>
 109
 110   <sect1 id="wal-intro">
 111    <title>Write-Ahead Logging (<acronym>WAL</acronym>)</title>
 112
 113    <indexterm zone="wal">
 114     <primary>WAL</primary>
 115    </indexterm>
 116
 117    <indexterm>
 118     <primary>transaction log</primary>
 119     <see>WAL</see>
 120    </indexterm>
 121
 122    <para>
 123     <firstterm>Write-Ahead Logging</firstterm> (<acronym>WAL</acronym>)
 124     is a standard method for ensuring data integrity.  A detailed
 125     description can be found in most (if not all) books about
 126     transaction processing. Briefly, <acronym>WAL</acronym>'s central
 127     concept is that changes to data files (where tables and indexes
 128     reside) must be written only after those changes have been logged,
 129     that is, after log records describing the changes have been flushed
 130     to permanent storage. If we follow this procedure, we do not need
 131     to flush data pages to disk on every transaction commit, because we
 132     know that in the event of a crash we will be able to recover the
 133     database using the log: any changes that have not been applied to
 134     the data pages can be redone from the log records.  (This is
 135     roll-forward recovery, also known as REDO.)
 136    </para>
 137
 138    <tip>
 139     <para>
 140      Because <acronym>WAL</acronym> restores database file
 141      contents after a crash, journaled filesystems are not necessary for
 142      reliable storage of the data files or WAL files.  In fact, journaling
 143      overhead can reduce performance, especially if journaling
 144      causes file system <emphasis>data</emphasis> to be flushed
 145      to disk.  Fortunately, data flushing during journaling can
 146      often be disabled with a filesystem mount option, e.g.
 147      <literal>data=writeback</> on a Linux ext3 file system.
 148      Journaled file systems do improve boot speed after a crash.
 149     </para>
 150    </tip>
 151
 152
 153    <para>
 154     Using <acronym>WAL</acronym> results in a
 155     significantly reduced number of disk writes, because only the log
 156     file needs to be flushed to disk to guarantee that a transaction is
 157     committed, rather than every data file changed by the transaction.
 158     The log file is written sequentially,
 159     and so the cost of syncing the log is much less than the cost of
 160     flushing the data pages.  This is especially true for servers
 161     handling many small transactions touching different parts of the data
 162     store.  Furthermore, when the server is processing many small concurrent
 163     transactions, one <function>fsync</function> of the log file may
 164     suffice to commit many transactions.
 165    </para>
 166
 167    <para>
 168     <acronym>WAL</acronym> also makes it possible to support on-line
 169     backup and point-in-time recovery, as described in <xref
 170     linkend="continuous-archiving">.  By archiving the WAL data we can support
 171     reverting to any time instant covered by the available WAL data:
 172     we simply install a prior physical backup of the database, and
 173     replay the WAL log just as far as the desired time.  What's more,
 174     the physical backup doesn't have to be an instantaneous snapshot
 175     of the database state &mdash; if it is made over some period of time,
 176     then replaying the WAL log for that period will fix any internal
 177     inconsistencies.
 178    </para>
 179   </sect1>
 180
 181  <sect1 id="wal-async-commit">
 182   <title>Asynchronous Commit</title>
 183
 184    <indexterm>
 185     <primary>synchronous commit</primary>
 186    </indexterm>
 187
 188    <indexterm>
 189     <primary>asynchronous commit</primary>
 190    </indexterm>
 191
 192   <para>
 193    <firstterm>Asynchronous commit</> is an option that allows transactions
 194    to complete more quickly, at the cost that the most recent transactions may
 195    be lost if the database should crash.  In many applications this is an
 196    acceptable trade-off.
 197   </para>
 198
 199   <para>
 200    As described in the previous section, transaction commit is normally
 201    <firstterm>synchronous</>: the server waits for the transaction's
 202    <acronym>WAL</acronym> records to be flushed to permanent storage
 203    before returning a success indication to the client.  The client is
 204    therefore guaranteed that a transaction reported to be committed will
 205    be preserved, even in the event of a server crash immediately after.
 206    However, for short transactions this delay is a major component of the
 207    total transaction time.  Selecting asynchronous commit mode means that
 208    the server returns success as soon as the transaction is logically
 209    completed, before the <acronym>WAL</acronym> records it generated have
 210    actually made their way to disk.  This can provide a significant boost
 211    in throughput for small transactions.
 212   </para>
 213
 214   <para>
 215    Asynchronous commit introduces the risk of data loss. There is a short
 216    time window between the report of transaction completion to the client
 217    and the time that the transaction is truly committed (that is, it is
 218    guaranteed not to be lost if the server crashes).  Thus asynchronous
 219    commit should not be used if the client will take external actions
 220    relying on the assumption that the transaction will be remembered.
 221    As an example, a bank would certainly not use asynchronous commit for
 222    a transaction recording an ATM's dispensing of cash.  But in many
 223    scenarios, such as event logging, there is no need for a strong
 224    guarantee of this kind.
 225   </para>
 226
 227   <para>
 228    The risk that is taken by using asynchronous commit is of data loss,
 229    not data corruption.  If the database should crash, it will recover
 230    by replaying <acronym>WAL</acronym> up to the last record that was
 231    flushed.  The database will therefore be restored to a self-consistent
 232    state, but any transactions that were not yet flushed to disk will
 233    not be reflected in that state.  The net effect is therefore loss of
 234    the last few transactions.  Because the transactions are replayed in
 235    commit order, no inconsistency can be introduced &mdash; for example,
 236    if transaction B made changes relying on the effects of a previous
 237    transaction A, it is not possible for A's effects to be lost while B's
 238    effects are preserved.
 239   </para>
 240
 241   <para>
 242    The user can select the commit mode of each transaction, so that
 243    it is possible to have both synchronous and asynchronous commit
 244    transactions running concurrently.  This allows flexible trade-offs
 245    between performance and certainty of transaction durability.
 246    The commit mode is controlled by the user-settable parameter
 247    <xref linkend="guc-synchronous-commit">, which can be changed in any of
 248    the ways that a configuration parameter can be set.  The mode used for
 249    any one transaction depends on the value of
 250    <varname>synchronous_commit</varname> when transaction commit begins.
 251   </para>
 252
 253   <para>
 254    Certain utility commands, for instance <command>DROP TABLE</>, are
 255    forced to commit synchronously regardless of the setting of
 256    <varname>synchronous_commit</varname>.  This is to ensure consistency
 257    between the server's file system and the logical state of the database.
 258    The commands supporting two-phase commit, such as <command>PREPARE
 259    TRANSACTION</>, are also always synchronous.
 260   </para>
 261
 262   <para>
 263    If the database crashes during the risk window between an
 264    asynchronous commit and the writing of the transaction's
 265    <acronym>WAL</acronym> records,
 266    then changes made during that transaction <emphasis>will</> be lost.
 267    The duration of the
 268    risk window is limited because a background process (the <quote>WAL
 269    writer</>) flushes unwritten <acronym>WAL</acronym> records to disk
 270    every <xref linkend="guc-wal-writer-delay"> milliseconds.
 271    The actual maximum duration of the risk window is three times
 272    <varname>wal_writer_delay</varname> because the WAL writer is
 273    designed to favor writing whole pages at a time during busy periods.
 274   </para>
 275
 276   <caution>
 277    <para>
 278     An immediate-mode shutdown is equivalent to a server crash, and will
 279     therefore cause loss of any unflushed asynchronous commits.
 280    </para>
 281   </caution>
 282
 283   <para>
 284    Asynchronous commit provides behavior different from setting
 285    <xref linkend="guc-fsync"> = off.
 286    <varname>fsync</varname> is a server-wide
 287    setting that will alter the behavior of all transactions.  It disables
 288    all logic within <productname>PostgreSQL</> that attempts to synchronize
 289    writes to different portions of the database, and therefore a system
 290    crash (that is, a hardware or operating system crash, not a failure of
 291    <productname>PostgreSQL</> itself) could result in arbitrarily bad
 292    corruption of the database state.  In many scenarios, asynchronous
 293    commit provides most of the performance improvement that could be
 294    obtained by turning off <varname>fsync</varname>, but without the risk
 295    of data corruption.
 296   </para>
 297
 298   <para>
 299    <xref linkend="guc-commit-delay"> also sounds very similar to
 300    asynchronous commit, but it is actually a synchronous commit method
 301    (in fact, <varname>commit_delay</varname> is ignored during an
 302    asynchronous commit).  <varname>commit_delay</varname> causes a delay
 303    just before a synchronous commit attempts to flush
 304    <acronym>WAL</acronym> to disk, in the hope that a single flush
 305    executed by one such transaction can also serve other transactions
 306    committing at about the same time.  Setting <varname>commit_delay</varname>
 307    can only help when there are many concurrently committing transactions,
 308    and it is difficult to tune it to a value that actually helps rather
 309    than hurting throughput.
 310   </para>
 311
 312  </sect1>
 313
 314  <sect1 id="wal-configuration">
 315   <title><acronym>WAL</acronym> Configuration</title>
 316
 317   <para>
 318    There are several <acronym>WAL</>-related configuration parameters that
 319    affect database performance. This section explains their use.
 320    Consult <xref linkend="runtime-config"> for general information about
 321    setting server configuration parameters.
 322   </para>
 323
 324   <para>
 325    <firstterm>Checkpoints</firstterm><indexterm><primary>checkpoint</></>
 326    are points in the sequence of transactions at which it is guaranteed
 327    that the data files have been updated with all information written before
 328    the checkpoint.  At checkpoint time, all dirty data pages are flushed to
 329    disk and a special checkpoint record is written to the log file.
 330    (The changes were previously flushed to the <acronym>WAL</acronym> files.)
 331    In the event of a crash, the crash recovery procedure looks at the latest
 332    checkpoint record to determine the point in the log (known as the redo
 333    record) from which it should start the REDO operation.  Any changes made to
 334    data files before that point are guaranteed to be already on disk.  Hence, after
 335    a checkpoint, log segments preceding the one containing
 336    the redo record are no longer needed and can be recycled or removed. (When
 337    <acronym>WAL</acronym> archiving is being done, the log segments must be
 338    archived before being recycled or removed.)
 339   </para>
 340
 341   <para>
 342    The checkpoint requirement of flushing all dirty data pages to disk
 343    can cause a significant I/O load.  For this reason, checkpoint
 344    activity is throttled so I/O begins at checkpoint start and completes
 345    before the next checkpoint starts;  this minimizes performance
 346    degradation during checkpoints.
 347   </para>
 348
 349   <para>
 350    The server's background writer process will automatically perform
 351    a checkpoint every so often.  A checkpoint is created every <xref
 352    linkend="guc-checkpoint-segments"> log segments, or every <xref
 353    linkend="guc-checkpoint-timeout"> seconds, whichever comes first.
 354    The default settings are 3 segments and 300 seconds respectively.
 355    It is also possible to force a checkpoint by using the SQL command
 356    <command>CHECKPOINT</command>.
 357   </para>
 358
 359   <para>
 360    Reducing <varname>checkpoint_segments</varname> and/or
 361    <varname>checkpoint_timeout</varname> causes checkpoints to be done
 362    more often. This allows faster after-crash recovery (since less work
 363    will need to be redone). However, one must balance this against the
 364    increased cost of flushing dirty data pages more often. If
 365    <xref linkend="guc-full-page-writes"> is set (as is the default), there is
 366    another factor to consider. To ensure data page consistency,
 367    the first modification of a data page after each checkpoint results in
 368    logging the entire page content. In that case,
 369    a smaller checkpoint interval increases the volume of output to the WAL log,
 370    partially negating the goal of using a smaller interval,
 371    and in any case causing more disk I/O.
 372   </para>
 373
 374   <para>
 375    Checkpoints are fairly expensive, first because they require writing
 376    out all currently dirty buffers, and second because they result in
 377    extra subsequent WAL traffic as discussed above.  It is therefore
 378    wise to set the checkpointing parameters high enough that checkpoints
 379    don't happen too often.  As a simple sanity check on your checkpointing
 380    parameters, you can set the <xref linkend="guc-checkpoint-warning">
 381    parameter.  If checkpoints happen closer together than
 382    <varname>checkpoint_warning</> seconds,
 383    a message will be output to the server log recommending increasing
 384    <varname>checkpoint_segments</varname>.  Occasional appearance of such
 385    a message is not cause for alarm, but if it appears often then the
 386    checkpoint control parameters should be increased. Bulk operations such
 387    as large <command>COPY</> transfers might cause a number of such warnings
 388    to appear if you have not set <varname>checkpoint_segments</> high
 389    enough.
 390   </para>
 391
 392   <para>
 393    To avoid flooding the I/O system with a burst of page writes,
 394    writing dirty buffers during a checkpoint is spread over a period of time.
 395    That period is controlled by
 396    <xref linkend="guc-checkpoint-completion-target">, which is
 397    given as a fraction of the checkpoint interval.
 398    The I/O rate is adjusted so that the checkpoint finishes when the
 399    given fraction of <varname>checkpoint_segments</varname> WAL segments
 400    have been consumed since checkpoint start, or the given fraction of
 401    <varname>checkpoint_timeout</varname> seconds have elapsed,
 402    whichever is sooner.  With the default value of 0.5,
 403    <productname>PostgreSQL</> can be expected to complete each checkpoint
 404    in about half the time before the next checkpoint starts.  On a system
 405    that's very close to maximum I/O throughput during normal operation,
 406    you might want to increase <varname>checkpoint_completion_target</varname>
 407    to reduce the I/O load from checkpoints.  The disadvantage of this is that
 408    prolonging checkpoints affects recovery time, because more WAL segments
 409    will need to be kept around for possible use in recovery.  Although
 410    <varname>checkpoint_completion_target</varname> can be set as high as 1.0,
 411    it is best to keep it less than that (perhaps 0.9 at most) since
 412    checkpoints include some other activities besides writing dirty buffers.
 413    A setting of 1.0 is quite likely to result in checkpoints not being
 414    completed on time, which would result in performance loss due to
 415    unexpected variation in the number of WAL segments needed.
 416   </para>
 417
 418   <para>
 419    There will always be at least one WAL segment file, and will normally
 420    not be more than (2 + <varname>checkpoint_completion_target</varname>) * <varname>checkpoint_segments</varname> + 1
 421    files.  Each segment file is normally 16 MB (though this size can be
 422    altered when building the server).  You can use this to estimate space
 423    requirements for <acronym>WAL</acronym>.
 424    Ordinarily, when old log segment files are no longer needed, they
 425    are recycled (renamed to become the next segments in the numbered
 426    sequence). If, due to a short-term peak of log output rate, there
 427    are more than 3 * <varname>checkpoint_segments</varname> + 1
 428    segment files, the unneeded segment files will be deleted instead
 429    of recycled until the system gets back under this limit.
 430   </para>
 431
 432   <para>
 433    There are two commonly used internal <acronym>WAL</acronym> functions:
 434    <function>LogInsert</function> and <function>LogFlush</function>.
 435    <function>LogInsert</function> is used to place a new record into
 436    the <acronym>WAL</acronym> buffers in shared memory. If there is no
 437    space for the new record, <function>LogInsert</function> will have
 438    to write (move to kernel cache) a few filled <acronym>WAL</acronym>
 439    buffers. This is undesirable because <function>LogInsert</function>
 440    is used on every database low level modification (for example, row
 441    insertion) at a time when an exclusive lock is held on affected
 442    data pages, so the operation needs to be as fast as possible.  What
 443    is worse, writing <acronym>WAL</acronym> buffers might also force the
 444    creation of a new log segment, which takes even more
 445    time. Normally, <acronym>WAL</acronym> buffers should be written
 446    and flushed by a <function>LogFlush</function> request, which is
 447    made, for the most part, at transaction commit time to ensure that
 448    transaction records are flushed to permanent storage. On systems
 449    with high log output, <function>LogFlush</function> requests might
 450    not occur often enough to prevent <function>LogInsert</function>
 451    from having to do writes.  On such systems
 452    one should increase the number of <acronym>WAL</acronym> buffers by
 453    modifying the configuration parameter <xref
 454    linkend="guc-wal-buffers">.  The default number of <acronym>WAL</acronym>
 455    buffers is 8.  Increasing this value will
 456    correspondingly increase shared memory usage.  When
 457    <xref linkend="guc-full-page-writes"> is set and the system is very busy,
 458    setting this value higher will help smooth response times during the
 459    period immediately following each checkpoint.
 460   </para>
 461
 462   <para>
 463    The <xref linkend="guc-commit-delay"> parameter defines for how many
 464    microseconds the server process will sleep after writing a commit
 465    record to the log with <function>LogInsert</function> but before
 466    performing a <function>LogFlush</function>. This delay allows other
 467    server processes to add their commit records to the log so as to have all
 468    of them flushed with a single log sync. No sleep will occur if
 469    <xref linkend="guc-fsync">
 470    is not enabled, nor if fewer than <xref linkend="guc-commit-siblings">
 471    other sessions are currently in active transactions; this avoids
 472    sleeping when it's unlikely that any other session will commit soon.
 473    Note that on most platforms, the resolution of a sleep request is
 474    ten milliseconds, so that any nonzero <varname>commit_delay</varname>
 475    setting between 1 and 10000 microseconds would have the same effect.
 476    Good values for these parameters are not yet clear; experimentation
 477    is encouraged.
 478   </para>
 479
 480   <para>
 481    The <xref linkend="guc-wal-sync-method"> parameter determines how
 482    <productname>PostgreSQL</productname> will ask the kernel to force
 483     <acronym>WAL</acronym> updates out to disk.
 484    All the options should be the same as far as reliability goes,
 485    but it's quite platform-specific which one will be the fastest.
 486    Note that this parameter is irrelevant if <varname>fsync</varname>
 487    has been turned off.
 488   </para>
 489
 490   <para>
 491    Enabling the <xref linkend="guc-wal-debug"> configuration parameter
 492    (provided that <productname>PostgreSQL</productname> has been
 493    compiled with support for it) will result in each
 494    <function>LogInsert</function> and <function>LogFlush</function>
 495    <acronym>WAL</acronym> call being logged to the server log. This
 496    option might be replaced by a more general mechanism in the future.
 497   </para>
 498  </sect1>
 499
 500  <sect1 id="wal-internals">
 501   <title>WAL Internals</title>
 502
 503   <para>
 504    <acronym>WAL</acronym> is automatically enabled; no action is
 505    required from the administrator except ensuring that the
 506    disk-space requirements for the <acronym>WAL</acronym> logs are met,
 507    and that any necessary tuning is done (see <xref
 508    linkend="wal-configuration">).
 509   </para>
 510
 511   <para>
 512    <acronym>WAL</acronym> logs are stored in the directory
 513    <filename>pg_xlog</filename> under the data directory, as a set of
 514    segment files, normally each 16 MB in size (but the size can be changed
 515    by altering the <option>--with-wal-segsize</> configure option when
 516    building the server).  Each segment is divided into pages, normally
 517    8 kB each (this size can be changed via the <option>--with-wal-blocksize</>
 518    configure option).  The log record headers are described in
 519    <filename>access/xlog.h</filename>; the record content is dependent
 520    on the type of event that is being logged.  Segment files are given
 521    ever-increasing numbers as names, starting at
 522    <filename>000000010000000000000000</filename>.  The numbers do not wrap, at
 523    present, but it should take a very very long time to exhaust the
 524    available stock of numbers.
 525   </para>
 526
 527   <para>
 528    It is of advantage if the log is located on another disk than the
 529    main database files.  This can be achieved by moving the directory
 530    <filename>pg_xlog</filename> to another location (while the server
 531    is shut down, of course) and creating a symbolic link from the
 532    original location in the main data directory to the new location.
 533   </para>
 534
 535   <para>
 536    The aim of <acronym>WAL</acronym>, to ensure that the log is
 537    written before database records are altered, can be subverted by
 538    disk drives<indexterm><primary>disk drive</></> that falsely report a
 539    successful write to the kernel,
 540    when in fact they have only cached the data and not yet stored it
 541    on the disk.  A power failure in such a situation might still lead to
 542    irrecoverable data corruption.  Administrators should try to ensure
 543    that disks holding <productname>PostgreSQL</productname>'s
 544    <acronym>WAL</acronym> log files do not make such false reports.
 545   </para>
 546
 547   <para>
 548    After a checkpoint has been made and the log flushed, the
 549    checkpoint's position is saved in the file
 550    <filename>pg_control</filename>. Therefore, when recovery is to be
 551    done, the server first reads <filename>pg_control</filename> and
 552    then the checkpoint record; then it performs the REDO operation by
 553    scanning forward from the log position indicated in the checkpoint
 554    record.  Because the entire content of data pages is saved in the
 555    log on the first page modification after a checkpoint (assuming
 556    <xref linkend="guc-full-page-writes"> is not disabled), all pages
 557    changed since the checkpoint will be restored to a consistent
 558    state.
 559   </para>
 560
 561   <para>
 562    To deal with the case where <filename>pg_control</filename> is
 563    corrupted, we should support the possibility of scanning existing log
 564    segments in reverse order &mdash; newest to oldest &mdash; in order to find the
 565    latest checkpoint.  This has not been implemented yet.
 566    <filename>pg_control</filename> is small enough (less than one disk page)
 567    that it is not subject to partial-write problems, and as of this writing
 568    there have been no reports of database failures due solely to inability
 569    to read <filename>pg_control</filename> itself.  So while it is
 570    theoretically a weak spot, <filename>pg_control</filename> does not
 571    seem to be a problem in practice.
 572   </para>
 573  </sect1>
 574 </chapter>