doc/src/sgml/high-availability.sgml

   1 <!-- doc/src/sgml/high-availability.sgml -->
   2
   3 <chapter id="high-availability">
   4  <title>High Availability, Load Balancing, and Replication</title>
   5
   6  <indexterm><primary>high availability</primary></indexterm>
   7  <indexterm><primary>failover</primary></indexterm>
   8  <indexterm><primary>replication</primary></indexterm>
   9  <indexterm><primary>load balancing</primary></indexterm>
  10  <indexterm><primary>clustering</primary></indexterm>
  11  <indexterm><primary>data partitioning</primary></indexterm>
  12
  13  <para>
  14   Database servers can work together to allow a second server to
  15   take over quickly if the primary server fails (high
  16   availability), or to allow several computers to serve the same
  17   data (load balancing).  Ideally, database servers could work
  18   together seamlessly.  Web servers serving static web pages can
  19   be combined quite easily by merely load-balancing web requests
  20   to multiple machines.  In fact, read-only database servers can
  21   be combined relatively easily too.  Unfortunately, most database
  22   servers have a read/write mix of requests, and read/write servers
  23   are much harder to combine.  This is because though read-only
  24   data needs to be placed on each server only once, a write to any
  25   server has to be propagated to all servers so that future read
  26   requests to those servers return consistent results.
  27  </para>
  28
  29  <para>
  30   This synchronization problem is the fundamental difficulty for
  31   servers working together.  Because there is no single solution
  32   that eliminates the impact of the sync problem for all use cases,
  33   there are multiple solutions.  Each solution addresses this
  34   problem in a different way, and minimizes its impact for a specific
  35   workload.
  36  </para>
  37
  38  <para>
  39   Some solutions deal with synchronization by allowing only one
  40   server to modify the data.  Servers that can modify data are
  41   called read/write, <firstterm>master</firstterm> or <firstterm>primary</firstterm> servers.
  42   Servers that track changes in the primary are called <firstterm>standby</firstterm>
  43   or <firstterm>secondary</firstterm> servers. A standby server that cannot be connected
  44   to until it is promoted to a primary server is called a <firstterm>warm
  45   standby</firstterm> server, and one that can accept connections and serves read-only
  46   queries is called a <firstterm>hot standby</firstterm> server.
  47  </para>
  48
  49  <para>
  50   Some solutions are synchronous,
  51   meaning that a data-modifying transaction is not considered
  52   committed until all servers have committed the transaction.  This
  53   guarantees that a failover will not lose any data and that all
  54   load-balanced servers will return consistent results no matter
  55   which server is queried. In contrast, asynchronous solutions allow some
  56   delay between the time of a commit and its propagation to the other servers,
  57   opening the possibility that some transactions might be lost in
  58   the switch to a backup server, and that load balanced servers
  59   might return slightly stale results.  Asynchronous communication
  60   is used when synchronous would be too slow.
  61  </para>
  62
  63  <para>
  64   Solutions can also be categorized by their granularity.  Some solutions
  65   can deal only with an entire database server, while others allow control
  66   at the per-table or per-database level.
  67  </para>
  68
  69  <para>
  70   Performance must be considered in any choice.  There is usually a
  71   trade-off between functionality and
  72   performance.  For example, a fully synchronous solution over a slow
  73   network might cut performance by more than half, while an asynchronous
  74   one might have a minimal performance impact.
  75  </para>
  76
  77  <para>
  78   The remainder of this section outlines various failover, replication,
  79   and load balancing solutions.
  80  </para>
  81
  82  <sect1 id="different-replication-solutions">
  83  <title>Comparison of Different Solutions</title>
  84
  85  <variablelist>
  86
  87   <varlistentry>
  88    <term>Shared Disk Failover</term>
  89    <listitem>
  90
  91     <para>
  92      Shared disk failover avoids synchronization overhead by having only one
  93      copy of the database.  It uses a single disk array that is shared by
  94      multiple servers.  If the main database server fails, the standby server
  95      is able to mount and start the database as though it were recovering from
  96      a database crash.  This allows rapid failover with no data loss.
  97     </para>
  98
  99     <para>
 100      Shared hardware functionality is common in network storage devices.
 101      Using a network file system is also possible, though care must be
 102      taken that the file system has full <acronym>POSIX</acronym> behavior (see <xref
 103      linkend="creating-cluster-nfs"/>).  One significant limitation of this
 104      method is that if the shared disk array fails or becomes corrupt, the
 105      primary and standby servers are both nonfunctional.  Another issue is
 106      that the standby server should never access the shared storage while
 107      the primary server is running.
 108     </para>
 109
 110    </listitem>
 111   </varlistentry>
 112
 113   <varlistentry>
 114    <term>File System (Block Device) Replication</term>
 115    <listitem>
 116
 117     <para>
 118      A modified version of shared hardware functionality is file system
 119      replication, where all changes to a file system are mirrored to a file
 120      system residing on another computer.  The only restriction is that
 121      the mirroring must be done in a way that ensures the standby server
 122      has a consistent copy of the file system &mdash; specifically, writes
 123      to the standby must be done in the same order as those on the primary.
 124      <productname>DRBD</productname> is a popular file system replication solution
 125      for Linux.
 126     </para>
 127
 128 <!--
 129 https://forge.continuent.org/pipermail/sequoia/2006-November/004070.html
 130
 131 Oracle RAC is a shared disk approach and just send cache invalidations
 132 to other nodes but not actual data. As the disk is shared, data is
 133 only committed once to disk and there is a distributed locking
 134 protocol to make nodes agree on a serializable transactional order.
 135 -->
 136
 137    </listitem>
 138   </varlistentry>
 139
 140   <varlistentry>
 141    <term>Write-Ahead Log Shipping</term>
 142    <listitem>
 143
 144     <para>
 145      Warm and hot standby servers can be kept current by reading a
 146      stream of write-ahead log (<acronym>WAL</acronym>)
 147      records.  If the main server fails, the standby contains
 148      almost all of the data of the main server, and can be quickly
 149      made the new primary database server.  This can be synchronous or
 150      asynchronous and can only be done for the entire database server.
 151     </para>
 152     <para>
 153      A standby server can be implemented using file-based log shipping
 154      (<xref linkend="warm-standby"/>) or streaming replication (see
 155      <xref linkend="streaming-replication"/>), or a combination of both. For
 156      information on hot standby, see <xref linkend="hot-standby"/>.
 157     </para>
 158    </listitem>
 159   </varlistentry>
 160
 161   <varlistentry>
 162    <term>Logical Replication</term>
 163    <listitem>
 164     <para>
 165      Logical replication allows a database server to send a stream of data
 166      modifications to another server.  <productname>PostgreSQL</productname>
 167      logical replication constructs a stream of logical data modifications
 168      from the WAL.  Logical replication allows replication of data changes on
 169      a per-table basis.  In addition, a server that is publishing its own
 170      changes can also subscribe to changes from another server, allowing data
 171      to flow in multiple directions.  For more information on logical
 172      replication, see <xref linkend="logical-replication"/>.  Through the
 173      logical decoding interface (<xref linkend="logicaldecoding"/>),
 174      third-party extensions can also provide similar functionality.
 175     </para>
 176    </listitem>
 177   </varlistentry>
 178
 179   <varlistentry>
 180    <term>Trigger-Based Primary-Standby Replication</term>
 181    <listitem>
 182
 183     <para>
 184      A trigger-based replication setup typically funnels data modification
 185      queries to a designated primary server. Operating on a per-table basis,
 186      the primary server sends data changes (typically) asynchronously to the
 187      standby servers.  Standby servers can answer queries while the primary is
 188      running, and may allow some local data changes or write activity.  This
 189      form of replication is often used for offloading large analytical or data
 190      warehouse queries.
 191     </para>
 192
 193     <para>
 194      <productname>Slony-I</productname> is an example of this type of
 195      replication, with per-table granularity, and support for multiple standby
 196      servers.  Because it updates the standby server asynchronously (in
 197      batches), there is possible data loss during fail over.
 198     </para>
 199    </listitem>
 200   </varlistentry>
 201
 202   <varlistentry>
 203    <term>SQL-Based Replication Middleware</term>
 204    <listitem>
 205
 206     <para>
 207      With SQL-based replication middleware, a program intercepts
 208      every SQL query and sends it to one or all servers.  Each server
 209      operates independently.  Read-write queries must be sent to all servers,
 210      so that every server receives any changes.  But read-only queries can be
 211      sent to just one server, allowing the read workload to be distributed
 212      among them.
 213     </para>
 214
 215     <para>
 216      If queries are simply broadcast unmodified, functions like
 217      <function>random()</function>, <function>CURRENT_TIMESTAMP</function>, and
 218      sequences can have different values on different servers.
 219      This is because each server operates independently, and because
 220      SQL queries are broadcast rather than actual data changes.  If
 221      this is unacceptable, either the middleware or the application
 222      must determine such values from a single source and then use those
 223      values in write queries.  Care must also be taken that all
 224      transactions either commit or abort on all servers, perhaps
 225      using two-phase commit (<xref linkend="sql-prepare-transaction"/>
 226      and <xref linkend="sql-commit-prepared"/>).
 227      <productname>Pgpool-II</productname> and <productname>Continuent Tungsten</productname>
 228      are examples of this type of replication.
 229     </para>
 230    </listitem>
 231   </varlistentry>
 232
 233   <varlistentry>
 234    <term>Asynchronous Multimaster Replication</term>
 235    <listitem>
 236
 237     <para>
 238      For servers that are not regularly connected or have slow
 239      communication links, like laptops or
 240      remote servers, keeping data consistent among servers is a
 241      challenge.  Using asynchronous multimaster replication, each
 242      server works independently, and periodically communicates with
 243      the other servers to identify conflicting transactions.  The
 244      conflicts can be resolved by users or conflict resolution rules.
 245      Bucardo is an example of this type of replication.
 246     </para>
 247    </listitem>
 248   </varlistentry>
 249
 250   <varlistentry>
 251    <term>Synchronous Multimaster Replication</term>
 252    <listitem>
 253
 254     <para>
 255      In synchronous multimaster replication, each server can accept
 256      write requests, and modified data is transmitted from the
 257      original server to every other server before each transaction
 258      commits.  Heavy write activity can cause excessive locking and
 259      commit delays, leading to poor performance.  Read requests can
 260      be sent to any server.  Some implementations use shared disk
 261      to reduce the communication overhead.  Synchronous multimaster
 262      replication is best for mostly read workloads, though its big
 263      advantage is that any server can accept write requests &mdash;
 264      there is no need to partition workloads between primary and
 265      standby servers, and because the data changes are sent from one
 266      server to another, there is no problem with non-deterministic
 267      functions like <function>random()</function>.
 268     </para>
 269
 270     <para>
 271      <productname>PostgreSQL</productname> does not offer this type of replication,
 272      though <productname>PostgreSQL</productname> two-phase commit (<xref
 273      linkend="sql-prepare-transaction"/> and <xref
 274      linkend="sql-commit-prepared"/>)
 275      can be used to implement this in application code or middleware.
 276     </para>
 277    </listitem>
 278   </varlistentry>
 279
 280  </variablelist>
 281
 282  <para>
 283   <xref linkend="high-availability-matrix"/> summarizes
 284   the capabilities of the various solutions listed above.
 285  </para>
 286
 287  <table id="high-availability-matrix">
 288   <title>High Availability, Load Balancing, and Replication Feature Matrix</title>
 289   <tgroup cols="9">
 290    <colspec colname="col1" colwidth="1.1*"/>
 291    <colspec colname="col2" colwidth="1*"/>
 292    <colspec colname="col3" colwidth="1*"/>
 293    <colspec colname="col4" colwidth="1*"/>
 294    <colspec colname="col5" colwidth="1*"/>
 295    <colspec colname="col6" colwidth="1*"/>
 296    <colspec colname="col7" colwidth="1*"/>
 297    <colspec colname="col8" colwidth="1*"/>
 298    <colspec colname="col9" colwidth="1*"/>
 299    <thead>
 300     <row>
 301      <entry>Feature</entry>
 302      <entry>Shared Disk</entry>
 303      <entry>File System Repl.</entry>
 304      <entry>Write-Ahead Log Shipping</entry>
 305      <entry>Logical Repl.</entry>
 306      <entry>Trigger-&zwsp;Based Repl.</entry>
 307      <entry>SQL Repl. Middle-ware</entry>
 308      <entry>Async. MM Repl.</entry>
 309      <entry>Sync. MM Repl.</entry>
 310     </row>
 311    </thead>
 312
 313    <tbody>
 314
 315     <row>
 316      <entry>Popular examples</entry>
 317      <entry align="center">NAS</entry>
 318      <entry align="center">DRBD</entry>
 319      <entry align="center">built-in streaming repl.</entry>
 320      <entry align="center">built-in logical repl., pglogical</entry>
 321      <entry align="center">Londiste, Slony</entry>
 322      <entry align="center">pgpool-II</entry>
 323      <entry align="center">Bucardo</entry>
 324      <entry align="center"></entry>
 325     </row>
 326
 327     <row>
 328      <entry>Comm. method</entry>
 329      <entry align="center">shared disk</entry>
 330      <entry align="center">disk blocks</entry>
 331      <entry align="center">WAL</entry>
 332      <entry align="center">logical decoding</entry>
 333      <entry align="center">table rows</entry>
 334      <entry align="center">SQL</entry>
 335      <entry align="center">table rows</entry>
 336      <entry align="center">table rows and row locks</entry>
 337     </row>
 338
 339     <row>
 340      <entry>No special hardware required</entry>
 341      <entry align="center"></entry>
 342      <entry align="center">&bull;</entry>
 343      <entry align="center">&bull;</entry>
 344      <entry align="center">&bull;</entry>
 345      <entry align="center">&bull;</entry>
 346      <entry align="center">&bull;</entry>
 347      <entry align="center">&bull;</entry>
 348      <entry align="center">&bull;</entry>
 349     </row>
 350
 351     <row>
 352      <entry>Allows multiple primary servers</entry>
 353      <entry align="center"></entry>
 354      <entry align="center"></entry>
 355      <entry align="center"></entry>
 356      <entry align="center">&bull;</entry>
 357      <entry align="center"></entry>
 358      <entry align="center">&bull;</entry>
 359      <entry align="center">&bull;</entry>
 360      <entry align="center">&bull;</entry>
 361     </row>
 362
 363     <row>
 364      <entry>No overhead on primary</entry>
 365      <entry align="center">&bull;</entry>
 366      <entry align="center"></entry>
 367      <entry align="center">&bull;</entry>
 368      <entry align="center">&bull;</entry>
 369      <entry align="center"></entry>
 370      <entry align="center">&bull;</entry>
 371      <entry align="center"></entry>
 372      <entry align="center"></entry>
 373     </row>
 374
 375     <row>
 376      <entry>No waiting for multiple servers</entry>
 377      <entry align="center">&bull;</entry>
 378      <entry align="center"></entry>
 379      <entry align="center">with sync off</entry>
 380      <entry align="center">with sync off</entry>
 381      <entry align="center">&bull;</entry>
 382      <entry align="center"></entry>
 383      <entry align="center">&bull;</entry>
 384      <entry align="center"></entry>
 385     </row>
 386
 387     <row>
 388      <entry>Primary failure will never lose data</entry>
 389      <entry align="center">&bull;</entry>
 390      <entry align="center">&bull;</entry>
 391      <entry align="center">with sync on</entry>
 392      <entry align="center">with sync on</entry>
 393      <entry align="center"></entry>
 394      <entry align="center">&bull;</entry>
 395      <entry align="center"></entry>
 396      <entry align="center">&bull;</entry>
 397     </row>
 398
 399     <row>
 400      <entry>Replicas accept read-only queries</entry>
 401      <entry align="center"></entry>
 402      <entry align="center"></entry>
 403      <entry align="center">with hot standby</entry>
 404      <entry align="center">&bull;</entry>
 405      <entry align="center">&bull;</entry>
 406      <entry align="center">&bull;</entry>
 407      <entry align="center">&bull;</entry>
 408      <entry align="center">&bull;</entry>
 409     </row>
 410
 411     <row>
 412      <entry>Per-table granularity</entry>
 413      <entry align="center"></entry>
 414      <entry align="center"></entry>
 415      <entry align="center"></entry>
 416      <entry align="center">&bull;</entry>
 417      <entry align="center">&bull;</entry>
 418      <entry align="center"></entry>
 419      <entry align="center">&bull;</entry>
 420      <entry align="center">&bull;</entry>
 421     </row>
 422
 423     <row>
 424      <entry>No conflict resolution necessary</entry>
 425      <entry align="center">&bull;</entry>
 426      <entry align="center">&bull;</entry>
 427      <entry align="center">&bull;</entry>
 428      <entry align="center"></entry>
 429      <entry align="center">&bull;</entry>
 430      <entry align="center">&bull;</entry>
 431      <entry align="center"></entry>
 432      <entry align="center">&bull;</entry>
 433     </row>
 434
 435    </tbody>
 436   </tgroup>
 437  </table>
 438
 439  <para>
 440   There are a few solutions that do not fit into the above categories:
 441  </para>
 442
 443  <variablelist>
 444
 445   <varlistentry>
 446    <term>Data Partitioning</term>
 447    <listitem>
 448
 449     <para>
 450      Data partitioning splits tables into data sets.  Each set can
 451      be modified by only one server.  For example, data can be
 452      partitioned by offices, e.g., London and Paris, with a server
 453      in each office.  If queries combining London and Paris data
 454      are necessary, an application can query both servers, or
 455      primary/standby replication can be used to keep a read-only copy
 456      of the other office's data on each server.
 457     </para>
 458    </listitem>
 459   </varlistentry>
 460
 461   <varlistentry>
 462    <term>Multiple-Server Parallel Query Execution</term>
 463    <listitem>
 464
 465     <para>
 466      Many of the above solutions allow multiple servers to handle multiple
 467      queries, but none allow a single query to use multiple servers to
 468      complete faster.  This solution allows multiple servers to work
 469      concurrently on a single query.  It is usually accomplished by
 470      splitting the data among servers and having each server execute its
 471      part of the query and return results to a central server where they
 472      are combined and returned to the user. This can be implemented using the
 473      <productname>PL/Proxy</productname> tool set.
 474     </para>
 475
 476    </listitem>
 477   </varlistentry>
 478
 479  </variablelist>
 480
 481   <para>
 482    It should also be noted that because <productname>PostgreSQL</productname>
 483    is open source and easily extended, a number of companies have
 484    taken <productname>PostgreSQL</productname> and created commercial
 485    closed-source solutions with unique failover, replication, and load
 486    balancing capabilities.  These are not discussed here.
 487   </para>
 488
 489  </sect1>
 490
 491
 492  <sect1 id="warm-standby">
 493  <title>Log-Shipping Standby Servers</title>
 494
 495
 496   <para>
 497    Continuous archiving can be used to create a <firstterm>high
 498    availability</firstterm> (HA) cluster configuration with one or more
 499    <firstterm>standby servers</firstterm> ready to take over operations if the
 500    primary server fails. This capability is widely referred to as
 501    <firstterm>warm standby</firstterm> or <firstterm>log shipping</firstterm>.
 502   </para>
 503
 504   <para>
 505    The primary and standby server work together to provide this capability,
 506    though the servers are only loosely coupled. The primary server operates
 507    in continuous archiving mode, while each standby server operates in
 508    continuous recovery mode, reading the WAL files from the primary. No
 509    changes to the database tables are required to enable this capability,
 510    so it offers low administration overhead compared to some other
 511    replication solutions. This configuration also has relatively low
 512    performance impact on the primary server.
 513   </para>
 514
 515   <para>
 516    Directly moving WAL records from one database server to another
 517    is typically described as log shipping. <productname>PostgreSQL</productname>
 518    implements file-based log shipping by transferring WAL records
 519    one file (WAL segment) at a time. WAL files (16MB) can be
 520    shipped easily and cheaply over any distance, whether it be to an
 521    adjacent system, another system at the same site, or another system on
 522    the far side of the globe. The bandwidth required for this technique
 523    varies according to the transaction rate of the primary server.
 524    Record-based log shipping is more granular and streams WAL changes
 525    incrementally over a network connection (see <xref
 526    linkend="streaming-replication"/>).
 527   </para>
 528
 529   <para>
 530    It should be noted that log shipping is asynchronous, i.e., the WAL
 531    records are shipped after transaction commit. As a result, there is a
 532    window for data loss should the primary server suffer a catastrophic
 533    failure; transactions not yet shipped will be lost.  The size of the
 534    data loss window in file-based log shipping can be limited by use of the
 535    <varname>archive_timeout</varname> parameter, which can be set as low
 536    as a few seconds.  However such a low setting will
 537    substantially increase the bandwidth required for file shipping.
 538    Streaming replication (see <xref linkend="streaming-replication"/>)
 539    allows a much smaller window of data loss.
 540   </para>
 541
 542   <para>
 543    Recovery performance is sufficiently good that the standby will
 544    typically be only moments away from full
 545    availability once it has been activated. As a result, this is called
 546    a warm standby configuration which offers high
 547    availability. Restoring a server from an archived base backup and
 548    rollforward will take considerably longer, so that technique only
 549    offers a solution for disaster recovery, not high availability.
 550    A standby server can also be used for read-only queries, in which case
 551    it is called a <firstterm>hot standby</firstterm> server. See
 552    <xref linkend="hot-standby"/> for more information.
 553   </para>
 554
 555   <indexterm zone="high-availability">
 556    <primary>warm standby</primary>
 557   </indexterm>
 558
 559   <indexterm zone="high-availability">
 560    <primary>PITR standby</primary>
 561   </indexterm>
 562
 563   <indexterm zone="high-availability">
 564    <primary>standby server</primary>
 565   </indexterm>
 566
 567   <indexterm zone="high-availability">
 568    <primary>log shipping</primary>
 569   </indexterm>
 570
 571   <indexterm zone="high-availability">
 572    <primary>witness server</primary>
 573   </indexterm>
 574
 575   <indexterm zone="high-availability">
 576    <primary>STONITH</primary>
 577   </indexterm>
 578
 579   <sect2 id="standby-planning">
 580    <title>Planning</title>
 581
 582    <para>
 583     It is usually wise to create the primary and standby servers
 584     so that they are as similar as possible, at least from the
 585     perspective of the database server.  In particular, the path names
 586     associated with tablespaces will be passed across unmodified, so both
 587     primary and standby servers must have the same mount paths for
 588     tablespaces if that feature is used.  Keep in mind that if
 589     <xref linkend="sql-createtablespace"/>
 590     is executed on the primary, any new mount point needed for it must
 591     be created on the primary and all standby servers before the command
 592     is executed. Hardware need not be exactly the same, but experience shows
 593     that maintaining two identical systems is easier than maintaining two
 594     dissimilar ones over the lifetime of the application and system.
 595     In any case the hardware architecture must be the same &mdash; shipping
 596     from, say, a 32-bit to a 64-bit system will not work.
 597    </para>
 598
 599    <para>
 600     In general, log shipping between servers running different major
 601     <productname>PostgreSQL</productname> release
 602     levels is not possible. It is the policy of the PostgreSQL Global
 603     Development Group not to make changes to disk formats during minor release
 604     upgrades, so it is likely that running different minor release levels
 605     on primary and standby servers will work successfully. However, no
 606     formal support for that is offered and you are advised to keep primary
 607     and standby servers at the same release level as much as possible.
 608     When updating to a new minor release, the safest policy is to update
 609     the standby servers first &mdash; a new minor release is more likely
 610     to be able to read WAL files from a previous minor release than vice
 611     versa.
 612    </para>
 613
 614   </sect2>
 615
 616   <sect2 id="standby-server-operation" xreflabel="Standby Server Operation">
 617    <title>Standby Server Operation</title>
 618
 619    <para>
 620     A server enters standby mode if a
 621     <anchor id="file-standby-signal" xreflabel="standby.signal"/>
 622     <filename>standby.signal</filename>
 623     <indexterm><primary><filename>standby.signal</filename></primary></indexterm>
 624     file exists in the data directory when the server is started.
 625    </para>
 626
 627    <para>
 628     In standby mode, the server continuously applies WAL received from the
 629     primary server. The standby server can read WAL from a WAL archive
 630     (see <xref linkend="guc-restore-command"/>) or directly from the primary
 631     over a TCP connection (streaming replication). The standby server will
 632     also attempt to restore any WAL found in the standby cluster's
 633     <filename>pg_wal</filename> directory. That typically happens after a server
 634     restart, when the standby replays again WAL that was streamed from the
 635     primary before the restart, but you can also manually copy files to
 636     <filename>pg_wal</filename> at any time to have them replayed.
 637    </para>
 638
 639    <para>
 640     At startup, the standby begins by restoring all WAL available in the
 641     archive location, calling <varname>restore_command</varname>. Once it
 642     reaches the end of WAL available there and <varname>restore_command</varname>
 643     fails, it tries to restore any WAL available in the <filename>pg_wal</filename> directory.
 644     If that fails, and streaming replication has been configured, the
 645     standby tries to connect to the primary server and start streaming WAL
 646     from the last valid record found in archive or <filename>pg_wal</filename>. If that fails
 647     or streaming replication is not configured, or if the connection is
 648     later disconnected, the standby goes back to step 1 and tries to
 649     restore the file from the archive again. This loop of retries from the
 650     archive, <filename>pg_wal</filename>, and via streaming replication goes on until the server
 651     is stopped or is promoted.
 652    </para>
 653
 654    <para>
 655     Standby mode is exited and the server switches to normal operation
 656     when <command>pg_ctl promote</command> is run, or
 657     <function>pg_promote()</function> is called. Before failover,
 658     any WAL immediately available in the archive or in <filename>pg_wal</filename>
 659     will be restored, but no attempt is made to connect to the primary.
 660    </para>
 661   </sect2>
 662
 663   <sect2 id="preparing-primary-for-standby">
 664    <title>Preparing the Primary for Standby Servers</title>
 665
 666    <para>
 667     Set up continuous archiving on the primary to an archive directory
 668     accessible from the standby, as described
 669     in <xref linkend="continuous-archiving"/>. The archive location should be
 670     accessible from the standby even when the primary is down, i.e., it should
 671     reside on the standby server itself or another trusted server, not on
 672     the primary server.
 673    </para>
 674
 675    <para>
 676     If you want to use streaming replication, set up authentication on the
 677     primary server to allow replication connections from the standby
 678     server(s); that is, create a role and provide a suitable entry or
 679     entries in <filename>pg_hba.conf</filename> with the database field set to
 680     <literal>replication</literal>.  Also ensure <varname>max_wal_senders</varname> is set
 681     to a sufficiently large value in the configuration file of the primary
 682     server. If replication slots will be used,
 683     ensure that <varname>max_replication_slots</varname> is set sufficiently
 684     high as well.
 685    </para>
 686
 687    <para>
 688     Take a base backup as described in <xref linkend="backup-base-backup"/>
 689     to bootstrap the standby server.
 690    </para>
 691   </sect2>
 692
 693   <sect2 id="standby-server-setup">
 694    <title>Setting Up a Standby Server</title>
 695
 696    <para>
 697     To set up the standby server, restore the base backup taken from primary
 698     server (see <xref linkend="backup-pitr-recovery"/>). Create a file
 699     <link linkend="file-standby-signal"><filename>standby.signal</filename></link><indexterm><primary>standby.signal</primary></indexterm>
 700     in the standby's cluster data
 701     directory. Set <xref linkend="guc-restore-command"/> to a simple command to copy files from
 702     the WAL archive. If you plan to have multiple standby servers for high
 703     availability purposes, make sure that <varname>recovery_target_timeline</varname> is set to
 704     <literal>latest</literal> (the default), to make the standby server follow the timeline change
 705     that occurs at failover to another standby.
 706    </para>
 707
 708    <note>
 709      <para>
 710      <xref linkend="guc-restore-command"/> should return immediately
 711      if the file does not exist; the server will retry the command again if
 712      necessary.
 713     </para>
 714    </note>
 715
 716    <para>
 717      If you want to use streaming replication, fill in
 718      <xref linkend="guc-primary-conninfo"/> with a libpq connection string, including
 719      the host name (or IP address) and any additional details needed to
 720      connect to the primary server. If the primary needs a password for
 721      authentication, the password needs to be specified in
 722      <xref linkend="guc-primary-conninfo"/> as well.
 723    </para>
 724
 725    <para>
 726     If you're setting up the standby server for high availability purposes,
 727     set up WAL archiving, connections and authentication like the primary
 728     server, because the standby server will work as a primary server after
 729     failover.
 730    </para>
 731
 732    <para>
 733     If you're using a WAL archive, its size can be minimized using the <xref
 734     linkend="guc-archive-cleanup-command"/> parameter to remove files that are no
 735     longer required by the standby server.
 736     The <application>pg_archivecleanup</application> utility is designed specifically to
 737     be used with <varname>archive_cleanup_command</varname> in typical single-standby
 738     configurations, see <xref linkend="pgarchivecleanup"/>.
 739     Note however, that if you're using the archive for backup purposes, you
 740     need to retain files needed to recover from at least the latest base
 741     backup, even if they're no longer needed by the standby.
 742    </para>
 743
 744    <para>
 745     A simple example of configuration is:
 746 <programlisting>
 747 primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass options=''-c wal_sender_timeout=5000'''
 748 restore_command = 'cp /path/to/archive/%f %p'
 749 archive_cleanup_command = 'pg_archivecleanup /path/to/archive %r'
 750 </programlisting>
 751    </para>
 752
 753    <para>
 754     You can have any number of standby servers, but if you use streaming
 755     replication, make sure you set <varname>max_wal_senders</varname> high enough in
 756     the primary to allow them to be connected simultaneously.
 757    </para>
 758
 759   </sect2>
 760
 761   <sect2 id="streaming-replication">
 762    <title>Streaming Replication</title>
 763
 764    <indexterm zone="high-availability">
 765     <primary>Streaming Replication</primary>
 766    </indexterm>
 767
 768    <para>
 769     Streaming replication allows a standby server to stay more up-to-date
 770     than is possible with file-based log shipping. The standby connects
 771     to the primary, which streams WAL records to the standby as they're
 772     generated, without waiting for the WAL file to be filled.
 773    </para>
 774
 775    <para>
 776     Streaming replication is asynchronous by default
 777     (see <xref linkend="synchronous-replication"/>), in which case there is
 778     a small delay between committing a transaction in the primary and the
 779     changes becoming visible in the standby. This delay is however much
 780     smaller than with file-based log shipping, typically under one second
 781     assuming the standby is powerful enough to keep up with the load. With
 782     streaming replication, <varname>archive_timeout</varname> is not required to
 783     reduce the data loss window.
 784    </para>
 785
 786    <para>
 787     If you use streaming replication without file-based continuous
 788     archiving, the server might recycle old WAL segments before the standby
 789     has received them.  If this occurs, the standby will need to be
 790     reinitialized from a new base backup.  You can avoid this by setting
 791     <varname>wal_keep_size</varname> to a value large enough to ensure that
 792     WAL segments are not recycled too early, or by configuring a replication
 793     slot for the standby.  If you set up a WAL archive that's accessible from
 794     the standby, these solutions are not required, since the standby can
 795     always use the archive to catch up provided it retains enough segments.
 796    </para>
 797
 798    <para>
 799     To use streaming replication, set up a file-based log-shipping standby
 800     server as described in <xref linkend="warm-standby"/>. The step that
 801     turns a file-based log-shipping standby into streaming replication
 802     standby is setting the <varname>primary_conninfo</varname> setting
 803     to point to the primary server. Set
 804     <xref linkend="guc-listen-addresses"/> and authentication options
 805     (see <filename>pg_hba.conf</filename>) on the primary so that the standby server
 806     can connect to the <literal>replication</literal> pseudo-database on the primary
 807     server (see <xref linkend="streaming-replication-authentication"/>).
 808    </para>
 809
 810    <para>
 811     On systems that support the keepalive socket option, setting
 812     <xref linkend="guc-tcp-keepalives-idle"/>,
 813     <xref linkend="guc-tcp-keepalives-interval"/> and
 814     <xref linkend="guc-tcp-keepalives-count"/> helps the primary promptly
 815     notice a broken connection.
 816    </para>
 817
 818    <para>
 819     Set the maximum number of concurrent connections from the standby servers
 820     (see <xref linkend="guc-max-wal-senders"/> for details).
 821    </para>
 822
 823    <para>
 824     When the standby is started and <varname>primary_conninfo</varname> is set
 825     correctly, the standby will connect to the primary after replaying all
 826     WAL files available in the archive. If the connection is established
 827     successfully, you will see a <literal>walreceiver</literal> in the standby, and
 828     a corresponding <literal>walsender</literal> process in the primary.
 829    </para>
 830
 831    <sect3 id="streaming-replication-authentication">
 832     <title>Authentication</title>
 833     <para>
 834      It is very important that the access privileges for replication be set up
 835      so that only trusted users can read the WAL stream, because it is
 836      easy to extract privileged information from it.  Standby servers must
 837      authenticate to the primary as an account that has the
 838      <literal>REPLICATION</literal> privilege or a superuser. It is
 839      recommended to create a dedicated user account with
 840      <literal>REPLICATION</literal> and <literal>LOGIN</literal>
 841      privileges for replication. While <literal>REPLICATION</literal>
 842      privilege gives very high permissions, it does not allow the user to
 843      modify any data on the primary system, which the
 844      <literal>SUPERUSER</literal> privilege does.
 845     </para>
 846
 847     <para>
 848      Client authentication for replication is controlled by a
 849      <filename>pg_hba.conf</filename> record specifying <literal>replication</literal> in the
 850      <replaceable>database</replaceable> field. For example, if the standby is running on
 851      host IP <literal>192.168.1.100</literal> and the account name for replication
 852      is <literal>foo</literal>, the administrator can add the following line to the
 853      <filename>pg_hba.conf</filename> file on the primary:
 854
 855 <programlisting>
 856 # Allow the user "foo" from host 192.168.1.100 to connect to the primary
 857 # as a replication standby if the user's password is correctly supplied.
 858 #
 859 # TYPE  DATABASE        USER            ADDRESS                 METHOD
 860 host    replication     foo             192.168.1.100/32        md5
 861 </programlisting>
 862     </para>
 863     <para>
 864      The host name and port number of the primary, connection user name,
 865      and password are specified in the <xref linkend="guc-primary-conninfo"/>.
 866      The password can also be set in the <filename>~/.pgpass</filename> file on the
 867      standby (specify <literal>replication</literal> in the <replaceable>database</replaceable>
 868      field).
 869      For example, if the primary is running on host IP <literal>192.168.1.50</literal>,
 870      port <literal>5432</literal>, the account name for replication is
 871      <literal>foo</literal>, and the password is <literal>foopass</literal>, the administrator
 872      can add the following line to the <filename>postgresql.conf</filename> file on the
 873      standby:
 874
 875 <programlisting>
 876 # The standby connects to the primary that is running on host 192.168.1.50
 877 # and port 5432 as the user "foo" whose password is "foopass".
 878 primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
 879 </programlisting>
 880     </para>
 881    </sect3>
 882
 883    <sect3 id="streaming-replication-monitoring">
 884     <title>Monitoring</title>
 885     <para>
 886      An important health indicator of streaming replication is the amount
 887      of WAL records generated in the primary, but not yet applied in the
 888      standby. You can calculate this lag by comparing the current WAL write
 889      location on the primary with the last WAL location received by the
 890      standby. These locations can be retrieved using
 891      <function>pg_current_wal_lsn</function> on the primary and
 892      <function>pg_last_wal_receive_lsn</function> on the standby,
 893      respectively (see <xref linkend="functions-admin-backup-table"/> and
 894      <xref linkend="functions-recovery-info-table"/> for details).
 895      The last WAL receive location in the standby is also displayed in the
 896      process status of the WAL receiver process, displayed using the
 897      <command>ps</command> command (see <xref linkend="monitoring-ps"/> for details).
 898     </para>
 899     <para>
 900      You can retrieve a list of WAL sender processes via the
 901      <link linkend="monitoring-pg-stat-replication-view"><structname>
 902      pg_stat_replication</structname></link> view. Large differences between
 903      <function>pg_current_wal_lsn</function> and the view's <literal>sent_lsn</literal> field
 904      might indicate that the primary server is under heavy load, while
 905      differences between <literal>sent_lsn</literal> and
 906      <function>pg_last_wal_receive_lsn</function> on the standby might indicate
 907      network delay, or that the standby is under heavy load.
 908     </para>
 909     <para>
 910      On a hot standby, the status of the WAL receiver process can be retrieved
 911      via the <link linkend="monitoring-pg-stat-wal-receiver-view">
 912      <structname>pg_stat_wal_receiver</structname></link> view.  A large
 913      difference between <function>pg_last_wal_replay_lsn</function> and the
 914      view's <literal>flushed_lsn</literal> indicates that WAL is being
 915      received faster than it can be replayed.
 916     </para>
 917    </sect3>
 918   </sect2>
 919
 920   <sect2 id="streaming-replication-slots">
 921    <title>Replication Slots</title>
 922    <indexterm>
 923     <primary>replication slot</primary>
 924     <secondary>streaming replication</secondary>
 925    </indexterm>
 926    <para>
 927     Replication slots provide an automated way to ensure that the
 928     primary server does
 929     not remove WAL segments until they have been received by all standbys,
 930     and that the primary does not remove rows which could cause a
 931     <link linkend="hot-standby-conflict">recovery conflict</link> even when the
 932     standby is disconnected.
 933    </para>
 934    <para>
 935     In lieu of using replication slots, it is possible to prevent the removal
 936     of old WAL segments using <xref linkend="guc-wal-keep-size"/>, or by
 937     storing the segments in an archive using
 938     <xref linkend="guc-archive-command"/> or <xref linkend="guc-archive-library"/>.
 939     A disadvantage of these methods is that they
 940     often result in retaining more WAL segments than
 941     required, whereas replication slots retain only the number of segments
 942     known to be needed.
 943    </para>
 944    <para>
 945     Similarly, <xref linkend="guc-hot-standby-feedback"/> on its own, without
 946     also using a replication slot, provides protection against relevant rows
 947     being removed by vacuum, but provides no protection during any time period
 948     when the standby is not connected.
 949    </para>
 950
 951    <caution>
 952     <para>
 953      Beware that replication slots can cause the server to retain so
 954      many WAL segments that they fill up the space allocated for
 955      <literal>pg_wal</literal>.
 956      <xref linkend="guc-max-slot-wal-keep-size"/> can be used to limit the size
 957      of WAL files retained by replication slots.
 958     </para>
 959    </caution>
 960
 961    <sect3 id="streaming-replication-slots-manipulation">
 962     <title>Querying and Manipulating Replication Slots</title>
 963     <para>
 964      Each replication slot has a name, which can contain lower-case letters,
 965      numbers, and the underscore character.
 966     </para>
 967     <para>
 968      Existing replication slots and their state can be seen in the
 969      <link linkend="view-pg-replication-slots"><structname>pg_replication_slots</structname></link>
 970      view.
 971     </para>
 972     <para>
 973      Slots can be created and dropped either via the streaming replication
 974      protocol (see <xref linkend="protocol-replication"/>) or via SQL
 975      functions (see <xref linkend="functions-replication"/>).
 976     </para>
 977    </sect3>
 978    <sect3 id="streaming-replication-slots-config">
 979     <title>Configuration Example</title>
 980     <para>
 981      You can create a replication slot like this:
 982 <programlisting>
 983 postgres=# SELECT * FROM pg_create_physical_replication_slot('node_a_slot');
 984   slot_name  | lsn
 985 -------------+-----
 986  node_a_slot |
 987
 988 postgres=# SELECT slot_name, slot_type, active FROM pg_replication_slots;
 989   slot_name  | slot_type | active
 990 -------------+-----------+--------
 991  node_a_slot | physical  | f
 992 (1 row)
 993 </programlisting>
 994      To configure the standby to use this slot, <varname>primary_slot_name</varname>
 995      should be configured on the standby. Here is a simple example:
 996 <programlisting>
 997 primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
 998 primary_slot_name = 'node_a_slot'
 999 </programlisting>
1000     </para>
1001    </sect3>
1002   </sect2>
1003
1004   <sect2 id="cascading-replication">
1005    <title>Cascading Replication</title>
1006
1007    <indexterm zone="high-availability">
1008     <primary>Cascading Replication</primary>
1009    </indexterm>
1010
1011    <para>
1012     The cascading replication feature allows a standby server to accept replication
1013     connections and stream WAL records to other standbys, acting as a relay.
1014     This can be used to reduce the number of direct connections to the primary
1015     and also to minimize inter-site bandwidth overheads.
1016    </para>
1017
1018    <para>
1019     A standby acting as both a receiver and a sender is known as a cascading
1020     standby.  Standbys that are more directly connected to the primary are known
1021     as upstream servers, while those standby servers further away are downstream
1022     servers.  Cascading replication does not place limits on the number or
1023     arrangement of downstream servers, though each standby connects to only
1024     one upstream server which eventually links to a single primary server.
1025    </para>
1026
1027    <para>
1028     A cascading standby sends not only WAL records received from the
1029     primary but also those restored from the archive. So even if the replication
1030     connection in some upstream connection is terminated, streaming replication
1031     continues downstream for as long as new WAL records are available.
1032    </para>
1033
1034    <para>
1035     Cascading replication is currently asynchronous. Synchronous replication
1036     (see <xref linkend="synchronous-replication"/>) settings have no effect on
1037     cascading replication at present.
1038    </para>
1039
1040    <para>
1041     Hot standby feedback propagates upstream, whatever the cascaded arrangement.
1042    </para>
1043
1044    <para>
1045     If an upstream standby server is promoted to become the new primary, downstream
1046     servers will continue to stream from the new primary if
1047     <varname>recovery_target_timeline</varname> is set to <literal>'latest'</literal> (the default).
1048    </para>
1049
1050    <para>
1051     To use cascading replication, set up the cascading standby so that it can
1052     accept replication connections (that is, set
1053     <xref linkend="guc-max-wal-senders"/> and <xref linkend="guc-hot-standby"/>,
1054     and configure
1055     <link linkend="auth-pg-hba-conf">host-based authentication</link>).
1056     You will also need to set <varname>primary_conninfo</varname> in the downstream
1057     standby to point to the cascading standby.
1058    </para>
1059   </sect2>
1060
1061   <sect2 id="synchronous-replication">
1062    <title>Synchronous Replication</title>
1063
1064    <indexterm zone="high-availability">
1065     <primary>Synchronous Replication</primary>
1066    </indexterm>
1067
1068    <para>
1069     <productname>PostgreSQL</productname> streaming replication is asynchronous by
1070     default. If the primary server
1071     crashes then some transactions that were committed may not have been
1072     replicated to the standby server, causing data loss. The amount
1073     of data loss is proportional to the replication delay at the time of
1074     failover.
1075    </para>
1076
1077    <para>
1078     Synchronous replication offers the ability to confirm that all changes
1079     made by a transaction have been transferred to one or more synchronous
1080     standby servers. This extends that standard level of durability
1081     offered by a transaction commit. This level of protection is referred
1082     to as 2-safe replication in computer science theory, and group-1-safe
1083     (group-safe and 1-safe) when <varname>synchronous_commit</varname> is set to
1084     <literal>remote_write</literal>.
1085    </para>
1086
1087    <para>
1088     When requesting synchronous replication, each commit of a
1089     write transaction will wait until confirmation is
1090     received that the commit has been written to the write-ahead log on disk
1091     of both the primary and standby server. The only possibility that data
1092     can be lost is if both the primary and the standby suffer crashes at the
1093     same time. This can provide a much higher level of durability, though only
1094     if the sysadmin is cautious about the placement and management of the two
1095     servers.  Waiting for confirmation increases the user's confidence that the
1096     changes will not be lost in the event of server crashes but it also
1097     necessarily increases the response time for the requesting transaction.
1098     The minimum wait time is the round-trip time between primary and standby.
1099    </para>
1100
1101    <para>
1102     Read-only transactions and transaction rollbacks need not wait for
1103     replies from standby servers. Subtransaction commits do not wait for
1104     responses from standby servers, only top-level commits. Long
1105     running actions such as data loading or index building do not wait
1106     until the very final commit message. All two-phase commit actions
1107     require commit waits, including both prepare and commit.
1108    </para>
1109
1110    <para>
1111     A synchronous standby can be a physical replication standby or a logical
1112     replication subscriber.  It can also be any other physical or logical WAL
1113     replication stream consumer that knows how to send the appropriate
1114     feedback messages.  Besides the built-in physical and logical replication
1115     systems, this includes special programs such
1116     as <command>pg_receivewal</command> and <command>pg_recvlogical</command>
1117     as well as some third-party replication systems and custom programs.
1118     Check the respective documentation for details on synchronous replication
1119     support.
1120    </para>
1121
1122    <sect3 id="synchronous-replication-config">
1123     <title>Basic Configuration</title>
1124
1125    <para>
1126     Once streaming replication has been configured, configuring synchronous
1127     replication requires only one additional configuration step:
1128     <xref linkend="guc-synchronous-standby-names"/> must be set to
1129     a non-empty value.  <varname>synchronous_commit</varname> must also be set to
1130     <literal>on</literal>, but since this is the default value, typically no change is
1131     required.  (See <xref linkend="runtime-config-wal-settings"/> and
1132     <xref linkend="runtime-config-replication-primary"/>.)
1133     This configuration will cause each commit to wait for
1134     confirmation that the standby has written the commit record to durable
1135     storage.
1136     <varname>synchronous_commit</varname> can be set by individual
1137     users, so it can be configured in the configuration file, for particular
1138     users or databases, or dynamically by applications, in order to control
1139     the durability guarantee on a per-transaction basis.
1140    </para>
1141
1142    <para>
1143     After a commit record has been written to disk on the primary, the
1144     WAL record is then sent to the standby. The standby sends reply
1145     messages each time a new batch of WAL data is written to disk, unless
1146     <varname>wal_receiver_status_interval</varname> is set to zero on the standby.
1147     In the case that <varname>synchronous_commit</varname> is set to
1148     <literal>remote_apply</literal>, the standby sends reply messages when the commit
1149     record is replayed, making the transaction visible.
1150     If the standby is chosen as a synchronous standby, according to the setting
1151     of <varname>synchronous_standby_names</varname> on the primary, the reply
1152     messages from that standby will be considered along with those from other
1153     synchronous standbys to decide when to release transactions waiting for
1154     confirmation that the commit record has been received. These parameters
1155     allow the administrator to specify which standby servers should be
1156     synchronous standbys. Note that the configuration of synchronous
1157     replication is mainly on the primary. Named standbys must be directly
1158     connected to the primary; the primary knows nothing about downstream
1159     standby servers using cascaded replication.
1160    </para>
1161
1162    <para>
1163     Setting <varname>synchronous_commit</varname> to <literal>remote_write</literal> will
1164     cause each commit to wait for confirmation that the standby has received
1165     the commit record and written it out to its own operating system, but not
1166     for the data to be flushed to disk on the standby.  This
1167     setting provides a weaker guarantee of durability than <literal>on</literal>
1168     does: the standby could lose the data in the event of an operating system
1169     crash, though not a <productname>PostgreSQL</productname> crash.
1170     However, it's a useful setting in practice
1171     because it can decrease the response time for the transaction.
1172     Data loss could only occur if both the primary and the standby crash and
1173     the database of the primary gets corrupted at the same time.
1174    </para>
1175
1176    <para>
1177     Setting <varname>synchronous_commit</varname> to <literal>remote_apply</literal> will
1178     cause each commit to wait until the current synchronous standbys report
1179     that they have replayed the transaction, making it visible to user
1180     queries.  In simple cases, this allows for load balancing with causal
1181     consistency.
1182    </para>
1183
1184    <para>
1185     Users will stop waiting if a fast shutdown is requested.  However, as
1186     when using asynchronous replication, the server will not fully
1187     shutdown until all outstanding WAL records are transferred to the currently
1188     connected standby servers.
1189    </para>
1190
1191    </sect3>
1192
1193    <sect3 id="synchronous-replication-multiple-standbys">
1194     <title>Multiple Synchronous Standbys</title>
1195
1196    <para>
1197     Synchronous replication supports one or more synchronous standby servers;
1198     transactions will wait until all the standby servers which are considered
1199     as synchronous confirm receipt of their data. The number of synchronous
1200     standbys that transactions must wait for replies from is specified in
1201     <varname>synchronous_standby_names</varname>. This parameter also specifies
1202     a list of standby names and the method (<literal>FIRST</literal> and
1203     <literal>ANY</literal>) to choose synchronous standbys from the listed ones.
1204    </para>
1205    <para>
1206     The method <literal>FIRST</literal> specifies a priority-based synchronous
1207     replication and makes transaction commits wait until their WAL records are
1208     replicated to the requested number of synchronous standbys chosen based on
1209     their priorities. The standbys whose names appear earlier in the list are
1210     given higher priority and will be considered as synchronous. Other standby
1211     servers appearing later in this list represent potential synchronous
1212     standbys. If any of the current synchronous standbys disconnects for
1213     whatever reason, it will be replaced immediately with the
1214     next-highest-priority standby.
1215    </para>
1216    <para>
1217     An example of <varname>synchronous_standby_names</varname> for
1218     a priority-based multiple synchronous standbys is:
1219 <programlisting>
1220 synchronous_standby_names = 'FIRST 2 (s1, s2, s3)'
1221 </programlisting>
1222     In this example, if four standby servers <literal>s1</literal>, <literal>s2</literal>,
1223     <literal>s3</literal> and <literal>s4</literal> are running, the two standbys
1224     <literal>s1</literal> and <literal>s2</literal> will be chosen as synchronous standbys
1225     because their names appear early in the list of standby names.
1226     <literal>s3</literal> is a potential synchronous standby and will take over
1227     the role of synchronous standby when either of <literal>s1</literal> or
1228     <literal>s2</literal> fails. <literal>s4</literal> is an asynchronous standby since
1229     its name is not in the list.
1230    </para>
1231    <para>
1232     The method <literal>ANY</literal> specifies a quorum-based synchronous
1233     replication and makes transaction commits wait until their WAL records
1234     are replicated to <emphasis>at least</emphasis> the requested number of
1235     synchronous standbys in the list.
1236    </para>
1237    <para>
1238     An example of <varname>synchronous_standby_names</varname> for
1239     a quorum-based multiple synchronous standbys is:
1240 <programlisting>
1241 synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
1242 </programlisting>
1243     In this example, if four standby servers <literal>s1</literal>, <literal>s2</literal>,
1244     <literal>s3</literal> and <literal>s4</literal> are running, transaction commits will
1245     wait for replies from at least any two standbys of <literal>s1</literal>,
1246     <literal>s2</literal> and <literal>s3</literal>. <literal>s4</literal> is an asynchronous
1247     standby since its name is not in the list.
1248    </para>
1249    <para>
1250     The synchronous states of standby servers can be viewed using
1251     the <structname>pg_stat_replication</structname> view.
1252    </para>
1253    </sect3>
1254
1255    <sect3 id="synchronous-replication-performance">
1256     <title>Planning for Performance</title>
1257
1258    <para>
1259     Synchronous replication usually requires carefully planned and placed
1260     standby servers to ensure applications perform acceptably. Waiting
1261     doesn't utilize system resources, but transaction locks continue to be
1262     held until the transfer is confirmed. As a result, incautious use of
1263     synchronous replication will reduce performance for database
1264     applications because of increased response times and higher contention.
1265    </para>
1266
1267    <para>
1268     <productname>PostgreSQL</productname> allows the application developer
1269     to specify the durability level required via replication. This can be
1270     specified for the system overall, though it can also be specified for
1271     specific users or connections, or even individual transactions.
1272    </para>
1273
1274    <para>
1275     For example, an application workload might consist of:
1276     10% of changes are important customer details, while
1277     90% of changes are less important data that the business can more
1278     easily survive if it is lost, such as chat messages between users.
1279    </para>
1280
1281    <para>
1282     With synchronous replication options specified at the application level
1283     (on the primary) we can offer synchronous replication for the most
1284     important changes, without slowing down the bulk of the total workload.
1285     Application level options are an important and practical tool for allowing
1286     the benefits of synchronous replication for high performance applications.
1287    </para>
1288
1289    <para>
1290     You should consider that the network bandwidth must be higher than
1291     the rate of generation of WAL data.
1292    </para>
1293
1294    </sect3>
1295
1296    <sect3 id="synchronous-replication-ha">
1297     <title>Planning for High Availability</title>
1298
1299    <para>
1300     <varname>synchronous_standby_names</varname> specifies the number and
1301     names of synchronous standbys that transaction commits made when
1302     <varname>synchronous_commit</varname> is set to <literal>on</literal>,
1303     <literal>remote_apply</literal> or <literal>remote_write</literal> will wait for
1304     responses from. Such transaction commits may never be completed
1305     if any one of the synchronous standbys should crash.
1306    </para>
1307
1308    <para>
1309     The best solution for high availability is to ensure you keep as many
1310     synchronous standbys as requested. This can be achieved by naming multiple
1311     potential synchronous standbys using <varname>synchronous_standby_names</varname>.
1312    </para>
1313
1314    <para>
1315     In a priority-based synchronous replication, the standbys whose names
1316     appear earlier in the list will be used as synchronous standbys.
1317     Standbys listed after these will take over the role of synchronous standby
1318     if one of current ones should fail.
1319    </para>
1320
1321    <para>
1322     In a quorum-based synchronous replication, all the standbys appearing
1323     in the list will be used as candidates for synchronous standbys.
1324     Even if one of them should fail, the other standbys will keep performing
1325     the role of candidates of synchronous standby.
1326    </para>
1327
1328    <para>
1329     When a standby first attaches to the primary, it will not yet be properly
1330     synchronized. This is described as <literal>catchup</literal> mode. Once
1331     the lag between standby and primary reaches zero for the first time
1332     we move to real-time <literal>streaming</literal> state.
1333     The catch-up duration may be long immediately after the standby has
1334     been created. If the standby is shut down, then the catch-up period
1335     will increase according to the length of time the standby has been down.
1336     The standby is only able to become a synchronous standby
1337     once it has reached <literal>streaming</literal> state.
1338     This state can be viewed using
1339     the <structname>pg_stat_replication</structname> view.
1340    </para>
1341
1342    <para>
1343     If primary restarts while commits are waiting for acknowledgment, those
1344     waiting transactions will be marked fully committed once the primary
1345     database recovers.
1346     There is no way to be certain that all standbys have received all
1347     outstanding WAL data at time of the crash of the primary. Some
1348     transactions may not show as committed on the standby, even though
1349     they show as committed on the primary. The guarantee we offer is that
1350     the application will not receive explicit acknowledgment of the
1351     successful commit of a transaction until the WAL data is known to be
1352     safely received by all the synchronous standbys.
1353    </para>
1354
1355    <para>
1356     If you really cannot keep as many synchronous standbys as requested
1357     then you should decrease the number of synchronous standbys that
1358     transaction commits must wait for responses from
1359     in <varname>synchronous_standby_names</varname> (or disable it) and
1360     reload the configuration file on the primary server.
1361    </para>
1362
1363    <para>
1364     If the primary is isolated from remaining standby servers you should
1365     fail over to the best candidate of those other remaining standby servers.
1366    </para>
1367
1368    <para>
1369     If you need to re-create a standby server while transactions are
1370     waiting, make sure that the functions <function>pg_backup_start()</function>
1371     and <function>pg_backup_stop()</function> are run in a session with
1372     <varname>synchronous_commit</varname> = <literal>off</literal>, otherwise those
1373     requests will wait forever for the standby to appear.
1374    </para>
1375
1376    </sect3>
1377   </sect2>
1378
1379   <sect2 id="continuous-archiving-in-standby">
1380    <title>Continuous Archiving in Standby</title>
1381
1382    <indexterm>
1383      <primary>continuous archiving</primary>
1384      <secondary>in standby</secondary>
1385    </indexterm>
1386
1387    <para>
1388      When continuous WAL archiving is used in a standby, there are two
1389      different scenarios: the WAL archive can be shared between the primary
1390      and the standby, or the standby can have its own WAL archive. When
1391      the standby has its own WAL archive, set <varname>archive_mode</varname>
1392      to <literal>always</literal>, and the standby will call the archive
1393      command for every WAL segment it receives, whether it's by restoring
1394      from the archive or by streaming replication. The shared archive can
1395      be handled similarly, but the <varname>archive_command</varname> or <varname>archive_library</varname> must
1396      test if the file being archived exists already, and if the existing file
1397      has identical contents. This requires more care in the
1398      <varname>archive_command</varname> or <varname>archive_library</varname>, as it must
1399      be careful to not overwrite an existing file with different contents,
1400      but return success if the exactly same file is archived twice. And
1401      all that must be done free of race conditions, if two servers attempt
1402      to archive the same file at the same time.
1403    </para>
1404
1405    <para>
1406      If <varname>archive_mode</varname> is set to <literal>on</literal>, the
1407      archiver is not enabled during recovery or standby mode. If the standby
1408      server is promoted, it will start archiving after the promotion, but
1409      will not archive any WAL or timeline history files that
1410      it did not generate itself. To get a complete
1411      series of WAL files in the archive, you must ensure that all WAL is
1412      archived, before it reaches the standby. This is inherently true with
1413      file-based log shipping, as the standby can only restore files that
1414      are found in the archive, but not if streaming replication is enabled.
1415      When a server is not in recovery mode, there is no difference between
1416      <literal>on</literal> and <literal>always</literal> modes.
1417    </para>
1418   </sect2>
1419   </sect1>
1420
1421   <sect1 id="warm-standby-failover">
1422    <title>Failover</title>
1423
1424    <para>
1425     If the primary server fails then the standby server should begin
1426     failover procedures.
1427    </para>
1428
1429    <para>
1430     If the standby server fails then no failover need take place. If the
1431     standby server can be restarted, even some time later, then the recovery
1432     process can also be restarted immediately, taking advantage of
1433     restartable recovery. If the standby server cannot be restarted, then a
1434     full new standby server instance should be created.
1435    </para>
1436
1437    <para>
1438     If the primary server fails and the standby server becomes the
1439     new primary, and then the old primary restarts, you must have
1440     a mechanism for informing the old primary that it is no longer the primary. This is
1441     sometimes known as <acronym>STONITH</acronym> (Shoot The Other Node In The Head), which is
1442     necessary to avoid situations where both systems think they are the
1443     primary, which will lead to confusion and ultimately data loss.
1444    </para>
1445
1446    <para>
1447     Many failover systems use just two systems, the primary and the standby,
1448     connected by some kind of heartbeat mechanism to continually verify the
1449     connectivity between the two and the viability of the primary. It is
1450     also possible to use a third system (called a witness server) to prevent
1451     some cases of inappropriate failover, but the additional complexity
1452     might not be worthwhile unless it is set up with sufficient care and
1453     rigorous testing.
1454    </para>
1455
1456    <para>
1457     <productname>PostgreSQL</productname> does not provide the system
1458     software required to identify a failure on the primary and notify
1459     the standby database server.  Many such tools exist and are well
1460     integrated with the operating system facilities required for
1461     successful failover, such as IP address migration.
1462    </para>
1463
1464    <para>
1465     Once failover to the standby occurs, there is only a
1466     single server in operation. This is known as a degenerate state.
1467     The former standby is now the primary, but the former primary is down
1468     and might stay down.  To return to normal operation, a standby server
1469     must be recreated,
1470     either on the former primary system when it comes up, or on a third,
1471     possibly new, system. The <xref linkend="app-pgrewind"/> utility can be
1472     used to speed up this process on large clusters.
1473     Once complete, the primary and standby can be
1474     considered to have switched roles. Some people choose to use a third
1475     server to provide backup for the new primary until the new standby
1476     server is recreated,
1477     though clearly this complicates the system configuration and
1478     operational processes.
1479    </para>
1480
1481    <para>
1482     So, switching from primary to standby server can be fast but requires
1483     some time to re-prepare the failover cluster. Regular switching from
1484     primary to standby is useful, since it allows regular downtime on
1485     each system for maintenance. This also serves as a test of the
1486     failover mechanism to ensure that it will really work when you need it.
1487     Written administration procedures are advised.
1488    </para>
1489
1490    <para>
1491     If you have opted for logical replication slot synchronization (see
1492     <xref linkend="logicaldecoding-replication-slots-synchronization"/>),
1493     then before switching to the standby server, it is recommended to check
1494     if the logical slots synchronized on the standby server are ready
1495     for failover. This can be done by following the steps described in
1496     <xref linkend="logical-replication-failover"/>.
1497    </para>
1498
1499    <para>
1500     To trigger failover of a log-shipping standby server, run
1501     <command>pg_ctl promote</command> or call <function>pg_promote()</function>.
1502     If you're setting up reporting servers that are only used to offload
1503     read-only queries from the primary, not for high availability purposes,
1504     you don't need to promote.
1505    </para>
1506   </sect1>
1507
1508  <sect1 id="hot-standby">
1509   <title>Hot Standby</title>
1510
1511   <indexterm zone="high-availability">
1512    <primary>hot standby</primary>
1513   </indexterm>
1514
1515    <para>
1516     Hot standby is the term used to describe the ability to connect to
1517     the server and run read-only queries while the server is in archive
1518     recovery or standby mode. This
1519     is useful both for replication purposes and for restoring a backup
1520     to a desired state with great precision.
1521     The term hot standby also refers to the ability of the server to move
1522     from recovery through to normal operation while users continue running
1523     queries and/or keep their connections open.
1524    </para>
1525
1526    <para>
1527     Running queries in hot standby mode is similar to normal query operation,
1528     though there are several usage and administrative differences
1529     explained below.
1530    </para>
1531
1532   <sect2 id="hot-standby-users">
1533    <title>User's Overview</title>
1534
1535    <para>
1536     When the <xref linkend="guc-hot-standby"/> parameter is set to true on a
1537     standby server, it will begin accepting connections once the recovery has
1538     brought the system to a consistent state.  All such connections are
1539     strictly read-only; not even temporary tables may be written.
1540    </para>
1541
1542    <para>
1543     The data on the standby takes some time to arrive from the primary server
1544     so there will be a measurable delay between primary and standby. Running the
1545     same query nearly simultaneously on both primary and standby might therefore
1546     return differing results. We say that data on the standby is
1547     <firstterm>eventually consistent</firstterm> with the primary.  Once the
1548     commit record for a transaction is replayed on the standby, the changes
1549     made by that transaction will be visible to any new snapshots taken on
1550     the standby.  Snapshots may be taken at the start of each query or at the
1551     start of each transaction, depending on the current transaction isolation
1552     level.  For more details, see <xref linkend="transaction-iso"/>.
1553    </para>
1554
1555    <para>
1556     Transactions started during hot standby may issue the following commands:
1557
1558     <itemizedlist>
1559      <listitem>
1560       <para>
1561        Query access: <command>SELECT</command>, <command>COPY TO</command>
1562       </para>
1563      </listitem>
1564      <listitem>
1565       <para>
1566        Cursor commands: <command>DECLARE</command>, <command>FETCH</command>, <command>CLOSE</command>
1567       </para>
1568      </listitem>
1569      <listitem>
1570       <para>
1571        Settings: <command>SHOW</command>, <command>SET</command>, <command>RESET</command>
1572       </para>
1573      </listitem>
1574      <listitem>
1575       <para>
1576        Transaction management commands:
1577         <itemizedlist>
1578          <listitem>
1579           <para>
1580            <command>BEGIN</command>, <command>END</command>, <command>ABORT</command>, <command>START TRANSACTION</command>
1581           </para>
1582          </listitem>
1583          <listitem>
1584           <para>
1585            <command>SAVEPOINT</command>, <command>RELEASE</command>, <command>ROLLBACK TO SAVEPOINT</command>
1586           </para>
1587          </listitem>
1588          <listitem>
1589           <para>
1590            <command>EXCEPTION</command> blocks and other internal subtransactions
1591           </para>
1592          </listitem>
1593         </itemizedlist>
1594       </para>
1595      </listitem>
1596      <listitem>
1597       <para>
1598        <command>LOCK TABLE</command>, though only when explicitly in one of these modes:
1599        <literal>ACCESS SHARE</literal>, <literal>ROW SHARE</literal> or <literal>ROW EXCLUSIVE</literal>.
1600       </para>
1601      </listitem>
1602      <listitem>
1603       <para>
1604        Plans and resources: <command>PREPARE</command>, <command>EXECUTE</command>,
1605        <command>DEALLOCATE</command>, <command>DISCARD</command>
1606       </para>
1607      </listitem>
1608      <listitem>
1609       <para>
1610        Plugins and extensions: <command>LOAD</command>
1611       </para>
1612      </listitem>
1613      <listitem>
1614       <para>
1615        <command>UNLISTEN</command>
1616       </para>
1617      </listitem>
1618     </itemizedlist>
1619    </para>
1620
1621    <para>
1622     Transactions started during hot standby will never be assigned a
1623     transaction ID and cannot write to the system write-ahead log.
1624     Therefore, the following actions will produce error messages:
1625
1626     <itemizedlist>
1627      <listitem>
1628       <para>
1629        Data Manipulation Language (DML): <command>INSERT</command>,
1630        <command>UPDATE</command>, <command>DELETE</command>,
1631        <command>MERGE</command>, <command>COPY FROM</command>,
1632        <command>TRUNCATE</command>.
1633        Note that there are no allowed actions that result in a trigger
1634        being executed during recovery.  This restriction applies even to
1635        temporary tables, because table rows cannot be read or written without
1636        assigning a transaction ID, which is currently not possible in a
1637        hot standby environment.
1638       </para>
1639      </listitem>
1640      <listitem>
1641       <para>
1642        Data Definition Language (DDL): <command>CREATE</command>,
1643        <command>DROP</command>, <command>ALTER</command>, <command>COMMENT</command>.
1644        This restriction applies even to temporary tables, because carrying
1645        out these operations would require updating the system catalog tables.
1646       </para>
1647      </listitem>
1648      <listitem>
1649       <para>
1650        <command>SELECT ... FOR SHARE | UPDATE</command>, because row locks cannot be
1651        taken without updating the underlying data files.
1652       </para>
1653      </listitem>
1654      <listitem>
1655       <para>
1656        Rules on <command>SELECT</command> statements that generate DML commands.
1657       </para>
1658      </listitem>
1659      <listitem>
1660       <para>
1661        <command>LOCK</command> that explicitly requests a mode higher than <literal>ROW EXCLUSIVE MODE</literal>.
1662       </para>
1663      </listitem>
1664      <listitem>
1665       <para>
1666        <command>LOCK</command> in short default form, since it requests <literal>ACCESS EXCLUSIVE MODE</literal>.
1667       </para>
1668      </listitem>
1669      <listitem>
1670       <para>
1671        Transaction management commands that explicitly set non-read-only state:
1672         <itemizedlist>
1673          <listitem>
1674           <para>
1675             <command>BEGIN READ WRITE</command>,
1676             <command>START TRANSACTION READ WRITE</command>
1677           </para>
1678          </listitem>
1679          <listitem>
1680           <para>
1681             <command>SET TRANSACTION READ WRITE</command>,
1682             <command>SET SESSION CHARACTERISTICS AS TRANSACTION READ WRITE</command>
1683           </para>
1684          </listitem>
1685          <listitem>
1686           <para>
1687            <command>SET transaction_read_only = off</command>
1688           </para>
1689          </listitem>
1690         </itemizedlist>
1691       </para>
1692      </listitem>
1693      <listitem>
1694       <para>
1695        Two-phase commit commands: <command>PREPARE TRANSACTION</command>,
1696        <command>COMMIT PREPARED</command>, <command>ROLLBACK PREPARED</command>
1697        because even read-only transactions need to write WAL in the
1698        prepare phase (the first phase of two phase commit).
1699       </para>
1700      </listitem>
1701      <listitem>
1702       <para>
1703        Sequence updates: <function>nextval()</function>, <function>setval()</function>
1704       </para>
1705      </listitem>
1706      <listitem>
1707       <para>
1708        <command>LISTEN</command>, <command>NOTIFY</command>
1709       </para>
1710      </listitem>
1711     </itemizedlist>
1712    </para>
1713
1714    <para>
1715     In normal operation, <quote>read-only</quote> transactions are allowed to
1716     use <command>LISTEN</command> and <command>NOTIFY</command>,
1717     so hot standby sessions operate under slightly tighter
1718     restrictions than ordinary read-only sessions.  It is possible that some
1719     of these restrictions might be loosened in a future release.
1720    </para>
1721
1722    <para>
1723     During hot standby, the parameter <varname>transaction_read_only</varname> is always
1724     true and may not be changed.  But as long as no attempt is made to modify
1725     the database, connections during hot standby will act much like any other
1726     database connection.  If failover or switchover occurs, the database will
1727     switch to normal processing mode.  Sessions will remain connected while the
1728     server changes mode.  Once hot standby finishes, it will be possible to
1729     initiate read-write transactions (even from a session begun during
1730     hot standby).
1731    </para>
1732
1733    <para>
1734     Users can determine whether hot standby is currently active for their
1735     session by issuing <command>SHOW in_hot_standby</command>.
1736     (In server versions before 14, the <varname>in_hot_standby</varname>
1737     parameter did not exist; a workable substitute method for older servers
1738     is <command>SHOW transaction_read_only</command>.)  In addition, a set of
1739     functions (<xref linkend="functions-recovery-info-table"/>) allow users to
1740     access information about the standby server. These allow you to write
1741     programs that are aware of the current state of the database. These
1742     can be used to monitor the progress of recovery, or to allow you to
1743     write complex programs that restore the database to particular states.
1744    </para>
1745   </sect2>
1746
1747   <sect2 id="hot-standby-conflict">
1748    <title>Handling Query Conflicts</title>
1749
1750    <para>
1751     The primary and standby servers are in many ways loosely connected. Actions
1752     on the primary will have an effect on the standby. As a result, there is
1753     potential for negative interactions or conflicts between them. The easiest
1754     conflict to understand is performance: if a huge data load is taking place
1755     on the primary then this will generate a similar stream of WAL records on the
1756     standby, so standby queries may contend for system resources, such as I/O.
1757    </para>
1758
1759    <para>
1760     There are also additional types of conflict that can occur with hot standby.
1761     These conflicts are <emphasis>hard conflicts</emphasis> in the sense that queries
1762     might need to be canceled and, in some cases, sessions disconnected to resolve them.
1763     The user is provided with several ways to handle these
1764     conflicts. Conflict cases include:
1765
1766       <itemizedlist>
1767        <listitem>
1768         <para>
1769          Access Exclusive locks taken on the primary server, including both
1770          explicit <command>LOCK</command> commands and various <acronym>DDL</acronym>
1771          actions, conflict with table accesses in standby queries.
1772         </para>
1773        </listitem>
1774        <listitem>
1775         <para>
1776          Dropping a tablespace on the primary conflicts with standby queries
1777          using that tablespace for temporary work files.
1778         </para>
1779        </listitem>
1780        <listitem>
1781         <para>
1782          Dropping a database on the primary conflicts with sessions connected
1783          to that database on the standby.
1784         </para>
1785        </listitem>
1786        <listitem>
1787         <para>
1788          Application of a vacuum cleanup record from WAL conflicts with
1789          standby transactions whose snapshots can still <quote>see</quote> any of
1790          the rows to be removed.
1791         </para>
1792        </listitem>
1793        <listitem>
1794         <para>
1795          Application of a vacuum cleanup record from WAL conflicts with
1796          queries accessing the target page on the standby, whether or not
1797          the data to be removed is visible.
1798         </para>
1799        </listitem>
1800       </itemizedlist>
1801    </para>
1802
1803    <para>
1804     On the primary server, these cases simply result in waiting; and the
1805     user might choose to cancel either of the conflicting actions.  However,
1806     on the standby there is no choice: the WAL-logged action already occurred
1807     on the primary so the standby must not fail to apply it.  Furthermore,
1808     allowing WAL application to wait indefinitely may be very undesirable,
1809     because the standby's state will become increasingly far behind the
1810     primary's.  Therefore, a mechanism is provided to forcibly cancel standby
1811     queries that conflict with to-be-applied WAL records.
1812    </para>
1813
1814    <para>
1815     An example of the problem situation is an administrator on the primary
1816     server running <command>DROP TABLE</command> on a table that is currently being
1817     queried on the standby server.  Clearly the standby query cannot continue
1818     if the <command>DROP TABLE</command> is applied on the standby. If this situation
1819     occurred on the primary, the <command>DROP TABLE</command> would wait until the
1820     other query had finished. But when <command>DROP TABLE</command> is run on the
1821     primary, the primary doesn't have information about what queries are
1822     running on the standby, so it will not wait for any such standby
1823     queries. The WAL change records come through to the standby while the
1824     standby query is still running, causing a conflict.  The standby server
1825     must either delay application of the WAL records (and everything after
1826     them, too) or else cancel the conflicting query so that the <command>DROP
1827     TABLE</command> can be applied.
1828    </para>
1829
1830    <para>
1831     When a conflicting query is short, it's typically desirable to allow it to
1832     complete by delaying WAL application for a little bit; but a long delay in
1833     WAL application is usually not desirable.  So the cancel mechanism has
1834     parameters, <xref linkend="guc-max-standby-archive-delay"/> and <xref
1835     linkend="guc-max-standby-streaming-delay"/>, that define the maximum
1836     allowed delay in WAL application.  Conflicting queries will be canceled
1837     once it has taken longer than the relevant delay setting to apply any
1838     newly-received WAL data.  There are two parameters so that different delay
1839     values can be specified for the case of reading WAL data from an archive
1840     (i.e., initial recovery from a base backup or <quote>catching up</quote> a
1841     standby server that has fallen far behind) versus reading WAL data via
1842     streaming replication.
1843    </para>
1844
1845    <para>
1846     In a standby server that exists primarily for high availability, it's
1847     best to set the delay parameters relatively short, so that the server
1848     cannot fall far behind the primary due to delays caused by standby
1849     queries.  However, if the standby server is meant for executing
1850     long-running queries, then a high or even infinite delay value may be
1851     preferable.  Keep in mind however that a long-running query could
1852     cause other sessions on the standby server to not see recent changes
1853     on the primary, if it delays application of WAL records.
1854    </para>
1855
1856    <para>
1857     Once the delay specified by <varname>max_standby_archive_delay</varname> or
1858     <varname>max_standby_streaming_delay</varname> has been exceeded, conflicting
1859     queries will be canceled.  This usually results just in a cancellation
1860     error, although in the case of replaying a <command>DROP DATABASE</command>
1861     the entire conflicting session will be terminated.  Also, if the conflict
1862     is over a lock held by an idle transaction, the conflicting session is
1863     terminated (this behavior might change in the future).
1864    </para>
1865
1866    <para>
1867     Canceled queries may be retried immediately (after beginning a new
1868     transaction, of course).  Since query cancellation depends on
1869     the nature of the WAL records being replayed, a query that was
1870     canceled may well succeed if it is executed again.
1871    </para>
1872
1873    <para>
1874     Keep in mind that the delay parameters are compared to the elapsed time
1875     since the WAL data was received by the standby server.  Thus, the grace
1876     period allowed to any one query on the standby is never more than the
1877     delay parameter, and could be considerably less if the standby has already
1878     fallen behind as a result of waiting for previous queries to complete, or
1879     as a result of being unable to keep up with a heavy update load.
1880    </para>
1881
1882    <para>
1883     The most common reason for conflict between standby queries and WAL replay
1884     is <quote>early cleanup</quote>.  Normally, <productname>PostgreSQL</productname> allows
1885     cleanup of old row versions when there are no transactions that need to
1886     see them to ensure correct visibility of data according to MVCC rules.
1887     However, this rule can only be applied for transactions executing on the
1888     primary.  So it is possible that cleanup on the primary will remove row
1889     versions that are still visible to a transaction on the standby.
1890    </para>
1891
1892    <para>
1893     Row version cleanup isn't the only potential cause of conflicts with
1894     standby queries.  All index-only scans (including those that run on
1895     standbys) must use an <acronym>MVCC</acronym> snapshot that
1896     <quote>agrees</quote> with the visibility map.  Conflicts are therefore
1897     required whenever <command>VACUUM</command> <link
1898      linkend="vacuum-for-visibility-map">sets a page as all-visible in the
1899      visibility map</link> containing one or more rows
1900     <emphasis>not</emphasis> visible to all standby queries.  So even running
1901     <command>VACUUM</command> against a table with no updated or deleted rows
1902     requiring cleanup might lead to conflicts.
1903    </para>
1904
1905    <para>
1906     Users should be clear that tables that are regularly and heavily updated
1907     on the primary server will quickly cause cancellation of longer running
1908     queries on the standby. In such cases the setting of a finite value for
1909     <varname>max_standby_archive_delay</varname> or
1910     <varname>max_standby_streaming_delay</varname> can be considered similar to
1911     setting <varname>statement_timeout</varname>.
1912    </para>
1913
1914    <para>
1915     Remedial possibilities exist if the number of standby-query cancellations
1916     is found to be unacceptable.  The first option is to set the parameter
1917     <varname>hot_standby_feedback</varname>, which prevents <command>VACUUM</command> from
1918     removing recently-dead rows and so cleanup conflicts do not occur.
1919     If you do this, you
1920     should note that this will delay cleanup of dead rows on the primary,
1921     which may result in undesirable table bloat. However, the cleanup
1922     situation will be no worse than if the standby queries were running
1923     directly on the primary server, and you are still getting the benefit of
1924     off-loading execution onto the standby.
1925     If standby servers connect and disconnect frequently, you
1926     might want to make adjustments to handle the period when
1927     <varname>hot_standby_feedback</varname> feedback is not being provided.
1928     For example, consider increasing <varname>max_standby_archive_delay</varname>
1929     so that queries are not rapidly canceled by conflicts in WAL archive
1930     files during disconnected periods.  You should also consider increasing
1931     <varname>max_standby_streaming_delay</varname> to avoid rapid cancellations
1932     by newly-arrived streaming WAL entries after reconnection.
1933    </para>
1934
1935    <para>
1936     The number of query cancels and the reason for them can be viewed using
1937     the <structname>pg_stat_database_conflicts</structname> system view on the standby
1938     server. The <structname>pg_stat_database</structname> system view also contains
1939     summary information.
1940    </para>
1941
1942    <para>
1943     Users can control whether a log message is produced when WAL replay is waiting
1944     longer than <varname>deadlock_timeout</varname> for conflicts. This
1945     is controlled by the <xref linkend="guc-log-recovery-conflict-waits"/> parameter.
1946    </para>
1947   </sect2>
1948
1949   <sect2 id="hot-standby-admin">
1950    <title>Administrator's Overview</title>
1951
1952    <para>
1953     If <varname>hot_standby</varname> is <literal>on</literal> in <filename>postgresql.conf</filename>
1954     (the default value) and there is a
1955     <link linkend="file-standby-signal"><filename>standby.signal</filename></link><indexterm><primary>standby.signal</primary><secondary>for hot standby</secondary></indexterm>
1956     file present, the server will run in hot standby mode.
1957     However, it may take some time for hot standby connections to be allowed,
1958     because the server will not accept connections until it has completed
1959     sufficient recovery to provide a consistent state against which queries
1960     can run.  During this period,
1961     clients that attempt to connect will be refused with an error message.
1962     To confirm the server has come up, either loop trying to connect from
1963     the application, or look for these messages in the server logs:
1964
1965 <programlisting>
1966 LOG:  entering standby mode
1967
1968 ... then some time later ...
1969
1970 LOG:  consistent recovery state reached
1971 LOG:  database system is ready to accept read-only connections
1972 </programlisting>
1973
1974     Consistency information is recorded once per checkpoint on the primary.
1975     It is not possible to enable hot standby when reading WAL
1976     written during a period when <varname>wal_level</varname> was not set to
1977     <literal>replica</literal> or <literal>logical</literal> on the primary.  Reaching
1978     a consistent state can also be delayed in the presence of both of these
1979     conditions:
1980
1981       <itemizedlist>
1982        <listitem>
1983         <para>
1984          A write transaction has more than 64 subtransactions
1985         </para>
1986        </listitem>
1987        <listitem>
1988         <para>
1989          Very long-lived write transactions
1990         </para>
1991        </listitem>
1992       </itemizedlist>
1993
1994     If you are running file-based log shipping ("warm standby"), you might need
1995     to wait until the next WAL file arrives, which could be as long as the
1996     <varname>archive_timeout</varname> setting on the primary.
1997    </para>
1998
1999    <para>
2000     The settings of some parameters determine the size of shared memory for
2001     tracking transaction IDs, locks, and prepared transactions.  These shared
2002     memory structures must be no smaller on a standby than on the primary in
2003     order to ensure that the standby does not run out of shared memory during
2004     recovery.  For example, if the primary had used a prepared transaction but
2005     the standby had not allocated any shared memory for tracking prepared
2006     transactions, then recovery could not continue until the standby's
2007     configuration is changed.  The parameters affected are:
2008
2009       <itemizedlist>
2010        <listitem>
2011         <para>
2012          <varname>max_connections</varname>
2013         </para>
2014        </listitem>
2015        <listitem>
2016         <para>
2017          <varname>max_prepared_transactions</varname>
2018         </para>
2019        </listitem>
2020        <listitem>
2021         <para>
2022          <varname>max_locks_per_transaction</varname>
2023         </para>
2024        </listitem>
2025        <listitem>
2026         <para>
2027          <varname>max_wal_senders</varname>
2028         </para>
2029        </listitem>
2030        <listitem>
2031         <para>
2032          <varname>max_worker_processes</varname>
2033         </para>
2034        </listitem>
2035       </itemizedlist>
2036
2037     The easiest way to ensure this does not become a problem is to have these
2038     parameters set on the standbys to values equal to or greater than on the
2039     primary.  Therefore, if you want to increase these values, you should do
2040     so on all standby servers first, before applying the changes to the
2041     primary server.  Conversely, if you want to decrease these values, you
2042     should do so on the primary server first, before applying the changes to
2043     all standby servers.  Keep in mind that when a standby is promoted, it
2044     becomes the new reference for the required parameter settings for the
2045     standbys that follow it.  Therefore, to avoid this becoming a problem
2046     during a switchover or failover, it is recommended to keep these settings
2047     the same on all standby servers.
2048    </para>
2049
2050    <para>
2051     The WAL tracks changes to these parameters on the
2052     primary.  If a hot standby processes WAL that indicates that the current
2053     value on the primary is higher than its own value, it will log a warning
2054     and pause recovery, for example:
2055 <screen>
2056 WARNING:  hot standby is not possible because of insufficient parameter settings
2057 DETAIL:  max_connections = 80 is a lower setting than on the primary server, where its value was 100.
2058 LOG:  recovery has paused
2059 DETAIL:  If recovery is unpaused, the server will shut down.
2060 HINT:  You can then restart the server after making the necessary configuration changes.
2061 </screen>
2062     At that point, the settings on the standby need to be updated and the
2063     instance restarted before recovery can continue.  If the standby is not a
2064     hot standby, then when it encounters the incompatible parameter change, it
2065     will shut down immediately without pausing, since there is then no value
2066     in keeping it up.
2067    </para>
2068
2069    <para>
2070     It is important that the administrator select appropriate settings for
2071     <xref linkend="guc-max-standby-archive-delay"/> and <xref
2072     linkend="guc-max-standby-streaming-delay"/>.  The best choices vary
2073     depending on business priorities.  For example if the server is primarily
2074     tasked as a High Availability server, then you will want low delay
2075     settings, perhaps even zero, though that is a very aggressive setting. If
2076     the standby server is tasked as an additional server for decision support
2077     queries then it might be acceptable to set the maximum delay values to
2078     many hours, or even -1 which means wait forever for queries to complete.
2079    </para>
2080
2081    <para>
2082     Transaction status "hint bits" written on the primary are not WAL-logged,
2083     so data on the standby will likely re-write the hints again on the standby.
2084     Thus, the standby server will still perform disk writes even though
2085     all users are read-only; no changes occur to the data values
2086     themselves.  Users will still write large sort temporary files and
2087     re-generate relcache info files, so no part of the database
2088     is truly read-only during hot standby mode.
2089     Note also that writes to remote databases using
2090     <application>dblink</application> module, and other operations outside the
2091     database using PL functions will still be possible, even though the
2092     transaction is read-only locally.
2093    </para>
2094
2095    <para>
2096     The following types of administration commands are not accepted
2097     during recovery mode:
2098
2099       <itemizedlist>
2100        <listitem>
2101         <para>
2102          Data Definition Language (DDL): e.g., <command>CREATE INDEX</command>
2103         </para>
2104        </listitem>
2105        <listitem>
2106         <para>
2107          Privilege and Ownership: <command>GRANT</command>, <command>REVOKE</command>,
2108          <command>REASSIGN</command>
2109         </para>
2110        </listitem>
2111        <listitem>
2112         <para>
2113          Maintenance commands: <command>ANALYZE</command>, <command>VACUUM</command>,
2114          <command>CLUSTER</command>, <command>REINDEX</command>
2115         </para>
2116        </listitem>
2117       </itemizedlist>
2118    </para>
2119
2120    <para>
2121     Again, note that some of these commands are actually allowed during
2122     "read only" mode transactions on the primary.
2123    </para>
2124
2125    <para>
2126     As a result, you cannot create additional indexes that exist solely
2127     on the standby, nor statistics that exist solely on the standby.
2128     If these administration commands are needed, they should be executed
2129     on the primary, and eventually those changes will propagate to the
2130     standby.
2131    </para>
2132
2133    <para>
2134     <function>pg_cancel_backend()</function>
2135     and <function>pg_terminate_backend()</function> will work on user backends,
2136     but not the startup process, which performs
2137     recovery. <structname>pg_stat_activity</structname> does not show
2138     recovering transactions as active. As a result,
2139     <structname>pg_prepared_xacts</structname> is always empty during
2140     recovery. If you wish to resolve in-doubt prepared transactions, view
2141     <literal>pg_prepared_xacts</literal> on the primary and issue commands to
2142     resolve transactions there or resolve them after the end of recovery.
2143    </para>
2144
2145    <para>
2146     <structname>pg_locks</structname> will show locks held by backends,
2147     as normal. <structname>pg_locks</structname> also shows
2148     a virtual transaction managed by the startup process that owns all
2149     <literal>AccessExclusiveLocks</literal> held by transactions being replayed by recovery.
2150     Note that the startup process does not acquire locks to
2151     make database changes, and thus locks other than <literal>AccessExclusiveLocks</literal>
2152     do not show in <structname>pg_locks</structname> for the Startup
2153     process; they are just presumed to exist.
2154    </para>
2155
2156    <para>
2157     The <productname>Nagios</productname> plugin <productname>check_pgsql</productname> will
2158     work, because the simple information it checks for exists.
2159     The <productname>check_postgres</productname> monitoring script will also work,
2160     though some reported values could give different or confusing results.
2161     For example, last vacuum time will not be maintained, since no
2162     vacuum occurs on the standby.  Vacuums running on the primary
2163     do still send their changes to the standby.
2164    </para>
2165
2166    <para>
2167     WAL file control commands will not work during recovery,
2168     e.g., <function>pg_backup_start</function>, <function>pg_switch_wal</function> etc.
2169    </para>
2170
2171    <para>
2172     Dynamically loadable modules work, including <structname>pg_stat_statements</structname>.
2173    </para>
2174
2175    <para>
2176     Advisory locks work normally in recovery, including deadlock detection.
2177     Note that advisory locks are never WAL logged, so it is impossible for
2178     an advisory lock on either the primary or the standby to conflict with WAL
2179     replay. Nor is it possible to acquire an advisory lock on the primary
2180     and have it initiate a similar advisory lock on the standby. Advisory
2181     locks relate only to the server on which they are acquired.
2182    </para>
2183
2184    <para>
2185     Trigger-based replication systems such as <productname>Slony</productname>,
2186     <productname>Londiste</productname> and <productname>Bucardo</productname> won't run on the
2187     standby at all, though they will run happily on the primary server as
2188     long as the changes are not sent to standby servers to be applied.
2189     WAL replay is not trigger-based so you cannot relay from the
2190     standby to any system that requires additional database writes or
2191     relies on the use of triggers.
2192    </para>
2193
2194    <para>
2195     New OIDs cannot be assigned, though some <acronym>UUID</acronym> generators may still
2196     work as long as they do not rely on writing new status to the database.
2197    </para>
2198
2199    <para>
2200     Currently, temporary table creation is not allowed during read-only
2201     transactions, so in some cases existing scripts will not run correctly.
2202     This restriction might be relaxed in a later release. This is
2203     both an SQL standard compliance issue and a technical issue.
2204    </para>
2205
2206    <para>
2207     <command>DROP TABLESPACE</command> can only succeed if the tablespace is empty.
2208     Some standby users may be actively using the tablespace via their
2209     <varname>temp_tablespaces</varname> parameter. If there are temporary files in the
2210     tablespace, all active queries are canceled to ensure that temporary
2211     files are removed, so the tablespace can be removed and WAL replay
2212     can continue.
2213    </para>
2214
2215    <para>
2216     Running <command>DROP DATABASE</command> or <command>ALTER DATABASE ... SET
2217     TABLESPACE</command> on the primary
2218     will generate a WAL entry that will cause all users connected to that
2219     database on the standby to be forcibly disconnected. This action occurs
2220     immediately, whatever the setting of
2221     <varname>max_standby_streaming_delay</varname>. Note that
2222     <command>ALTER DATABASE ... RENAME</command> does not disconnect users, which
2223     in most cases will go unnoticed, though might in some cases cause a
2224     program confusion if it depends in some way upon database name.
2225    </para>
2226
2227    <para>
2228     In normal (non-recovery) mode, if you issue <command>DROP USER</command> or <command>DROP ROLE</command>
2229     for a role with login capability while that user is still connected then
2230     nothing happens to the connected user &mdash; they remain connected. The user cannot
2231     reconnect however. This behavior applies in recovery also, so a
2232     <command>DROP USER</command> on the primary does not disconnect that user on the standby.
2233    </para>
2234
2235    <para>
2236     The cumulative statistics system is active during recovery. All scans,
2237     reads, blocks, index usage, etc., will be recorded normally on the
2238     standby. However, WAL replay will not increment relation and database
2239     specific counters. I.e. replay will not increment
2240     <structname>pg_stat_all_tables</structname> columns (like <structfield>n_tup_ins</structfield>),
2241     nor will reads or writes performed by the startup process be tracked in the
2242     <structname>pg_statio_</structname> views, nor will associated
2243     <structname>pg_stat_database</structname> columns be incremented.
2244    </para>
2245
2246    <para>
2247     Autovacuum is not active during recovery.  It will start normally at the
2248     end of recovery.
2249    </para>
2250
2251    <para>
2252     The checkpointer process and the background writer process are active during
2253     recovery. The checkpointer process will perform restartpoints (similar to
2254     checkpoints on the primary) and the background writer process will perform
2255     normal block cleaning activities. This can include updates of the hint bit
2256     information stored on the standby server.
2257     The <command>CHECKPOINT</command> command is accepted during recovery,
2258     though it performs a restartpoint rather than a new checkpoint.
2259    </para>
2260   </sect2>
2261
2262   <sect2 id="hot-standby-parameters">
2263    <title>Hot Standby Parameter Reference</title>
2264
2265    <para>
2266     Various parameters have been mentioned above in
2267     <xref linkend="hot-standby-conflict"/> and
2268     <xref linkend="hot-standby-admin"/>.
2269    </para>
2270
2271    <para>
2272     On the primary, the <xref linkend="guc-wal-level"/> parameter can be used.
2273     <xref linkend="guc-max-standby-archive-delay"/> and
2274     <xref linkend="guc-max-standby-streaming-delay"/> have no effect if set on
2275     the primary.
2276    </para>
2277
2278    <para>
2279     On the standby, parameters <xref linkend="guc-hot-standby"/>,
2280     <xref linkend="guc-max-standby-archive-delay"/> and
2281     <xref linkend="guc-max-standby-streaming-delay"/> can be used.
2282    </para>
2283   </sect2>
2284
2285   <sect2 id="hot-standby-caveats">
2286    <title>Caveats</title>
2287
2288    <para>
2289     There are several limitations of hot standby.
2290     These can and probably will be fixed in future releases:
2291
2292   <itemizedlist>
2293    <listitem>
2294     <para>
2295      Full knowledge of running transactions is required before snapshots
2296      can be taken. Transactions that use large numbers of subtransactions
2297      (currently greater than 64) will delay the start of read-only
2298      connections until the completion of the longest running write transaction.
2299      If this situation occurs, explanatory messages will be sent to the server log.
2300     </para>
2301    </listitem>
2302    <listitem>
2303     <para>
2304      Valid starting points for standby queries are generated at each
2305      checkpoint on the primary. If the standby is shut down while the primary
2306      is in a shutdown state, it might not be possible to re-enter hot standby
2307      until the primary is started up, so that it generates further starting
2308      points in the WAL logs.  This situation isn't a problem in the most
2309      common situations where it might happen. Generally, if the primary is
2310      shut down and not available anymore, that's likely due to a serious
2311      failure that requires the standby being converted to operate as
2312      the new primary anyway.  And in situations where the primary is
2313      being intentionally taken down, coordinating to make sure the standby
2314      becomes the new primary smoothly is also standard procedure.
2315     </para>
2316    </listitem>
2317    <listitem>
2318     <para>
2319      At the end of recovery, <literal>AccessExclusiveLocks</literal> held by prepared transactions
2320      will require twice the normal number of lock table entries. If you plan
2321      on running either a large number of concurrent prepared transactions
2322      that normally take <literal>AccessExclusiveLocks</literal>, or you plan on having one
2323      large transaction that takes many <literal>AccessExclusiveLocks</literal>, you are
2324      advised to select a larger value of <varname>max_locks_per_transaction</varname>,
2325      perhaps as much as twice the value of the parameter on
2326      the primary server. You need not consider this at all if
2327      your setting of <varname>max_prepared_transactions</varname> is 0.
2328     </para>
2329    </listitem>
2330    <listitem>
2331     <para>
2332      The Serializable transaction isolation level is not yet available in hot
2333      standby.  (See <xref linkend="xact-serializable"/> and
2334      <xref linkend="serializable-consistency"/> for details.)
2335      An attempt to set a transaction to the serializable isolation level in
2336      hot standby mode will generate an error.
2337     </para>
2338    </listitem>
2339   </itemizedlist>
2340
2341    </para>
2342   </sect2>
2343
2344  </sect1>
2345
2346 </chapter>