doc/src/sgml/high-availability.sgml

   1 <!-- $PostgreSQL$ -->
   2
   3 <chapter id="high-availability">
   4  <title>High Availability, Load Balancing, and Replication</title>
   5
   6  <indexterm><primary>high availability</></>
   7  <indexterm><primary>failover</></>
   8  <indexterm><primary>replication</></>
   9  <indexterm><primary>load balancing</></>
  10  <indexterm><primary>clustering</></>
  11  <indexterm><primary>data partitioning</></>
  12
  13  <para>
  14   Database servers can work together to allow a second server to
  15   take over quickly if the primary server fails (high
  16   availability), or to allow several computers to serve the same
  17   data (load balancing).  Ideally, database servers could work
  18   together seamlessly.  Web servers serving static web pages can
  19   be combined quite easily by merely load-balancing web requests
  20   to multiple machines.  In fact, read-only database servers can
  21   be combined relatively easily too.  Unfortunately, most database
  22   servers have a read/write mix of requests, and read/write servers
  23   are much harder to combine.  This is because though read-only
  24   data needs to be placed on each server only once, a write to any
  25   server has to be propagated to all servers so that future read
  26   requests to those servers return consistent results.
  27  </para>
  28
  29  <para>
  30   This synchronization problem is the fundamental difficulty for
  31   servers working together.  Because there is no single solution
  32   that eliminates the impact of the sync problem for all use cases,
  33   there are multiple solutions.  Each solution addresses this
  34   problem in a different way, and minimizes its impact for a specific
  35   workload.
  36  </para>
  37
  38  <para>
  39   Some solutions deal with synchronization by allowing only one
  40   server to modify the data.  Servers that can modify data are
  41   called read/write or "master" servers.  Servers that can reply
  42   to read-only queries are called "slave" servers.  Servers that
  43   cannot be accessed until they are changed to master servers are
  44   called "standby" servers.
  45  </para>
  46
  47  <para>
  48   Some solutions are synchronous,
  49   meaning that a data-modifying transaction is not considered
  50   committed until all servers have committed the transaction.  This
  51   guarantees that a failover will not lose any data and that all
  52   load-balanced servers will return consistent results no matter
  53   which server is queried. In contrast, asynchronous solutions allow some
  54   delay between the time of a commit and its propagation to the other servers,
  55   opening the possibility that some transactions might be lost in
  56   the switch to a backup server, and that load balanced servers
  57   might return slightly stale results.  Asynchronous communication
  58   is used when synchronous would be too slow.
  59  </para>
  60
  61  <para>
  62   Solutions can also be categorized by their granularity.  Some solutions
  63   can deal only with an entire database server, while others allow control
  64   at the per-table or per-database level.
  65  </para>
  66
  67  <para>
  68   Performance must be considered in any choice.  There is usually a
  69   trade-off between functionality and
  70   performance.  For example, a full synchronous solution over a slow
  71   network might cut performance by more than half, while an asynchronous
  72   one might have a minimal performance impact.
  73  </para>
  74
  75  <para>
  76   The remainder of this section outlines various failover, replication,
  77   and load balancing solutions.  A <ulink
  78   url="http://www.postgres-r.org/documentation/terms">glossary</ulink> is
  79   also available.
  80  </para>
  81
  82  <variablelist>
  83
  84   <varlistentry>
  85    <term>Shared Disk Failover</term>
  86    <listitem>
  87
  88     <para>
  89      Shared disk failover avoids synchronization overhead by having only one
  90      copy of the database.  It uses a single disk array that is shared by
  91      multiple servers.  If the main database server fails, the standby server
  92      is able to mount and start the database as though it was recovering from
  93      a database crash.  This allows rapid failover with no data loss.
  94     </para>
  95
  96     <para>
  97      Shared hardware functionality is common in network storage devices.
  98      Using a network file system is also possible, though care must be
  99      taken that the file system has full <acronym>POSIX</> behavior (see <xref
 100      linkend="creating-cluster-nfs">).  One significant limitation of this
 101      method is that if the shared disk array fails or becomes corrupt, the
 102      primary and standby servers are both nonfunctional.  Another issue is
 103      that the standby server should never access the shared storage while
 104      the primary server is running.
 105     </para>
 106
 107    </listitem>
 108   </varlistentry>
 109
 110   <varlistentry>
 111    <term>File System (Block-Device) Replication</term>
 112    <listitem>
 113
 114     <para>
 115      A modified version of shared hardware functionality is file system
 116      replication, where all changes to a file system are mirrored to a file
 117      system residing on another computer.  The only restriction is that
 118      the mirroring must be done in a way that ensures the standby server
 119      has a consistent copy of the file system &mdash; specifically, writes
 120      to the standby must be done in the same order as those on the master.
 121      <productname>DRBD</> is a popular file system replication solution
 122      for Linux.
 123     </para>
 124
 125 <!--
 126 https://forge.continuent.org/pipermail/sequoia/2006-November/004070.html
 127
 128 Oracle RAC is a shared disk approach and just send cache invalidations
 129 to other nodes but not actual data. As the disk is shared, data is
 130 only committed once to disk and there is a distributed locking
 131 protocol to make nodes agree on a serializable transactional order.
 132 -->
 133
 134    </listitem>
 135   </varlistentry>
 136
 137   <varlistentry>
 138    <term>Warm Standby Using Point-In-Time Recovery (<acronym>PITR</>)</term>
 139    <listitem>
 140
 141     <para>
 142      A warm standby server (see <xref linkend="warm-standby">) can
 143      be kept current by reading a stream of write-ahead log (<acronym>WAL</>)
 144      records.  If the main server fails, the warm standby contains
 145      almost all of the data of the main server, and can be quickly
 146      made the new master database server.  This is asynchronous and
 147      can only be done for the entire database server.
 148     </para>
 149    </listitem>
 150   </varlistentry>
 151
 152   <varlistentry>
 153    <term>Master-Slave Replication</term>
 154    <listitem>
 155
 156     <para>
 157      A master-slave replication setup sends all data modification
 158      queries to the master server.  The master server asynchronously
 159      sends data changes to the slave server.  The slave can answer
 160      read-only queries while the master server is running.  The
 161      slave server is ideal for data warehouse queries.
 162     </para>
 163
 164     <para>
 165      <productname>Slony-I</> is an example of this type of replication, with per-table
 166      granularity, and support for multiple slaves.  Because it
 167      updates the slave server asynchronously (in batches), there is
 168      possible data loss during fail over.
 169     </para>
 170    </listitem>
 171   </varlistentry>
 172
 173   <varlistentry>
 174    <term>Statement-Based Replication Middleware</term>
 175    <listitem>
 176
 177     <para>
 178      With statement-based replication middleware, a program intercepts
 179      every SQL query and sends it to one or all servers.  Each server
 180      operates independently.  Read-write queries are sent to all servers,
 181      while read-only queries can be sent to just one server, allowing
 182      the read workload to be distributed.
 183     </para>
 184
 185     <para>
 186      If queries are simply broadcast unmodified, functions like
 187      <function>random()</>, <function>CURRENT_TIMESTAMP</>, and
 188      sequences would have different values on different servers.
 189      This is because each server operates independently, and because
 190      SQL queries are broadcast (and not actual modified rows).  If
 191      this is unacceptable, either the middleware or the application
 192      must query such values from a single server and then use those
 193      values in write queries.  Also, care must be taken that all
 194      transactions either commit or abort on all servers, perhaps
 195      using two-phase commit (<xref linkend="sql-prepare-transaction"
 196      endterm="sql-prepare-transaction-title"> and <xref
 197      linkend="sql-commit-prepared" endterm="sql-commit-prepared-title">.
 198      <productname>Pgpool-II</> and <productname>Sequoia</> are examples of
 199      this type of replication.
 200     </para>
 201    </listitem>
 202   </varlistentry>
 203
 204   <varlistentry>
 205    <term>Asynchronous Multimaster Replication</term>
 206    <listitem>
 207
 208     <para>
 209      For servers that are not regularly connected, like laptops or
 210      remote servers, keeping data consistent among servers is a
 211      challenge.  Using asynchronous multimaster replication, each
 212      server works independently, and periodically communicates with
 213      the other servers to identify conflicting transactions.  The
 214      conflicts can be resolved by users or conflict resolution rules.
 215      Bucardo is an example of this type of replication.
 216     </para>
 217    </listitem>
 218   </varlistentry>
 219
 220   <varlistentry>
 221    <term>Synchronous Multimaster Replication</term>
 222    <listitem>
 223
 224     <para>
 225      In synchronous multimaster replication, each server can accept
 226      write requests, and modified data is transmitted from the
 227      original server to every other server before each transaction
 228      commits.  Heavy write activity can cause excessive locking,
 229      leading to poor performance.  In fact, write performance is
 230      often worse than that of a single server.  Read requests can
 231      be sent to any server.  Some implementations use shared disk
 232      to reduce the communication overhead.  Synchronous multimaster
 233      replication is best for mostly read workloads, though its big
 234      advantage is that any server can accept write requests &mdash;
 235      there is no need to partition workloads between master and
 236      slave servers, and because the data changes are sent from one
 237      server to another, there is no problem with non-deterministic
 238      functions like <function>random()</>.
 239     </para>
 240
 241     <para>
 242      <productname>PostgreSQL</> does not offer this type of replication,
 243      though <productname>PostgreSQL</> two-phase commit (<xref
 244      linkend="sql-prepare-transaction"
 245      endterm="sql-prepare-transaction-title"> and <xref
 246      linkend="sql-commit-prepared" endterm="sql-commit-prepared-title">)
 247      can be used to implement this in application code or middleware.
 248     </para>
 249    </listitem>
 250   </varlistentry>
 251
 252   <varlistentry>
 253    <term>Commercial Solutions</term>
 254    <listitem>
 255
 256     <para>
 257      Because <productname>PostgreSQL</> is open source and easily
 258      extended, a number of companies have taken <productname>PostgreSQL</>
 259      and created commercial closed-source solutions with unique
 260      failover, replication, and load balancing capabilities.
 261     </para>
 262    </listitem>
 263   </varlistentry>
 264
 265  </variablelist>
 266
 267  <para>
 268   <xref linkend="high-availability-matrix"> summarizes
 269   the capabilities of the various solutions listed above.
 270  </para>
 271
 272  <table id="high-availability-matrix">
 273   <title>High Availability, Load Balancing, and Replication Feature Matrix</title>
 274   <tgroup cols="8">
 275    <thead>
 276     <row>
 277      <entry>Feature</entry>
 278      <entry>Shared Disk Failover</entry>
 279      <entry>File System Replication</entry>
 280      <entry>Warm Standby Using PITR</entry>
 281      <entry>Master-Slave Replication</entry>
 282      <entry>Statement-Based Replication Middleware</entry>
 283      <entry>Asynchronous Multimaster Replication</entry>
 284      <entry>Synchronous Multimaster Replication</entry>
 285     </row>
 286    </thead>
 287
 288    <tbody>
 289
 290     <row>
 291      <entry>Most Common Implementation</entry>
 292      <entry align="center">NAS</entry>
 293      <entry align="center">DRBD</entry>
 294      <entry align="center">PITR</entry>
 295      <entry align="center">Slony</entry>
 296      <entry align="center">pgpool-II</entry>
 297      <entry align="center">Bucardo</entry>
 298      <entry align="center"></entry>
 299     </row>
 300
 301     <row>
 302      <entry>Communication Method</entry>
 303      <entry align="center">shared disk</entry>
 304      <entry align="center">disk blocks</entry>
 305      <entry align="center">WAL</entry>
 306      <entry align="center">table rows</entry>
 307      <entry align="center">SQL</entry>
 308      <entry align="center">table rows</entry>
 309      <entry align="center">table rows and row locks</entry>
 310     </row>
 311
 312     <row>
 313      <entry>No special hardware required</entry>
 314      <entry align="center"></entry>
 315      <entry align="center">&bull;</entry>
 316      <entry align="center">&bull;</entry>
 317      <entry align="center">&bull;</entry>
 318      <entry align="center">&bull;</entry>
 319      <entry align="center">&bull;</entry>
 320      <entry align="center">&bull;</entry>
 321     </row>
 322
 323     <row>
 324      <entry>Allows multiple master servers</entry>
 325      <entry align="center"></entry>
 326      <entry align="center"></entry>
 327      <entry align="center"></entry>
 328      <entry align="center"></entry>
 329      <entry align="center">&bull;</entry>
 330      <entry align="center">&bull;</entry>
 331      <entry align="center">&bull;</entry>
 332     </row>
 333
 334     <row>
 335      <entry>No master server overhead</entry>
 336      <entry align="center">&bull;</entry>
 337      <entry align="center"></entry>
 338      <entry align="center">&bull;</entry>
 339      <entry align="center"></entry>
 340      <entry align="center">&bull;</entry>
 341      <entry align="center"></entry>
 342      <entry align="center"></entry>
 343     </row>
 344
 345     <row>
 346      <entry>No waiting for multiple servers</entry>
 347      <entry align="center">&bull;</entry>
 348      <entry align="center"></entry>
 349      <entry align="center">&bull;</entry>
 350      <entry align="center">&bull;</entry>
 351      <entry align="center"></entry>
 352      <entry align="center">&bull;</entry>
 353      <entry align="center"></entry>
 354     </row>
 355
 356     <row>
 357      <entry>Master failure will never lose data</entry>
 358      <entry align="center">&bull;</entry>
 359      <entry align="center">&bull;</entry>
 360      <entry align="center"></entry>
 361      <entry align="center"></entry>
 362      <entry align="center">&bull;</entry>
 363      <entry align="center"></entry>
 364      <entry align="center">&bull;</entry>
 365     </row>
 366
 367     <row>
 368      <entry>Slaves accept read-only queries</entry>
 369      <entry align="center"></entry>
 370      <entry align="center"></entry>
 371      <entry align="center"></entry>
 372      <entry align="center">&bull;</entry>
 373      <entry align="center">&bull;</entry>
 374      <entry align="center">&bull;</entry>
 375      <entry align="center">&bull;</entry>
 376     </row>
 377
 378     <row>
 379      <entry>Per-table granularity</entry>
 380      <entry align="center"></entry>
 381      <entry align="center"></entry>
 382      <entry align="center"></entry>
 383      <entry align="center">&bull;</entry>
 384      <entry align="center"></entry>
 385      <entry align="center">&bull;</entry>
 386      <entry align="center">&bull;</entry>
 387     </row>
 388
 389     <row>
 390      <entry>No conflict resolution necessary</entry>
 391      <entry align="center">&bull;</entry>
 392      <entry align="center">&bull;</entry>
 393      <entry align="center">&bull;</entry>
 394      <entry align="center">&bull;</entry>
 395      <entry align="center"></entry>
 396      <entry align="center"></entry>
 397      <entry align="center">&bull;</entry>
 398     </row>
 399
 400    </tbody>
 401   </tgroup>
 402  </table>
 403
 404  <para>
 405   There are a few solutions that do not fit into the above categories:
 406  </para>
 407
 408  <variablelist>
 409
 410   <varlistentry>
 411    <term>Data Partitioning</term>
 412    <listitem>
 413
 414     <para>
 415      Data partitioning splits tables into data sets.  Each set can
 416      be modified by only one server.  For example, data can be
 417      partitioned by offices, e.g., London and Paris, with a server
 418      in each office.  If queries combining London and Paris data
 419      are necessary, an application can query both servers, or
 420      master/slave replication can be used to keep a read-only copy
 421      of the other office's data on each server.
 422     </para>
 423    </listitem>
 424   </varlistentry>
 425
 426   <varlistentry>
 427    <term>Multiple-Server Parallel Query Execution</term>
 428    <listitem>
 429
 430     <para>
 431      Many of the above solutions allow multiple servers to handle multiple
 432      queries, but none allow a single query to use multiple servers to
 433      complete faster.  This solution allows multiple servers to work
 434      concurrently on a single query.  It is usually accomplished by
 435      splitting the data among servers and having each server execute its
 436      part of the query and return results to a central server where they
 437      are combined and returned to the user.  <productname>Pgpool-II</>
 438      has this capability.  Also, this can be implemented using the
 439      <productname>PL/Proxy</> toolset.
 440     </para>
 441
 442    </listitem>
 443   </varlistentry>
 444
 445  </variablelist>
 446
 447 </chapter>