The code to unlink dropped relations in FinishPreparedTransaction() was
[PostgreSQL.git] / doc / src / sgml / high-availability.sgml
blobe3c676cf9d7bd2f295b4dfc4096653e1508b4bcc
1 <!-- $PostgreSQL$ -->
3 <chapter id="high-availability">
4 <title>High Availability, Load Balancing, and Replication</title>
6 <indexterm><primary>high availability</></>
7 <indexterm><primary>failover</></>
8 <indexterm><primary>replication</></>
9 <indexterm><primary>load balancing</></>
10 <indexterm><primary>clustering</></>
11 <indexterm><primary>data partitioning</></>
13 <para>
14 Database servers can work together to allow a second server to
15 take over quickly if the primary server fails (high
16 availability), or to allow several computers to serve the same
17 data (load balancing). Ideally, database servers could work
18 together seamlessly. Web servers serving static web pages can
19 be combined quite easily by merely load-balancing web requests
20 to multiple machines. In fact, read-only database servers can
21 be combined relatively easily too. Unfortunately, most database
22 servers have a read/write mix of requests, and read/write servers
23 are much harder to combine. This is because though read-only
24 data needs to be placed on each server only once, a write to any
25 server has to be propagated to all servers so that future read
26 requests to those servers return consistent results.
27 </para>
29 <para>
30 This synchronization problem is the fundamental difficulty for
31 servers working together. Because there is no single solution
32 that eliminates the impact of the sync problem for all use cases,
33 there are multiple solutions. Each solution addresses this
34 problem in a different way, and minimizes its impact for a specific
35 workload.
36 </para>
38 <para>
39 Some solutions deal with synchronization by allowing only one
40 server to modify the data. Servers that can modify data are
41 called read/write or "master" servers. Servers that can reply
42 to read-only queries are called "slave" servers. Servers that
43 cannot be accessed until they are changed to master servers are
44 called "standby" servers.
45 </para>
47 <para>
48 Some solutions are synchronous,
49 meaning that a data-modifying transaction is not considered
50 committed until all servers have committed the transaction. This
51 guarantees that a failover will not lose any data and that all
52 load-balanced servers will return consistent results no matter
53 which server is queried. In contrast, asynchronous solutions allow some
54 delay between the time of a commit and its propagation to the other servers,
55 opening the possibility that some transactions might be lost in
56 the switch to a backup server, and that load balanced servers
57 might return slightly stale results. Asynchronous communication
58 is used when synchronous would be too slow.
59 </para>
61 <para>
62 Solutions can also be categorized by their granularity. Some solutions
63 can deal only with an entire database server, while others allow control
64 at the per-table or per-database level.
65 </para>
67 <para>
68 Performance must be considered in any choice. There is usually a
69 trade-off between functionality and
70 performance. For example, a full synchronous solution over a slow
71 network might cut performance by more than half, while an asynchronous
72 one might have a minimal performance impact.
73 </para>
75 <para>
76 The remainder of this section outlines various failover, replication,
77 and load balancing solutions. A <ulink
78 url="http://www.postgres-r.org/documentation/terms">glossary</ulink> is
79 also available.
80 </para>
82 <variablelist>
84 <varlistentry>
85 <term>Shared Disk Failover</term>
86 <listitem>
88 <para>
89 Shared disk failover avoids synchronization overhead by having only one
90 copy of the database. It uses a single disk array that is shared by
91 multiple servers. If the main database server fails, the standby server
92 is able to mount and start the database as though it was recovering from
93 a database crash. This allows rapid failover with no data loss.
94 </para>
96 <para>
97 Shared hardware functionality is common in network storage devices.
98 Using a network file system is also possible, though care must be
99 taken that the file system has full <acronym>POSIX</> behavior (see <xref
100 linkend="creating-cluster-nfs">). One significant limitation of this
101 method is that if the shared disk array fails or becomes corrupt, the
102 primary and standby servers are both nonfunctional. Another issue is
103 that the standby server should never access the shared storage while
104 the primary server is running.
105 </para>
107 </listitem>
108 </varlistentry>
110 <varlistentry>
111 <term>File System (Block-Device) Replication</term>
112 <listitem>
114 <para>
115 A modified version of shared hardware functionality is file system
116 replication, where all changes to a file system are mirrored to a file
117 system residing on another computer. The only restriction is that
118 the mirroring must be done in a way that ensures the standby server
119 has a consistent copy of the file system &mdash; specifically, writes
120 to the standby must be done in the same order as those on the master.
121 <productname>DRBD</> is a popular file system replication solution
122 for Linux.
123 </para>
125 <!--
126 https://forge.continuent.org/pipermail/sequoia/2006-November/004070.html
128 Oracle RAC is a shared disk approach and just send cache invalidations
129 to other nodes but not actual data. As the disk is shared, data is
130 only committed once to disk and there is a distributed locking
131 protocol to make nodes agree on a serializable transactional order.
134 </listitem>
135 </varlistentry>
137 <varlistentry>
138 <term>Warm Standby Using Point-In-Time Recovery (<acronym>PITR</>)</term>
139 <listitem>
141 <para>
142 A warm standby server (see <xref linkend="warm-standby">) can
143 be kept current by reading a stream of write-ahead log (<acronym>WAL</>)
144 records. If the main server fails, the warm standby contains
145 almost all of the data of the main server, and can be quickly
146 made the new master database server. This is asynchronous and
147 can only be done for the entire database server.
148 </para>
149 </listitem>
150 </varlistentry>
152 <varlistentry>
153 <term>Master-Slave Replication</term>
154 <listitem>
156 <para>
157 A master-slave replication setup sends all data modification
158 queries to the master server. The master server asynchronously
159 sends data changes to the slave server. The slave can answer
160 read-only queries while the master server is running. The
161 slave server is ideal for data warehouse queries.
162 </para>
164 <para>
165 <productname>Slony-I</> is an example of this type of replication, with per-table
166 granularity, and support for multiple slaves. Because it
167 updates the slave server asynchronously (in batches), there is
168 possible data loss during fail over.
169 </para>
170 </listitem>
171 </varlistentry>
173 <varlistentry>
174 <term>Statement-Based Replication Middleware</term>
175 <listitem>
177 <para>
178 With statement-based replication middleware, a program intercepts
179 every SQL query and sends it to one or all servers. Each server
180 operates independently. Read-write queries are sent to all servers,
181 while read-only queries can be sent to just one server, allowing
182 the read workload to be distributed.
183 </para>
185 <para>
186 If queries are simply broadcast unmodified, functions like
187 <function>random()</>, <function>CURRENT_TIMESTAMP</>, and
188 sequences would have different values on different servers.
189 This is because each server operates independently, and because
190 SQL queries are broadcast (and not actual modified rows). If
191 this is unacceptable, either the middleware or the application
192 must query such values from a single server and then use those
193 values in write queries. Also, care must be taken that all
194 transactions either commit or abort on all servers, perhaps
195 using two-phase commit (<xref linkend="sql-prepare-transaction"
196 endterm="sql-prepare-transaction-title"> and <xref
197 linkend="sql-commit-prepared" endterm="sql-commit-prepared-title">.
198 <productname>Pgpool-II</> and <productname>Sequoia</> are examples of
199 this type of replication.
200 </para>
201 </listitem>
202 </varlistentry>
204 <varlistentry>
205 <term>Asynchronous Multimaster Replication</term>
206 <listitem>
208 <para>
209 For servers that are not regularly connected, like laptops or
210 remote servers, keeping data consistent among servers is a
211 challenge. Using asynchronous multimaster replication, each
212 server works independently, and periodically communicates with
213 the other servers to identify conflicting transactions. The
214 conflicts can be resolved by users or conflict resolution rules.
215 Bucardo is an example of this type of replication.
216 </para>
217 </listitem>
218 </varlistentry>
220 <varlistentry>
221 <term>Synchronous Multimaster Replication</term>
222 <listitem>
224 <para>
225 In synchronous multimaster replication, each server can accept
226 write requests, and modified data is transmitted from the
227 original server to every other server before each transaction
228 commits. Heavy write activity can cause excessive locking,
229 leading to poor performance. In fact, write performance is
230 often worse than that of a single server. Read requests can
231 be sent to any server. Some implementations use shared disk
232 to reduce the communication overhead. Synchronous multimaster
233 replication is best for mostly read workloads, though its big
234 advantage is that any server can accept write requests &mdash;
235 there is no need to partition workloads between master and
236 slave servers, and because the data changes are sent from one
237 server to another, there is no problem with non-deterministic
238 functions like <function>random()</>.
239 </para>
241 <para>
242 <productname>PostgreSQL</> does not offer this type of replication,
243 though <productname>PostgreSQL</> two-phase commit (<xref
244 linkend="sql-prepare-transaction"
245 endterm="sql-prepare-transaction-title"> and <xref
246 linkend="sql-commit-prepared" endterm="sql-commit-prepared-title">)
247 can be used to implement this in application code or middleware.
248 </para>
249 </listitem>
250 </varlistentry>
252 <varlistentry>
253 <term>Commercial Solutions</term>
254 <listitem>
256 <para>
257 Because <productname>PostgreSQL</> is open source and easily
258 extended, a number of companies have taken <productname>PostgreSQL</>
259 and created commercial closed-source solutions with unique
260 failover, replication, and load balancing capabilities.
261 </para>
262 </listitem>
263 </varlistentry>
265 </variablelist>
267 <para>
268 <xref linkend="high-availability-matrix"> summarizes
269 the capabilities of the various solutions listed above.
270 </para>
272 <table id="high-availability-matrix">
273 <title>High Availability, Load Balancing, and Replication Feature Matrix</title>
274 <tgroup cols="8">
275 <thead>
276 <row>
277 <entry>Feature</entry>
278 <entry>Shared Disk Failover</entry>
279 <entry>File System Replication</entry>
280 <entry>Warm Standby Using PITR</entry>
281 <entry>Master-Slave Replication</entry>
282 <entry>Statement-Based Replication Middleware</entry>
283 <entry>Asynchronous Multimaster Replication</entry>
284 <entry>Synchronous Multimaster Replication</entry>
285 </row>
286 </thead>
288 <tbody>
290 <row>
291 <entry>Most Common Implementation</entry>
292 <entry align="center">NAS</entry>
293 <entry align="center">DRBD</entry>
294 <entry align="center">PITR</entry>
295 <entry align="center">Slony</entry>
296 <entry align="center">pgpool-II</entry>
297 <entry align="center">Bucardo</entry>
298 <entry align="center"></entry>
299 </row>
301 <row>
302 <entry>Communication Method</entry>
303 <entry align="center">shared disk</entry>
304 <entry align="center">disk blocks</entry>
305 <entry align="center">WAL</entry>
306 <entry align="center">table rows</entry>
307 <entry align="center">SQL</entry>
308 <entry align="center">table rows</entry>
309 <entry align="center">table rows and row locks</entry>
310 </row>
312 <row>
313 <entry>No special hardware required</entry>
314 <entry align="center"></entry>
315 <entry align="center">&bull;</entry>
316 <entry align="center">&bull;</entry>
317 <entry align="center">&bull;</entry>
318 <entry align="center">&bull;</entry>
319 <entry align="center">&bull;</entry>
320 <entry align="center">&bull;</entry>
321 </row>
323 <row>
324 <entry>Allows multiple master servers</entry>
325 <entry align="center"></entry>
326 <entry align="center"></entry>
327 <entry align="center"></entry>
328 <entry align="center"></entry>
329 <entry align="center">&bull;</entry>
330 <entry align="center">&bull;</entry>
331 <entry align="center">&bull;</entry>
332 </row>
334 <row>
335 <entry>No master server overhead</entry>
336 <entry align="center">&bull;</entry>
337 <entry align="center"></entry>
338 <entry align="center">&bull;</entry>
339 <entry align="center"></entry>
340 <entry align="center">&bull;</entry>
341 <entry align="center"></entry>
342 <entry align="center"></entry>
343 </row>
345 <row>
346 <entry>No waiting for multiple servers</entry>
347 <entry align="center">&bull;</entry>
348 <entry align="center"></entry>
349 <entry align="center">&bull;</entry>
350 <entry align="center">&bull;</entry>
351 <entry align="center"></entry>
352 <entry align="center">&bull;</entry>
353 <entry align="center"></entry>
354 </row>
356 <row>
357 <entry>Master failure will never lose data</entry>
358 <entry align="center">&bull;</entry>
359 <entry align="center">&bull;</entry>
360 <entry align="center"></entry>
361 <entry align="center"></entry>
362 <entry align="center">&bull;</entry>
363 <entry align="center"></entry>
364 <entry align="center">&bull;</entry>
365 </row>
367 <row>
368 <entry>Slaves accept read-only queries</entry>
369 <entry align="center"></entry>
370 <entry align="center"></entry>
371 <entry align="center"></entry>
372 <entry align="center">&bull;</entry>
373 <entry align="center">&bull;</entry>
374 <entry align="center">&bull;</entry>
375 <entry align="center">&bull;</entry>
376 </row>
378 <row>
379 <entry>Per-table granularity</entry>
380 <entry align="center"></entry>
381 <entry align="center"></entry>
382 <entry align="center"></entry>
383 <entry align="center">&bull;</entry>
384 <entry align="center"></entry>
385 <entry align="center">&bull;</entry>
386 <entry align="center">&bull;</entry>
387 </row>
389 <row>
390 <entry>No conflict resolution necessary</entry>
391 <entry align="center">&bull;</entry>
392 <entry align="center">&bull;</entry>
393 <entry align="center">&bull;</entry>
394 <entry align="center">&bull;</entry>
395 <entry align="center"></entry>
396 <entry align="center"></entry>
397 <entry align="center">&bull;</entry>
398 </row>
400 </tbody>
401 </tgroup>
402 </table>
404 <para>
405 There are a few solutions that do not fit into the above categories:
406 </para>
408 <variablelist>
410 <varlistentry>
411 <term>Data Partitioning</term>
412 <listitem>
414 <para>
415 Data partitioning splits tables into data sets. Each set can
416 be modified by only one server. For example, data can be
417 partitioned by offices, e.g., London and Paris, with a server
418 in each office. If queries combining London and Paris data
419 are necessary, an application can query both servers, or
420 master/slave replication can be used to keep a read-only copy
421 of the other office's data on each server.
422 </para>
423 </listitem>
424 </varlistentry>
426 <varlistentry>
427 <term>Multiple-Server Parallel Query Execution</term>
428 <listitem>
430 <para>
431 Many of the above solutions allow multiple servers to handle multiple
432 queries, but none allow a single query to use multiple servers to
433 complete faster. This solution allows multiple servers to work
434 concurrently on a single query. It is usually accomplished by
435 splitting the data among servers and having each server execute its
436 part of the query and return results to a central server where they
437 are combined and returned to the user. <productname>Pgpool-II</>
438 has this capability. Also, this can be implemented using the
439 <productname>PL/Proxy</> toolset.
440 </para>
442 </listitem>
443 </varlistentry>
445 </variablelist>
447 </chapter>