doc/src/sgml/perform.sgml

   1 <!-- doc/src/sgml/perform.sgml -->
   2
   3  <chapter id="performance-tips">
   4   <title>Performance Tips</title>
   5
   6   <indexterm zone="performance-tips">
   7    <primary>performance</primary>
   8   </indexterm>
   9
  10   <para>
  11    Query performance can be affected by many things. Some of these can
  12    be controlled by the user, while others are fundamental to the underlying
  13    design of the system.  This chapter provides some hints about understanding
  14    and tuning <productname>PostgreSQL</productname> performance.
  15   </para>
  16
  17  <sect1 id="using-explain">
  18   <title>Using <command>EXPLAIN</command></title>
  19
  20    <indexterm zone="using-explain">
  21     <primary>EXPLAIN</primary>
  22    </indexterm>
  23
  24    <indexterm zone="using-explain">
  25     <primary>query plan</primary>
  26    </indexterm>
  27
  28    <para>
  29     <productname>PostgreSQL</productname> devises a <firstterm>query
  30     plan</firstterm> for each query it receives.  Choosing the right
  31     plan to match the query structure and the properties of the data
  32     is absolutely critical for good performance, so the system includes
  33     a complex <firstterm>planner</firstterm> that tries to choose good plans.
  34     You can use the <link linkend="sql-explain"><command>EXPLAIN</command></link> command
  35     to see what query plan the planner creates for any query.
  36     Plan-reading is an art that requires some experience to master,
  37     but this section attempts to cover the basics.
  38    </para>
  39
  40    <para>
  41     Examples in this section are drawn from the regression test database
  42     after doing a <command>VACUUM ANALYZE</command>, using v17 development sources.
  43     You should be able to get similar results if you try the examples
  44     yourself, but your estimated costs and row counts might vary slightly
  45     because <command>ANALYZE</command>'s statistics are random samples rather
  46     than exact, and because costs are inherently somewhat platform-dependent.
  47    </para>
  48
  49    <para>
  50     The examples use <command>EXPLAIN</command>'s default <quote>text</quote> output
  51     format, which is compact and convenient for humans to read.
  52     If you want to feed <command>EXPLAIN</command>'s output to a program for further
  53     analysis, you should use one of its machine-readable output formats
  54     (XML, JSON, or YAML) instead.
  55    </para>
  56
  57   <sect2 id="using-explain-basics">
  58    <title><command>EXPLAIN</command> Basics</title>
  59
  60    <para>
  61     The structure of a query plan is a tree of <firstterm>plan nodes</firstterm>.
  62     Nodes at the bottom level of the tree are scan nodes: they return raw rows
  63     from a table.  There are different types of scan nodes for different
  64     table access methods: sequential scans, index scans, and bitmap index
  65     scans.  There are also non-table row sources, such as <literal>VALUES</literal>
  66     clauses and set-returning functions in <literal>FROM</literal>, which have their
  67     own scan node types.
  68     If the query requires joining, aggregation, sorting, or other
  69     operations on the raw rows, then there will be additional nodes
  70     above the scan nodes to perform these operations.  Again,
  71     there is usually more than one possible way to do these operations,
  72     so different node types can appear here too.  The output
  73     of <command>EXPLAIN</command> has one line for each node in the plan
  74     tree, showing the basic node type plus the cost estimates that the planner
  75     made for the execution of that plan node.  Additional lines might appear,
  76     indented from the node's summary line,
  77     to show additional properties of the node.
  78     The very first line (the summary line for the topmost
  79     node) has the estimated total execution cost for the plan; it is this
  80     number that the planner seeks to minimize.
  81    </para>
  82
  83    <para>
  84     Here is a trivial example, just to show what the output looks like:
  85
  86 <screen>
  87 EXPLAIN SELECT * FROM tenk1;
  88
  89                          QUERY PLAN
  90 -------------------------------------------------------------
  91  Seq Scan on tenk1  (cost=0.00..445.00 rows=10000 width=244)
  92 </screen>
  93    </para>
  94
  95    <para>
  96     Since this query has no <literal>WHERE</literal> clause, it must scan all the
  97     rows of the table, so the planner has chosen to use a simple sequential
  98     scan plan.  The numbers that are quoted in parentheses are (left
  99     to right):
 100
 101     <itemizedlist>
 102      <listitem>
 103       <para>
 104        Estimated start-up cost.  This is the time expended before the output
 105        phase can begin, e.g., time to do the sorting in a sort node.
 106       </para>
 107      </listitem>
 108
 109      <listitem>
 110       <para>
 111        Estimated total cost.  This is stated on the assumption that the plan
 112        node is run to completion, i.e., all available rows are retrieved.
 113        In practice a node's parent node might stop short of reading all
 114        available rows (see the <literal>LIMIT</literal> example below).
 115       </para>
 116      </listitem>
 117
 118      <listitem>
 119       <para>
 120        Estimated number of rows output by this plan node.  Again, the node
 121        is assumed to be run to completion.
 122       </para>
 123      </listitem>
 124
 125      <listitem>
 126       <para>
 127        Estimated average width of rows output by this plan node (in bytes).
 128       </para>
 129      </listitem>
 130     </itemizedlist>
 131    </para>
 132
 133    <para>
 134     The costs are measured in arbitrary units determined by the planner's
 135     cost parameters (see <xref linkend="runtime-config-query-constants"/>).
 136     Traditional practice is to measure the costs in units of disk page
 137     fetches; that is, <xref linkend="guc-seq-page-cost"/> is conventionally
 138     set to <literal>1.0</literal> and the other cost parameters are set relative
 139     to that.  The examples in this section are run with the default cost
 140     parameters.
 141    </para>
 142
 143    <para>
 144     It's important to understand that the cost of an upper-level node includes
 145     the cost of all its child nodes.  It's also important to realize that
 146     the cost only reflects things that the planner cares about.
 147     In particular, the cost does not consider the time spent to convert
 148     output values to text form or to transmit them to the client, which
 149     could be important factors in the real elapsed time; but the planner
 150     ignores those costs because it cannot change them by altering the
 151     plan.  (Every correct plan will output the same row set, we trust.)
 152    </para>
 153
 154    <para>
 155     The <literal>rows</literal> value is a little tricky because it is
 156     not the number of rows processed or scanned by the
 157     plan node, but rather the number emitted by the node.  This is often
 158     less than the number scanned, as a result of filtering by any
 159     <literal>WHERE</literal>-clause conditions that are being applied at the node.
 160     Ideally the top-level rows estimate will approximate the number of rows
 161     actually returned, updated, or deleted by the query.
 162    </para>
 163
 164    <para>
 165     Returning to our example:
 166
 167 <screen>
 168 EXPLAIN SELECT * FROM tenk1;
 169
 170                          QUERY PLAN
 171 -------------------------------------------------------------
 172  Seq Scan on tenk1  (cost=0.00..445.00 rows=10000 width=244)
 173 </screen>
 174    </para>
 175
 176    <para>
 177     These numbers are derived very straightforwardly.  If you do:
 178
 179 <programlisting>
 180 SELECT relpages, reltuples FROM pg_class WHERE relname = 'tenk1';
 181 </programlisting>
 182
 183     you will find that <classname>tenk1</classname> has 345 disk
 184     pages and 10000 rows.  The estimated cost is computed as (disk pages read *
 185     <xref linkend="guc-seq-page-cost"/>) + (rows scanned *
 186     <xref linkend="guc-cpu-tuple-cost"/>).  By default,
 187     <varname>seq_page_cost</varname> is 1.0 and <varname>cpu_tuple_cost</varname> is 0.01,
 188     so the estimated cost is (345 * 1.0) + (10000 * 0.01) = 445.
 189    </para>
 190
 191    <para>
 192     Now let's modify the query to add a <literal>WHERE</literal> condition:
 193
 194 <screen>
 195 EXPLAIN SELECT * FROM tenk1 WHERE unique1 &lt; 7000;
 196
 197                          QUERY PLAN
 198 ------------------------------------------------------------
 199  Seq Scan on tenk1  (cost=0.00..470.00 rows=7000 width=244)
 200    Filter: (unique1 &lt; 7000)
 201 </screen>
 202
 203     Notice that the <command>EXPLAIN</command> output shows the <literal>WHERE</literal>
 204     clause being applied as a <quote>filter</quote> condition attached to the Seq
 205     Scan plan node.  This means that
 206     the plan node checks the condition for each row it scans, and outputs
 207     only the ones that pass the condition.
 208     The estimate of output rows has been reduced because of the
 209     <literal>WHERE</literal> clause.
 210     However, the scan will still have to visit all 10000 rows, so the cost
 211     hasn't decreased; in fact it has gone up a bit (by 10000 * <xref
 212     linkend="guc-cpu-operator-cost"/>, to be exact) to reflect the extra CPU
 213     time spent checking the <literal>WHERE</literal> condition.
 214    </para>
 215
 216    <para>
 217     The actual number of rows this query would select is 7000, but the <literal>rows</literal>
 218     estimate is only approximate.  If you try to duplicate this experiment,
 219     you may well get a slightly different estimate; moreover, it can
 220     change after each <command>ANALYZE</command> command, because the
 221     statistics produced by <command>ANALYZE</command> are taken from a
 222     randomized sample of the table.
 223    </para>
 224
 225    <para>
 226     Now, let's make the condition more restrictive:
 227
 228 <screen>
 229 EXPLAIN SELECT * FROM tenk1 WHERE unique1 &lt; 100;
 230
 231                                   QUERY PLAN
 232 -------------------------------------------------------------------&zwsp;-----------
 233  Bitmap Heap Scan on tenk1  (cost=5.06..224.98 rows=100 width=244)
 234    Recheck Cond: (unique1 &lt; 100)
 235    -&gt;  Bitmap Index Scan on tenk1_unique1  (cost=0.00..5.04 rows=100 width=0)
 236          Index Cond: (unique1 &lt; 100)
 237 </screen>
 238
 239     Here the planner has decided to use a two-step plan: the child plan
 240     node visits an index to find the locations of rows matching the index
 241     condition, and then the upper plan node actually fetches those rows
 242     from the table itself.  Fetching rows separately is much more
 243     expensive than reading them sequentially, but because not all the pages
 244     of the table have to be visited, this is still cheaper than a sequential
 245     scan.  (The reason for using two plan levels is that the upper plan
 246     node sorts the row locations identified by the index into physical order
 247     before reading them, to minimize the cost of separate fetches.
 248     The <quote>bitmap</quote> mentioned in the node names is the mechanism that
 249     does the sorting.)
 250    </para>
 251
 252    <para>
 253     Now let's add another condition to the <literal>WHERE</literal> clause:
 254
 255 <screen>
 256 EXPLAIN SELECT * FROM tenk1 WHERE unique1 &lt; 100 AND stringu1 = 'xxx';
 257
 258                                   QUERY PLAN
 259 -------------------------------------------------------------------&zwsp;-----------
 260  Bitmap Heap Scan on tenk1  (cost=5.04..225.20 rows=1 width=244)
 261    Recheck Cond: (unique1 &lt; 100)
 262    Filter: (stringu1 = 'xxx'::name)
 263    -&gt;  Bitmap Index Scan on tenk1_unique1  (cost=0.00..5.04 rows=100 width=0)
 264          Index Cond: (unique1 &lt; 100)
 265 </screen>
 266
 267     The added condition <literal>stringu1 = 'xxx'</literal> reduces the
 268     output row count estimate, but not the cost because we still have to visit
 269     the same set of rows.  That's because the <literal>stringu1</literal> clause
 270     cannot be applied as an index condition, since this index is only on
 271     the <literal>unique1</literal> column.  Instead it is applied as a filter on
 272     the rows retrieved using the index.  Thus the cost has actually gone up
 273     slightly to reflect this extra checking.
 274    </para>
 275
 276    <para>
 277     In some cases the planner will prefer a <quote>simple</quote> index scan plan:
 278
 279 <screen>
 280 EXPLAIN SELECT * FROM tenk1 WHERE unique1 = 42;
 281
 282                                  QUERY PLAN
 283 -------------------------------------------------------------------&zwsp;----------
 284  Index Scan using tenk1_unique1 on tenk1  (cost=0.29..8.30 rows=1 width=244)
 285    Index Cond: (unique1 = 42)
 286 </screen>
 287
 288     In this type of plan the table rows are fetched in index order, which
 289     makes them even more expensive to read, but there are so few that the
 290     extra cost of sorting the row locations is not worth it.  You'll most
 291     often see this plan type for queries that fetch just a single row.  It's
 292     also often used for queries that have an <literal>ORDER BY</literal> condition
 293     that matches the index order, because then no extra sorting step is needed
 294     to satisfy the <literal>ORDER BY</literal>.  In this example, adding
 295     <literal>ORDER BY unique1</literal> would use the same plan because the
 296     index already implicitly provides the requested ordering.
 297    </para>
 298
 299    <para>
 300      The planner may implement an <literal>ORDER BY</literal> clause in several
 301      ways.  The above example shows that such an ordering clause may be
 302      implemented implicitly.  The planner may also add an explicit
 303      <literal>Sort</literal> step:
 304
 305 <screen>
 306 EXPLAIN SELECT * FROM tenk1 ORDER BY unique1;
 307
 308                             QUERY PLAN
 309 -------------------------------------------------------------------
 310  Sort  (cost=1109.39..1134.39 rows=10000 width=244)
 311    Sort Key: unique1
 312    ->  Seq Scan on tenk1  (cost=0.00..445.00 rows=10000 width=244)
 313 </screen>
 314
 315     If a part of the plan guarantees an ordering on a prefix of the
 316     required sort keys, then the planner may instead decide to use an
 317     <literal>Incremental Sort</literal> step:
 318
 319 <screen>
 320 EXPLAIN SELECT * FROM tenk1 ORDER BY hundred, ten LIMIT 100;
 321
 322                                               QUERY PLAN
 323 -------------------------------------------------------------------&zwsp;-----------------------------
 324  Limit  (cost=19.35..39.49 rows=100 width=244)
 325    -&gt;  Incremental Sort  (cost=19.35..2033.39 rows=10000 width=244)
 326          Sort Key: hundred, ten
 327          Presorted Key: hundred
 328          -&gt;  Index Scan using tenk1_hundred on tenk1  (cost=0.29..1574.20 rows=10000 width=244)
 329 </screen>
 330
 331     Compared to regular sorts, sorting incrementally allows returning tuples
 332     before the entire result set has been sorted, which particularly enables
 333     optimizations with <literal>LIMIT</literal> queries.  It may also reduce
 334     memory usage and the likelihood of spilling sorts to disk, but it comes at
 335     the cost of the increased overhead of splitting the result set into multiple
 336     sorting batches.
 337    </para>
 338
 339    <para>
 340     If there are separate indexes on several of the columns referenced
 341     in <literal>WHERE</literal>, the planner might choose to use an AND or OR
 342     combination of the indexes:
 343
 344 <screen>
 345 EXPLAIN SELECT * FROM tenk1 WHERE unique1 &lt; 100 AND unique2 &gt; 9000;
 346
 347                                      QUERY PLAN
 348 -------------------------------------------------------------------&zwsp;------------------
 349  Bitmap Heap Scan on tenk1  (cost=25.07..60.11 rows=10 width=244)
 350    Recheck Cond: ((unique1 &lt; 100) AND (unique2 &gt; 9000))
 351    -&gt;  BitmapAnd  (cost=25.07..25.07 rows=10 width=0)
 352          -&gt;  Bitmap Index Scan on tenk1_unique1  (cost=0.00..5.04 rows=100 width=0)
 353                Index Cond: (unique1 &lt; 100)
 354          -&gt;  Bitmap Index Scan on tenk1_unique2  (cost=0.00..19.78 rows=999 width=0)
 355                Index Cond: (unique2 &gt; 9000)
 356 </screen>
 357
 358     But this requires visiting both indexes, so it's not necessarily a win
 359     compared to using just one index and treating the other condition as
 360     a filter.  If you vary the ranges involved you'll see the plan change
 361     accordingly.
 362    </para>
 363
 364    <para>
 365     Here is an example showing the effects of <literal>LIMIT</literal>:
 366
 367 <screen>
 368 EXPLAIN SELECT * FROM tenk1 WHERE unique1 &lt; 100 AND unique2 &gt; 9000 LIMIT 2;
 369
 370                                      QUERY PLAN
 371 -------------------------------------------------------------------&zwsp;------------------
 372  Limit  (cost=0.29..14.28 rows=2 width=244)
 373    -&gt;  Index Scan using tenk1_unique2 on tenk1  (cost=0.29..70.27 rows=10 width=244)
 374          Index Cond: (unique2 &gt; 9000)
 375          Filter: (unique1 &lt; 100)
 376 </screen>
 377    </para>
 378
 379    <para>
 380     This is the same query as above, but we added a <literal>LIMIT</literal> so that
 381     not all the rows need be retrieved, and the planner changed its mind about
 382     what to do.  Notice that the total cost and row count of the Index Scan
 383     node are shown as if it were run to completion.  However, the Limit node
 384     is expected to stop after retrieving only a fifth of those rows, so its
 385     total cost is only a fifth as much, and that's the actual estimated cost
 386     of the query.  This plan is preferred over adding a Limit node to the
 387     previous plan because the Limit could not avoid paying the startup cost
 388     of the bitmap scan, so the total cost would be something over 25 units
 389     with that approach.
 390    </para>
 391
 392    <para>
 393     Let's try joining two tables, using the columns we have been discussing:
 394
 395 <screen>
 396 EXPLAIN SELECT *
 397 FROM tenk1 t1, tenk2 t2
 398 WHERE t1.unique1 &lt; 10 AND t1.unique2 = t2.unique2;
 399
 400                                       QUERY PLAN
 401 -------------------------------------------------------------------&zwsp;-------------------
 402  Nested Loop  (cost=4.65..118.50 rows=10 width=488)
 403    -&gt;  Bitmap Heap Scan on tenk1 t1  (cost=4.36..39.38 rows=10 width=244)
 404          Recheck Cond: (unique1 &lt; 10)
 405          -&gt;  Bitmap Index Scan on tenk1_unique1  (cost=0.00..4.36 rows=10 width=0)
 406                Index Cond: (unique1 &lt; 10)
 407    -&gt;  Index Scan using tenk2_unique2 on tenk2 t2  (cost=0.29..7.90 rows=1 width=244)
 408          Index Cond: (unique2 = t1.unique2)
 409 </screen>
 410    </para>
 411
 412    <para>
 413     In this plan, we have a nested-loop join node with two table scans as
 414     inputs, or children.  The indentation of the node summary lines reflects
 415     the plan tree structure.  The join's first, or <quote>outer</quote>, child
 416     is a bitmap scan similar to those we saw before.  Its cost and row count
 417     are the same as we'd get from <literal>SELECT ... WHERE unique1 &lt; 10</literal>
 418     because we are
 419     applying the <literal>WHERE</literal> clause <literal>unique1 &lt; 10</literal>
 420     at that node.
 421     The <literal>t1.unique2 = t2.unique2</literal> clause is not relevant yet,
 422     so it doesn't affect the row count of the outer scan.  The nested-loop
 423     join node will run its second,
 424     or <quote>inner</quote> child once for each row obtained from the outer child.
 425     Column values from the current outer row can be plugged into the inner
 426     scan; here, the <literal>t1.unique2</literal> value from the outer row is available,
 427     so we get a plan and costs similar to what we saw above for a simple
 428     <literal>SELECT ... WHERE t2.unique2 = <replaceable>constant</replaceable></literal> case.
 429     (The estimated cost is actually a bit lower than what was seen above,
 430     as a result of caching that's expected to occur during the repeated
 431     index scans on <literal>t2</literal>.)  The
 432     costs of the loop node are then set on the basis of the cost of the outer
 433     scan, plus one repetition of the inner scan for each outer row (10 * 7.90,
 434     here), plus a little CPU time for join processing.
 435    </para>
 436
 437    <para>
 438     In this example the join's output row count is the same as the product
 439     of the two scans' row counts, but that's not true in all cases because
 440     there can be additional <literal>WHERE</literal> clauses that mention both tables
 441     and so can only be applied at the join point, not to either input scan.
 442     Here's an example:
 443
 444 <screen>
 445 EXPLAIN SELECT *
 446 FROM tenk1 t1, tenk2 t2
 447 WHERE t1.unique1 &lt; 10 AND t2.unique2 &lt; 10 AND t1.hundred &lt; t2.hundred;
 448
 449                                          QUERY PLAN
 450 -------------------------------------------------------------------&zwsp;--------------------------
 451  Nested Loop  (cost=4.65..49.36 rows=33 width=488)
 452    Join Filter: (t1.hundred &lt; t2.hundred)
 453    -&gt;  Bitmap Heap Scan on tenk1 t1  (cost=4.36..39.38 rows=10 width=244)
 454          Recheck Cond: (unique1 &lt; 10)
 455          -&gt;  Bitmap Index Scan on tenk1_unique1  (cost=0.00..4.36 rows=10 width=0)
 456                Index Cond: (unique1 &lt; 10)
 457    -&gt;  Materialize  (cost=0.29..8.51 rows=10 width=244)
 458          -&gt;  Index Scan using tenk2_unique2 on tenk2 t2  (cost=0.29..8.46 rows=10 width=244)
 459                Index Cond: (unique2 &lt; 10)
 460 </screen>
 461
 462     The condition <literal>t1.hundred &lt; t2.hundred</literal> can't be
 463     tested in the <literal>tenk2_unique2</literal> index, so it's applied at the
 464     join node.  This reduces the estimated output row count of the join node,
 465     but does not change either input scan.
 466    </para>
 467
 468    <para>
 469     Notice that here the planner has chosen to <quote>materialize</quote> the inner
 470     relation of the join, by putting a Materialize plan node atop it.  This
 471     means that the <literal>t2</literal> index scan will be done just once, even
 472     though the nested-loop join node needs to read that data ten times, once
 473     for each row from the outer relation.  The Materialize node saves the data
 474     in memory as it's read, and then returns the data from memory on each
 475     subsequent pass.
 476    </para>
 477
 478    <para>
 479     When dealing with outer joins, you might see join plan nodes with both
 480     <quote>Join Filter</quote> and plain <quote>Filter</quote> conditions attached.
 481     Join Filter conditions come from the outer join's <literal>ON</literal> clause,
 482     so a row that fails the Join Filter condition could still get emitted as
 483     a null-extended row.  But a plain Filter condition is applied after the
 484     outer-join rules and so acts to remove rows unconditionally.  In an inner
 485     join there is no semantic difference between these types of filters.
 486    </para>
 487
 488    <para>
 489     If we change the query's selectivity a bit, we might get a very different
 490     join plan:
 491
 492 <screen>
 493 EXPLAIN SELECT *
 494 FROM tenk1 t1, tenk2 t2
 495 WHERE t1.unique1 &lt; 100 AND t1.unique2 = t2.unique2;
 496
 497                                         QUERY PLAN
 498 -------------------------------------------------------------------&zwsp;-----------------------
 499  Hash Join  (cost=226.23..709.73 rows=100 width=488)
 500    Hash Cond: (t2.unique2 = t1.unique2)
 501    -&gt;  Seq Scan on tenk2 t2  (cost=0.00..445.00 rows=10000 width=244)
 502    -&gt;  Hash  (cost=224.98..224.98 rows=100 width=244)
 503          -&gt;  Bitmap Heap Scan on tenk1 t1  (cost=5.06..224.98 rows=100 width=244)
 504                Recheck Cond: (unique1 &lt; 100)
 505                -&gt;  Bitmap Index Scan on tenk1_unique1  (cost=0.00..5.04 rows=100 width=0)
 506                      Index Cond: (unique1 &lt; 100)
 507 </screen>
 508    </para>
 509
 510    <para>
 511     Here, the planner has chosen to use a hash join, in which rows of one
 512     table are entered into an in-memory hash table, after which the other
 513     table is scanned and the hash table is probed for matches to each row.
 514     Again note how the indentation reflects the plan structure: the bitmap
 515     scan on <literal>tenk1</literal> is the input to the Hash node, which constructs
 516     the hash table.  That's then returned to the Hash Join node, which reads
 517     rows from its outer child plan and searches the hash table for each one.
 518    </para>
 519
 520    <para>
 521     Another possible type of join is a merge join, illustrated here:
 522
 523 <screen>
 524 EXPLAIN SELECT *
 525 FROM tenk1 t1, onek t2
 526 WHERE t1.unique1 &lt; 100 AND t1.unique2 = t2.unique2;
 527
 528                                         QUERY PLAN
 529 -------------------------------------------------------------------&zwsp;-----------------------
 530  Merge Join  (cost=0.56..233.49 rows=10 width=488)
 531    Merge Cond: (t1.unique2 = t2.unique2)
 532    -&gt;  Index Scan using tenk1_unique2 on tenk1 t1  (cost=0.29..643.28 rows=100 width=244)
 533          Filter: (unique1 &lt; 100)
 534    -&gt;  Index Scan using onek_unique2 on onek t2  (cost=0.28..166.28 rows=1000 width=244)
 535 </screen>
 536    </para>
 537
 538    <para>
 539     Merge join requires its input data to be sorted on the join keys.  In this
 540     example each input is sorted by using an index scan to visit the rows
 541     in the correct order; but a sequential scan and sort could also be used.
 542     (Sequential-scan-and-sort frequently beats an index scan for sorting many rows,
 543     because of the nonsequential disk access required by the index scan.)
 544    </para>
 545
 546    <para>
 547     One way to look at variant plans is to force the planner to disregard
 548     whatever strategy it thought was the cheapest, using the enable/disable
 549     flags described in <xref linkend="runtime-config-query-enable"/>.
 550     (This is a crude tool, but useful.  See
 551     also <xref linkend="explicit-joins"/>.)
 552     For example, if we're unconvinced that merge join is the best join
 553     type for the previous example, we could try
 554
 555 <screen>
 556 SET enable_mergejoin = off;
 557
 558 EXPLAIN SELECT *
 559 FROM tenk1 t1, onek t2
 560 WHERE t1.unique1 &lt; 100 AND t1.unique2 = t2.unique2;
 561
 562                                         QUERY PLAN
 563 -------------------------------------------------------------------&zwsp;-----------------------
 564  Hash Join  (cost=226.23..344.08 rows=10 width=488)
 565    Hash Cond: (t2.unique2 = t1.unique2)
 566    -&gt;  Seq Scan on onek t2  (cost=0.00..114.00 rows=1000 width=244)
 567    -&gt;  Hash  (cost=224.98..224.98 rows=100 width=244)
 568          -&gt;  Bitmap Heap Scan on tenk1 t1  (cost=5.06..224.98 rows=100 width=244)
 569                Recheck Cond: (unique1 &lt; 100)
 570                -&gt;  Bitmap Index Scan on tenk1_unique1  (cost=0.00..5.04 rows=100 width=0)
 571                      Index Cond: (unique1 &lt; 100)
 572 </screen>
 573
 574     which shows that the planner thinks that hash join would be nearly 50%
 575     more expensive than merge join for this case.
 576     Of course, the next question is whether it's right about that.
 577     We can investigate that using <command>EXPLAIN ANALYZE</command>, as
 578     discussed <link linkend="using-explain-analyze">below</link>.
 579    </para>
 580
 581    <para>
 582     <indexterm>
 583      <primary>subplan</primary>
 584     </indexterm>
 585     Some query plans involve <firstterm>subplans</firstterm>, which arise
 586     from sub-<literal>SELECT</literal>s in the original query.  Such
 587     queries can sometimes be transformed into ordinary join plans, but
 588     when they cannot be, we get plans like:
 589
 590 <screen>
 591 EXPLAIN VERBOSE SELECT unique1
 592 FROM tenk1 t
 593 WHERE t.ten &lt; ALL (SELECT o.ten FROM onek o WHERE o.four = t.four);
 594
 595                                QUERY PLAN
 596 -------------------------------------------------------------------&zwsp;------
 597  Seq Scan on public.tenk1 t  (cost=0.00..586095.00 rows=5000 width=4)
 598    Output: t.unique1
 599    Filter: (ALL (t.ten &lt; (SubPlan 1).col1))
 600    SubPlan 1
 601      -&gt;  Seq Scan on public.onek o  (cost=0.00..116.50 rows=250 width=4)
 602            Output: o.ten
 603            Filter: (o.four = t.four)
 604 </screen>
 605
 606     This rather artificial example serves to illustrate a couple of
 607     points: values from the outer plan level can be passed down into a
 608     subplan (here, <literal>t.four</literal> is passed down) and the
 609     results of the sub-select are available to the outer plan.  Those
 610     result values are shown by <command>EXPLAIN</command> with notations
 611     like
 612     <literal>(<replaceable>subplan_name</replaceable>).col<replaceable>N</replaceable></literal>,
 613     which refers to the <replaceable>N</replaceable>'th output column of
 614     the sub-<literal>SELECT</literal>.
 615    </para>
 616
 617    <para>
 618     <indexterm>
 619      <primary>subplan</primary>
 620      <secondary>hashed</secondary>
 621     </indexterm>
 622     In the example above, the <literal>ALL</literal> operator runs the
 623     subplan again for each row of the outer query (which accounts for the
 624     high estimated cost).  Some queries can use a <firstterm>hashed
 625     subplan</firstterm> to avoid that:
 626
 627 <screen>
 628 EXPLAIN SELECT *
 629 FROM tenk1 t
 630 WHERE t.unique1 NOT IN (SELECT o.unique1 FROM onek o);
 631
 632                                          QUERY PLAN
 633 -------------------------------------------------------------------&zwsp;-------------------------
 634  Seq Scan on tenk1 t  (cost=61.77..531.77 rows=5000 width=244)
 635    Filter: (NOT (ANY (unique1 = (hashed SubPlan 1).col1)))
 636    SubPlan 1
 637      -&gt;  Index Only Scan using onek_unique1 on onek o  (cost=0.28..59.27 rows=1000 width=4)
 638 (4 rows)
 639 </screen>
 640
 641     Here, the subplan is run a single time and its output is loaded into
 642     an in-memory hash table, which is then probed by the
 643     outer <literal>ANY</literal> operator.  This requires that the
 644     sub-<literal>SELECT</literal> not reference any variables of the outer
 645     query, and that the <literal>ANY</literal>'s comparison operator be
 646     amenable to hashing.
 647    </para>
 648
 649    <para>
 650     <indexterm>
 651      <primary>initplan</primary>
 652     </indexterm>
 653     If, in addition to not referencing any variables of the outer query,
 654     the sub-<literal>SELECT</literal> cannot return more than one row,
 655     it may instead be implemented as an <firstterm>initplan</firstterm>:
 656
 657 <screen>
 658 EXPLAIN VERBOSE SELECT unique1
 659 FROM tenk1 t1 WHERE t1.ten = (SELECT (random() * 10)::integer);
 660
 661                              QUERY PLAN
 662 ------------------------------------------------------------&zwsp;--------
 663  Seq Scan on public.tenk1 t1  (cost=0.02..470.02 rows=1000 width=4)
 664    Output: t1.unique1
 665    Filter: (t1.ten = (InitPlan 1).col1)
 666    InitPlan 1
 667      -&gt;  Result  (cost=0.00..0.02 rows=1 width=4)
 668            Output: ((random() * '10'::double precision))::integer
 669 </screen>
 670
 671     An initplan is run only once per execution of the outer plan, and its
 672     results are saved for re-use in later rows of the outer plan.  So in
 673     this example <literal>random()</literal> is evaluated only once and
 674     all the values of <literal>t1.ten</literal> are compared to the same
 675     randomly-chosen integer.  That's quite different from what would
 676     happen without the sub-<literal>SELECT</literal> construct.
 677    </para>
 678
 679   </sect2>
 680
 681   <sect2 id="using-explain-analyze">
 682    <title><command>EXPLAIN ANALYZE</command></title>
 683
 684    <para>
 685     It is possible to check the accuracy of the planner's estimates
 686     by using <command>EXPLAIN</command>'s <literal>ANALYZE</literal> option.  With this
 687     option, <command>EXPLAIN</command> actually executes the query, and then displays
 688     the true row counts and true run time accumulated within each plan node,
 689     along with the same estimates that a plain <command>EXPLAIN</command>
 690     shows.  For example, we might get a result like this:
 691
 692 <screen>
 693 EXPLAIN ANALYZE SELECT *
 694 FROM tenk1 t1, tenk2 t2
 695 WHERE t1.unique1 &lt; 10 AND t1.unique2 = t2.unique2;
 696
 697                                                            QUERY PLAN
 698 -------------------------------------------------------------------&zwsp;--------------------------------------------------------------
 699  Nested Loop  (cost=4.65..118.50 rows=10 width=488) (actual time=0.017..0.051 rows=10 loops=1)
 700    -&gt;  Bitmap Heap Scan on tenk1 t1  (cost=4.36..39.38 rows=10 width=244) (actual time=0.009..0.017 rows=10 loops=1)
 701          Recheck Cond: (unique1 &lt; 10)
 702          Heap Blocks: exact=10
 703          -&gt;  Bitmap Index Scan on tenk1_unique1  (cost=0.00..4.36 rows=10 width=0) (actual time=0.004..0.004 rows=10 loops=1)
 704                Index Cond: (unique1 &lt; 10)
 705    -&gt;  Index Scan using tenk2_unique2 on tenk2 t2  (cost=0.29..7.90 rows=1 width=244) (actual time=0.003..0.003 rows=1 loops=10)
 706          Index Cond: (unique2 = t1.unique2)
 707  Planning Time: 0.485 ms
 708  Execution Time: 0.073 ms
 709 </screen>
 710
 711     Note that the <quote>actual time</quote> values are in milliseconds of
 712     real time, whereas the <literal>cost</literal> estimates are expressed in
 713     arbitrary units; so they are unlikely to match up.
 714     The thing that's usually most important to look for is whether the
 715     estimated row counts are reasonably close to reality.  In this example
 716     the estimates were all dead-on, but that's quite unusual in practice.
 717    </para>
 718
 719    <para>
 720     In some query plans, it is possible for a subplan node to be executed more
 721     than once.  For example, the inner index scan will be executed once per
 722     outer row in the above nested-loop plan.  In such cases, the
 723     <literal>loops</literal> value reports the
 724     total number of executions of the node, and the actual time and rows
 725     values shown are averages per-execution.  This is done to make the numbers
 726     comparable with the way that the cost estimates are shown.  Multiply by
 727     the <literal>loops</literal> value to get the total time actually spent in
 728     the node.  In the above example, we spent a total of 0.030 milliseconds
 729     executing the index scans on <literal>tenk2</literal>.
 730    </para>
 731
 732    <para>
 733     In some cases <command>EXPLAIN ANALYZE</command> shows additional execution
 734     statistics beyond the plan node execution times and row counts.
 735     For example, Sort and Hash nodes provide extra information:
 736
 737 <screen>
 738 EXPLAIN ANALYZE SELECT *
 739 FROM tenk1 t1, tenk2 t2
 740 WHERE t1.unique1 &lt; 100 AND t1.unique2 = t2.unique2 ORDER BY t1.fivethous;
 741
 742                                                                  QUERY PLAN
 743 -------------------------------------------------------------------&zwsp;-------------------------------------------------------------------&zwsp;------
 744  Sort  (cost=713.05..713.30 rows=100 width=488) (actual time=2.995..3.002 rows=100 loops=1)
 745    Sort Key: t1.fivethous
 746    Sort Method: quicksort  Memory: 74kB
 747    -&gt;  Hash Join  (cost=226.23..709.73 rows=100 width=488) (actual time=0.515..2.920 rows=100 loops=1)
 748          Hash Cond: (t2.unique2 = t1.unique2)
 749          -&gt;  Seq Scan on tenk2 t2  (cost=0.00..445.00 rows=10000 width=244) (actual time=0.026..1.790 rows=10000 loops=1)
 750          -&gt;  Hash  (cost=224.98..224.98 rows=100 width=244) (actual time=0.476..0.477 rows=100 loops=1)
 751                Buckets: 1024  Batches: 1  Memory Usage: 35kB
 752                -&gt;  Bitmap Heap Scan on tenk1 t1  (cost=5.06..224.98 rows=100 width=244) (actual time=0.030..0.450 rows=100 loops=1)
 753                      Recheck Cond: (unique1 &lt; 100)
 754                      Heap Blocks: exact=90
 755                      -&gt;  Bitmap Index Scan on tenk1_unique1  (cost=0.00..5.04 rows=100 width=0) (actual time=0.013..0.013 rows=100 loops=1)
 756                            Index Cond: (unique1 &lt; 100)
 757  Planning Time: 0.187 ms
 758  Execution Time: 3.036 ms
 759 </screen>
 760
 761     The Sort node shows the sort method used (in particular, whether the sort
 762     was in-memory or on-disk) and the amount of memory or disk space needed.
 763     The Hash node shows the number of hash buckets and batches as well as the
 764     peak amount of memory used for the hash table.  (If the number of batches
 765     exceeds one, there will also be disk space usage involved, but that is not
 766     shown.)
 767    </para>
 768
 769    <para>
 770     Another type of extra information is the number of rows removed by a
 771     filter condition:
 772
 773 <screen>
 774 EXPLAIN ANALYZE SELECT * FROM tenk1 WHERE ten &lt; 7;
 775
 776                                                QUERY PLAN
 777 -------------------------------------------------------------------&zwsp;--------------------------------------
 778  Seq Scan on tenk1  (cost=0.00..470.00 rows=7000 width=244) (actual time=0.030..1.995 rows=7000 loops=1)
 779    Filter: (ten &lt; 7)
 780    Rows Removed by Filter: 3000
 781  Planning Time: 0.102 ms
 782  Execution Time: 2.145 ms
 783 </screen>
 784
 785     These counts can be particularly valuable for filter conditions applied at
 786     join nodes.  The <quote>Rows Removed</quote> line only appears when at least
 787     one scanned row, or potential join pair in the case of a join node,
 788     is rejected by the filter condition.
 789    </para>
 790
 791    <para>
 792     A case similar to filter conditions occurs with <quote>lossy</quote>
 793     index scans.  For example, consider this search for polygons containing a
 794     specific point:
 795
 796 <screen>
 797 EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @&gt; polygon '(0.5,2.0)';
 798
 799                                               QUERY PLAN
 800 -------------------------------------------------------------------&zwsp;-----------------------------------
 801  Seq Scan on polygon_tbl  (cost=0.00..1.09 rows=1 width=85) (actual time=0.023..0.023 rows=0 loops=1)
 802    Filter: (f1 @&gt; '((0.5,2))'::polygon)
 803    Rows Removed by Filter: 7
 804  Planning Time: 0.039 ms
 805  Execution Time: 0.033 ms
 806 </screen>
 807
 808     The planner thinks (quite correctly) that this sample table is too small
 809     to bother with an index scan, so we have a plain sequential scan in which
 810     all the rows got rejected by the filter condition.  But if we force an
 811     index scan to be used, we see:
 812
 813 <screen>
 814 SET enable_seqscan TO off;
 815
 816 EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @&gt; polygon '(0.5,2.0)';
 817
 818                                                         QUERY PLAN
 819 -------------------------------------------------------------------&zwsp;-------------------------------------------------------
 820  Index Scan using gpolygonind on polygon_tbl  (cost=0.13..8.15 rows=1 width=85) (actual time=0.074..0.074 rows=0 loops=1)
 821    Index Cond: (f1 @&gt; '((0.5,2))'::polygon)
 822    Rows Removed by Index Recheck: 1
 823  Planning Time: 0.039 ms
 824  Execution Time: 0.098 ms
 825 </screen>
 826
 827     Here we can see that the index returned one candidate row, which was
 828     then rejected by a recheck of the index condition.  This happens because a
 829     GiST index is <quote>lossy</quote> for polygon containment tests: it actually
 830     returns the rows with polygons that overlap the target, and then we have
 831     to do the exact containment test on those rows.
 832    </para>
 833
 834    <para>
 835     <command>EXPLAIN</command> has a <literal>BUFFERS</literal> option that can be used with
 836     <literal>ANALYZE</literal> to get even more run time statistics:
 837
 838 <screen>
 839 EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM tenk1 WHERE unique1 &lt; 100 AND unique2 &gt; 9000;
 840
 841                                                            QUERY PLAN
 842 -------------------------------------------------------------------&zwsp;--------------------------------------------------------------
 843  Bitmap Heap Scan on tenk1  (cost=25.07..60.11 rows=10 width=244) (actual time=0.105..0.114 rows=10 loops=1)
 844    Recheck Cond: ((unique1 &lt; 100) AND (unique2 &gt; 9000))
 845    Heap Blocks: exact=10
 846    Buffers: shared hit=14 read=3
 847    -&gt;  BitmapAnd  (cost=25.07..25.07 rows=10 width=0) (actual time=0.100..0.101 rows=0 loops=1)
 848          Buffers: shared hit=4 read=3
 849          -&gt;  Bitmap Index Scan on tenk1_unique1  (cost=0.00..5.04 rows=100 width=0) (actual time=0.027..0.027 rows=100 loops=1)
 850                Index Cond: (unique1 &lt; 100)
 851                Buffers: shared hit=2
 852          -&gt;  Bitmap Index Scan on tenk1_unique2  (cost=0.00..19.78 rows=999 width=0) (actual time=0.070..0.070 rows=999 loops=1)
 853                Index Cond: (unique2 &gt; 9000)
 854                Buffers: shared hit=2 read=3
 855  Planning:
 856    Buffers: shared hit=3
 857  Planning Time: 0.162 ms
 858  Execution Time: 0.143 ms
 859 </screen>
 860
 861     The numbers provided by <literal>BUFFERS</literal> help to identify which parts
 862     of the query are the most I/O-intensive.
 863    </para>
 864
 865    <para>
 866     Keep in mind that because <command>EXPLAIN ANALYZE</command> actually
 867     runs the query, any side-effects will happen as usual, even though
 868     whatever results the query might output are discarded in favor of
 869     printing the <command>EXPLAIN</command> data.  If you want to analyze a
 870     data-modifying query without changing your tables, you can
 871     roll the command back afterwards, for example:
 872
 873 <screen>
 874 BEGIN;
 875
 876 EXPLAIN ANALYZE UPDATE tenk1 SET hundred = hundred + 1 WHERE unique1 &lt; 100;
 877
 878                                                            QUERY PLAN
 879 -------------------------------------------------------------------&zwsp;-------------------------------------------------------------
 880  Update on tenk1  (cost=5.06..225.23 rows=0 width=0) (actual time=1.634..1.635 rows=0 loops=1)
 881    -&gt;  Bitmap Heap Scan on tenk1  (cost=5.06..225.23 rows=100 width=10) (actual time=0.065..0.141 rows=100 loops=1)
 882          Recheck Cond: (unique1 &lt; 100)
 883          Heap Blocks: exact=90
 884          -&gt;  Bitmap Index Scan on tenk1_unique1  (cost=0.00..5.04 rows=100 width=0) (actual time=0.031..0.031 rows=100 loops=1)
 885                Index Cond: (unique1 &lt; 100)
 886  Planning Time: 0.151 ms
 887  Execution Time: 1.856 ms
 888
 889 ROLLBACK;
 890 </screen>
 891    </para>
 892
 893    <para>
 894     As seen in this example, when the query is an <command>INSERT</command>,
 895     <command>UPDATE</command>, <command>DELETE</command>, or
 896     <command>MERGE</command> command, the actual work of
 897     applying the table changes is done by a top-level Insert, Update,
 898     Delete, or Merge plan node.  The plan nodes underneath this node perform
 899     the work of locating the old rows and/or computing the new data.
 900     So above, we see the same sort of bitmap table scan we've seen already,
 901     and its output is fed to an Update node that stores the updated rows.
 902     It's worth noting that although the data-modifying node can take a
 903     considerable amount of run time (here, it's consuming the lion's share
 904     of the time), the planner does not currently add anything to the cost
 905     estimates to account for that work.  That's because the work to be done is
 906     the same for every correct query plan, so it doesn't affect planning
 907     decisions.
 908    </para>
 909
 910    <para>
 911     When an <command>UPDATE</command>, <command>DELETE</command>, or
 912     <command>MERGE</command> command affects a partitioned table or
 913     inheritance hierarchy, the output might look like this:
 914
 915 <screen>
 916 EXPLAIN UPDATE gtest_parent SET f1 = CURRENT_DATE WHERE f2 = 101;
 917
 918                                        QUERY PLAN
 919 -------------------------------------------------------------------&zwsp;---------------------
 920  Update on gtest_parent  (cost=0.00..3.06 rows=0 width=0)
 921    Update on gtest_child gtest_parent_1
 922    Update on gtest_child2 gtest_parent_2
 923    Update on gtest_child3 gtest_parent_3
 924    -&gt;  Append  (cost=0.00..3.06 rows=3 width=14)
 925          -&gt;  Seq Scan on gtest_child gtest_parent_1  (cost=0.00..1.01 rows=1 width=14)
 926                Filter: (f2 = 101)
 927          -&gt;  Seq Scan on gtest_child2 gtest_parent_2  (cost=0.00..1.01 rows=1 width=14)
 928                Filter: (f2 = 101)
 929          -&gt;  Seq Scan on gtest_child3 gtest_parent_3  (cost=0.00..1.01 rows=1 width=14)
 930                Filter: (f2 = 101)
 931 </screen>
 932
 933     In this example the Update node needs to consider three child tables,
 934     but not the originally-mentioned partitioned table (since that never
 935     stores any data).  So there are three input
 936     scanning subplans, one per table.  For clarity, the Update node is
 937     annotated to show the specific target tables that will be updated, in the
 938     same order as the corresponding subplans.
 939    </para>
 940
 941    <para>
 942     The <literal>Planning time</literal> shown by <command>EXPLAIN
 943     ANALYZE</command> is the time it took to generate the query plan from the
 944     parsed query and optimize it. It does not include parsing or rewriting.
 945    </para>
 946
 947    <para>
 948     The <literal>Execution time</literal> shown by <command>EXPLAIN
 949     ANALYZE</command> includes executor start-up and shut-down time, as well
 950     as the time to run any triggers that are fired, but it does not include
 951     parsing, rewriting, or planning time.
 952     Time spent executing <literal>BEFORE</literal> triggers, if any, is included in
 953     the time for the related Insert, Update, or Delete node; but time
 954     spent executing <literal>AFTER</literal> triggers is not counted there because
 955     <literal>AFTER</literal> triggers are fired after completion of the whole plan.
 956     The total time spent in each trigger
 957     (either <literal>BEFORE</literal> or <literal>AFTER</literal>) is also shown separately.
 958     Note that deferred constraint triggers will not be executed
 959     until end of transaction and are thus not considered at all by
 960     <command>EXPLAIN ANALYZE</command>.
 961    </para>
 962
 963    <para>
 964     The time shown for the top-level node does not include any time needed
 965     to convert the query's output data into displayable form or to send it
 966     to the client.  While <command>EXPLAIN ANALYZE</command> will never
 967     send the data to the client, it can be told to convert the query's
 968     output data to displayable form and measure the time needed for that,
 969     by specifying the <literal>SERIALIZE</literal> option.  That time will
 970     be shown separately, and it's also included in the
 971     total <literal>Execution time</literal>.
 972    </para>
 973
 974   </sect2>
 975
 976   <sect2 id="using-explain-caveats">
 977    <title>Caveats</title>
 978
 979    <para>
 980     There are two significant ways in which run times measured by
 981     <command>EXPLAIN ANALYZE</command> can deviate from normal execution of
 982     the same query.  First, since no output rows are delivered to the client,
 983     network transmission costs are not included.  I/O conversion costs are
 984     not included either unless <literal>SERIALIZE</literal> is specified.
 985     Second, the measurement overhead added by <command>EXPLAIN
 986     ANALYZE</command> can be significant, especially on machines with slow
 987     <function>gettimeofday()</function> operating-system calls. You can use the
 988     <xref linkend="pgtesttiming"/> tool to measure the overhead of timing
 989     on your system.
 990    </para>
 991
 992    <para>
 993     <command>EXPLAIN</command> results should not be extrapolated to situations
 994     much different from the one you are actually testing; for example,
 995     results on a toy-sized table cannot be assumed to apply to large tables.
 996     The planner's cost estimates are not linear and so it might choose
 997     a different plan for a larger or smaller table.  An extreme example
 998     is that on a table that only occupies one disk page, you'll nearly
 999     always get a sequential scan plan whether indexes are available or not.
1000     The planner realizes that it's going to take one disk page read to
1001     process the table in any case, so there's no value in expending additional
1002     page reads to look at an index.  (We saw this happening in the
1003     <literal>polygon_tbl</literal> example above.)
1004    </para>
1005
1006    <para>
1007     There are cases in which the actual and estimated values won't match up
1008     well, but nothing is really wrong.  One such case occurs when
1009     plan node execution is stopped short by a <literal>LIMIT</literal> or similar
1010     effect.  For example, in the <literal>LIMIT</literal> query we used before,
1011
1012 <screen>
1013 EXPLAIN ANALYZE SELECT * FROM tenk1 WHERE unique1 &lt; 100 AND unique2 &gt; 9000 LIMIT 2;
1014
1015                                                           QUERY PLAN
1016 -------------------------------------------------------------------&zwsp;------------------------------------------------------------
1017  Limit  (cost=0.29..14.33 rows=2 width=244) (actual time=0.051..0.071 rows=2 loops=1)
1018    -&gt;  Index Scan using tenk1_unique2 on tenk1  (cost=0.29..70.50 rows=10 width=244) (actual time=0.051..0.070 rows=2 loops=1)
1019          Index Cond: (unique2 &gt; 9000)
1020          Filter: (unique1 &lt; 100)
1021          Rows Removed by Filter: 287
1022  Planning Time: 0.077 ms
1023  Execution Time: 0.086 ms
1024 </screen>
1025
1026     the estimated cost and row count for the Index Scan node are shown as
1027     though it were run to completion.  But in reality the Limit node stopped
1028     requesting rows after it got two, so the actual row count is only 2 and
1029     the run time is less than the cost estimate would suggest.  This is not
1030     an estimation error, only a discrepancy in the way the estimates and true
1031     values are displayed.
1032    </para>
1033
1034    <para>
1035     Merge joins also have measurement artifacts that can confuse the unwary.
1036     A merge join will stop reading one input if it's exhausted the other input
1037     and the next key value in the one input is greater than the last key value
1038     of the other input; in such a case there can be no more matches and so no
1039     need to scan the rest of the first input.  This results in not reading all
1040     of one child, with results like those mentioned for <literal>LIMIT</literal>.
1041     Also, if the outer (first) child contains rows with duplicate key values,
1042     the inner (second) child is backed up and rescanned for the portion of its
1043     rows matching that key value.  <command>EXPLAIN ANALYZE</command> counts these
1044     repeated emissions of the same inner rows as if they were real additional
1045     rows.  When there are many outer duplicates, the reported actual row count
1046     for the inner child plan node can be significantly larger than the number
1047     of rows that are actually in the inner relation.
1048    </para>
1049
1050    <para>
1051     BitmapAnd and BitmapOr nodes always report their actual row counts as zero,
1052     due to implementation limitations.
1053    </para>
1054
1055    <para>
1056     Normally, <command>EXPLAIN</command> will display every plan node
1057     created by the planner.  However, there are cases where the executor
1058     can determine that certain nodes need not be executed because they
1059     cannot produce any rows, based on parameter values that were not
1060     available at planning time.  (Currently this can only happen for child
1061     nodes of an Append or MergeAppend node that is scanning a partitioned
1062     table.)  When this happens, those plan nodes are omitted from
1063     the <command>EXPLAIN</command> output and a <literal>Subplans
1064     Removed: <replaceable>N</replaceable></literal> annotation appears
1065     instead.
1066    </para>
1067   </sect2>
1068
1069  </sect1>
1070
1071  <sect1 id="planner-stats">
1072   <title>Statistics Used by the Planner</title>
1073
1074   <indexterm zone="planner-stats">
1075    <primary>statistics</primary>
1076    <secondary>of the planner</secondary>
1077   </indexterm>
1078
1079   <sect2 id="planner-stats-single-column">
1080    <title>Single-Column Statistics</title>
1081   <para>
1082    As we saw in the previous section, the query planner needs to estimate
1083    the number of rows retrieved by a query in order to make good choices
1084    of query plans.  This section provides a quick look at the statistics
1085    that the system uses for these estimates.
1086   </para>
1087
1088   <para>
1089    One component of the statistics is the total number of entries in
1090    each table and index, as well as the number of disk blocks occupied
1091    by each table and index.  This information is kept in the table
1092    <link linkend="catalog-pg-class"><structname>pg_class</structname></link>,
1093    in the columns <structfield>reltuples</structfield> and
1094    <structfield>relpages</structfield>.  We can look at it with
1095    queries similar to this one:
1096
1097 <screen>
1098 SELECT relname, relkind, reltuples, relpages
1099 FROM pg_class
1100 WHERE relname LIKE 'tenk1%';
1101
1102        relname        | relkind | reltuples | relpages
1103 ----------------------+---------+-----------+----------
1104  tenk1                | r       |     10000 |      345
1105  tenk1_hundred        | i       |     10000 |       11
1106  tenk1_thous_tenthous | i       |     10000 |       30
1107  tenk1_unique1        | i       |     10000 |       30
1108  tenk1_unique2        | i       |     10000 |       30
1109 (5 rows)
1110 </screen>
1111
1112    Here we can see that <structname>tenk1</structname> contains 10000
1113    rows, as do its indexes, but the indexes are (unsurprisingly) much
1114    smaller than the table.
1115   </para>
1116
1117   <para>
1118    For efficiency reasons, <structfield>reltuples</structfield>
1119    and <structfield>relpages</structfield> are not updated on-the-fly,
1120    and so they usually contain somewhat out-of-date values.
1121    They are updated by <command>VACUUM</command>, <command>ANALYZE</command>, and a
1122    few DDL commands such as <command>CREATE INDEX</command>.  A <command>VACUUM</command>
1123    or <command>ANALYZE</command> operation that does not scan the entire table
1124    (which is commonly the case) will incrementally update the
1125    <structfield>reltuples</structfield> count on the basis of the part
1126    of the table it did scan, resulting in an approximate value.
1127    In any case, the planner
1128    will scale the values it finds in <structname>pg_class</structname>
1129    to match the current physical table size, thus obtaining a closer
1130    approximation.
1131   </para>
1132
1133   <indexterm>
1134    <primary>pg_statistic</primary>
1135   </indexterm>
1136
1137   <para>
1138    Most queries retrieve only a fraction of the rows in a table, due
1139    to <literal>WHERE</literal> clauses that restrict the rows to be
1140    examined.  The planner thus needs to make an estimate of the
1141    <firstterm>selectivity</firstterm> of <literal>WHERE</literal> clauses, that is,
1142    the fraction of rows that match each condition in the
1143    <literal>WHERE</literal> clause.  The information used for this task is
1144    stored in the
1145    <link linkend="catalog-pg-statistic"><structname>pg_statistic</structname></link>
1146    system catalog.  Entries in <structname>pg_statistic</structname>
1147    are updated by the <command>ANALYZE</command> and <command>VACUUM
1148    ANALYZE</command> commands, and are always approximate even when freshly
1149    updated.
1150   </para>
1151
1152   <indexterm>
1153    <primary>pg_stats</primary>
1154   </indexterm>
1155
1156   <para>
1157    Rather than look at <structname>pg_statistic</structname> directly,
1158    it's better to look at its view
1159    <link linkend="view-pg-stats"><structname>pg_stats</structname></link>
1160    when examining the statistics manually.  <structname>pg_stats</structname>
1161    is designed to be more easily readable.  Furthermore,
1162    <structname>pg_stats</structname> is readable by all, whereas
1163    <structname>pg_statistic</structname> is only readable by a superuser.
1164    (This prevents unprivileged users from learning something about
1165    the contents of other people's tables from the statistics.  The
1166    <structname>pg_stats</structname> view is restricted to show only
1167    rows about tables that the current user can read.)
1168    For example, we might do:
1169
1170 <screen>
1171 SELECT attname, inherited, n_distinct,
1172        array_to_string(most_common_vals, E'\n') as most_common_vals
1173 FROM pg_stats
1174 WHERE tablename = 'road';
1175
1176  attname | inherited | n_distinct |          most_common_vals
1177 ---------+-----------+------------+------------------------------------
1178  name    | f         | -0.5681108 | I- 580                        Ramp+
1179          |           |            | I- 880                        Ramp+
1180          |           |            | Sp Railroad                       +
1181          |           |            | I- 580                            +
1182          |           |            | I- 680                        Ramp+
1183          |           |            | I- 80                         Ramp+
1184          |           |            | 14th                          St  +
1185          |           |            | I- 880                            +
1186          |           |            | Mac Arthur                    Blvd+
1187          |           |            | Mission                       Blvd+
1188 ...
1189  name    | t         |    -0.5125 | I- 580                        Ramp+
1190          |           |            | I- 880                        Ramp+
1191          |           |            | I- 580                            +
1192          |           |            | I- 680                        Ramp+
1193          |           |            | I- 80                         Ramp+
1194          |           |            | Sp Railroad                       +
1195          |           |            | I- 880                            +
1196          |           |            | State Hwy 13                  Ramp+
1197          |           |            | I- 80                             +
1198          |           |            | State Hwy 24                  Ramp+
1199 ...
1200  thepath | f         |          0 |
1201  thepath | t         |          0 |
1202 (4 rows)
1203 </screen>
1204
1205    Note that two rows are displayed for the same column, one corresponding
1206    to the complete inheritance hierarchy starting at the
1207    <literal>road</literal> table (<literal>inherited</literal>=<literal>t</literal>),
1208    and another one including only the <literal>road</literal> table itself
1209    (<literal>inherited</literal>=<literal>f</literal>).
1210    (For brevity, we have only shown the first ten most-common values for
1211    the <literal>name</literal> column.)
1212   </para>
1213
1214   <para>
1215    The amount of information stored in <structname>pg_statistic</structname>
1216    by <command>ANALYZE</command>, in particular the maximum number of entries in the
1217    <structfield>most_common_vals</structfield> and <structfield>histogram_bounds</structfield>
1218    arrays for each column, can be set on a
1219    column-by-column basis using the <command>ALTER TABLE SET STATISTICS</command>
1220    command, or globally by setting the
1221    <xref linkend="guc-default-statistics-target"/> configuration variable.
1222    The default limit is presently 100 entries.  Raising the limit
1223    might allow more accurate planner estimates to be made, particularly for
1224    columns with irregular data distributions, at the price of consuming
1225    more space in <structname>pg_statistic</structname> and slightly more
1226    time to compute the estimates.  Conversely, a lower limit might be
1227    sufficient for columns with simple data distributions.
1228   </para>
1229
1230   <para>
1231    Further details about the planner's use of statistics can be found in
1232    <xref linkend="planner-stats-details"/>.
1233   </para>
1234   </sect2>
1235
1236   <sect2 id="planner-stats-extended">
1237    <title>Extended Statistics</title>
1238
1239    <indexterm zone="planner-stats-extended">
1240     <primary>statistics</primary>
1241     <secondary>of the planner</secondary>
1242    </indexterm>
1243
1244    <indexterm>
1245     <primary>correlation</primary>
1246     <secondary>in the query planner</secondary>
1247    </indexterm>
1248
1249    <indexterm>
1250     <primary>pg_statistic_ext</primary>
1251    </indexterm>
1252
1253    <indexterm>
1254     <primary>pg_statistic_ext_data</primary>
1255    </indexterm>
1256
1257    <para>
1258     It is common to see slow queries running bad execution plans because
1259     multiple columns used in the query clauses are correlated.
1260     The planner normally assumes that multiple conditions
1261     are independent of each other,
1262     an assumption that does not hold when column values are correlated.
1263     Regular statistics, because of their per-individual-column nature,
1264     cannot capture any knowledge about cross-column correlation.
1265     However, <productname>PostgreSQL</productname> has the ability to compute
1266     <firstterm>multivariate statistics</firstterm>, which can capture
1267     such information.
1268    </para>
1269
1270    <para>
1271     Because the number of possible column combinations is very large,
1272     it's impractical to compute multivariate statistics automatically.
1273     Instead, <firstterm>extended statistics objects</firstterm>, more often
1274     called just <firstterm>statistics objects</firstterm>, can be created to instruct
1275     the server to obtain statistics across interesting sets of columns.
1276    </para>
1277
1278    <para>
1279     Statistics objects are created using the
1280     <link linkend="sql-createstatistics"><command>CREATE STATISTICS</command></link> command.
1281     Creation of such an object merely creates a catalog entry expressing
1282     interest in the statistics.  Actual data collection is performed
1283     by <command>ANALYZE</command> (either a manual command, or background
1284     auto-analyze).  The collected values can be examined in the
1285     <link linkend="catalog-pg-statistic-ext-data"><structname>pg_statistic_ext_data</structname></link>
1286     catalog.
1287    </para>
1288
1289    <para>
1290     <command>ANALYZE</command> computes extended statistics based on the same
1291     sample of table rows that it takes for computing regular single-column
1292     statistics.  Since the sample size is increased by increasing the
1293     statistics target for the table or any of its columns (as described in
1294     the previous section), a larger statistics target will normally result in
1295     more accurate extended statistics, as well as more time spent calculating
1296     them.
1297    </para>
1298
1299    <para>
1300     The following subsections describe the kinds of extended statistics
1301     that are currently supported.
1302    </para>
1303
1304    <sect3 id="planner-stats-extended-functional-deps">
1305     <title>Functional Dependencies</title>
1306
1307     <para>
1308      The simplest kind of extended statistics tracks <firstterm>functional
1309      dependencies</firstterm>, a concept used in definitions of database normal forms.
1310      We say that column <structfield>b</structfield> is functionally dependent on
1311      column <structfield>a</structfield> if knowledge of the value of
1312      <structfield>a</structfield> is sufficient to determine the value
1313      of <structfield>b</structfield>, that is there are no two rows having the same value
1314      of <structfield>a</structfield> but different values of <structfield>b</structfield>.
1315      In a fully normalized database, functional dependencies should exist
1316      only on primary keys and superkeys. However, in practice many data sets
1317      are not fully normalized for various reasons; intentional
1318      denormalization for performance reasons is a common example.
1319      Even in a fully normalized database, there may be partial correlation
1320      between some columns, which can be expressed as partial functional
1321      dependency.
1322     </para>
1323
1324     <para>
1325      The existence of functional dependencies directly affects the accuracy
1326      of estimates in certain queries.  If a query contains conditions on
1327      both the independent and the dependent column(s), the
1328      conditions on the dependent columns do not further reduce the result
1329      size; but without knowledge of the functional dependency, the query
1330      planner will assume that the conditions are independent, resulting
1331      in underestimating the result size.
1332     </para>
1333
1334     <para>
1335      To inform the planner about functional dependencies, <command>ANALYZE</command>
1336      can collect measurements of cross-column dependency. Assessing the
1337      degree of dependency between all sets of columns would be prohibitively
1338      expensive, so data collection is limited to those groups of columns
1339      appearing together in a statistics object defined with
1340      the <literal>dependencies</literal> option.  It is advisable to create
1341      <literal>dependencies</literal> statistics only for column groups that are
1342      strongly correlated, to avoid unnecessary overhead in both
1343      <command>ANALYZE</command> and later query planning.
1344     </para>
1345
1346     <para>
1347      Here is an example of collecting functional-dependency statistics:
1348 <programlisting>
1349 CREATE STATISTICS stts (dependencies) ON city, zip FROM zipcodes;
1350
1351 ANALYZE zipcodes;
1352
1353 SELECT stxname, stxkeys, stxddependencies
1354   FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
1355   WHERE stxname = 'stts';
1356  stxname | stxkeys |             stxddependencies
1357 ---------+---------+------------------------------------------
1358  stts    | 1 5     | {"1 => 5": 1.000000, "5 => 1": 0.423130}
1359 (1 row)
1360 </programlisting>
1361      Here it can be seen that column 1 (zip code) fully determines column
1362      5 (city) so the coefficient is 1.0, while city only determines zip code
1363      about 42% of the time, meaning that there are many cities (58%) that are
1364      represented by more than a single ZIP code.
1365     </para>
1366
1367     <para>
1368      When computing the selectivity for a query involving functionally
1369      dependent columns, the planner adjusts the per-condition selectivity
1370      estimates using the dependency coefficients so as not to produce
1371      an underestimate.
1372     </para>
1373
1374     <sect4 id="planner-stats-extended-functional-deps-limits">
1375      <title>Limitations of Functional Dependencies</title>
1376
1377      <para>
1378       Functional dependencies are currently only applied when considering
1379       simple equality conditions that compare columns to constant values,
1380       and <literal>IN</literal> clauses with constant values.
1381       They are not used to improve estimates for equality conditions
1382       comparing two columns or comparing a column to an expression, nor for
1383       range clauses, <literal>LIKE</literal> or any other type of condition.
1384      </para>
1385
1386      <para>
1387       When estimating with functional dependencies, the planner assumes that
1388       conditions on the involved columns are compatible and hence redundant.
1389       If they are incompatible, the correct estimate would be zero rows, but
1390       that possibility is not considered.  For example, given a query like
1391 <programlisting>
1392 SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '94105';
1393 </programlisting>
1394       the planner will disregard the <structfield>city</structfield> clause as not
1395       changing the selectivity, which is correct.  However, it will make
1396       the same assumption about
1397 <programlisting>
1398 SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '90210';
1399 </programlisting>
1400       even though there will really be zero rows satisfying this query.
1401       Functional dependency statistics do not provide enough information
1402       to conclude that, however.
1403      </para>
1404
1405      <para>
1406       In many practical situations, this assumption is usually satisfied;
1407       for example, there might be a GUI in the application that only allows
1408       selecting compatible city and ZIP code values to use in a query.
1409       But if that's not the case, functional dependencies may not be a viable
1410       option.
1411      </para>
1412     </sect4>
1413    </sect3>
1414
1415    <sect3 id="planner-stats-extended-n-distinct-counts">
1416     <title>Multivariate N-Distinct Counts</title>
1417
1418     <para>
1419      Single-column statistics store the number of distinct values in each
1420      column.  Estimates of the number of distinct values when combining more
1421      than one column (for example, for <literal>GROUP BY a, b</literal>) are
1422      frequently wrong when the planner only has single-column statistical
1423      data, causing it to select bad plans.
1424     </para>
1425
1426     <para>
1427      To improve such estimates, <command>ANALYZE</command> can collect n-distinct
1428      statistics for groups of columns.  As before, it's impractical to do
1429      this for every possible column grouping, so data is collected only for
1430      those groups of columns appearing together in a statistics object
1431      defined with the <literal>ndistinct</literal> option.  Data will be collected
1432      for each possible combination of two or more columns from the set of
1433      listed columns.
1434     </para>
1435
1436     <para>
1437      Continuing the previous example, the n-distinct counts in a
1438      table of ZIP codes might look like the following:
1439 <programlisting>
1440 CREATE STATISTICS stts2 (ndistinct) ON city, state, zip FROM zipcodes;
1441
1442 ANALYZE zipcodes;
1443
1444 SELECT stxkeys AS k, stxdndistinct AS nd
1445   FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
1446   WHERE stxname = 'stts2';
1447 -[ RECORD 1 ]------------------------------------------------------&zwsp;--
1448 k  | 1 2 5
1449 nd | {"1, 2": 33178, "1, 5": 33178, "2, 5": 27435, "1, 2, 5": 33178}
1450 (1 row)
1451 </programlisting>
1452      This indicates that there are three combinations of columns that
1453      have 33178 distinct values: ZIP code and state; ZIP code and city;
1454      and ZIP code, city and state (the fact that they are all equal is
1455      expected given that ZIP code alone is unique in this table).  On the
1456      other hand, the combination of city and state has only 27435 distinct
1457      values.
1458     </para>
1459
1460     <para>
1461      It's advisable to create <literal>ndistinct</literal> statistics objects only
1462      on combinations of columns that are actually used for grouping, and
1463      for which misestimation of the number of groups is resulting in bad
1464      plans.  Otherwise, the <command>ANALYZE</command> cycles are just wasted.
1465     </para>
1466    </sect3>
1467
1468    <sect3 id="planner-stats-extended-mcv-lists">
1469     <title>Multivariate MCV Lists</title>
1470
1471     <para>
1472      Another type of statistic stored for each column are most-common value
1473      lists.  This allows very accurate estimates for individual columns, but
1474      may result in significant misestimates for queries with conditions on
1475      multiple columns.
1476     </para>
1477
1478     <para>
1479      To improve such estimates, <command>ANALYZE</command> can collect MCV
1480      lists on combinations of columns.  Similarly to functional dependencies
1481      and n-distinct coefficients, it's impractical to do this for every
1482      possible column grouping.  Even more so in this case, as the MCV list
1483      (unlike functional dependencies and n-distinct coefficients) does store
1484      the common column values.  So data is collected only for those groups
1485      of columns appearing together in a statistics object defined with the
1486      <literal>mcv</literal> option.
1487     </para>
1488
1489     <para>
1490      Continuing the previous example, the MCV list for a table of ZIP codes
1491      might look like the following (unlike for simpler types of statistics,
1492      a function is required for inspection of MCV contents):
1493
1494 <programlisting>
1495 CREATE STATISTICS stts3 (mcv) ON city, state FROM zipcodes;
1496
1497 ANALYZE zipcodes;
1498
1499 SELECT m.* FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid),
1500                 pg_mcv_list_items(stxdmcv) m WHERE stxname = 'stts3';
1501
1502  index |         values         | nulls | frequency | base_frequency
1503 -------+------------------------+-------+-----------+----------------
1504      0 | {Washington, DC}       | {f,f} |  0.003467 |        2.7e-05
1505      1 | {Apo, AE}              | {f,f} |  0.003067 |        1.9e-05
1506      2 | {Houston, TX}          | {f,f} |  0.002167 |       0.000133
1507      3 | {El Paso, TX}          | {f,f} |     0.002 |       0.000113
1508      4 | {New York, NY}         | {f,f} |  0.001967 |       0.000114
1509      5 | {Atlanta, GA}          | {f,f} |  0.001633 |        3.3e-05
1510      6 | {Sacramento, CA}       | {f,f} |  0.001433 |        7.8e-05
1511      7 | {Miami, FL}            | {f,f} |    0.0014 |          6e-05
1512      8 | {Dallas, TX}           | {f,f} |  0.001367 |        8.8e-05
1513      9 | {Chicago, IL}          | {f,f} |  0.001333 |        5.1e-05
1514    ...
1515 (99 rows)
1516 </programlisting>
1517      This indicates that the most common combination of city and state is
1518      Washington in DC, with actual frequency (in the sample) about 0.35%.
1519      The base frequency of the combination (as computed from the simple
1520      per-column frequencies) is only 0.0027%, resulting in two orders of
1521      magnitude under-estimates.
1522     </para>
1523
1524     <para>
1525      It's advisable to create <acronym>MCV</acronym> statistics objects only
1526      on combinations of columns that are actually used in conditions together,
1527      and for which misestimation of the number of groups is resulting in bad
1528      plans.  Otherwise, the <command>ANALYZE</command> and planning cycles
1529      are just wasted.
1530     </para>
1531    </sect3>
1532
1533   </sect2>
1534  </sect1>
1535
1536  <sect1 id="explicit-joins">
1537   <title>Controlling the Planner with Explicit <literal>JOIN</literal> Clauses</title>
1538
1539   <indexterm zone="explicit-joins">
1540    <primary>join</primary>
1541    <secondary>controlling the order</secondary>
1542   </indexterm>
1543
1544   <para>
1545    It is possible
1546    to control the query planner to some extent by using the explicit <literal>JOIN</literal>
1547    syntax.  To see why this matters, we first need some background.
1548   </para>
1549
1550   <para>
1551    In a simple join query, such as:
1552 <programlisting>
1553 SELECT * FROM a, b, c WHERE a.id = b.id AND b.ref = c.id;
1554 </programlisting>
1555    the planner is free to join the given tables in any order.  For
1556    example, it could generate a query plan that joins A to B, using
1557    the <literal>WHERE</literal> condition <literal>a.id = b.id</literal>, and then
1558    joins C to this joined table, using the other <literal>WHERE</literal>
1559    condition.  Or it could join B to C and then join A to that result.
1560    Or it could join A to C and then join them with B &mdash; but that
1561    would be inefficient, since the full Cartesian product of A and C
1562    would have to be formed, there being no applicable condition in the
1563    <literal>WHERE</literal> clause to allow optimization of the join.  (All
1564    joins in the <productname>PostgreSQL</productname> executor happen
1565    between two input tables, so it's necessary to build up the result
1566    in one or another of these fashions.)  The important point is that
1567    these different join possibilities give semantically equivalent
1568    results but might have hugely different execution costs.  Therefore,
1569    the planner will explore all of them to try to find the most
1570    efficient query plan.
1571   </para>
1572
1573   <para>
1574    When a query only involves two or three tables, there aren't many join
1575    orders to worry about.  But the number of possible join orders grows
1576    exponentially as the number of tables expands.  Beyond ten or so input
1577    tables it's no longer practical to do an exhaustive search of all the
1578    possibilities, and even for six or seven tables planning might take an
1579    annoyingly long time.  When there are too many input tables, the
1580    <productname>PostgreSQL</productname> planner will switch from exhaustive
1581    search to a <firstterm>genetic</firstterm> probabilistic search
1582    through a limited number of possibilities.  (The switch-over threshold is
1583    set by the <xref linkend="guc-geqo-threshold"/> run-time
1584    parameter.)
1585    The genetic search takes less time, but it won't
1586    necessarily find the best possible plan.
1587   </para>
1588
1589   <para>
1590    When the query involves outer joins, the planner has less freedom
1591    than it does for plain (inner) joins. For example, consider:
1592 <programlisting>
1593 SELECT * FROM a LEFT JOIN (b JOIN c ON (b.ref = c.id)) ON (a.id = b.id);
1594 </programlisting>
1595    Although this query's restrictions are superficially similar to the
1596    previous example, the semantics are different because a row must be
1597    emitted for each row of A that has no matching row in the join of B and C.
1598    Therefore the planner has no choice of join order here: it must join
1599    B to C and then join A to that result.  Accordingly, this query takes
1600    less time to plan than the previous query.  In other cases, the planner
1601    might be able to determine that more than one join order is safe.
1602    For example, given:
1603 <programlisting>
1604 SELECT * FROM a LEFT JOIN b ON (a.bid = b.id) LEFT JOIN c ON (a.cid = c.id);
1605 </programlisting>
1606    it is valid to join A to either B or C first.  Currently, only
1607    <literal>FULL JOIN</literal> completely constrains the join order.  Most
1608    practical cases involving <literal>LEFT JOIN</literal> or <literal>RIGHT JOIN</literal>
1609    can be rearranged to some extent.
1610   </para>
1611
1612   <para>
1613    Explicit inner join syntax (<literal>INNER JOIN</literal>, <literal>CROSS
1614    JOIN</literal>, or unadorned <literal>JOIN</literal>) is semantically the same as
1615    listing the input relations in <literal>FROM</literal>, so it does not
1616    constrain the join order.
1617   </para>
1618
1619   <para>
1620    Even though most kinds of <literal>JOIN</literal> don't completely constrain
1621    the join order, it is possible to instruct the
1622    <productname>PostgreSQL</productname> query planner to treat all
1623    <literal>JOIN</literal> clauses as constraining the join order anyway.
1624    For example, these three queries are logically equivalent:
1625 <programlisting>
1626 SELECT * FROM a, b, c WHERE a.id = b.id AND b.ref = c.id;
1627 SELECT * FROM a CROSS JOIN b CROSS JOIN c WHERE a.id = b.id AND b.ref = c.id;
1628 SELECT * FROM a JOIN (b JOIN c ON (b.ref = c.id)) ON (a.id = b.id);
1629 </programlisting>
1630    But if we tell the planner to honor the <literal>JOIN</literal> order,
1631    the second and third take less time to plan than the first.  This effect
1632    is not worth worrying about for only three tables, but it can be a
1633    lifesaver with many tables.
1634   </para>
1635
1636   <para>
1637    To force the planner to follow the join order laid out by explicit
1638    <literal>JOIN</literal>s,
1639    set the <xref linkend="guc-join-collapse-limit"/> run-time parameter to 1.
1640    (Other possible values are discussed below.)
1641   </para>
1642
1643   <para>
1644    You do not need to constrain the join order completely in order to
1645    cut search time, because it's OK to use <literal>JOIN</literal> operators
1646    within items of a plain <literal>FROM</literal> list.  For example, consider:
1647 <programlisting>
1648 SELECT * FROM a CROSS JOIN b, c, d, e WHERE ...;
1649 </programlisting>
1650    With <varname>join_collapse_limit</varname> = 1, this
1651    forces the planner to join A to B before joining them to other tables,
1652    but doesn't constrain its choices otherwise.  In this example, the
1653    number of possible join orders is reduced by a factor of 5.
1654   </para>
1655
1656   <para>
1657    Constraining the planner's search in this way is a useful technique
1658    both for reducing planning time and for directing the planner to a
1659    good query plan.  If the planner chooses a bad join order by default,
1660    you can force it to choose a better order via <literal>JOIN</literal> syntax
1661    &mdash; assuming that you know of a better order, that is.  Experimentation
1662    is recommended.
1663   </para>
1664
1665   <para>
1666    A closely related issue that affects planning time is collapsing of
1667    subqueries into their parent query.  For example, consider:
1668 <programlisting>
1669 SELECT *
1670 FROM x, y,
1671     (SELECT * FROM a, b, c WHERE something) AS ss
1672 WHERE somethingelse;
1673 </programlisting>
1674    This situation might arise from use of a view that contains a join;
1675    the view's <literal>SELECT</literal> rule will be inserted in place of the view
1676    reference, yielding a query much like the above.  Normally, the planner
1677    will try to collapse the subquery into the parent, yielding:
1678 <programlisting>
1679 SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
1680 </programlisting>
1681    This usually results in a better plan than planning the subquery
1682    separately.  (For example, the outer <literal>WHERE</literal> conditions might be such that
1683    joining X to A first eliminates many rows of A, thus avoiding the need to
1684    form the full logical output of the subquery.)  But at the same time,
1685    we have increased the planning time; here, we have a five-way join
1686    problem replacing two separate three-way join problems.  Because of the
1687    exponential growth of the number of possibilities, this makes a big
1688    difference.  The planner tries to avoid getting stuck in huge join search
1689    problems by not collapsing a subquery if more than <varname>from_collapse_limit</varname>
1690    <literal>FROM</literal> items would result in the parent
1691    query.  You can trade off planning time against quality of plan by
1692    adjusting this run-time parameter up or down.
1693   </para>
1694
1695   <para>
1696    <xref linkend="guc-from-collapse-limit"/> and <xref
1697    linkend="guc-join-collapse-limit"/>
1698    are similarly named because they do almost the same thing: one controls
1699    when the planner will <quote>flatten out</quote> subqueries, and the
1700    other controls when it will flatten out explicit joins.  Typically
1701    you would either set <varname>join_collapse_limit</varname> equal to
1702    <varname>from_collapse_limit</varname> (so that explicit joins and subqueries
1703    act similarly) or set <varname>join_collapse_limit</varname> to 1 (if you want
1704    to control join order with explicit joins).  But you might set them
1705    differently if you are trying to fine-tune the trade-off between planning
1706    time and run time.
1707   </para>
1708  </sect1>
1709
1710  <sect1 id="populate">
1711   <title>Populating a Database</title>
1712
1713   <para>
1714    One might need to insert a large amount of data when first populating
1715    a database. This section contains some suggestions on how to make
1716    this process as efficient as possible.
1717   </para>
1718
1719   <sect2 id="disable-autocommit">
1720    <title>Disable Autocommit</title>
1721
1722    <indexterm>
1723     <primary>autocommit</primary>
1724     <secondary>bulk-loading data</secondary>
1725    </indexterm>
1726
1727    <para>
1728     When using multiple <command>INSERT</command>s, turn off autocommit and just do
1729     one commit at the end.  (In plain
1730     SQL, this means issuing <command>BEGIN</command> at the start and
1731     <command>COMMIT</command> at the end.  Some client libraries might
1732     do this behind your back, in which case you need to make sure the
1733     library does it when you want it done.)  If you allow each
1734     insertion to be committed separately,
1735     <productname>PostgreSQL</productname> is doing a lot of work for
1736     each row that is added.  An additional benefit of doing all
1737     insertions in one transaction is that if the insertion of one row
1738     were to fail then the insertion of all rows inserted up to that
1739     point would be rolled back, so you won't be stuck with partially
1740     loaded data.
1741    </para>
1742   </sect2>
1743
1744   <sect2 id="populate-copy-from">
1745    <title>Use <command>COPY</command></title>
1746
1747    <para>
1748     Use <link linkend="sql-copy"><command>COPY</command></link> to load
1749     all the rows in one command, instead of using a series of
1750     <command>INSERT</command> commands.  The <command>COPY</command>
1751     command is optimized for loading large numbers of rows; it is less
1752     flexible than <command>INSERT</command>, but incurs significantly
1753     less overhead for large data loads. Since <command>COPY</command>
1754     is a single command, there is no need to disable autocommit if you
1755     use this method to populate a table.
1756    </para>
1757
1758    <para>
1759     If you cannot use <command>COPY</command>, it might help to use <link
1760     linkend="sql-prepare"><command>PREPARE</command></link> to create a
1761     prepared <command>INSERT</command> statement, and then use
1762     <command>EXECUTE</command> as many times as required.  This avoids
1763     some of the overhead of repeatedly parsing and planning
1764     <command>INSERT</command>. Different interfaces provide this facility
1765     in different ways; look for <quote>prepared statements</quote> in the interface
1766     documentation.
1767    </para>
1768
1769    <para>
1770     Note that loading a large number of rows using
1771     <command>COPY</command> is almost always faster than using
1772     <command>INSERT</command>, even if <command>PREPARE</command> is used and
1773     multiple insertions are batched into a single transaction.
1774    </para>
1775
1776    <para>
1777     <command>COPY</command> is fastest when used within the same
1778     transaction as an earlier <command>CREATE TABLE</command> or
1779     <command>TRUNCATE</command> command. In such cases no WAL
1780     needs to be written, because in case of an error, the files
1781     containing the newly loaded data will be removed anyway.
1782     However, this consideration only applies when
1783     <xref linkend="guc-wal-level"/> is <literal>minimal</literal>
1784     as all commands must write WAL otherwise.
1785    </para>
1786
1787   </sect2>
1788
1789   <sect2 id="populate-rm-indexes">
1790    <title>Remove Indexes</title>
1791
1792    <para>
1793     If you are loading a freshly created table, the fastest method is to
1794     create the table, bulk load the table's data using
1795     <command>COPY</command>, then create any indexes needed for the
1796     table.  Creating an index on pre-existing data is quicker than
1797     updating it incrementally as each row is loaded.
1798    </para>
1799
1800    <para>
1801     If you are adding large amounts of data to an existing table,
1802     it might be a win to drop the indexes,
1803     load the table, and then recreate the indexes.  Of course, the
1804     database performance for other users might suffer
1805     during the time the indexes are missing.  One should also think
1806     twice before dropping a unique index, since the error checking
1807     afforded by the unique constraint will be lost while the index is
1808     missing.
1809    </para>
1810   </sect2>
1811
1812   <sect2 id="populate-rm-fkeys">
1813    <title>Remove Foreign Key Constraints</title>
1814
1815    <para>
1816     Just as with indexes, a foreign key constraint can be checked
1817     <quote>in bulk</quote> more efficiently than row-by-row.  So it might be
1818     useful to drop foreign key constraints, load data, and re-create
1819     the constraints.  Again, there is a trade-off between data load
1820     speed and loss of error checking while the constraint is missing.
1821    </para>
1822
1823    <para>
1824     What's more, when you load data into a table with existing foreign key
1825     constraints, each new row requires an entry in the server's list of
1826     pending trigger events (since it is the firing of a trigger that checks
1827     the row's foreign key constraint).  Loading many millions of rows can
1828     cause the trigger event queue to overflow available memory, leading to
1829     intolerable swapping or even outright failure of the command.  Therefore
1830     it may be <emphasis>necessary</emphasis>, not just desirable, to drop and re-apply
1831     foreign keys when loading large amounts of data.  If temporarily removing
1832     the constraint isn't acceptable, the only other recourse may be to split
1833     up the load operation into smaller transactions.
1834    </para>
1835   </sect2>
1836
1837   <sect2 id="populate-work-mem">
1838    <title>Increase <varname>maintenance_work_mem</varname></title>
1839
1840    <para>
1841     Temporarily increasing the <xref linkend="guc-maintenance-work-mem"/>
1842     configuration variable when loading large amounts of data can
1843     lead to improved performance.  This will help to speed up <command>CREATE
1844     INDEX</command> commands and <command>ALTER TABLE ADD FOREIGN KEY</command> commands.
1845     It won't do much for <command>COPY</command> itself, so this advice is
1846     only useful when you are using one or both of the above techniques.
1847    </para>
1848   </sect2>
1849
1850   <sect2 id="populate-max-wal-size">
1851    <title>Increase <varname>max_wal_size</varname></title>
1852
1853    <para>
1854     Temporarily increasing the <xref linkend="guc-max-wal-size"/>
1855     configuration variable can also
1856     make large data loads faster.  This is because loading a large
1857     amount of data into <productname>PostgreSQL</productname> will
1858     cause checkpoints to occur more often than the normal checkpoint
1859     frequency (specified by the <varname>checkpoint_timeout</varname>
1860     configuration variable). Whenever a checkpoint occurs, all dirty
1861     pages must be flushed to disk. By increasing
1862     <varname>max_wal_size</varname> temporarily during bulk
1863     data loads, the number of checkpoints that are required can be
1864     reduced.
1865    </para>
1866   </sect2>
1867
1868   <sect2 id="populate-pitr">
1869    <title>Disable WAL Archival and Streaming Replication</title>
1870
1871    <para>
1872     When loading large amounts of data into an installation that uses
1873     WAL archiving or streaming replication, it might be faster to take a
1874     new base backup after the load has completed than to process a large
1875     amount of incremental WAL data.  To prevent incremental WAL logging
1876     while loading, disable archiving and streaming replication, by setting
1877     <xref linkend="guc-wal-level"/> to <literal>minimal</literal>,
1878     <xref linkend="guc-archive-mode"/> to <literal>off</literal>, and
1879     <xref linkend="guc-max-wal-senders"/> to zero.
1880     But note that changing these settings requires a server restart,
1881     and makes any base backups taken before unavailable for archive
1882     recovery and standby server, which may lead to data loss.
1883    </para>
1884
1885    <para>
1886     Aside from avoiding the time for the archiver or WAL sender to process the
1887     WAL data, doing this will actually make certain commands faster, because
1888     they do not to write WAL at all if <varname>wal_level</varname>
1889     is <literal>minimal</literal> and the current subtransaction (or top-level
1890     transaction) created or truncated the table or index they change.  (They
1891     can guarantee crash safety more cheaply by doing
1892     an <function>fsync</function> at the end than by writing WAL.)
1893    </para>
1894   </sect2>
1895
1896   <sect2 id="populate-analyze">
1897    <title>Run <command>ANALYZE</command> Afterwards</title>
1898
1899    <para>
1900     Whenever you have significantly altered the distribution of data
1901     within a table, running <link linkend="sql-analyze"><command>ANALYZE</command></link> is strongly recommended. This
1902     includes bulk loading large amounts of data into the table.  Running
1903     <command>ANALYZE</command> (or <command>VACUUM ANALYZE</command>)
1904     ensures that the planner has up-to-date statistics about the
1905     table.  With no statistics or obsolete statistics, the planner might
1906     make poor decisions during query planning, leading to poor
1907     performance on any tables with inaccurate or nonexistent
1908     statistics.  Note that if the autovacuum daemon is enabled, it might
1909     run <command>ANALYZE</command> automatically; see
1910     <xref linkend="vacuum-for-statistics"/>
1911     and <xref linkend="autovacuum"/> for more information.
1912    </para>
1913   </sect2>
1914
1915   <sect2 id="populate-pg-dump">
1916    <title>Some Notes about <application>pg_dump</application></title>
1917
1918    <para>
1919     Dump scripts generated by <application>pg_dump</application> automatically apply
1920     several, but not all, of the above guidelines.  To restore a
1921     <application>pg_dump</application> dump as quickly as possible, you need to
1922     do a few extra things manually.  (Note that these points apply while
1923     <emphasis>restoring</emphasis> a dump, not while <emphasis>creating</emphasis> it.
1924     The same points apply whether loading a text dump with
1925     <application>psql</application> or using <application>pg_restore</application> to load
1926     from a <application>pg_dump</application> archive file.)
1927    </para>
1928
1929    <para>
1930     By default, <application>pg_dump</application> uses <command>COPY</command>, and when
1931     it is generating a complete schema-and-data dump, it is careful to
1932     load data before creating indexes and foreign keys.  So in this case
1933     several guidelines are handled automatically.  What is left
1934     for you to do is to:
1935     <itemizedlist>
1936      <listitem>
1937       <para>
1938        Set appropriate (i.e., larger than normal) values for
1939        <varname>maintenance_work_mem</varname> and
1940        <varname>max_wal_size</varname>.
1941       </para>
1942      </listitem>
1943      <listitem>
1944       <para>
1945        If using WAL archiving or streaming replication, consider disabling
1946        them during the restore. To do that, set <varname>archive_mode</varname>
1947        to <literal>off</literal>,
1948        <varname>wal_level</varname> to <literal>minimal</literal>, and
1949        <varname>max_wal_senders</varname> to zero before loading the dump.
1950        Afterwards, set them back to the right values and take a fresh
1951        base backup.
1952       </para>
1953      </listitem>
1954      <listitem>
1955       <para>
1956        Experiment with the parallel dump and restore modes of both
1957        <application>pg_dump</application> and <application>pg_restore</application> and find the
1958        optimal number of concurrent jobs to use. Dumping and restoring in
1959        parallel by means of the <option>-j</option> option should give you a
1960        significantly higher performance over the serial mode.
1961       </para>
1962      </listitem>
1963      <listitem>
1964       <para>
1965        Consider whether the whole dump should be restored as a single
1966        transaction.  To do that, pass the <option>-1</option> or
1967        <option>--single-transaction</option> command-line option to
1968        <application>psql</application> or <application>pg_restore</application>. When using this
1969        mode, even the smallest of errors will rollback the entire restore,
1970        possibly discarding many hours of processing.  Depending on how
1971        interrelated the data is, that might seem preferable to manual cleanup,
1972        or not.  <command>COPY</command> commands will run fastest if you use a single
1973        transaction and have WAL archiving turned off.
1974       </para>
1975      </listitem>
1976      <listitem>
1977       <para>
1978        If multiple CPUs are available in the database server, consider using
1979        <application>pg_restore</application>'s <option>--jobs</option> option.  This
1980        allows concurrent data loading and index creation.
1981       </para>
1982      </listitem>
1983      <listitem>
1984       <para>
1985        Run <command>ANALYZE</command> afterwards.
1986       </para>
1987      </listitem>
1988     </itemizedlist>
1989    </para>
1990
1991    <para>
1992     A data-only dump will still use <command>COPY</command>, but it does not
1993     drop or recreate indexes, and it does not normally touch foreign
1994     keys.
1995
1996      <footnote>
1997       <para>
1998        You can get the effect of disabling foreign keys by using
1999        the <option>--disable-triggers</option> option &mdash; but realize that
2000        that eliminates, rather than just postpones, foreign key
2001        validation, and so it is possible to insert bad data if you use it.
2002       </para>
2003      </footnote>
2004
2005     So when loading a data-only dump, it is up to you to drop and recreate
2006     indexes and foreign keys if you wish to use those techniques.
2007     It's still useful to increase <varname>max_wal_size</varname>
2008     while loading the data, but don't bother increasing
2009     <varname>maintenance_work_mem</varname>; rather, you'd do that while
2010     manually recreating indexes and foreign keys afterwards.
2011     And don't forget to <command>ANALYZE</command> when you're done; see
2012     <xref linkend="vacuum-for-statistics"/>
2013     and <xref linkend="autovacuum"/> for more information.
2014    </para>
2015   </sect2>
2016   </sect1>
2017
2018   <sect1 id="non-durability">
2019    <title>Non-Durable Settings</title>
2020
2021    <indexterm zone="non-durability">
2022     <primary>non-durable</primary>
2023    </indexterm>
2024
2025    <para>
2026     Durability is a database feature that guarantees the recording of
2027     committed transactions even if the server crashes or loses
2028     power.  However, durability adds significant database overhead,
2029     so if your site does not require such a guarantee,
2030     <productname>PostgreSQL</productname> can be configured to run
2031     much faster.  The following are configuration changes you can make
2032     to improve performance in such cases.  Except as noted below, durability
2033     is still guaranteed in case of a crash of the database software;
2034     only an abrupt operating system crash creates a risk of data loss
2035     or corruption when these settings are used.
2036
2037     <itemizedlist>
2038      <listitem>
2039       <para>
2040        Place the database cluster's data directory in a memory-backed
2041        file system (i.e., <acronym>RAM</acronym> disk).  This eliminates all
2042        database disk I/O, but limits data storage to the amount of
2043        available memory (and perhaps swap).
2044       </para>
2045      </listitem>
2046
2047      <listitem>
2048       <para>
2049        Turn off <xref linkend="guc-fsync"/>;  there is no need to flush
2050        data to disk.
2051       </para>
2052      </listitem>
2053
2054      <listitem>
2055       <para>
2056        Turn off <xref linkend="guc-synchronous-commit"/>;  there might be no
2057        need to force <acronym>WAL</acronym> writes to disk on every
2058        commit.  This setting does risk transaction loss (though not data
2059        corruption) in case of a crash of the <emphasis>database</emphasis>.
2060       </para>
2061      </listitem>
2062
2063      <listitem>
2064       <para>
2065        Turn off <xref linkend="guc-full-page-writes"/>;  there is no need
2066        to guard against partial page writes.
2067       </para>
2068      </listitem>
2069
2070      <listitem>
2071       <para>
2072        Increase <xref linkend="guc-max-wal-size"/> and <xref
2073        linkend="guc-checkpoint-timeout"/>; this reduces the frequency
2074        of checkpoints, but increases the storage requirements of
2075        <filename>/pg_wal</filename>.
2076       </para>
2077      </listitem>
2078
2079      <listitem>
2080       <para>
2081        Create <link linkend="sql-createtable-unlogged">unlogged
2082        tables</link> to avoid <acronym>WAL</acronym> writes, though it
2083        makes the tables non-crash-safe.
2084       </para>
2085      </listitem>
2086
2087     </itemizedlist>
2088    </para>
2089   </sect1>
2090
2091  </chapter>