doc/src/sgml/indices.sgml

   1 <!-- $PostgreSQL$ -->
   2
   3 <chapter id="indexes">
   4  <title id="indexes-title">Indexes</title>
   5
   6  <indexterm zone="indexes">
   7   <primary>index</primary>
   8  </indexterm>
   9
  10  <para>
  11   Indexes are a common way to enhance database performance.  An index
  12   allows the database server to find and retrieve specific rows much
  13   faster than it could do without an index.  But indexes also add
  14   overhead to the database system as a whole, so they should be used
  15   sensibly.
  16  </para>
  17
  18
  19  <sect1 id="indexes-intro">
  20   <title>Introduction</title>
  21
  22   <para>
  23    Suppose we have a table similar to this:
  24 <programlisting>
  25 CREATE TABLE test1 (
  26     id integer,
  27     content varchar
  28 );
  29 </programlisting>
  30    and the application issues many queries of the form:
  31 <programlisting>
  32 SELECT content FROM test1 WHERE id = <replaceable>constant</replaceable>;
  33 </programlisting>
  34    With no advance preparation, the system would have to scan the entire
  35    <structname>test1</structname> table, row by row, to find all
  36    matching entries.  If there are many rows in
  37    <structname>test1</structname> and only a few rows (perhaps zero
  38    or one) that would be returned by such a query, this is clearly an
  39    inefficient method.  But if the system has been instructed to maintain an
  40    index on the <structfield>id</structfield> column, it can use a more
  41    efficient method for locating matching rows.  For instance, it
  42    might only have to walk a few levels deep into a search tree.
  43   </para>
  44
  45   <para>
  46    A similar approach is used in most non-fiction books:  terms and
  47    concepts that are frequently looked up by readers are collected in
  48    an alphabetic index at the end of the book.  The interested reader
  49    can scan the index relatively quickly and flip to the appropriate
  50    page(s), rather than having to read the entire book to find the
  51    material of interest.  Just as it is the task of the author to
  52    anticipate the items that readers are likely to look up,
  53    it is the task of the database programmer to foresee which indexes
  54    will be useful.
  55   </para>
  56
  57   <para>
  58    The following command can be used to create an index on the
  59    <structfield>id</structfield> column, as discussed:
  60 <programlisting>
  61 CREATE INDEX test1_id_index ON test1 (id);
  62 </programlisting>
  63    The name <structname>test1_id_index</structname> can be chosen
  64    freely, but you should pick something that enables you to remember
  65    later what the index was for.
  66   </para>
  67
  68   <para>
  69    To remove an index, use the <command>DROP INDEX</command> command.
  70    Indexes can be added to and removed from tables at any time.
  71   </para>
  72
  73   <para>
  74    Once an index is created, no further intervention is required: the
  75    system will update the index when the table is modified, and it will
  76    use the index in queries when it thinks doing so would be more efficient
  77    than a sequential table scan.  But you might have to run the
  78    <command>ANALYZE</command> command regularly to update
  79    statistics to allow the query planner to make educated decisions.
  80    See <xref linkend="performance-tips"> for information about
  81    how to find out whether an index is used and when and why the
  82    planner might choose <emphasis>not</emphasis> to use an index.
  83   </para>
  84
  85   <para>
  86    Indexes can also benefit <command>UPDATE</command> and
  87    <command>DELETE</command> commands with search conditions.
  88    Indexes can moreover be used in join searches.  Thus,
  89    an index defined on a column that is part of a join condition can
  90    also significantly speed up queries with joins.
  91   </para>
  92
  93   <para>
  94    Creating an index on a large table can take a long time.  By default,
  95    <productname>PostgreSQL</productname> allows reads (selects) to occur
  96    on the table in parallel with index creation, but writes (INSERTs,
  97    UPDATEs, DELETEs) are blocked until the index build is finished.
  98    In production environments this is often unacceptable.
  99    It is possible to allow writes to occur in parallel with index
 100    creation, but there are several caveats to be aware of &mdash;
 101    for more information see <xref linkend="SQL-CREATEINDEX-CONCURRENTLY"
 102    endterm="SQL-CREATEINDEX-CONCURRENTLY-title">.
 103   </para>
 104
 105   <para>
 106    After an index is created, the system has to keep it synchronized with the
 107    table.  This adds overhead to data manipulation operations.
 108    Therefore indexes that are seldom or never used in queries
 109    should be removed.
 110   </para>
 111  </sect1>
 112
 113
 114  <sect1 id="indexes-types">
 115   <title>Index Types</title>
 116
 117   <para>
 118    <productname>PostgreSQL</productname> provides several index types:
 119    B-tree, Hash, GiST and GIN.  Each index type uses a different
 120    algorithm that is best suited to different types of queries.
 121    By default, the <command>CREATE INDEX</command> command creates
 122    B-tree indexes, which fit the most common situations.
 123   </para>
 124
 125   <para>
 126    <indexterm>
 127     <primary>index</primary>
 128     <secondary>B-tree</secondary>
 129    </indexterm>
 130    <indexterm>
 131     <primary>B-tree</primary>
 132     <see>index</see>
 133    </indexterm>
 134    B-trees can handle equality and range queries on data that can be sorted
 135    into some ordering.
 136    In particular, the <productname>PostgreSQL</productname> query planner
 137    will consider using a B-tree index whenever an indexed column is
 138    involved in a comparison using one of these operators:
 139
 140    <simplelist>
 141     <member><literal>&lt;</literal></member>
 142     <member><literal>&lt;=</literal></member>
 143     <member><literal>=</literal></member>
 144     <member><literal>&gt;=</literal></member>
 145     <member><literal>&gt;</literal></member>
 146    </simplelist>
 147
 148    Constructs equivalent to combinations of these operators, such as
 149    <literal>BETWEEN</> and <literal>IN</>, can also be implemented with
 150    a B-tree index search.  Also, an <literal>IS NULL</> condition on
 151    an index column can be used with a B-tree index.
 152   </para>
 153
 154   <para>
 155    The optimizer can also use a B-tree index for queries involving the
 156    pattern matching operators <literal>LIKE</> and <literal>~</literal>
 157    <emphasis>if</emphasis> the pattern is a constant and is anchored to
 158    the beginning of the string &mdash; for example, <literal>col LIKE
 159    'foo%'</literal> or <literal>col ~ '^foo'</literal>, but not
 160    <literal>col LIKE '%bar'</literal>. However, if your database does not
 161    use the C locale you will need to create the index with a special
 162    operator class to support indexing of pattern-matching queries; see
 163    <xref linkend="indexes-opclass"> below. It is also possible to use
 164    B-tree indexes for <literal>ILIKE</literal> and
 165    <literal>~*</literal>, but only if the pattern starts with
 166    non-alphabetic characters, i.e., characters that are not affected by
 167    upper/lower case conversion.
 168   </para>
 169
 170
 171   <para>
 172    <indexterm>
 173     <primary>index</primary>
 174     <secondary>hash</secondary>
 175    </indexterm>
 176    <indexterm>
 177     <primary>hash</primary>
 178     <see>index</see>
 179    </indexterm>
 180    Hash indexes can only handle simple equality comparisons.
 181    The query planner will consider using a hash index whenever an
 182    indexed column is involved in a comparison using the
 183    <literal>=</literal> operator.  (Hash indexes do not support
 184    <literal>IS NULL</> searches.)
 185    The following command is used to create a hash index:
 186 <synopsis>
 187 CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> USING hash (<replaceable>column</replaceable>);
 188 </synopsis>
 189   </para>
 190
 191   <note>
 192    <para>
 193     Hash index operations are not presently WAL-logged,
 194     so hash indexes might need to be rebuilt with <command>REINDEX</>
 195     after a database crash.
 196     For this reason, hash index use is presently discouraged.
 197    </para>
 198   </note>
 199
 200   <para>
 201    <indexterm>
 202     <primary>index</primary>
 203     <secondary>GiST</secondary>
 204    </indexterm>
 205    <indexterm>
 206     <primary>GiST</primary>
 207     <see>index</see>
 208    </indexterm>
 209    GiST indexes are not a single kind of index, but rather an infrastructure
 210    within which many different indexing strategies can be implemented.
 211    Accordingly, the particular operators with which a GiST index can be
 212    used vary depending on the indexing strategy (the <firstterm>operator
 213    class</>).  As an example, the standard distribution of
 214    <productname>PostgreSQL</productname> includes GiST operator classes
 215    for several two-dimensional geometric data types, which support indexed
 216    queries using these operators:
 217
 218    <simplelist>
 219     <member><literal>&lt;&lt;</literal></member>
 220     <member><literal>&amp;&lt;</literal></member>
 221     <member><literal>&amp;&gt;</literal></member>
 222     <member><literal>&gt;&gt;</literal></member>
 223     <member><literal>&lt;&lt;|</literal></member>
 224     <member><literal>&amp;&lt;|</literal></member>
 225     <member><literal>|&amp;&gt;</literal></member>
 226     <member><literal>|&gt;&gt;</literal></member>
 227     <member><literal>@&gt;</literal></member>
 228     <member><literal>&lt;@</literal></member>
 229     <member><literal>~=</literal></member>
 230     <member><literal>&amp;&amp;</literal></member>
 231    </simplelist>
 232
 233    (See <xref linkend="functions-geometry"> for the meaning of
 234    these operators.)
 235    Many other GiST operator
 236    classes are available in the <literal>contrib</> collection or as separate
 237    projects.  For more information see <xref linkend="GiST">.
 238   </para>
 239   <para>
 240    <indexterm>
 241     <primary>index</primary>
 242     <secondary>GIN</secondary>
 243    </indexterm>
 244    <indexterm>
 245     <primary>GIN</primary>
 246     <see>index</see>
 247    </indexterm>
 248    GIN indexes are inverted indexes which can handle values that contain more
 249    than one key, arrays for example. Like GiST, GIN can support
 250    many different user-defined indexing strategies and the particular
 251    operators with which a GIN index can be used vary depending on the
 252    indexing strategy.
 253    As an example, the standard distribution of
 254    <productname>PostgreSQL</productname> includes GIN operator classes
 255    for one-dimensional arrays, which support indexed
 256    queries using these operators:
 257
 258    <simplelist>
 259     <member><literal>&lt;@</literal></member>
 260     <member><literal>@&gt;</literal></member>
 261     <member><literal>=</literal></member>
 262     <member><literal>&amp;&amp;</literal></member>
 263    </simplelist>
 264
 265    (See <xref linkend="functions-array"> for the meaning of
 266    these operators.)
 267    Many other GIN operator
 268    classes are available in the <literal>contrib</> collection or as separate
 269    projects.  For more information see <xref linkend="GIN">.
 270   </para>
 271  </sect1>
 272
 273
 274  <sect1 id="indexes-multicolumn">
 275   <title>Multicolumn Indexes</title>
 276
 277   <indexterm zone="indexes-multicolumn">
 278    <primary>index</primary>
 279    <secondary>multicolumn</secondary>
 280   </indexterm>
 281
 282   <para>
 283    An index can be defined on more than one column of a table.  For example, if
 284    you have a table of this form:
 285 <programlisting>
 286 CREATE TABLE test2 (
 287   major int,
 288   minor int,
 289   name varchar
 290 );
 291 </programlisting>
 292    (say, you keep your <filename class="directory">/dev</filename>
 293    directory in a database...) and you frequently issue queries like:
 294 <programlisting>
 295 SELECT name FROM test2 WHERE major = <replaceable>constant</replaceable> AND minor = <replaceable>constant</replaceable>;
 296 </programlisting>
 297    then it might be appropriate to define an index on the columns
 298    <structfield>major</structfield> and
 299    <structfield>minor</structfield> together, e.g.:
 300 <programlisting>
 301 CREATE INDEX test2_mm_idx ON test2 (major, minor);
 302 </programlisting>
 303   </para>
 304
 305   <para>
 306    Currently, only the B-tree, GiST and GIN index types support multicolumn
 307    indexes.  Up to 32 columns can be specified.  (This limit can be
 308    altered when building <productname>PostgreSQL</productname>; see the
 309    file <filename>pg_config_manual.h</filename>.)
 310   </para>
 311
 312   <para>
 313    A multicolumn B-tree index can be used with query conditions that
 314    involve any subset of the index's columns, but the index is most
 315    efficient when there are constraints on the leading (leftmost) columns.
 316    The exact rule is that equality constraints on leading columns, plus
 317    any inequality constraints on the first column that does not have an
 318    equality constraint, will be used to limit the portion of the index
 319    that is scanned.  Constraints on columns to the right of these columns
 320    are checked in the index, so they save visits to the table proper, but
 321    they do not reduce the portion of the index that has to be scanned.
 322    For example, given an index on <literal>(a, b, c)</literal> and a
 323    query condition <literal>WHERE a = 5 AND b &gt;= 42 AND c &lt; 77</>,
 324    the index would have to be scanned from the first entry with
 325    <literal>a</> = 5 and <literal>b</> = 42 up through the last entry with
 326    <literal>a</> = 5.  Index entries with <literal>c</> &gt;= 77 would be
 327    skipped, but they'd still have to be scanned through.
 328    This index could in principle be used for queries that have constraints
 329    on <literal>b</> and/or <literal>c</> with no constraint on <literal>a</>
 330    &mdash; but the entire index would have to be scanned, so in most cases
 331    the planner would prefer a sequential table scan over using the index.
 332   </para>
 333
 334   <para>
 335    A multicolumn GiST index can be used with query conditions that
 336    involve any subset of the index's columns. Conditions on additional
 337    columns restrict the entries returned by the index, but the condition on
 338    the first column is the most important one for determining how much of
 339    the index needs to be scanned.  A GiST index will be relatively
 340    ineffective if its first column has only a few distinct values, even if
 341    there are many distinct values in additional columns.
 342   </para>
 343
 344   <para>
 345    A multicolumn GIN index can be used with query conditions that
 346    involve any subset of the index's columns. Unlike B-tree or GiST,
 347    index search effectiveness is the same regardless of which index column(s)
 348    the query conditions use.
 349   </para>
 350
 351   <para>
 352    Of course, each column must be used with operators appropriate to the index
 353    type; clauses that involve other operators will not be considered.
 354   </para>
 355
 356   <para>
 357    Multicolumn indexes should be used sparingly.  In most situations,
 358    an index on a single column is sufficient and saves space and time.
 359    Indexes with more than three columns are unlikely to be helpful
 360    unless the usage of the table is extremely stylized.  See also
 361    <xref linkend="indexes-bitmap-scans"> for some discussion of the
 362    merits of different index configurations.
 363   </para>
 364  </sect1>
 365
 366
 367  <sect1 id="indexes-ordering">
 368   <title>Indexes and <literal>ORDER BY</></title>
 369
 370   <indexterm zone="indexes-ordering">
 371    <primary>index</primary>
 372    <secondary>and <literal>ORDER BY</></secondary>
 373   </indexterm>
 374
 375   <para>
 376    In addition to simply finding the rows to be returned by a query,
 377    an index may be able to deliver them in a specific sorted order.
 378    This allows a query's <literal>ORDER BY</> specification to be honored
 379    without a separate sorting step.  Of the index types currently
 380    supported by <productname>PostgreSQL</productname>, only B-tree
 381    can produce sorted output &mdash; the other index types return
 382    matching rows in an unspecified, implementation-dependent order.
 383   </para>
 384
 385   <para>
 386    The planner will consider satisfying an <literal>ORDER BY</> specification
 387    either by scanning an available index that matches the specification,
 388    or by scanning the table in physical order and doing an explicit
 389    sort.  For a query that requires scanning a large fraction of the
 390    table, an explicit sort is likely to be faster than using an index
 391    because it requires
 392    less disk I/O due to following a sequential access pattern.  Indexes are
 393    more useful when only a few rows need be fetched.  An important
 394    special case is <literal>ORDER BY</> in combination with
 395    <literal>LIMIT</> <replaceable>n</>: an explicit sort will have to process
 396    all the data to identify the first <replaceable>n</> rows, but if there is
 397    an index matching the <literal>ORDER BY</>, the first <replaceable>n</>
 398    rows can be retrieved directly, without scanning the remainder at all.
 399   </para>
 400
 401   <para>
 402    By default, B-tree indexes store their entries in ascending order
 403    with nulls last.  This means that a forward scan of an index on
 404    column <literal>x</> produces output satisfying <literal>ORDER BY x</>
 405    (or more verbosely, <literal>ORDER BY x ASC NULLS LAST</>).  The
 406    index can also be scanned backward, producing output satisfying
 407    <literal>ORDER BY x DESC</>
 408    (or more verbosely, <literal>ORDER BY x DESC NULLS FIRST</>, since
 409    <literal>NULLS FIRST</> is the default for <literal>ORDER BY DESC</>).
 410   </para>
 411
 412   <para>
 413    You can adjust the ordering of a B-tree index by including the
 414    options <literal>ASC</>, <literal>DESC</>, <literal>NULLS FIRST</>,
 415    and/or <literal>NULLS LAST</> when creating the index; for example:
 416 <programlisting>
 417 CREATE INDEX test2_info_nulls_low ON test2 (info NULLS FIRST);
 418 CREATE INDEX test3_desc_index ON test3 (id DESC NULLS LAST);
 419 </programlisting>
 420    An index stored in ascending order with nulls first can satisfy
 421    either <literal>ORDER BY x ASC NULLS FIRST</> or
 422    <literal>ORDER BY x DESC NULLS LAST</> depending on which direction
 423    it is scanned in.
 424   </para>
 425
 426   <para>
 427    You might wonder why bother providing all four options, when two
 428    options together with the possibility of backward scan would cover
 429    all the variants of <literal>ORDER BY</>.  In single-column indexes
 430    the options are indeed redundant, but in multicolumn indexes they can be
 431    useful.  Consider a two-column index on <literal>(x, y)</>: this can
 432    satisfy <literal>ORDER BY x, y</> if we scan forward, or
 433    <literal>ORDER BY x DESC, y DESC</> if we scan backward.
 434    But it might be that the application frequently needs to use
 435    <literal>ORDER BY x ASC, y DESC</>.  There is no way to get that
 436    ordering from a plain index, but it is possible if the index is defined
 437    as <literal>(x ASC, y DESC)</> or <literal>(x DESC, y ASC)</>.
 438   </para>
 439
 440   <para>
 441    Obviously, indexes with non-default sort orderings are a fairly
 442    specialized feature, but sometimes they can produce tremendous
 443    speedups for certain queries.  Whether it's worth maintaining such an
 444    index depends on how often you use queries that require a special
 445    sort ordering.
 446   </para>
 447  </sect1>
 448
 449
 450  <sect1 id="indexes-bitmap-scans">
 451   <title>Combining Multiple Indexes</title>
 452
 453   <indexterm zone="indexes-bitmap-scans">
 454    <primary>index</primary>
 455    <secondary>combining multiple indexes</secondary>
 456   </indexterm>
 457
 458   <indexterm zone="indexes-bitmap-scans">
 459    <primary>bitmap scan</primary>
 460   </indexterm>
 461
 462   <para>
 463    A single index scan can only use query clauses that use the index's
 464    columns with operators of its operator class and are joined with
 465    <literal>AND</>.  For example, given an index on <literal>(a, b)</literal>
 466    a query condition like <literal>WHERE a = 5 AND b = 6</> could
 467    use the index, but a query like <literal>WHERE a = 5 OR b = 6</> could not
 468    directly use the index.
 469   </para>
 470
 471   <para>
 472    Fortunately,
 473    <productname>PostgreSQL</> has the ability to combine multiple indexes
 474    (including multiple uses of the same index) to handle cases that cannot
 475    be implemented by single index scans.  The system can form <literal>AND</>
 476    and <literal>OR</> conditions across several index scans.  For example,
 477    a query like <literal>WHERE x = 42 OR x = 47 OR x = 53 OR x = 99</>
 478    could be broken down into four separate scans of an index on <literal>x</>,
 479    each scan using one of the query clauses.  The results of these scans are
 480    then ORed together to produce the result.  Another example is that if we
 481    have separate indexes on <literal>x</> and <literal>y</>, one possible
 482    implementation of a query like <literal>WHERE x = 5 AND y = 6</> is to
 483    use each index with the appropriate query clause and then AND together
 484    the index results to identify the result rows.
 485   </para>
 486
 487   <para>
 488    To combine multiple indexes, the system scans each needed index and
 489    prepares a <firstterm>bitmap</> in memory giving the locations of
 490    table rows that are reported as matching that index's conditions.
 491    The bitmaps are then ANDed and ORed together as needed by the query.
 492    Finally, the actual table rows are visited and returned.  The table rows
 493    are visited in physical order, because that is how the bitmap is laid
 494    out; this means that any ordering of the original indexes is lost, and
 495    so a separate sort step will be needed if the query has an <literal>ORDER
 496    BY</> clause.  For this reason, and because each additional index scan
 497    adds extra time, the planner will sometimes choose to use a simple index
 498    scan even though additional indexes are available that could have been
 499    used as well.
 500   </para>
 501
 502   <para>
 503    In all but the simplest applications, there are various combinations of
 504    indexes that might be useful, and the database developer must make
 505    trade-offs to decide which indexes to provide.  Sometimes multicolumn
 506    indexes are best, but sometimes it's better to create separate indexes
 507    and rely on the index-combination feature.  For example, if your
 508    workload includes a mix of queries that sometimes involve only column
 509    <literal>x</>, sometimes only column <literal>y</>, and sometimes both
 510    columns, you might choose to create two separate indexes on
 511    <literal>x</> and <literal>y</>, relying on index combination to
 512    process the queries that use both columns.  You could also create a
 513    multicolumn index on <literal>(x, y)</>.  This index would typically be
 514    more efficient than index combination for queries involving both
 515    columns, but as discussed in <xref linkend="indexes-multicolumn">, it
 516    would be almost useless for queries involving only <literal>y</>, so it
 517    should not be the only index.  A combination of the multicolumn index
 518    and a separate index on <literal>y</> would serve reasonably well.  For
 519    queries involving only <literal>x</>, the multicolumn index could be
 520    used, though it would be larger and hence slower than an index on
 521    <literal>x</> alone.  The last alternative is to create all three
 522    indexes, but this is probably only reasonable if the table is searched
 523    much more often than it is updated and all three types of query are
 524    common.  If one of the types of query is much less common than the
 525    others, you'd probably settle for creating just the two indexes that
 526    best match the common types.
 527   </para>
 528
 529  </sect1>
 530
 531
 532  <sect1 id="indexes-unique">
 533   <title>Unique Indexes</title>
 534
 535   <indexterm zone="indexes-unique">
 536    <primary>index</primary>
 537    <secondary>unique</secondary>
 538   </indexterm>
 539
 540   <para>
 541    Indexes can also be used to enforce uniqueness of a column's value,
 542    or the uniqueness of the combined values of more than one column.
 543 <synopsis>
 544 CREATE UNIQUE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> (<replaceable>column</replaceable> <optional>, ...</optional>);
 545 </synopsis>
 546    Currently, only B-tree indexes can be declared unique.
 547   </para>
 548
 549   <para>
 550    When an index is declared unique, multiple table rows with equal
 551    indexed values are not allowed.  Null values are not considered
 552    equal.  A multicolumn unique index will only reject cases where all
 553    indexed columns are equal in multiple rows.
 554   </para>
 555
 556   <para>
 557    <productname>PostgreSQL</productname> automatically creates a unique
 558    index when a unique constraint or primary key is defined for a table.
 559    The index covers the columns that make up the primary key or unique
 560    constraint (a multicolumn index, if appropriate), and is the mechanism
 561    that enforces the constraint.
 562   </para>
 563
 564   <note>
 565    <para>
 566     The preferred way to add a unique constraint to a table is
 567     <literal>ALTER TABLE ... ADD CONSTRAINT</literal>.  The use of
 568     indexes to enforce unique constraints could be considered an
 569     implementation detail that should not be accessed directly.
 570     One should, however, be aware that there's no need to manually
 571     create indexes on unique columns; doing so would just duplicate
 572     the automatically-created index.
 573    </para>
 574   </note>
 575  </sect1>
 576
 577
 578  <sect1 id="indexes-expressional">
 579   <title>Indexes on Expressions</title>
 580
 581   <indexterm zone="indexes-expressional">
 582    <primary>index</primary>
 583    <secondary sortas="expressions">on expressions</secondary>
 584   </indexterm>
 585
 586   <para>
 587    An index column need not be just a column of the underlying table,
 588    but can be a function or scalar expression computed from one or
 589    more columns of the table.  This feature is useful to obtain fast
 590    access to tables based on the results of computations.
 591   </para>
 592
 593   <para>
 594    For example, a common way to do case-insensitive comparisons is to
 595    use the <function>lower</function> function:
 596 <programlisting>
 597 SELECT * FROM test1 WHERE lower(col1) = 'value';
 598 </programlisting>
 599    This query can use an index if one has been
 600    defined on the result of the <literal>lower(col1)</literal>
 601    function:
 602 <programlisting>
 603 CREATE INDEX test1_lower_col1_idx ON test1 (lower(col1));
 604 </programlisting>
 605   </para>
 606
 607   <para>
 608    If we were to declare this index <literal>UNIQUE</>, it would prevent
 609    creation of rows whose <literal>col1</> values differ only in case,
 610    as well as rows whose <literal>col1</> values are actually identical.
 611    Thus, indexes on expressions can be used to enforce constraints that
 612    are not definable as simple unique constraints.
 613   </para>
 614
 615   <para>
 616    As another example, if one often does queries like:
 617 <programlisting>
 618 SELECT * FROM people WHERE (first_name || ' ' || last_name) = 'John Smith';
 619 </programlisting>
 620    then it might be worth creating an index like this:
 621 <programlisting>
 622 CREATE INDEX people_names ON people ((first_name || ' ' || last_name));
 623 </programlisting>
 624   </para>
 625
 626   <para>
 627    The syntax of the <command>CREATE INDEX</> command normally requires
 628    writing parentheses around index expressions, as shown in the second
 629    example.  The parentheses can be omitted when the expression is just
 630    a function call, as in the first example.
 631   </para>
 632
 633   <para>
 634    Index expressions are relatively expensive to maintain, because the
 635    derived expression(s) must be computed for each row upon insertion
 636    and whenever it is updated.  However, the index expressions are
 637    <emphasis>not</> recomputed during an indexed search, since they are
 638    already stored in the index.  In both examples above, the system
 639    sees the query as just <literal>WHERE indexedcolumn = 'constant'</>
 640    and so the speed of the search is equivalent to any other simple index
 641    query.  Thus, indexes on expressions are useful when retrieval speed
 642    is more important than insertion and update speed.
 643   </para>
 644  </sect1>
 645
 646
 647  <sect1 id="indexes-partial">
 648   <title>Partial Indexes</title>
 649
 650   <indexterm zone="indexes-partial">
 651    <primary>index</primary>
 652    <secondary>partial</secondary>
 653   </indexterm>
 654
 655   <para>
 656    A <firstterm>partial index</firstterm> is an index built over a
 657    subset of a table; the subset is defined by a conditional
 658    expression (called the <firstterm>predicate</firstterm> of the
 659    partial index).  The index contains entries only for those table
 660    rows that satisfy the predicate.  Partial indexes are a specialized
 661    feature, but there are several situations in which they are useful.
 662   </para>
 663
 664   <para>
 665    One major reason for using a partial index is to avoid indexing common
 666    values.  Since a query searching for a common value (one that
 667    accounts for more than a few percent of all the table rows) will not
 668    use the index anyway, there is no point in keeping those rows in the
 669    index at all.  This reduces the size of the index, which will speed
 670    up those queries that do use the index.  It will also speed up many table
 671    update operations because the index does not need to be
 672    updated in all cases.  <xref linkend="indexes-partial-ex1"> shows a
 673    possible application of this idea.
 674   </para>
 675
 676   <example id="indexes-partial-ex1">
 677    <title>Setting up a Partial Index to Exclude Common Values</title>
 678
 679    <para>
 680     Suppose you are storing web server access logs in a database.
 681     Most accesses originate from the IP address range of your organization but
 682     some are from elsewhere (say, employees on dial-up connections).
 683     If your searches by IP are primarily for outside accesses,
 684     you probably do not need to index the IP range that corresponds to your
 685     organization's subnet.
 686    </para>
 687
 688    <para>
 689     Assume a table like this:
 690 <programlisting>
 691 CREATE TABLE access_log (
 692     url varchar,
 693     client_ip inet,
 694     ...
 695 );
 696 </programlisting>
 697    </para>
 698
 699    <para>
 700     To create a partial index that suits our example, use a command
 701     such as this:
 702 <programlisting>
 703 CREATE INDEX access_log_client_ip_ix ON access_log (client_ip)
 704 WHERE NOT (client_ip &gt; inet '192.168.100.0' AND
 705            client_ip &lt; inet '192.168.100.255');
 706 </programlisting>
 707    </para>
 708
 709    <para>
 710     A typical query that can use this index would be:
 711 <programlisting>
 712 SELECT *
 713 FROM access_log
 714 WHERE url = '/index.html' AND client_ip = inet '212.78.10.32';
 715 </programlisting>
 716     A query that cannot use this index is:
 717 <programlisting>
 718 SELECT *
 719 FROM access_log
 720 WHERE client_ip = inet '192.168.100.23';
 721 </programlisting>
 722    </para>
 723
 724    <para>
 725     Observe that this kind of partial index requires that the common
 726     values be predetermined, so such partial indexes are best used for
 727     data distributions that do not change.  The indexes can be recreated
 728     occasionally to adjust for new data distributions, but this adds
 729     maintenance effort.
 730    </para>
 731   </example>
 732
 733   <para>
 734    Another possible use for a partial index is to exclude values from the
 735    index that the
 736    typical query workload is not interested in; this is shown in <xref
 737    linkend="indexes-partial-ex2">.  This results in the same
 738    advantages as listed above, but it prevents the
 739    <quote>uninteresting</quote> values from being accessed via that
 740    index, even if an index scan might be profitable in that
 741    case.  Obviously, setting up partial indexes for this kind of
 742    scenario will require a lot of care and experimentation.
 743   </para>
 744
 745   <example id="indexes-partial-ex2">
 746    <title>Setting up a Partial Index to Exclude Uninteresting Values</title>
 747
 748    <para>
 749     If you have a table that contains both billed and unbilled orders,
 750     where the unbilled orders take up a small fraction of the total
 751     table and yet those are the most-accessed rows, you can improve
 752     performance by creating an index on just the unbilled rows.  The
 753     command to create the index would look like this:
 754 <programlisting>
 755 CREATE INDEX orders_unbilled_index ON orders (order_nr)
 756     WHERE billed is not true;
 757 </programlisting>
 758    </para>
 759
 760    <para>
 761     A possible query to use this index would be:
 762 <programlisting>
 763 SELECT * FROM orders WHERE billed is not true AND order_nr &lt; 10000;
 764 </programlisting>
 765     However, the index can also be used in queries that do not involve
 766     <structfield>order_nr</> at all, e.g.:
 767 <programlisting>
 768 SELECT * FROM orders WHERE billed is not true AND amount &gt; 5000.00;
 769 </programlisting>
 770     This is not as efficient as a partial index on the
 771     <structfield>amount</> column would be, since the system has to
 772     scan the entire index.  Yet, if there are relatively few unbilled
 773     orders, using this partial index just to find the unbilled orders
 774     could be a win.
 775    </para>
 776
 777    <para>
 778     Note that this query cannot use this index:
 779 <programlisting>
 780 SELECT * FROM orders WHERE order_nr = 3501;
 781 </programlisting>
 782     The order 3501 might be among the billed or unbilled
 783     orders.
 784    </para>
 785   </example>
 786
 787   <para>
 788    <xref linkend="indexes-partial-ex2"> also illustrates that the
 789    indexed column and the column used in the predicate do not need to
 790    match.  <productname>PostgreSQL</productname> supports partial
 791    indexes with arbitrary predicates, so long as only columns of the
 792    table being indexed are involved.  However, keep in mind that the
 793    predicate must match the conditions used in the queries that
 794    are supposed to benefit from the index.  To be precise, a partial
 795    index can be used in a query only if the system can recognize that
 796    the <literal>WHERE</> condition of the query mathematically implies
 797    the predicate of the index.
 798    <productname>PostgreSQL</productname> does not have a sophisticated
 799    theorem prover that can recognize mathematically equivalent
 800    expressions that are written in different forms.  (Not
 801    only is such a general theorem prover extremely difficult to
 802    create, it would probably be too slow to be of any real use.)
 803    The system can recognize simple inequality implications, for example
 804    <quote>x &lt; 1</quote> implies <quote>x &lt; 2</quote>; otherwise
 805    the predicate condition must exactly match part of the query's
 806    <literal>WHERE</> condition
 807    or the index will not be recognized as usable. Matching takes
 808    place at query planning time, not at run time. As a result,
 809    parameterized query clauses do not work with a partial index. For
 810    example a prepared query with a parameter might specify
 811    <quote>x &lt; ?</quote> which will never imply
 812    <quote>x &lt; 2</quote> for all possible values of the parameter.
 813   </para>
 814
 815   <para>
 816    A third possible use for partial indexes does not require the
 817    index to be used in queries at all.  The idea here is to create
 818    a unique index over a subset of a table, as in <xref
 819    linkend="indexes-partial-ex3">.  This enforces uniqueness
 820    among the rows that satisfy the index predicate, without constraining
 821    those that do not.
 822   </para>
 823
 824   <example id="indexes-partial-ex3">
 825    <title>Setting up a Partial Unique Index</title>
 826
 827    <para>
 828     Suppose that we have a table describing test outcomes.  We wish
 829     to ensure that there is only one <quote>successful</> entry for
 830     a given subject and target combination, but there might be any number of
 831     <quote>unsuccessful</> entries.  Here is one way to do it:
 832 <programlisting>
 833 CREATE TABLE tests (
 834     subject text,
 835     target text,
 836     success boolean,
 837     ...
 838 );
 839
 840 CREATE UNIQUE INDEX tests_success_constraint ON tests (subject, target)
 841     WHERE success;
 842 </programlisting>
 843     This is a particularly efficient approach when there are few
 844     successful tests and many unsuccessful ones.
 845    </para>
 846   </example>
 847
 848   <para>
 849    Finally, a partial index can also be used to override the system's
 850    query plan choices.  Also, data sets with peculiar
 851    distributions might cause the system to use an index when it really
 852    should not.  In that case the index can be set up so that it is not
 853    available for the offending query.  Normally,
 854    <productname>PostgreSQL</> makes reasonable choices about index
 855    usage (e.g., it avoids them when retrieving common values, so the
 856    earlier example really only saves index size, it is not required to
 857    avoid index usage), and grossly incorrect plan choices are cause
 858    for a bug report.
 859   </para>
 860
 861   <para>
 862    Keep in mind that setting up a partial index indicates that you
 863    know at least as much as the query planner knows, in particular you
 864    know when an index might be profitable.  Forming this knowledge
 865    requires experience and understanding of how indexes in
 866    <productname>PostgreSQL</> work.  In most cases, the advantage of a
 867    partial index over a regular index will be minimal.
 868   </para>
 869
 870   <para>
 871    More information about partial indexes can be found in <xref
 872    linkend="STON89b">, <xref linkend="OLSON93">, and <xref
 873    linkend="SESHADRI95">.
 874   </para>
 875  </sect1>
 876
 877
 878  <sect1 id="indexes-opclass">
 879   <title>Operator Classes and Operator Families</title>
 880
 881   <indexterm zone="indexes-opclass">
 882    <primary>operator class</primary>
 883   </indexterm>
 884
 885   <indexterm zone="indexes-opclass">
 886    <primary>operator family</primary>
 887   </indexterm>
 888
 889   <para>
 890    An index definition can specify an <firstterm>operator
 891    class</firstterm> for each column of an index.
 892 <synopsis>
 893 CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> (<replaceable>column</replaceable> <replaceable>opclass</replaceable> <optional><replaceable>sort options</replaceable></optional> <optional>, ...</optional>);
 894 </synopsis>
 895    The operator class identifies the operators to be used by the index
 896    for that column.  For example, a B-tree index on the type <type>int4</type>
 897    would use the <literal>int4_ops</literal> class; this operator
 898    class includes comparison functions for values of type <type>int4</type>.
 899    In practice the default operator class for the column's data type is
 900    usually sufficient.  The main reason for having operator classes is
 901    that for some data types, there could be more than one meaningful
 902    index behavior.  For example, we might want to sort a complex-number data
 903    type either by absolute value or by real part.  We could do this by
 904    defining two operator classes for the data type and then selecting
 905    the proper class when making an index.  The operator class determines
 906    the basic sort ordering (which can then be modified by adding sort options
 907    <literal>ASC</>/<literal>DESC</> and/or
 908    <literal>NULLS FIRST</>/<literal>NULLS LAST</>).
 909   </para>
 910
 911   <para>
 912    There are also some built-in operator classes besides the default ones:
 913
 914    <itemizedlist>
 915     <listitem>
 916      <para>
 917       The operator classes <literal>text_pattern_ops</literal>,
 918       <literal>varchar_pattern_ops</literal>, and
 919       <literal>bpchar_pattern_ops</literal> support B-tree indexes on
 920       the types <type>text</type>, <type>varchar</type>, and
 921       <type>char</type> respectively.  The
 922       difference from the default operator classes is that the values
 923       are compared strictly character by character rather than
 924       according to the locale-specific collation rules.  This makes
 925       these operator classes suitable for use by queries involving
 926       pattern matching expressions (<literal>LIKE</literal> or POSIX
 927       regular expressions) when the database does not use the standard
 928       <quote>C</quote> locale.  As an example, you might index a
 929       <type>varchar</type> column like this:
 930 <programlisting>
 931 CREATE INDEX test_index ON test_table (col varchar_pattern_ops);
 932 </programlisting>
 933       Note that you should also create an index with the default operator
 934       class if you want queries involving ordinary <literal>&lt;</>,
 935       <literal>&lt;=</>, <literal>&gt;</>, or <literal>&gt;=</> comparisons
 936       to use an index.  Such queries cannot use the
 937       <literal><replaceable>xxx</replaceable>_pattern_ops</literal>
 938       operator classes.  (Ordinary equality comparisons can use these
 939       operator classes, however.)  It is possible to create multiple
 940       indexes on the same column with different operator classes.
 941       If you do use the C locale, you do not need the
 942       <literal><replaceable>xxx</replaceable>_pattern_ops</literal>
 943       operator classes, because an index with the default operator class
 944       is usable for pattern-matching queries in the C locale.
 945      </para>
 946     </listitem>
 947    </itemizedlist>
 948   </para>
 949
 950   <para>
 951     The following query shows all defined operator classes:
 952
 953 <programlisting>
 954 SELECT am.amname AS index_method,
 955        opc.opcname AS opclass_name
 956     FROM pg_am am, pg_opclass opc
 957     WHERE opc.opcmethod = am.oid
 958     ORDER BY index_method, opclass_name;
 959 </programlisting>
 960   </para>
 961
 962   <para>
 963    An operator class is actually just a subset of a larger structure called an
 964    <firstterm>operator family</>.  In cases where several data types have
 965    similar behaviors, it is frequently useful to define cross-data-type
 966    operators and allow these to work with indexes.  To do this, the operator
 967    classes for each of the types must be grouped into the same operator
 968    family.  The cross-type operators are members of the family, but are not
 969    associated with any single class within the family.
 970   </para>
 971
 972   <para>
 973     This query shows all defined operator families and all
 974     the operators included in each family:
 975 <programlisting>
 976 SELECT am.amname AS index_method,
 977        opf.opfname AS opfamily_name,
 978        amop.amopopr::regoperator AS opfamily_operator
 979     FROM pg_am am, pg_opfamily opf, pg_amop amop
 980     WHERE opf.opfmethod = am.oid AND
 981           amop.amopfamily = opf.oid
 982     ORDER BY index_method, opfamily_name, opfamily_operator;
 983 </programlisting>
 984   </para>
 985  </sect1>
 986
 987
 988  <sect1 id="indexes-examine">
 989   <title>Examining Index Usage</title>
 990
 991   <indexterm zone="indexes-examine">
 992    <primary>index</primary>
 993    <secondary>examining usage</secondary>
 994   </indexterm>
 995
 996   <para>
 997    Although indexes in <productname>PostgreSQL</> do not need
 998    maintenance or tuning, it is still important to check
 999    which indexes are actually used by the real-life query workload.
1000    Examining index usage for an individual query is done with the
1001    <xref linkend="sql-explain" endterm="sql-explain-title">
1002    command; its application for this purpose is
1003    illustrated in <xref linkend="using-explain">.
1004    It is also possible to gather overall statistics about index usage
1005    in a running server, as described in <xref linkend="monitoring-stats">.
1006   </para>
1007
1008   <para>
1009    It is difficult to formulate a general procedure for determining
1010    which indexes to create.  There are a number of typical cases that
1011    have been shown in the examples throughout the previous sections.
1012    A good deal of experimentation is often necessary.
1013    The rest of this section gives some tips for that:
1014   </para>
1015
1016   <itemizedlist>
1017    <listitem>
1018     <para>
1019      Always run <xref linkend="sql-analyze" endterm="sql-analyze-title">
1020      first.  This command
1021      collects statistics about the distribution of the values in the
1022      table.  This information is required to estimate the number of rows
1023      returned by a query, which is needed by the planner to assign
1024      realistic costs to each possible query plan.  In absence of any
1025      real statistics, some default values are assumed, which are
1026      almost certain to be inaccurate.  Examining an application's
1027      index usage without having run <command>ANALYZE</command> is
1028      therefore a lost cause.
1029     </para>
1030    </listitem>
1031
1032    <listitem>
1033     <para>
1034      Use real data for experimentation.  Using test data for setting
1035      up indexes will tell you what indexes you need for the test data,
1036      but that is all.
1037     </para>
1038
1039     <para>
1040      It is especially fatal to use very small test data sets.
1041      While selecting 1000 out of 100000 rows could be a candidate for
1042      an index, selecting 1 out of 100 rows will hardly be, because the
1043      100 rows probably fit within a single disk page, and there
1044      is no plan that can beat sequentially fetching 1 disk page.
1045     </para>
1046
1047     <para>
1048      Also be careful when making up test data, which is often
1049      unavoidable when the application is not yet in production.
1050      Values that are very similar, completely random, or inserted in
1051      sorted order will skew the statistics away from the distribution
1052      that real data would have.
1053     </para>
1054    </listitem>
1055
1056    <listitem>
1057     <para>
1058      When indexes are not used, it can be useful for testing to force
1059      their use.  There are run-time parameters that can turn off
1060      various plan types (see <xref linkend="runtime-config-query-enable">).
1061      For instance, turning off sequential scans
1062      (<varname>enable_seqscan</>) and nested-loop joins
1063      (<varname>enable_nestloop</>), which are the most basic plans,
1064      will force the system to use a different plan.  If the system
1065      still chooses a sequential scan or nested-loop join then there is
1066      probably a more fundamental reason why the index is not being
1067      used; for example, the query condition does not match the index.
1068      (What kind of query can use what kind of index is explained in
1069      the previous sections.)
1070     </para>
1071    </listitem>
1072
1073    <listitem>
1074     <para>
1075      If forcing index usage does use the index, then there are two
1076      possibilities: Either the system is right and using the index is
1077      indeed not appropriate, or the cost estimates of the query plans
1078      are not reflecting reality.  So you should time your query with
1079      and without indexes.  The <command>EXPLAIN ANALYZE</command>
1080      command can be useful here.
1081     </para>
1082    </listitem>
1083
1084    <listitem>
1085     <para>
1086      If it turns out that the cost estimates are wrong, there are,
1087      again, two possibilities.  The total cost is computed from the
1088      per-row costs of each plan node times the selectivity estimate of
1089      the plan node.  The costs estimated for the plan nodes can be adjusted
1090      via run-time parameters (described in <xref
1091      linkend="runtime-config-query-constants">).
1092      An inaccurate selectivity estimate is due to
1093      insufficient statistics.  It might be possible to improve this by
1094      tuning the statistics-gathering parameters (see
1095      <xref linkend="sql-altertable" endterm="sql-altertable-title">).
1096     </para>
1097
1098     <para>
1099      If you do not succeed in adjusting the costs to be more
1100      appropriate, then you might have to resort to forcing index usage
1101      explicitly.  You might also want to contact the
1102      <productname>PostgreSQL</> developers to examine the issue.
1103     </para>
1104    </listitem>
1105   </itemizedlist>
1106  </sect1>
1107 </chapter>