doc/src/sgml/ref/create_statistics.sgml

   1 <!--
   2 doc/src/sgml/ref/create_statistics.sgml
   3 PostgreSQL documentation
   4 -->
   5
   6 <refentry id="sql-createstatistics">
   7  <indexterm zone="sql-createstatistics">
   8   <primary>CREATE STATISTICS</primary>
   9  </indexterm>
  10
  11  <refmeta>
  12   <refentrytitle>CREATE STATISTICS</refentrytitle>
  13   <manvolnum>7</manvolnum>
  14   <refmiscinfo>SQL - Language Statements</refmiscinfo>
  15  </refmeta>
  16
  17  <refnamediv>
  18   <refname>CREATE STATISTICS</refname>
  19   <refpurpose>define extended statistics</refpurpose>
  20  </refnamediv>
  21
  22  <refsynopsisdiv>
  23 <synopsis>
  24 CREATE STATISTICS [ [ IF NOT EXISTS ] <replaceable class="parameter">statistics_name</replaceable> ]
  25     ON ( <replaceable class="parameter">expression</replaceable> )
  26     FROM <replaceable class="parameter">table_name</replaceable>
  27
  28 CREATE STATISTICS [ [ IF NOT EXISTS ] <replaceable class="parameter">statistics_name</replaceable> ]
  29     [ ( <replaceable class="parameter">statistics_kind</replaceable> [, ... ] ) ]
  30     ON { <replaceable class="parameter">column_name</replaceable> | ( <replaceable class="parameter">expression</replaceable> ) }, { <replaceable class="parameter">column_name</replaceable> | ( <replaceable class="parameter">expression</replaceable> ) } [, ...]
  31     FROM <replaceable class="parameter">table_name</replaceable>
  32 </synopsis>
  33
  34  </refsynopsisdiv>
  35
  36  <refsect1 id="sql-createstatistics-description">
  37   <title>Description</title>
  38
  39   <para>
  40    <command>CREATE STATISTICS</command> will create a new extended statistics
  41    object tracking data about the specified table, foreign table or
  42    materialized view.  The statistics object will be created in the current
  43    database and will be owned by the user issuing the command.
  44   </para>
  45
  46   <para>
  47    The <command>CREATE STATISTICS</command> command has two basic forms. The
  48    first form allows univariate statistics for a single expression to be
  49    collected, providing benefits similar to an expression index without the
  50    overhead of index maintenance.  This form does not allow the statistics
  51    kind to be specified, since the various statistics kinds refer only to
  52    multivariate statistics.  The second form of the command allows
  53    multivariate statistics on multiple columns and/or expressions to be
  54    collected, optionally specifying which statistics kinds to include.  This
  55    form will also automatically cause univariate statistics to be collected on
  56    any expressions included in the list.
  57   </para>
  58
  59   <para>
  60    If a schema name is given (for example, <literal>CREATE STATISTICS
  61    myschema.mystat ...</literal>) then the statistics object is created in the
  62    specified schema.  Otherwise it is created in the current schema.
  63    If given, the name of the statistics object must be distinct from the name
  64    of any other statistics object in the same schema.
  65   </para>
  66  </refsect1>
  67
  68  <refsect1>
  69   <title>Parameters</title>
  70
  71   <variablelist>
  72
  73    <varlistentry>
  74     <term><literal>IF NOT EXISTS</literal></term>
  75     <listitem>
  76      <para>
  77       Do not throw an error if a statistics object with the same name already
  78       exists.  A notice is issued in this case.  Note that only the name of
  79       the statistics object is considered here, not the details of its
  80       definition.
  81       Statistics name is required when <literal>IF NOT EXISTS</literal> is specified.
  82      </para>
  83     </listitem>
  84    </varlistentry>
  85
  86    <varlistentry>
  87     <term><replaceable class="parameter">statistics_name</replaceable></term>
  88     <listitem>
  89      <para>
  90       The name (optionally schema-qualified) of the statistics object to be
  91       created.
  92       If the name is omitted, <productname>PostgreSQL</productname> chooses a
  93       suitable name based on the parent table's name and the defined column
  94       name(s) and/or expression(s).
  95      </para>
  96     </listitem>
  97    </varlistentry>
  98
  99    <varlistentry>
 100     <term><replaceable class="parameter">statistics_kind</replaceable></term>
 101     <listitem>
 102      <para>
 103       A multivariate statistics kind to be computed in this statistics object.
 104       Currently supported kinds are
 105       <literal>ndistinct</literal>, which enables n-distinct statistics,
 106       <literal>dependencies</literal>, which enables functional
 107       dependency statistics, and <literal>mcv</literal> which enables
 108       most-common values lists.
 109       If this clause is omitted, all supported statistics kinds are
 110       included in the statistics object. Univariate expression statistics are
 111       built automatically if the statistics definition includes any complex
 112       expressions rather than just simple column references.
 113       For more information, see <xref linkend="planner-stats-extended"/>
 114       and <xref linkend="multivariate-statistics-examples"/>.
 115      </para>
 116     </listitem>
 117    </varlistentry>
 118
 119    <varlistentry>
 120     <term><replaceable class="parameter">column_name</replaceable></term>
 121     <listitem>
 122      <para>
 123       The name of a table column to be covered by the computed statistics.
 124       This is only allowed when building multivariate statistics.  At least
 125       two column names or expressions must be specified, and their order is
 126       not significant.
 127      </para>
 128     </listitem>
 129    </varlistentry>
 130
 131    <varlistentry>
 132     <term><replaceable class="parameter">expression</replaceable></term>
 133     <listitem>
 134      <para>
 135       An expression to be covered by the computed statistics.  This may be
 136       used to build univariate statistics on a single expression, or as part
 137       of a list of multiple column names and/or expressions to build
 138       multivariate statistics.  In the latter case, separate univariate
 139       statistics are built automatically for each expression in the list.
 140      </para>
 141     </listitem>
 142    </varlistentry>
 143
 144    <varlistentry>
 145     <term><replaceable class="parameter">table_name</replaceable></term>
 146     <listitem>
 147      <para>
 148       The name (optionally schema-qualified) of the table containing the
 149       column(s) the statistics are computed on;  see <xref
 150       linkend="sql-analyze"/> for an explanation of the handling of
 151       inheritance and partitions.
 152      </para>
 153     </listitem>
 154    </varlistentry>
 155
 156   </variablelist>
 157  </refsect1>
 158
 159  <refsect1>
 160   <title>Notes</title>
 161
 162   <para>
 163    You must be the owner of a table to create a statistics object
 164    reading it.  Once created, however, the ownership of the statistics
 165    object is independent of the underlying table(s).
 166   </para>
 167
 168   <para>
 169    Expression statistics are per-expression and are similar to creating an
 170    index on the expression, except that they avoid the overhead of index
 171    maintenance. Expression statistics are built automatically for each
 172    expression in the statistics object definition.
 173   </para>
 174
 175   <para>
 176    Extended statistics are not currently used by the planner for selectivity
 177    estimations made for table joins.  This limitation will likely be removed
 178    in a future version of <productname>PostgreSQL</productname>.
 179   </para>
 180  </refsect1>
 181
 182  <refsect1 id="sql-createstatistics-examples">
 183   <title>Examples</title>
 184
 185   <para>
 186    Create table <structname>t1</structname> with two functionally dependent columns, i.e.,
 187    knowledge of a value in the first column is sufficient for determining the
 188    value in the other column. Then functional dependency statistics are built
 189    on those columns:
 190
 191 <programlisting>
 192 CREATE TABLE t1 (
 193     a   int,
 194     b   int
 195 );
 196
 197 INSERT INTO t1 SELECT i/100, i/500
 198                  FROM generate_series(1,1000000) s(i);
 199
 200 ANALYZE t1;
 201
 202 -- the number of matching rows will be drastically underestimated:
 203 EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 0);
 204
 205 CREATE STATISTICS s1 (dependencies) ON a, b FROM t1;
 206
 207 ANALYZE t1;
 208
 209 -- now the row count estimate is more accurate:
 210 EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 0);
 211 </programlisting>
 212
 213    Without functional-dependency statistics, the planner would assume
 214    that the two <literal>WHERE</literal> conditions are independent, and would
 215    multiply their selectivities together to arrive at a much-too-small
 216    row count estimate.
 217    With such statistics, the planner recognizes that the <literal>WHERE</literal>
 218    conditions are redundant and does not underestimate the row count.
 219   </para>
 220
 221   <para>
 222    Create table <structname>t2</structname> with two perfectly correlated columns
 223    (containing identical data), and an MCV list on those columns:
 224
 225 <programlisting>
 226 CREATE TABLE t2 (
 227     a   int,
 228     b   int
 229 );
 230
 231 INSERT INTO t2 SELECT mod(i,100), mod(i,100)
 232                  FROM generate_series(1,1000000) s(i);
 233
 234 CREATE STATISTICS s2 (mcv) ON a, b FROM t2;
 235
 236 ANALYZE t2;
 237
 238 -- valid combination (found in MCV)
 239 EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 1) AND (b = 1);
 240
 241 -- invalid combination (not found in MCV)
 242 EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 1) AND (b = 2);
 243 </programlisting>
 244
 245    The MCV list gives the planner more detailed information about the
 246    specific values that commonly appear in the table, as well as an upper
 247    bound on the selectivities of combinations of values that do not appear
 248    in the table, allowing it to generate better estimates in both cases.
 249   </para>
 250
 251   <para>
 252    Create table <structname>t3</structname> with a single timestamp column,
 253    and run queries using expressions on that column.  Without extended
 254    statistics, the planner has no information about the data distribution for
 255    the expressions, and uses default estimates.  The planner also does not
 256    realize that the value of the date truncated to the month is fully
 257    determined by the value of the date truncated to the day. Then expression
 258    and ndistinct statistics are built on those two expressions:
 259
 260 <programlisting>
 261 CREATE TABLE t3 (
 262     a   timestamp
 263 );
 264
 265 INSERT INTO t3 SELECT i FROM generate_series('2020-01-01'::timestamp,
 266                                              '2020-12-31'::timestamp,
 267                                              '1 minute'::interval) s(i);
 268
 269 ANALYZE t3;
 270
 271 -- the number of matching rows will be drastically underestimated:
 272 EXPLAIN ANALYZE SELECT * FROM t3
 273   WHERE date_trunc('month', a) = '2020-01-01'::timestamp;
 274
 275 EXPLAIN ANALYZE SELECT * FROM t3
 276   WHERE date_trunc('day', a) BETWEEN '2020-01-01'::timestamp
 277                                  AND '2020-06-30'::timestamp;
 278
 279 EXPLAIN ANALYZE SELECT date_trunc('month', a), date_trunc('day', a)
 280    FROM t3 GROUP BY 1, 2;
 281
 282 -- build ndistinct statistics on the pair of expressions (per-expression
 283 -- statistics are built automatically)
 284 CREATE STATISTICS s3 (ndistinct) ON date_trunc('month', a), date_trunc('day', a) FROM t3;
 285
 286 ANALYZE t3;
 287
 288 -- now the row count estimates are more accurate:
 289 EXPLAIN ANALYZE SELECT * FROM t3
 290   WHERE date_trunc('month', a) = '2020-01-01'::timestamp;
 291
 292 EXPLAIN ANALYZE SELECT * FROM t3
 293   WHERE date_trunc('day', a) BETWEEN '2020-01-01'::timestamp
 294                                  AND '2020-06-30'::timestamp;
 295
 296 EXPLAIN ANALYZE SELECT date_trunc('month', a), date_trunc('day', a)
 297    FROM t3 GROUP BY 1, 2;
 298 </programlisting>
 299
 300    Without expression and ndistinct statistics, the planner has no information
 301    about the number of distinct values for the expressions, and has to rely
 302    on default estimates. The equality and range conditions are assumed to have
 303    0.5% selectivity, and the number of distinct values in the expression is
 304    assumed to be the same as for the column (i.e. unique). This results in a
 305    significant underestimate of the row count in the first two queries. Moreover,
 306    the planner has no information about the relationship between the expressions,
 307    so it assumes the two <literal>WHERE</literal> and <literal>GROUP BY</literal>
 308    conditions are independent, and multiplies their selectivities together to
 309    arrive at a severe overestimate of the group count in the aggregate query.
 310    This is further exacerbated by the lack of accurate statistics for the
 311    expressions, forcing the planner to use a default ndistinct estimate for the
 312    expression derived from ndistinct for the column. With such statistics, the
 313    planner recognizes that the conditions are correlated, and arrives at much
 314    more accurate estimates.
 315   </para>
 316
 317  </refsect1>
 318
 319  <refsect1>
 320   <title>Compatibility</title>
 321
 322   <para>
 323    There is no <command>CREATE STATISTICS</command> command in the SQL standard.
 324   </para>
 325  </refsect1>
 326
 327  <refsect1>
 328   <title>See Also</title>
 329
 330   <simplelist type="inline">
 331    <member><xref linkend="sql-alterstatistics"/></member>
 332    <member><xref linkend="sql-dropstatistics"/></member>
 333   </simplelist>
 334  </refsect1>
 335 </refentry>