1 <!-- doc/src/sgml/planstats.sgml -->
3 <chapter id=
"planner-stats-details">
4 <title>How the Planner Uses Statistics
</title>
7 This chapter builds on the material covered in
<xref
8 linkend=
"using-explain"/> and
<xref linkend=
"planner-stats"/> to show some
9 additional details about how the planner uses the
10 system statistics to estimate the number of rows each part of a query might
11 return. This is a significant part of the planning process,
12 providing much of the raw material for cost calculation.
16 The intent of this chapter is not to document the code in detail,
17 but to present an overview of how it works.
18 This will perhaps ease the learning curve for someone who subsequently
19 wishes to read the code.
22 <sect1 id=
"row-estimation-examples">
23 <title>Row Estimation Examples
</title>
25 <indexterm zone=
"row-estimation-examples">
26 <primary>row estimation
</primary>
27 <secondary>planner
</secondary>
31 The examples shown below use tables in the
<productname>PostgreSQL
</productname>
32 regression test database.
33 Note also that since
<command>ANALYZE
</command> uses random sampling
34 while producing statistics, the results will change slightly after
35 any new
<command>ANALYZE
</command>.
39 Let's start with a very simple query:
42 EXPLAIN SELECT * FROM tenk1;
45 -------------------------------------------------------------
46 Seq Scan on tenk1 (cost=
0.00.
.458.00 rows=
10000 width=
244)
49 How the planner determines the cardinality of
<structname>tenk1
</structname>
50 is covered in
<xref linkend=
"planner-stats"/>, but is repeated here for
51 completeness. The number of pages and rows is looked up in
52 <structname>pg_class
</structname>:
55 SELECT relpages, reltuples FROM pg_class WHERE relname = 'tenk1';
58 ----------+-----------
62 These numbers are current as of the last
<command>VACUUM
</command> or
63 <command>ANALYZE
</command> on the table. The planner then fetches the
64 actual current number of pages in the table (this is a cheap operation,
65 not requiring a table scan). If that is different from
66 <structfield>relpages
</structfield> then
67 <structfield>reltuples
</structfield> is scaled accordingly to
68 arrive at a current number-of-rows estimate. In the example above, the value of
69 <structfield>relpages
</structfield> is up-to-date so the rows estimate is
70 the same as
<structfield>reltuples
</structfield>.
74 Let's move on to an example with a range condition in its
75 <literal>WHERE
</literal> clause:
78 EXPLAIN SELECT * FROM tenk1 WHERE unique1
< 1000;
81 -------------------------------------------------------------------
&zwsp;-------------
82 Bitmap Heap Scan on tenk1 (cost=
24.06.
.394.64 rows=
1007 width=
244)
83 Recheck Cond: (unique1
< 1000)
84 -
> Bitmap Index Scan on tenk1_unique1 (cost=
0.00.
.23.80 rows=
1007 width=
0)
85 Index Cond: (unique1
< 1000)
88 The planner examines the
<literal>WHERE
</literal> clause condition
89 and looks up the selectivity function for the operator
90 <literal><</literal> in
<structname>pg_operator
</structname>.
91 This is held in the column
<structfield>oprrest
</structfield>,
92 and the entry in this case is
<function>scalarltsel
</function>.
93 The
<function>scalarltsel
</function> function retrieves the histogram for
94 <structfield>unique1
</structfield> from
95 <structname>pg_statistic
</structname>. For manual queries it is more
96 convenient to look in the simpler
<structname>pg_stats
</structname>
100 SELECT histogram_bounds FROM pg_stats
101 WHERE tablename='tenk1' AND attname='unique1';
104 ------------------------------------------------------
105 {
0,
993,
1997,
3050,
4040,
5036,
5957,
7057,
8029,
9016,
9995}
108 Next the fraction of the histogram occupied by
<quote>< 1000</quote>
109 is worked out. This is the selectivity. The histogram divides the range
110 into equal frequency buckets, so all we have to do is locate the bucket
111 that our value is in and count
<emphasis>part
</emphasis> of it and
112 <emphasis>all
</emphasis> of the ones before. The value
1000 is clearly in
113 the second bucket (
993–1997). Assuming a linear distribution of
114 values inside each bucket, we can calculate the selectivity as:
117 selectivity = (
1 + (
1000 - bucket[
2].min)/(bucket[
2].max - bucket[
2].min))/num_buckets
118 = (
1 + (
1000 -
993)/(
1997 -
993))/
10
122 that is, one whole bucket plus a linear fraction of the second, divided by
123 the number of buckets. The estimated number of rows can now be calculated as
124 the product of the selectivity and the cardinality of
125 <structname>tenk1
</structname>:
128 rows = rel_cardinality * selectivity
130 =
1007 (rounding off)
135 Next let's consider an example with an equality condition in its
136 <literal>WHERE
</literal> clause:
139 EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'CRAAAA';
142 ----------------------------------------------------------
143 Seq Scan on tenk1 (cost=
0.00.
.483.00 rows=
30 width=
244)
144 Filter: (stringu1 = 'CRAAAA'::name)
147 Again the planner examines the
<literal>WHERE
</literal> clause condition
148 and looks up the selectivity function for
<literal>=
</literal>, which is
149 <function>eqsel
</function>. For equality estimation the histogram is
150 not useful; instead the list of
<firstterm>most
151 common values
</firstterm> (
<acronym>MCV
</acronym>s) is used to determine the
152 selectivity. Let's have a look at the MCVs, with some additional columns
153 that will be useful later:
156 SELECT null_frac, n_distinct, most_common_vals, most_common_freqs FROM pg_stats
157 WHERE tablename='tenk1' AND attname='stringu1';
161 most_common_vals | {EJAAAA,BBAAAA,CRAAAA,FCAAAA,FEAAAA,GSAAAA,
&zwsp;JOAAAA,MCAAAA,NAAAAA,WGAAAA}
162 most_common_freqs | {
0.00333333,
0.003,
0.003,
0.003,
0.003,
0.003,
&zwsp;0.003,
0.003,
0.003,
0.003}
166 Since
<literal>CRAAAA
</literal> appears in the list of MCVs, the selectivity is
167 merely the corresponding entry in the list of most common frequencies
168 (
<acronym>MCF
</acronym>s):
175 As before, the estimated number of rows is just the product of this with the
176 cardinality of
<structname>tenk1
</structname>:
185 Now consider the same query, but with a constant that is not in the
186 <acronym>MCV
</acronym> list:
189 EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'xxx';
192 ----------------------------------------------------------
193 Seq Scan on tenk1 (cost=
0.00.
.483.00 rows=
15 width=
244)
194 Filter: (stringu1 = 'xxx'::name)
197 This is quite a different problem: how to estimate the selectivity when the
198 value is
<emphasis>not
</emphasis> in the
<acronym>MCV
</acronym> list.
199 The approach is to use the fact that the value is not in the list,
200 combined with the knowledge of the frequencies for all of the
201 <acronym>MCV
</acronym>s:
204 selectivity = (
1 - sum(mcv_freqs))/(num_distinct - num_mcv)
205 = (
1 - (
0.00333333 +
0.003 +
0.003 +
0.003 +
0.003 +
0.003 +
206 0.003 +
0.003 +
0.003 +
0.003))/(
676 -
10)
210 That is, add up all the frequencies for the
<acronym>MCV
</acronym>s and
211 subtract them from one, then
212 divide by the number of
<emphasis>other
</emphasis> distinct values.
213 This amounts to assuming that the fraction of the column that is not any
214 of the MCVs is evenly distributed among all the other distinct values.
215 Notice that there are no null values so we don't have to worry about those
216 (otherwise we'd subtract the null fraction from the numerator as well).
217 The estimated number of rows is then calculated as usual:
220 rows =
10000 *
0.0014559
226 The previous example with
<literal>unique1
< 1000</literal> was an
227 oversimplification of what
<function>scalarltsel
</function> really does;
228 now that we have seen an example of the use of MCVs, we can fill in some
229 more detail. The example was correct as far as it went, because since
230 <structfield>unique1
</structfield> is a unique column it has no MCVs (obviously, no
231 value is any more common than any other value). For a non-unique
232 column, there will normally be both a histogram and an MCV list, and
233 <emphasis>the histogram does not include the portion of the column
234 population represented by the MCVs
</emphasis>. We do things this way because
235 it allows more precise estimation. In this situation
236 <function>scalarltsel
</function> directly applies the condition (e.g.,
237 <quote>< 1000</quote>) to each value of the MCV list, and adds up the
238 frequencies of the MCVs for which the condition is true. This gives
239 an exact estimate of the selectivity within the portion of the table
240 that is MCVs. The histogram is then used in the same way as above
241 to estimate the selectivity in the portion of the table that is not
242 MCVs, and then the two numbers are combined to estimate the overall
243 selectivity. For example, consider
246 EXPLAIN SELECT * FROM tenk1 WHERE stringu1
< 'IAAAAA';
249 ------------------------------------------------------------
250 Seq Scan on tenk1 (cost=
0.00.
.483.00 rows=
3077 width=
244)
251 Filter: (stringu1
< 'IAAAAA'::name)
254 We already saw the MCV information for
<structfield>stringu1
</structfield>,
255 and here is its histogram:
258 SELECT histogram_bounds FROM pg_stats
259 WHERE tablename='tenk1' AND attname='stringu1';
262 -------------------------------------------------------------------
&zwsp;-------------
263 {AAAAAA,CQAAAA,FRAAAA,IBAAAA,KRAAAA,NFAAAA,PSAAAA,SGAAAA,VAAAAA,
&zwsp;XLAAAA,ZZAAAA}
266 Checking the MCV list, we find that the condition
<literal>stringu1
<
267 'IAAAAA'
</literal> is satisfied by the first six entries and not the last four,
268 so the selectivity within the MCV part of the population is
271 selectivity = sum(relevant mvfs)
272 =
0.00333333 +
0.003 +
0.003 +
0.003 +
0.003 +
0.003
276 Summing all the MCFs also tells us that the total fraction of the
277 population represented by MCVs is
0.03033333, and therefore the
278 fraction represented by the histogram is
0.96966667 (again, there
279 are no nulls, else we'd have to exclude them here). We can see
280 that the value
<literal>IAAAAA
</literal> falls nearly at the end of the
281 third histogram bucket. Using some rather cheesy assumptions
282 about the frequency of different characters, the planner arrives
283 at the estimate
0.298387 for the portion of the histogram population
284 that is less than
<literal>IAAAAA
</literal>. We then combine the estimates
285 for the MCV and non-MCV populations:
288 selectivity = mcv_selectivity + histogram_selectivity * histogram_fraction
289 =
0.01833333 +
0.298387 *
0.96966667
292 rows =
10000 *
0.307669
293 =
3077 (rounding off)
296 In this particular example, the correction from the MCV list is fairly
297 small, because the column distribution is actually quite flat (the
298 statistics showing these particular values as being more common than
299 others are mostly due to sampling error). In a more typical case where
300 some values are significantly more common than others, this complicated
301 process gives a useful improvement in accuracy because the selectivity
302 for the most common values is found exactly.
306 Now let's consider a case with more than one
307 condition in the
<literal>WHERE
</literal> clause:
310 EXPLAIN SELECT * FROM tenk1 WHERE unique1
< 1000 AND stringu1 = 'xxx';
313 -------------------------------------------------------------------
&zwsp;-------------
314 Bitmap Heap Scan on tenk1 (cost=
23.80.
.396.91 rows=
1 width=
244)
315 Recheck Cond: (unique1
< 1000)
316 Filter: (stringu1 = 'xxx'::name)
317 -
> Bitmap Index Scan on tenk1_unique1 (cost=
0.00.
.23.80 rows=
1007 width=
0)
318 Index Cond: (unique1
< 1000)
321 The planner assumes that the two conditions are independent, so that
322 the individual selectivities of the clauses can be multiplied together:
325 selectivity = selectivity(unique1
< 1000) * selectivity(stringu1 = 'xxx')
326 =
0.100697 *
0.0014559
329 rows =
10000 *
0.0001466
333 Notice that the number of rows estimated to be returned from the bitmap
334 index scan reflects only the condition used with the index; this is
335 important since it affects the cost estimate for the subsequent heap
340 Finally we will examine a query that involves a join:
343 EXPLAIN SELECT * FROM tenk1 t1, tenk2 t2
344 WHERE t1.unique1
< 50 AND t1.unique2 = t2.unique2;
347 -------------------------------------------------------------------
&zwsp;-------------------
348 Nested Loop (cost=
4.64.
.456.23 rows=
50 width=
488)
349 -
> Bitmap Heap Scan on tenk1 t1 (cost=
4.64.
.142.17 rows=
50 width=
244)
350 Recheck Cond: (unique1
< 50)
351 -
> Bitmap Index Scan on tenk1_unique1 (cost=
0.00.
.4.63 rows=
50 width=
0)
352 Index Cond: (unique1
< 50)
353 -
> Index Scan using tenk2_unique2 on tenk2 t2 (cost=
0.00.
.6.27 rows=
1 width=
244)
354 Index Cond: (unique2 = t1.unique2)
357 The restriction on
<structname>tenk1
</structname>,
358 <literal>unique1
< 50</literal>,
359 is evaluated before the nested-loop join.
360 This is handled analogously to the previous range example. This time the
361 value
50 falls into the first bucket of the
362 <structfield>unique1
</structfield> histogram:
365 selectivity = (
0 + (
50 - bucket[
1].min)/(bucket[
1].max - bucket[
1].min))/num_buckets
366 = (
0 + (
50 -
0)/(
993 -
0))/
10
369 rows =
10000 *
0.005035
373 The restriction for the join is
<literal>t2.unique2 = t1.unique2
</literal>.
375 our familiar
<literal>=
</literal>, however the selectivity function is
376 obtained from the
<structfield>oprjoin
</structfield> column of
377 <structname>pg_operator
</structname>, and is
<function>eqjoinsel
</function>.
378 <function>eqjoinsel
</function> looks up the statistical information for both
379 <structname>tenk2
</structname> and
<structname>tenk1
</structname>:
382 SELECT tablename, null_frac,n_distinct, most_common_vals FROM pg_stats
383 WHERE tablename IN ('tenk1', 'tenk2') AND attname='unique2';
385 tablename | null_frac | n_distinct | most_common_vals
386 -----------+-----------+------------+------------------
391 In this case there is no
<acronym>MCV
</acronym> information for
392 <structname>unique2
</structname> and all the values appear to be
393 unique (n_distinct = -
1), so we use an algorithm that relies on the row
394 count estimates for both relations (num_rows, not shown, but
"tenk")
395 together with the column null fractions (zero for both):
398 selectivity = (
1 - null_frac1) * (
1 - null_frac2) / max(num_rows1, num_rows2)
399 = (
1 -
0) * (
1 -
0) / max(
10000,
10000)
403 This is, subtract the null fraction from one for each of the relations,
404 and divide by the row count of the larger relation (this value does get
405 scaled in the non-unique case).
407 that the join is likely to emit is calculated as the cardinality of the
408 Cartesian product of the two inputs, multiplied by the
412 rows = (outer_cardinality * inner_cardinality) * selectivity
413 = (
50 *
10000) *
0.0001
419 Had there been MCV lists for the two columns,
420 <function>eqjoinsel
</function> would have used direct comparison of the MCV
421 lists to determine the join selectivity within the part of the column
422 populations represented by the MCVs. The estimate for the remainder of the
423 populations follows the same approach shown here.
427 Notice that we showed
<literal>inner_cardinality
</literal> as
10000, that is,
428 the unmodified size of
<structname>tenk2
</structname>. It might appear from
429 inspection of the
<command>EXPLAIN
</command> output that the estimate of
430 join rows comes from
50 *
1, that is, the number of outer rows times
431 the estimated number of rows obtained by each inner index scan on
432 <structname>tenk2
</structname>. But this is not the case: the join relation size
433 is estimated before any particular join plan has been considered. If
434 everything is working well then the two ways of estimating the join
435 size will produce about the same answer, but due to round-off error and
436 other factors they sometimes diverge significantly.
440 For those interested in further details, estimation of the size of
441 a table (before any
<literal>WHERE
</literal> clauses) is done in
442 <filename>src/backend/optimizer/util/plancat.c
</filename>. The generic
443 logic for clause selectivities is in
444 <filename>src/backend/optimizer/path/clausesel.c
</filename>. The
445 operator-specific selectivity functions are mostly found
446 in
<filename>src/backend/utils/adt/selfuncs.c
</filename>.
450 <sect1 id=
"multivariate-statistics-examples">
451 <title>Multivariate Statistics Examples
</title>
454 <primary>row estimation
</primary>
455 <secondary>multivariate
</secondary>
458 <sect2 id=
"functional-dependencies">
459 <title>Functional Dependencies
</title>
462 Multivariate correlation can be demonstrated with a very simple data set
463 — a table with two columns, both containing the same values:
466 CREATE TABLE t (a INT, b INT);
467 INSERT INTO t SELECT i %
100, i %
100 FROM generate_series(
1,
10000) s(i);
471 As explained in
<xref linkend=
"planner-stats"/>, the planner can determine
472 cardinality of
<structname>t
</structname> using the number of pages and
473 rows obtained from
<structname>pg_class
</structname>:
476 SELECT relpages, reltuples FROM pg_class WHERE relname = 't';
479 ----------+-----------
483 The data distribution is very simple; there are only
100 distinct values
484 in each column, uniformly distributed.
488 The following example shows the result of estimating a
<literal>WHERE
</literal>
489 condition on the
<structfield>a
</structfield> column:
492 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a =
1;
494 -------------------------------------------------------------------
&zwsp;------------
495 Seq Scan on t (cost=
0.00.
.170.00 rows=
100 width=
8) (actual rows=
100 loops=
1)
497 Rows Removed by Filter:
9900
500 The planner examines the condition and determines the selectivity
501 of this clause to be
1%. By comparing this estimate and the actual
502 number of rows, we see that the estimate is very accurate
503 (in fact exact, as the table is very small). Changing the
504 <literal>WHERE
</literal> condition to use the
<structfield>b
</structfield> column, an
505 identical plan is generated. But observe what happens if we apply the same
506 condition on both columns, combining them with
<literal>AND
</literal>:
509 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a =
1 AND b =
1;
511 -------------------------------------------------------------------
&zwsp;----------
512 Seq Scan on t (cost=
0.00.
.195.00 rows=
1 width=
8) (actual rows=
100 loops=
1)
513 Filter: ((a =
1) AND (b =
1))
514 Rows Removed by Filter:
9900
517 The planner estimates the selectivity for each condition individually,
518 arriving at the same
1% estimates as above. Then it assumes that the
519 conditions are independent, and so it multiplies their selectivities,
520 producing a final selectivity estimate of just
0.01%.
521 This is a significant underestimate, as the actual number of rows
522 matching the conditions (
100) is two orders of magnitude higher.
526 This problem can be fixed by creating a statistics object that
527 directs
<command>ANALYZE
</command> to calculate functional-dependency
528 multivariate statistics on the two columns:
531 CREATE STATISTICS stts (dependencies) ON a, b FROM t;
533 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a =
1 AND b =
1;
535 -------------------------------------------------------------------
&zwsp;------------
536 Seq Scan on t (cost=
0.00.
.195.00 rows=
100 width=
8) (actual rows=
100 loops=
1)
537 Filter: ((a =
1) AND (b =
1))
538 Rows Removed by Filter:
9900
543 <sect2 id=
"multivariate-ndistinct-counts">
544 <title>Multivariate N-Distinct Counts
</title>
547 A similar problem occurs with estimation of the cardinality of sets of
548 multiple columns, such as the number of groups that would be generated by
549 a
<command>GROUP BY
</command> clause. When
<command>GROUP BY
</command>
550 lists a single column, the n-distinct estimate (which is visible as the
551 estimated number of rows returned by the HashAggregate node) is very
554 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT COUNT(*) FROM t GROUP BY a;
556 -------------------------------------------------------------------
&zwsp;----------------------
557 HashAggregate (cost=
195.00.
.196.00 rows=
100 width=
12) (actual rows=
100 loops=
1)
559 -
> Seq Scan on t (cost=
0.00.
.145.00 rows=
10000 width=
4) (actual rows=
10000 loops=
1)
561 But without multivariate statistics, the estimate for the number of
562 groups in a query with two columns in
<command>GROUP BY
</command>, as
563 in the following example, is off by an order of magnitude:
565 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT COUNT(*) FROM t GROUP BY a, b;
567 -------------------------------------------------------------------
&zwsp;-------------------------
568 HashAggregate (cost=
220.00.
.230.00 rows=
1000 width=
16) (actual rows=
100 loops=
1)
570 -
> Seq Scan on t (cost=
0.00.
.145.00 rows=
10000 width=
8) (actual rows=
10000 loops=
1)
572 By redefining the statistics object to include n-distinct counts for the
573 two columns, the estimate is much improved:
575 DROP STATISTICS stts;
576 CREATE STATISTICS stts (dependencies, ndistinct) ON a, b FROM t;
578 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT COUNT(*) FROM t GROUP BY a, b;
580 -------------------------------------------------------------------
&zwsp;-------------------------
581 HashAggregate (cost=
220.00.
.221.00 rows=
100 width=
16) (actual rows=
100 loops=
1)
583 -
> Seq Scan on t (cost=
0.00.
.145.00 rows=
10000 width=
8) (actual rows=
10000 loops=
1)
589 <sect2 id=
"mcv-lists">
590 <title>MCV Lists
</title>
593 As explained in
<xref linkend=
"functional-dependencies"/>, functional
594 dependencies are very cheap and efficient type of statistics, but their
595 main limitation is their global nature (only tracking dependencies at
596 the column level, not between individual column values).
600 This section introduces multivariate variant of
<acronym>MCV
</acronym>
601 (most-common values) lists, a straightforward extension of the per-column
602 statistics described in
<xref linkend=
"row-estimation-examples"/>. These
603 statistics address the limitation by storing individual values, but it is
604 naturally more expensive, both in terms of building the statistics in
605 <command>ANALYZE
</command>, storage and planning time.
609 Let's look at the query from
<xref linkend=
"functional-dependencies"/>
610 again, but this time with a
<acronym>MCV
</acronym> list created on the
611 same set of columns (be sure to drop the functional dependencies, to
612 make sure the planner uses the newly created statistics).
615 DROP STATISTICS stts;
616 CREATE STATISTICS stts2 (mcv) ON a, b FROM t;
618 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a =
1 AND b =
1;
620 -------------------------------------------------------------------
&zwsp;------------
621 Seq Scan on t (cost=
0.00.
.195.00 rows=
100 width=
8) (actual rows=
100 loops=
1)
622 Filter: ((a =
1) AND (b =
1))
623 Rows Removed by Filter:
9900
626 The estimate is as accurate as with the functional dependencies, mostly
627 thanks to the table being fairly small and having a simple distribution
628 with a low number of distinct values. Before looking at the second query,
629 which was not handled by functional dependencies particularly well,
630 let's inspect the
<acronym>MCV
</acronym> list a bit.
634 Inspecting the
<acronym>MCV
</acronym> list is possible using
635 <function>pg_mcv_list_items
</function> set-returning function.
638 SELECT m.* FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid),
639 pg_mcv_list_items(stxdmcv) m WHERE stxname = 'stts2';
640 index | values | nulls | frequency | base_frequency
641 -------+----------+-------+-----------+----------------
642 0 | {
0,
0} | {f,f} |
0.01 |
0.0001
643 1 | {
1,
1} | {f,f} |
0.01 |
0.0001
645 49 | {
49,
49} | {f,f} |
0.01 |
0.0001
646 50 | {
50,
50} | {f,f} |
0.01 |
0.0001
648 97 | {
97,
97} | {f,f} |
0.01 |
0.0001
649 98 | {
98,
98} | {f,f} |
0.01 |
0.0001
650 99 | {
99,
99} | {f,f} |
0.01 |
0.0001
654 This confirms there are
100 distinct combinations in the two columns, and
655 all of them are about equally likely (
1% frequency for each one). The
656 base frequency is the frequency computed from per-column statistics, as if
657 there were no multi-column statistics. Had there been any null values in
658 either of the columns, this would be identified in the
659 <structfield>nulls
</structfield> column.
663 When estimating the selectivity, the planner applies all the conditions
664 on items in the
<acronym>MCV
</acronym> list, and then sums the frequencies
665 of the matching ones. See
<function>mcv_clauselist_selectivity
</function>
666 in
<filename>src/backend/statistics/mcv.c
</filename> for details.
670 Compared to functional dependencies,
<acronym>MCV
</acronym> lists have two
671 major advantages. Firstly, the list stores actual values, making it possible
672 to decide which combinations are compatible.
675 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a =
1 AND b =
10;
677 -------------------------------------------------------------------
&zwsp;--------
678 Seq Scan on t (cost=
0.00.
.195.00 rows=
1 width=
8) (actual rows=
0 loops=
1)
679 Filter: ((a =
1) AND (b =
10))
680 Rows Removed by Filter:
10000
683 Secondly,
<acronym>MCV
</acronym> lists handle a wider range of clause types,
684 not just equality clauses like functional dependencies. For example,
685 consider the following range query for the same table:
688 EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a
<=
49 AND b
> 49;
690 -------------------------------------------------------------------
&zwsp;--------
691 Seq Scan on t (cost=
0.00.
.195.00 rows=
1 width=
8) (actual rows=
0 loops=
1)
692 Filter: ((a
<=
49) AND (b
> 49))
693 Rows Removed by Filter:
10000
702 <sect1 id=
"planner-stats-security">
703 <title>Planner Statistics and Security
</title>
706 Access to the table
<structname>pg_statistic
</structname> is restricted to
707 superusers, so that ordinary users cannot learn about the contents of the
708 tables of other users from it. Some selectivity estimation functions will
709 use a user-provided operator (either the operator appearing in the query or
710 a related operator) to analyze the stored statistics. For example, in order
711 to determine whether a stored most common value is applicable, the
712 selectivity estimator will have to run the appropriate
<literal>=
</literal>
713 operator to compare the constant in the query to the stored value.
714 Thus the data in
<structname>pg_statistic
</structname> is potentially
715 passed to user-defined operators. An appropriately crafted operator can
716 intentionally leak the passed operands (for example, by logging them
717 or writing them to a different table), or accidentally leak them by showing
718 their values in error messages, in either case possibly exposing data from
719 <structname>pg_statistic
</structname> to a user who should not be able to
724 In order to prevent this, the following applies to all built-in selectivity
725 estimation functions. When planning a query, in order to be able to use
726 stored statistics, the current user must either
727 have
<literal>SELECT
</literal> privilege on the table or the involved
728 columns, or the operator used must be
<literal>LEAKPROOF
</literal> (more
729 accurately, the function that the operator is based on). If not, then the
730 selectivity estimator will behave as if no statistics are available, and
731 the planner will proceed with default or fall-back assumptions.
732 The
<xref linkend=
"app-psql"/> program's
733 <command><link linkend=
"app-psql-meta-command-do-lc">\do+
</link></command>
734 meta-command is useful to determine which operators are marked as leakproof.
738 If a user does not have the required privilege on the table or columns,
739 then in many cases the query will ultimately receive a permission-denied
740 error, in which case this mechanism is invisible in practice. But if the
741 user is reading from a security-barrier view, then the planner might wish
742 to check the statistics of an underlying table that is otherwise
743 inaccessible to the user. In that case, the operator should be leakproof
744 or the statistics will not be used. There is no direct feedback about
745 that, except that the plan might be suboptimal. If one suspects that this
746 is the case, one could try running the query as a more privileged user,
747 to see if a different plan results.
751 This restriction applies only to cases where the planner would need to
752 execute a user-defined operator on one or more values
753 from
<structname>pg_statistic
</structname>. Thus the planner is permitted
754 to use generic statistical information, such as the fraction of null values
755 or the number of distinct values in a column, regardless of access
760 Selectivity estimation functions contained in third-party extensions that
761 potentially operate on statistics with user-defined operators should follow
762 the same security rules. Consult the PostgreSQL source code for guidance.