At update of non-LP_NORMAL TID, fail instead of corrupting page header.
[pgsql.git] / doc / src / sgml / geqo.sgml
blob82bf3b690d85ede7ed3ac05fe7437b17bb165c0d
1 <!-- doc/src/sgml/geqo.sgml -->
3 <chapter id="geqo">
4 <title>Genetic Query Optimizer</title>
6 <para>
7 <note>
8 <title>Author</title>
9 <para>
10 Written by Martin Utesch (<email>utesch@aut.tu-freiberg.de</email>)
11 for the Institute of Automatic Control at the University of Mining and Technology in Freiberg, Germany.
12 </para>
13 </note>
14 </para>
16 <sect1 id="geqo-intro">
17 <title>Query Handling as a Complex Optimization Problem</title>
19 <para>
20 Among all relational operators the most difficult one to process
21 and optimize is the <firstterm>join</firstterm>. The number of
22 possible query plans grows exponentially with the
23 number of joins in the query. Further optimization effort is
24 caused by the support of a variety of <firstterm>join
25 methods</firstterm> (e.g., nested loop, hash join, merge join in
26 <productname>PostgreSQL</productname>) to process individual joins
27 and a diversity of <firstterm>indexes</firstterm> (e.g.,
28 B-tree, hash, GiST and GIN in <productname>PostgreSQL</productname>) as
29 access paths for relations.
30 </para>
32 <para>
33 The normal <productname>PostgreSQL</productname> query optimizer
34 performs a <firstterm>near-exhaustive search</firstterm> over the
35 space of alternative strategies. This algorithm, first introduced
36 in IBM's System R database, produces a near-optimal join order,
37 but can take an enormous amount of time and memory space when the
38 number of joins in the query grows large. This makes the ordinary
39 <productname>PostgreSQL</productname> query optimizer
40 inappropriate for queries that join a large number of tables.
41 </para>
43 <para>
44 The Institute of Automatic Control at the University of Mining and
45 Technology, in Freiberg, Germany, encountered some problems when
46 it wanted to use <productname>PostgreSQL</productname> as the
47 backend for a decision support knowledge based system for the
48 maintenance of an electrical power grid. The DBMS needed to handle
49 large join queries for the inference machine of the knowledge
50 based system. The number of joins in these queries made using the
51 normal query optimizer infeasible.
52 </para>
54 <para>
55 In the following we describe the implementation of a
56 <firstterm>genetic algorithm</firstterm> to solve the join
57 ordering problem in a manner that is efficient for queries
58 involving large numbers of joins.
59 </para>
60 </sect1>
62 <sect1 id="geqo-intro2">
63 <title>Genetic Algorithms</title>
65 <para>
66 The genetic algorithm (<acronym>GA</acronym>) is a heuristic optimization method which
67 operates through randomized search. The set of possible solutions for the
68 optimization problem is considered as a
69 <firstterm>population</firstterm> of <firstterm>individuals</firstterm>.
70 The degree of adaptation of an individual to its environment is specified
71 by its <firstterm>fitness</firstterm>.
72 </para>
74 <para>
75 The coordinates of an individual in the search space are represented
76 by <firstterm>chromosomes</firstterm>, in essence a set of character
77 strings. A <firstterm>gene</firstterm> is a
78 subsection of a chromosome which encodes the value of a single parameter
79 being optimized. Typical encodings for a gene could be <firstterm>binary</firstterm> or
80 <firstterm>integer</firstterm>.
81 </para>
83 <para>
84 Through simulation of the evolutionary operations <firstterm>recombination</firstterm>,
85 <firstterm>mutation</firstterm>, and
86 <firstterm>selection</firstterm> new generations of search points are found
87 that show a higher average fitness than their ancestors. <xref linkend="geqo-figure"/>
88 illustrates these steps.
89 </para>
91 <figure id="geqo-figure">
92 <title>Structure of a Genetic Algorithm</title>
93 <mediaobject>
94 <imageobject>
95 <imagedata fileref="images/genetic-algorithm.svg" format="SVG" width="100%"/>
96 </imageobject>
97 </mediaobject>
98 </figure>
100 <para>
101 According to the <systemitem class="resource">comp.ai.genetic</systemitem> <acronym>FAQ</acronym> it cannot be stressed too
102 strongly that a <acronym>GA</acronym> is not a pure random search for a solution to a
103 problem. A <acronym>GA</acronym> uses stochastic processes, but the result is distinctly
104 non-random (better than random).
105 </para>
107 </sect1>
109 <sect1 id="geqo-pg-intro">
110 <title>Genetic Query Optimization (<acronym>GEQO</acronym>) in PostgreSQL</title>
112 <para>
113 The <acronym>GEQO</acronym> module approaches the query
114 optimization problem as though it were the well-known traveling salesman
115 problem (<acronym>TSP</acronym>).
116 Possible query plans are encoded as integer strings. Each string
117 represents the join order from one relation of the query to the next.
118 For example, the join tree
119 <literallayout class="monospaced">
121 /\ 2
122 /\ 3
124 </literallayout>
125 is encoded by the integer string '4-1-3-2',
126 which means, first join relation '4' and '1', then '3', and
127 then '2', where 1, 2, 3, 4 are relation IDs within the
128 <productname>PostgreSQL</productname> optimizer.
129 </para>
131 <para>
132 Specific characteristics of the <acronym>GEQO</acronym>
133 implementation in <productname>PostgreSQL</productname>
134 are:
136 <itemizedlist spacing="compact" mark="bullet">
137 <listitem>
138 <para>
139 Usage of a <firstterm>steady state</firstterm> <acronym>GA</acronym> (replacement of the least fit
140 individuals in a population, not whole-generational replacement)
141 allows fast convergence towards improved query plans. This is
142 essential for query handling with reasonable time;
143 </para>
144 </listitem>
146 <listitem>
147 <para>
148 Usage of <firstterm>edge recombination crossover</firstterm>
149 which is especially suited to keep edge losses low for the
150 solution of the <acronym>TSP</acronym> by means of a
151 <acronym>GA</acronym>;
152 </para>
153 </listitem>
155 <listitem>
156 <para>
157 Mutation as genetic operator is deprecated so that no repair
158 mechanisms are needed to generate legal <acronym>TSP</acronym> tours.
159 </para>
160 </listitem>
161 </itemizedlist>
162 </para>
164 <para>
165 Parts of the <acronym>GEQO</acronym> module are adapted from D. Whitley's
166 Genitor algorithm.
167 </para>
169 <para>
170 The <acronym>GEQO</acronym> module allows
171 the <productname>PostgreSQL</productname> query optimizer to
172 support large join queries effectively through
173 non-exhaustive search.
174 </para>
176 <sect2 id="geqo-pg-intro-gen-possible-plans">
177 <title>Generating Possible Plans with <acronym>GEQO</acronym></title>
179 <para>
180 The <acronym>GEQO</acronym> planning process uses the standard planner
181 code to generate plans for scans of individual relations. Then join
182 plans are developed using the genetic approach. As shown above, each
183 candidate join plan is represented by a sequence in which to join
184 the base relations. In the initial stage, the <acronym>GEQO</acronym>
185 code simply generates some possible join sequences at random. For each
186 join sequence considered, the standard planner code is invoked to
187 estimate the cost of performing the query using that join sequence.
188 (For each step of the join sequence, all three possible join strategies
189 are considered; and all the initially-determined relation scan plans
190 are available. The estimated cost is the cheapest of these
191 possibilities.) Join sequences with lower estimated cost are considered
192 <quote>more fit</quote> than those with higher cost. The genetic algorithm
193 discards the least fit candidates. Then new candidates are generated
194 by combining genes of more-fit candidates &mdash; that is, by using
195 randomly-chosen portions of known low-cost join sequences to create
196 new sequences for consideration. This process is repeated until a
197 preset number of join sequences have been considered; then the best
198 one found at any time during the search is used to generate the finished
199 plan.
200 </para>
202 <para>
203 This process is inherently nondeterministic, because of the randomized
204 choices made during both the initial population selection and subsequent
205 <quote>mutation</quote> of the best candidates. To avoid surprising changes
206 of the selected plan, each run of the GEQO algorithm restarts its
207 random number generator with the current <xref linkend="guc-geqo-seed"/>
208 parameter setting. As long as <varname>geqo_seed</varname> and the other
209 GEQO parameters are kept fixed, the same plan will be generated for a
210 given query (and other planner inputs such as statistics). To experiment
211 with different search paths, try changing <varname>geqo_seed</varname>.
212 </para>
214 </sect2>
216 <sect2 id="geqo-future">
217 <title>Future Implementation Tasks for
218 <productname>PostgreSQL</productname> <acronym>GEQO</acronym></title>
220 <para>
221 Work is still needed to improve the genetic algorithm parameter
222 settings.
223 In file <filename>src/backend/optimizer/geqo/geqo_main.c</filename>,
224 routines
225 <function>gimme_pool_size</function> and <function>gimme_number_generations</function>,
226 we have to find a compromise for the parameter settings
227 to satisfy two competing demands:
228 <itemizedlist spacing="compact">
229 <listitem>
230 <para>
231 Optimality of the query plan
232 </para>
233 </listitem>
234 <listitem>
235 <para>
236 Computing time
237 </para>
238 </listitem>
239 </itemizedlist>
240 </para>
242 <para>
243 In the current implementation, the fitness of each candidate join
244 sequence is estimated by running the standard planner's join selection
245 and cost estimation code from scratch. To the extent that different
246 candidates use similar sub-sequences of joins, a great deal of work
247 will be repeated. This could be made significantly faster by retaining
248 cost estimates for sub-joins. The problem is to avoid expending
249 unreasonable amounts of memory on retaining that state.
250 </para>
252 <para>
253 At a more basic level, it is not clear that solving query optimization
254 with a GA algorithm designed for TSP is appropriate. In the TSP case,
255 the cost associated with any substring (partial tour) is independent
256 of the rest of the tour, but this is certainly not true for query
257 optimization. Thus it is questionable whether edge recombination
258 crossover is the most effective mutation procedure.
259 </para>
261 </sect2>
262 </sect1>
264 <sect1 id="geqo-biblio">
265 <title>Further Reading</title>
267 <para>
268 The following resources contain additional information about
269 genetic algorithms:
271 <itemizedlist>
272 <listitem>
273 <para>
274 <ulink url="http://www.faqs.org/faqs/ai-faq/genetic/part1/">
275 The Hitch-Hiker's Guide to Evolutionary Computation</ulink>, (FAQ for <ulink
276 url="news://comp.ai.genetic"></ulink>)
277 </para>
278 </listitem>
280 <listitem>
281 <para>
282 <ulink url="https://www.red3d.com/cwr/evolve.html">
283 Evolutionary Computation and its application to art and design</ulink>, by
284 Craig Reynolds
285 </para>
286 </listitem>
288 <listitem>
289 <para>
290 <xref linkend="elma04"/>
291 </para>
292 </listitem>
294 <listitem>
295 <para>
296 <xref linkend="fong"/>
297 </para>
298 </listitem>
299 </itemizedlist>
300 </para>
302 </sect1>
303 </chapter>