The code to unlink dropped relations in FinishPreparedTransaction() was
[PostgreSQL.git] / doc / src / sgml / geqo.sgml
blob7e03f9b9b59898195601b71f9c44d71a22b2907d
1 <!-- $PostgreSQL$ -->
3 <chapter id="geqo">
4 <chapterinfo>
5 <author>
6 <firstname>Martin</firstname>
7 <surname>Utesch</surname>
8 <affiliation>
9 <orgname>
10 University of Mining and Technology
11 </orgname>
12 <orgdiv>
13 Institute of Automatic Control
14 </orgdiv>
15 <address>
16 <city>
17 Freiberg
18 </city>
19 <country>
20 Germany
21 </country>
22 </address>
23 </affiliation>
24 </author>
25 <date>1997-10-02</date>
26 </chapterinfo>
28 <title id="geqo-title">Genetic Query Optimizer</title>
30 <para>
31 <note>
32 <title>Author</title>
33 <para>
34 Written by Martin Utesch (<email>utesch@aut.tu-freiberg.de</email>)
35 for the Institute of Automatic Control at the University of Mining and Technology in Freiberg, Germany.
36 </para>
37 </note>
38 </para>
40 <sect1 id="geqo-intro">
41 <title>Query Handling as a Complex Optimization Problem</title>
43 <para>
44 Among all relational operators the most difficult one to process
45 and optimize is the <firstterm>join</firstterm>. The number of
46 possible query plans grows exponentially with the
47 number of joins in the query. Further optimization effort is
48 caused by the support of a variety of <firstterm>join
49 methods</firstterm> (e.g., nested loop, hash join, merge join in
50 <productname>PostgreSQL</productname>) to process individual joins
51 and a diversity of <firstterm>indexes</firstterm> (e.g.,
52 B-tree, hash, GiST and GIN in <productname>PostgreSQL</productname>) as
53 access paths for relations.
54 </para>
56 <para>
57 The normal <productname>PostgreSQL</productname> query optimizer
58 performs a <firstterm>near-exhaustive search</firstterm> over the
59 space of alternative strategies. This algorithm, first introduced
60 in IBM's System R database, produces a near-optimal join order,
61 but can take an enormous amount of time and memory space when the
62 number of joins in the query grows large. This makes the ordinary
63 <productname>PostgreSQL</productname> query optimizer
64 inappropriate for queries that join a large number of tables.
65 </para>
67 <para>
68 The Institute of Automatic Control at the University of Mining and
69 Technology, in Freiberg, Germany, encountered some problems when
70 it wanted to use <productname>PostgreSQL</productname> as the
71 backend for a decision support knowledge based system for the
72 maintenance of an electrical power grid. The DBMS needed to handle
73 large join queries for the inference machine of the knowledge
74 based system. The number of joins in these queries made using the
75 normal query optimizer infeasible.
76 </para>
78 <para>
79 In the following we describe the implementation of a
80 <firstterm>genetic algorithm</firstterm> to solve the join
81 ordering problem in a manner that is efficient for queries
82 involving large numbers of joins.
83 </para>
84 </sect1>
86 <sect1 id="geqo-intro2">
87 <title>Genetic Algorithms</title>
89 <para>
90 The genetic algorithm (<acronym>GA</acronym>) is a heuristic optimization method which
91 operates through
92 nondeterministic, randomized search. The set of possible solutions for the
93 optimization problem is considered as a
94 <firstterm>population</firstterm> of <firstterm>individuals</firstterm>.
95 The degree of adaptation of an individual to its environment is specified
96 by its <firstterm>fitness</firstterm>.
97 </para>
99 <para>
100 The coordinates of an individual in the search space are represented
101 by <firstterm>chromosomes</firstterm>, in essence a set of character
102 strings. A <firstterm>gene</firstterm> is a
103 subsection of a chromosome which encodes the value of a single parameter
104 being optimized. Typical encodings for a gene could be <firstterm>binary</firstterm> or
105 <firstterm>integer</firstterm>.
106 </para>
108 <para>
109 Through simulation of the evolutionary operations <firstterm>recombination</firstterm>,
110 <firstterm>mutation</firstterm>, and
111 <firstterm>selection</firstterm> new generations of search points are found
112 that show a higher average fitness than their ancestors.
113 </para>
115 <para>
116 According to the <systemitem class="resource">comp.ai.genetic</> <acronym>FAQ</acronym> it cannot be stressed too
117 strongly that a <acronym>GA</acronym> is not a pure random search for a solution to a
118 problem. A <acronym>GA</acronym> uses stochastic processes, but the result is distinctly
119 non-random (better than random).
120 </para>
122 <figure id="geqo-diagram">
123 <title>Structured Diagram of a Genetic Algorithm</title>
125 <informaltable frame="none">
126 <tgroup cols="2">
127 <tbody>
128 <row>
129 <entry>P(t)</entry>
130 <entry>generation of ancestors at a time t</entry>
131 </row>
133 <row>
134 <entry>P''(t)</entry>
135 <entry>generation of descendants at a time t</entry>
136 </row>
137 </tbody>
138 </tgroup>
139 </informaltable>
141 <literallayout class="monospaced">
142 +=========================================+
143 |&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Algorithm GA &lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;|
144 +=========================================+
145 | INITIALIZE t := 0 |
146 +=========================================+
147 | INITIALIZE P(t) |
148 +=========================================+
149 | evaluate FITNESS of P(t) |
150 +=========================================+
151 | while not STOPPING CRITERION do |
152 | +-------------------------------------+
153 | | P'(t) := RECOMBINATION{P(t)} |
154 | +-------------------------------------+
155 | | P''(t) := MUTATION{P'(t)} |
156 | +-------------------------------------+
157 | | P(t+1) := SELECTION{P''(t) + P(t)} |
158 | +-------------------------------------+
159 | | evaluate FITNESS of P''(t) |
160 | +-------------------------------------+
161 | | t := t + 1 |
162 +===+=====================================+
163 </literallayout>
164 </figure>
165 </sect1>
167 <sect1 id="geqo-pg-intro">
168 <title>Genetic Query Optimization (<acronym>GEQO</acronym>) in PostgreSQL</title>
170 <para>
171 The <acronym>GEQO</acronym> module approaches the query
172 optimization problem as though it were the well-known traveling salesman
173 problem (<acronym>TSP</acronym>).
174 Possible query plans are encoded as integer strings. Each string
175 represents the join order from one relation of the query to the next.
176 For example, the join tree
177 <literallayout class="monospaced">
179 /\ 2
180 /\ 3
182 </literallayout>
183 is encoded by the integer string '4-1-3-2',
184 which means, first join relation '4' and '1', then '3', and
185 then '2', where 1, 2, 3, 4 are relation IDs within the
186 <productname>PostgreSQL</productname> optimizer.
187 </para>
189 <para>
190 Specific characteristics of the <acronym>GEQO</acronym>
191 implementation in <productname>PostgreSQL</productname>
192 are:
194 <itemizedlist spacing="compact" mark="bullet">
195 <listitem>
196 <para>
197 Usage of a <firstterm>steady state</firstterm> <acronym>GA</acronym> (replacement of the least fit
198 individuals in a population, not whole-generational replacement)
199 allows fast convergence towards improved query plans. This is
200 essential for query handling with reasonable time;
201 </para>
202 </listitem>
204 <listitem>
205 <para>
206 Usage of <firstterm>edge recombination crossover</firstterm>
207 which is especially suited to keep edge losses low for the
208 solution of the <acronym>TSP</acronym> by means of a
209 <acronym>GA</acronym>;
210 </para>
211 </listitem>
213 <listitem>
214 <para>
215 Mutation as genetic operator is deprecated so that no repair
216 mechanisms are needed to generate legal <acronym>TSP</acronym> tours.
217 </para>
218 </listitem>
219 </itemizedlist>
220 </para>
222 <para>
223 Parts of the <acronym>GEQO</acronym> module are adapted from D. Whitley's
224 Genitor algorithm.
225 </para>
227 <para>
228 The <acronym>GEQO</acronym> module allows
229 the <productname>PostgreSQL</productname> query optimizer to
230 support large join queries effectively through
231 non-exhaustive search.
232 </para>
234 <sect2>
235 <title>Generating Possible Plans with <acronym>GEQO</acronym></title>
237 <para>
238 The <acronym>GEQO</acronym> planning process uses the standard planner
239 code to generate plans for scans of individual relations. Then join
240 plans are developed using the genetic approach. As shown above, each
241 candidate join plan is represented by a sequence in which to join
242 the base relations. In the initial stage, the <acronym>GEQO</acronym>
243 code simply generates some possible join sequences at random. For each
244 join sequence considered, the standard planner code is invoked to
245 estimate the cost of performing the query using that join sequence.
246 (For each step of the join sequence, all three possible join strategies
247 are considered; and all the initially-determined relation scan plans
248 are available. The estimated cost is the cheapest of these
249 possibilities.) Join sequences with lower estimated cost are considered
250 <quote>more fit</> than those with higher cost. The genetic algorithm
251 discards the least fit candidates. Then new candidates are generated
252 by combining genes of more-fit candidates &mdash; that is, by using
253 randomly-chosen portions of known low-cost join sequences to create
254 new sequences for consideration. This process is repeated until a
255 preset number of join sequences have been considered; then the best
256 one found at any time during the search is used to generate the finished
257 plan.
258 </para>
260 <para>
261 This process is inherently nondeterministic, because of the randomized
262 choices made during both the initial population selection and subsequent
263 <quote>mutation</> of the best candidates. Hence different plans may
264 be selected from one run to the next, resulting in varying run time
265 and varying output row order.
266 </para>
268 </sect2>
270 <sect2 id="geqo-future">
271 <title>Future Implementation Tasks for
272 <productname>PostgreSQL</> <acronym>GEQO</acronym></title>
274 <para>
275 Work is still needed to improve the genetic algorithm parameter
276 settings.
277 In file <filename>src/backend/optimizer/geqo/geqo_main.c</filename>,
278 routines
279 <function>gimme_pool_size</function> and <function>gimme_number_generations</function>,
280 we have to find a compromise for the parameter settings
281 to satisfy two competing demands:
282 <itemizedlist spacing="compact">
283 <listitem>
284 <para>
285 Optimality of the query plan
286 </para>
287 </listitem>
288 <listitem>
289 <para>
290 Computing time
291 </para>
292 </listitem>
293 </itemizedlist>
294 </para>
296 <para>
297 In the current implementation, the fitness of each candidate join
298 sequence is estimated by running the standard planner's join selection
299 and cost estimation code from scratch. To the extent that different
300 candidates use similar sub-sequences of joins, a great deal of work
301 will be repeated. This could be made significantly faster by retaining
302 cost estimates for sub-joins. The problem is to avoid expending
303 unreasonable amounts of memory on retaining that state.
304 </para>
306 <para>
307 At a more basic level, it is not clear that solving query optimization
308 with a GA algorithm designed for TSP is appropriate. In the TSP case,
309 the cost associated with any substring (partial tour) is independent
310 of the rest of the tour, but this is certainly not true for query
311 optimization. Thus it is questionable whether edge recombination
312 crossover is the most effective mutation procedure.
313 </para>
315 </sect2>
316 </sect1>
318 <sect1 id="geqo-biblio">
319 <title>Further Reading</title>
321 <para>
322 The following resources contain additional information about
323 genetic algorithms:
325 <itemizedlist>
326 <listitem>
327 <para>
328 <ulink url="http://www.cs.bham.ac.uk/Mirrors/ftp.de.uu.net/EC/clife/www/location.htm">
329 The Hitch-Hiker's Guide to Evolutionary Computation</ulink>, (FAQ for <ulink
330 url="news://comp.ai.genetic"></ulink>)
331 </para>
332 </listitem>
334 <listitem>
335 <para>
336 <ulink url="http://www.red3d.com/cwr/evolve.html">
337 Evolutionary Computation and its application to art and design</ulink>, by
338 Craig Reynolds
339 </para>
340 </listitem>
342 <listitem>
343 <para>
344 <xref linkend="ELMA04">
345 </para>
346 </listitem>
348 <listitem>
349 <para>
350 <xref linkend="FONG">
351 </para>
352 </listitem>
353 </itemizedlist>
354 </para>
356 </sect1>
357 </chapter>