doc/src/sgml/geqo.sgml

   1 <!-- doc/src/sgml/geqo.sgml -->
   2
   3  <chapter id="geqo">
   4   <title>Genetic Query Optimizer</title>
   5
   6   <para>
   7    <note>
   8     <title>Author</title>
   9     <para>
  10      Written by Martin Utesch (<email>utesch@aut.tu-freiberg.de</email>)
  11      for the Institute of Automatic Control at the University of Mining and Technology in Freiberg, Germany.
  12     </para>
  13    </note>
  14   </para>
  15
  16   <sect1 id="geqo-intro">
  17    <title>Query Handling as a Complex Optimization Problem</title>
  18
  19    <para>
  20     Among all relational operators the most difficult one to process
  21     and optimize is the <firstterm>join</firstterm>. The number of
  22     possible query plans grows exponentially with the
  23     number of joins in the query. Further optimization effort is
  24     caused by the support of a variety of <firstterm>join
  25     methods</firstterm> (e.g., nested loop, hash join, merge join in
  26     <productname>PostgreSQL</productname>) to process individual joins
  27     and a diversity of <firstterm>indexes</firstterm> (e.g.,
  28     B-tree, hash, GiST and GIN in <productname>PostgreSQL</productname>) as
  29     access paths for relations.
  30    </para>
  31
  32    <para>
  33     The normal <productname>PostgreSQL</productname> query optimizer
  34     performs a <firstterm>near-exhaustive search</firstterm> over the
  35     space of alternative strategies. This algorithm, first introduced
  36     in IBM's System R database, produces a near-optimal join order,
  37     but can take an enormous amount of time and memory space when the
  38     number of joins in the query grows large. This makes the ordinary
  39     <productname>PostgreSQL</productname> query optimizer
  40     inappropriate for queries that join a large number of tables.
  41    </para>
  42
  43    <para>
  44     The Institute of Automatic Control at the University of Mining and
  45     Technology, in Freiberg, Germany, encountered some problems when
  46     it wanted to use <productname>PostgreSQL</productname> as the
  47     backend for a decision support knowledge based system for the
  48     maintenance of an electrical power grid. The DBMS needed to handle
  49     large join queries for the inference machine of the knowledge
  50     based system. The number of joins in these queries made using the
  51     normal query optimizer infeasible.
  52    </para>
  53
  54    <para>
  55     In the following we describe the implementation of a
  56     <firstterm>genetic algorithm</firstterm> to solve the join
  57     ordering problem in a manner that is efficient for queries
  58     involving large numbers of joins.
  59    </para>
  60   </sect1>
  61
  62   <sect1 id="geqo-intro2">
  63    <title>Genetic Algorithms</title>
  64
  65    <para>
  66     The genetic algorithm (<acronym>GA</acronym>) is a heuristic optimization method which
  67     operates through randomized search. The set of possible solutions for the
  68     optimization problem is considered as a
  69     <firstterm>population</firstterm> of <firstterm>individuals</firstterm>.
  70     The degree of adaptation of an individual to its environment is specified
  71     by its <firstterm>fitness</firstterm>.
  72    </para>
  73
  74    <para>
  75     The coordinates of an individual in the search space are represented
  76     by <firstterm>chromosomes</firstterm>, in essence a set of character
  77     strings. A <firstterm>gene</firstterm> is a
  78     subsection of a chromosome which encodes the value of a single parameter
  79     being optimized. Typical encodings for a gene could be <firstterm>binary</firstterm> or
  80     <firstterm>integer</firstterm>.
  81    </para>
  82
  83    <para>
  84     Through simulation of the evolutionary operations <firstterm>recombination</firstterm>,
  85     <firstterm>mutation</firstterm>, and
  86     <firstterm>selection</firstterm> new generations of search points are found
  87     that show a higher average fitness than their ancestors. <xref linkend="geqo-figure"/>
  88     illustrates these steps.
  89    </para>
  90
  91    <figure id="geqo-figure">
  92     <title>Structure of a Genetic Algorithm</title>
  93     <mediaobject>
  94      <imageobject>
  95       <imagedata fileref="images/genetic-algorithm.svg" format="SVG" width="100%"/>
  96      </imageobject>
  97     </mediaobject>
  98    </figure>
  99
 100    <para>
 101     According to the <systemitem class="resource">comp.ai.genetic</systemitem> <acronym>FAQ</acronym> it cannot be stressed too
 102     strongly that a <acronym>GA</acronym> is not a pure random search for a solution to a
 103     problem. A <acronym>GA</acronym> uses stochastic processes, but the result is distinctly
 104     non-random (better than random).
 105    </para>
 106
 107   </sect1>
 108
 109   <sect1 id="geqo-pg-intro">
 110    <title>Genetic Query Optimization (<acronym>GEQO</acronym>) in PostgreSQL</title>
 111
 112    <para>
 113     The <acronym>GEQO</acronym> module approaches the query
 114     optimization problem as though it were the well-known traveling salesman
 115     problem (<acronym>TSP</acronym>).
 116     Possible query plans are encoded as integer strings. Each string
 117     represents the join order from one relation of the query to the next.
 118     For example, the join tree
 119 <literallayout class="monospaced">
 120    /\
 121   /\ 2
 122  /\ 3
 123 4  1
 124 </literallayout>
 125     is encoded by the integer string '4-1-3-2',
 126     which means, first join relation '4' and '1', then '3', and
 127     then '2', where 1, 2, 3, 4 are relation IDs within the
 128     <productname>PostgreSQL</productname> optimizer.
 129    </para>
 130
 131    <para>
 132     Specific characteristics of the <acronym>GEQO</acronym>
 133     implementation in <productname>PostgreSQL</productname>
 134     are:
 135
 136     <itemizedlist spacing="compact" mark="bullet">
 137      <listitem>
 138       <para>
 139        Usage of a <firstterm>steady state</firstterm> <acronym>GA</acronym> (replacement of the least fit
 140        individuals in a population, not whole-generational replacement)
 141        allows fast convergence towards improved query plans. This is
 142        essential for query handling with reasonable time;
 143       </para>
 144      </listitem>
 145
 146      <listitem>
 147       <para>
 148        Usage of <firstterm>edge recombination crossover</firstterm>
 149        which is especially suited to keep edge losses low for the
 150        solution of the <acronym>TSP</acronym> by means of a
 151        <acronym>GA</acronym>;
 152       </para>
 153      </listitem>
 154
 155      <listitem>
 156       <para>
 157        Mutation as genetic operator is deprecated so that no repair
 158        mechanisms are needed to generate legal <acronym>TSP</acronym> tours.
 159       </para>
 160      </listitem>
 161     </itemizedlist>
 162    </para>
 163
 164    <para>
 165     Parts of the <acronym>GEQO</acronym> module are adapted from D. Whitley's
 166     Genitor algorithm.
 167    </para>
 168
 169    <para>
 170     The <acronym>GEQO</acronym> module allows
 171     the <productname>PostgreSQL</productname> query optimizer to
 172     support large join queries effectively through
 173     non-exhaustive search.
 174    </para>
 175
 176   <sect2 id="geqo-pg-intro-gen-possible-plans">
 177    <title>Generating Possible Plans with <acronym>GEQO</acronym></title>
 178
 179    <para>
 180     The <acronym>GEQO</acronym> planning process uses the standard planner
 181     code to generate plans for scans of individual relations.  Then join
 182     plans are developed using the genetic approach.  As shown above, each
 183     candidate join plan is represented by a sequence in which to join
 184     the base relations.  In the initial stage, the <acronym>GEQO</acronym>
 185     code simply generates some possible join sequences at random.  For each
 186     join sequence considered, the standard planner code is invoked to
 187     estimate the cost of performing the query using that join sequence.
 188     (For each step of the join sequence, all three possible join strategies
 189     are considered; and all the initially-determined relation scan plans
 190     are available.  The estimated cost is the cheapest of these
 191     possibilities.)  Join sequences with lower estimated cost are considered
 192     <quote>more fit</quote> than those with higher cost.  The genetic algorithm
 193     discards the least fit candidates.  Then new candidates are generated
 194     by combining genes of more-fit candidates &mdash; that is, by using
 195     randomly-chosen portions of known low-cost join sequences to create
 196     new sequences for consideration.  This process is repeated until a
 197     preset number of join sequences have been considered; then the best
 198     one found at any time during the search is used to generate the finished
 199     plan.
 200    </para>
 201
 202    <para>
 203     This process is inherently nondeterministic, because of the randomized
 204     choices made during both the initial population selection and subsequent
 205     <quote>mutation</quote> of the best candidates.  To avoid surprising changes
 206     of the selected plan, each run of the GEQO algorithm restarts its
 207     random number generator with the current <xref linkend="guc-geqo-seed"/>
 208     parameter setting.  As long as <varname>geqo_seed</varname> and the other
 209     GEQO parameters are kept fixed, the same plan will be generated for a
 210     given query (and other planner inputs such as statistics).  To experiment
 211     with different search paths, try changing <varname>geqo_seed</varname>.
 212    </para>
 213
 214   </sect2>
 215
 216   <sect2 id="geqo-future">
 217    <title>Future Implementation Tasks for
 218     <productname>PostgreSQL</productname> <acronym>GEQO</acronym></title>
 219
 220      <para>
 221       Work is still needed to improve the genetic algorithm parameter
 222       settings.
 223       In file <filename>src/backend/optimizer/geqo/geqo_main.c</filename>,
 224       routines
 225       <function>gimme_pool_size</function> and <function>gimme_number_generations</function>,
 226       we have to find a compromise for the parameter settings
 227       to satisfy two competing demands:
 228       <itemizedlist spacing="compact">
 229        <listitem>
 230         <para>
 231          Optimality of the query plan
 232         </para>
 233        </listitem>
 234        <listitem>
 235         <para>
 236          Computing time
 237         </para>
 238        </listitem>
 239       </itemizedlist>
 240      </para>
 241
 242      <para>
 243       In the current implementation, the fitness of each candidate join
 244       sequence is estimated by running the standard planner's join selection
 245       and cost estimation code from scratch.  To the extent that different
 246       candidates use similar sub-sequences of joins, a great deal of work
 247       will be repeated.  This could be made significantly faster by retaining
 248       cost estimates for sub-joins.  The problem is to avoid expending
 249       unreasonable amounts of memory on retaining that state.
 250      </para>
 251
 252      <para>
 253       At a more basic level, it is not clear that solving query optimization
 254       with a GA algorithm designed for TSP is appropriate.  In the TSP case,
 255       the cost associated with any substring (partial tour) is independent
 256       of the rest of the tour, but this is certainly not true for query
 257       optimization.  Thus it is questionable whether edge recombination
 258       crossover is the most effective mutation procedure.
 259      </para>
 260
 261    </sect2>
 262   </sect1>
 263
 264  <sect1 id="geqo-biblio">
 265   <title>Further Reading</title>
 266
 267   <para>
 268    The following resources contain additional information about
 269    genetic algorithms:
 270
 271    <itemizedlist>
 272     <listitem>
 273      <para>
 274       <ulink url="http://www.faqs.org/faqs/ai-faq/genetic/part1/">
 275       The Hitch-Hiker's Guide to Evolutionary Computation</ulink>, (FAQ for <ulink
 276       url="news://comp.ai.genetic"></ulink>)
 277      </para>
 278     </listitem>
 279
 280     <listitem>
 281      <para>
 282       <ulink url="https://www.red3d.com/cwr/evolve.html">
 283       Evolutionary Computation and its application to art and design</ulink>, by
 284       Craig Reynolds
 285      </para>
 286     </listitem>
 287
 288     <listitem>
 289      <para>
 290       <xref linkend="elma04"/>
 291      </para>
 292     </listitem>
 293
 294     <listitem>
 295      <para>
 296       <xref linkend="fong"/>
 297      </para>
 298     </listitem>
 299    </itemizedlist>
 300   </para>
 301
 302  </sect1>
 303 </chapter>