doc/src/sgml/geqo.sgml

   1 <!-- $PostgreSQL$ -->
   2
   3  <chapter id="geqo">
   4   <chapterinfo>
   5    <author>
   6     <firstname>Martin</firstname>
   7     <surname>Utesch</surname>
   8     <affiliation>
   9      <orgname>
  10       University of Mining and Technology
  11      </orgname>
  12      <orgdiv>
  13       Institute of Automatic Control
  14      </orgdiv>
  15      <address>
  16       <city>
  17        Freiberg
  18       </city>
  19       <country>
  20        Germany
  21       </country>
  22      </address>
  23     </affiliation>
  24    </author>
  25    <date>1997-10-02</date>
  26   </chapterinfo>
  27
  28   <title id="geqo-title">Genetic Query Optimizer</title>
  29
  30   <para>
  31    <note>
  32     <title>Author</title>
  33     <para>
  34      Written by Martin Utesch (<email>utesch@aut.tu-freiberg.de</email>)
  35      for the Institute of Automatic Control at the University of Mining and Technology in Freiberg, Germany.
  36     </para>
  37    </note>
  38   </para>
  39
  40   <sect1 id="geqo-intro">
  41    <title>Query Handling as a Complex Optimization Problem</title>
  42
  43    <para>
  44     Among all relational operators the most difficult one to process
  45     and optimize is the <firstterm>join</firstterm>. The number of
  46     possible query plans grows exponentially with the
  47     number of joins in the query. Further optimization effort is
  48     caused by the support of a variety of <firstterm>join
  49     methods</firstterm> (e.g., nested loop, hash join, merge join in
  50     <productname>PostgreSQL</productname>) to process individual joins
  51     and a diversity of <firstterm>indexes</firstterm> (e.g.,
  52     B-tree, hash, GiST and GIN in <productname>PostgreSQL</productname>) as
  53     access paths for relations.
  54    </para>
  55
  56    <para>
  57     The normal <productname>PostgreSQL</productname> query optimizer
  58     performs a <firstterm>near-exhaustive search</firstterm> over the
  59     space of alternative strategies. This algorithm, first introduced
  60     in IBM's System R database, produces a near-optimal join order,
  61     but can take an enormous amount of time and memory space when the
  62     number of joins in the query grows large. This makes the ordinary
  63     <productname>PostgreSQL</productname> query optimizer
  64     inappropriate for queries that join a large number of tables.
  65    </para>
  66
  67    <para>
  68     The Institute of Automatic Control at the University of Mining and
  69     Technology, in Freiberg, Germany, encountered some problems when
  70     it wanted to use <productname>PostgreSQL</productname> as the
  71     backend for a decision support knowledge based system for the
  72     maintenance of an electrical power grid. The DBMS needed to handle
  73     large join queries for the inference machine of the knowledge
  74     based system. The number of joins in these queries made using the
  75     normal query optimizer infeasible.
  76    </para>
  77
  78    <para>
  79     In the following we describe the implementation of a
  80     <firstterm>genetic algorithm</firstterm> to solve the join
  81     ordering problem in a manner that is efficient for queries
  82     involving large numbers of joins.
  83    </para>
  84   </sect1>
  85
  86   <sect1 id="geqo-intro2">
  87    <title>Genetic Algorithms</title>
  88
  89    <para>
  90     The genetic algorithm (<acronym>GA</acronym>) is a heuristic optimization method which
  91     operates through
  92     nondeterministic, randomized search. The set of possible solutions for the
  93     optimization problem is considered as a
  94     <firstterm>population</firstterm> of <firstterm>individuals</firstterm>.
  95     The degree of adaptation of an individual to its environment is specified
  96     by its <firstterm>fitness</firstterm>.
  97    </para>
  98
  99    <para>
 100     The coordinates of an individual in the search space are represented
 101     by <firstterm>chromosomes</firstterm>, in essence a set of character
 102     strings. A <firstterm>gene</firstterm> is a
 103     subsection of a chromosome which encodes the value of a single parameter
 104     being optimized. Typical encodings for a gene could be <firstterm>binary</firstterm> or
 105     <firstterm>integer</firstterm>.
 106    </para>
 107
 108    <para>
 109     Through simulation of the evolutionary operations <firstterm>recombination</firstterm>,
 110     <firstterm>mutation</firstterm>, and
 111     <firstterm>selection</firstterm> new generations of search points are found
 112     that show a higher average fitness than their ancestors.
 113    </para>
 114
 115    <para>
 116     According to the <systemitem class="resource">comp.ai.genetic</> <acronym>FAQ</acronym> it cannot be stressed too
 117     strongly that a <acronym>GA</acronym> is not a pure random search for a solution to a
 118     problem. A <acronym>GA</acronym> uses stochastic processes, but the result is distinctly
 119     non-random (better than random).
 120    </para>
 121
 122    <figure id="geqo-diagram">
 123     <title>Structured Diagram of a Genetic Algorithm</title>
 124
 125     <informaltable frame="none">
 126      <tgroup cols="2">
 127       <tbody>
 128        <row>
 129         <entry>P(t)</entry>
 130         <entry>generation of ancestors at a time t</entry>
 131        </row>
 132
 133        <row>
 134         <entry>P''(t)</entry>
 135         <entry>generation of descendants at a time t</entry>
 136        </row>
 137       </tbody>
 138      </tgroup>
 139     </informaltable>
 140
 141 <literallayout class="monospaced">
 142 +=========================================+
 143 |&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;  Algorithm GA  &lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;|
 144 +=========================================+
 145 | INITIALIZE t := 0                       |
 146 +=========================================+
 147 | INITIALIZE P(t)                         |
 148 +=========================================+
 149 | evaluate FITNESS of P(t)                |
 150 +=========================================+
 151 | while not STOPPING CRITERION do         |
 152 |   +-------------------------------------+
 153 |   | P'(t)  := RECOMBINATION{P(t)}       |
 154 |   +-------------------------------------+
 155 |   | P''(t) := MUTATION{P'(t)}           |
 156 |   +-------------------------------------+
 157 |   | P(t+1) := SELECTION{P''(t) + P(t)}  |
 158 |   +-------------------------------------+
 159 |   | evaluate FITNESS of P''(t)          |
 160 |   +-------------------------------------+
 161 |   | t := t + 1                          |
 162 +===+=====================================+
 163 </literallayout>
 164    </figure>
 165   </sect1>
 166
 167   <sect1 id="geqo-pg-intro">
 168    <title>Genetic Query Optimization (<acronym>GEQO</acronym>) in PostgreSQL</title>
 169
 170    <para>
 171     The <acronym>GEQO</acronym> module approaches the query
 172     optimization problem as though it were the well-known traveling salesman
 173     problem (<acronym>TSP</acronym>).
 174     Possible query plans are encoded as integer strings. Each string
 175     represents the join order from one relation of the query to the next.
 176     For example, the join tree
 177 <literallayout class="monospaced">
 178    /\
 179   /\ 2
 180  /\ 3
 181 4  1
 182 </literallayout>
 183     is encoded by the integer string '4-1-3-2',
 184     which means, first join relation '4' and '1', then '3', and
 185     then '2', where 1, 2, 3, 4 are relation IDs within the
 186     <productname>PostgreSQL</productname> optimizer.
 187    </para>
 188
 189    <para>
 190     Specific characteristics of the <acronym>GEQO</acronym>
 191     implementation in <productname>PostgreSQL</productname>
 192     are:
 193
 194     <itemizedlist spacing="compact" mark="bullet">
 195      <listitem>
 196       <para>
 197        Usage of a <firstterm>steady state</firstterm> <acronym>GA</acronym> (replacement of the least fit
 198        individuals in a population, not whole-generational replacement)
 199        allows fast convergence towards improved query plans. This is
 200        essential for query handling with reasonable time;
 201       </para>
 202      </listitem>
 203
 204      <listitem>
 205       <para>
 206        Usage of <firstterm>edge recombination crossover</firstterm>
 207        which is especially suited to keep edge losses low for the
 208        solution of the <acronym>TSP</acronym> by means of a
 209        <acronym>GA</acronym>;
 210       </para>
 211      </listitem>
 212
 213      <listitem>
 214       <para>
 215        Mutation as genetic operator is deprecated so that no repair
 216        mechanisms are needed to generate legal <acronym>TSP</acronym> tours.
 217       </para>
 218      </listitem>
 219     </itemizedlist>
 220    </para>
 221
 222    <para>
 223     Parts of the <acronym>GEQO</acronym> module are adapted from D. Whitley's
 224     Genitor algorithm.
 225    </para>
 226
 227    <para>
 228     The <acronym>GEQO</acronym> module allows
 229     the <productname>PostgreSQL</productname> query optimizer to
 230     support large join queries effectively through
 231     non-exhaustive search.
 232    </para>
 233
 234   <sect2>
 235    <title>Generating Possible Plans with <acronym>GEQO</acronym></title>
 236
 237    <para>
 238     The <acronym>GEQO</acronym> planning process uses the standard planner
 239     code to generate plans for scans of individual relations.  Then join
 240     plans are developed using the genetic approach.  As shown above, each
 241     candidate join plan is represented by a sequence in which to join
 242     the base relations.  In the initial stage, the <acronym>GEQO</acronym>
 243     code simply generates some possible join sequences at random.  For each
 244     join sequence considered, the standard planner code is invoked to
 245     estimate the cost of performing the query using that join sequence.
 246     (For each step of the join sequence, all three possible join strategies
 247     are considered; and all the initially-determined relation scan plans
 248     are available.  The estimated cost is the cheapest of these
 249     possibilities.)  Join sequences with lower estimated cost are considered
 250     <quote>more fit</> than those with higher cost.  The genetic algorithm
 251     discards the least fit candidates.  Then new candidates are generated
 252     by combining genes of more-fit candidates &mdash; that is, by using
 253     randomly-chosen portions of known low-cost join sequences to create
 254     new sequences for consideration.  This process is repeated until a
 255     preset number of join sequences have been considered; then the best
 256     one found at any time during the search is used to generate the finished
 257     plan.
 258    </para>
 259
 260    <para>
 261     This process is inherently nondeterministic, because of the randomized
 262     choices made during both the initial population selection and subsequent
 263     <quote>mutation</> of the best candidates.  Hence different plans may
 264     be selected from one run to the next, resulting in varying run time
 265     and varying output row order.
 266    </para>
 267
 268   </sect2>
 269
 270   <sect2 id="geqo-future">
 271    <title>Future Implementation Tasks for
 272     <productname>PostgreSQL</> <acronym>GEQO</acronym></title>
 273
 274      <para>
 275       Work is still needed to improve the genetic algorithm parameter
 276       settings.
 277       In file <filename>src/backend/optimizer/geqo/geqo_main.c</filename>,
 278       routines
 279       <function>gimme_pool_size</function> and <function>gimme_number_generations</function>,
 280       we have to find a compromise for the parameter settings
 281       to satisfy two competing demands:
 282       <itemizedlist spacing="compact">
 283        <listitem>
 284         <para>
 285          Optimality of the query plan
 286         </para>
 287        </listitem>
 288        <listitem>
 289         <para>
 290          Computing time
 291         </para>
 292        </listitem>
 293       </itemizedlist>
 294      </para>
 295
 296      <para>
 297       In the current implementation, the fitness of each candidate join
 298       sequence is estimated by running the standard planner's join selection
 299       and cost estimation code from scratch.  To the extent that different
 300       candidates use similar sub-sequences of joins, a great deal of work
 301       will be repeated.  This could be made significantly faster by retaining
 302       cost estimates for sub-joins.  The problem is to avoid expending
 303       unreasonable amounts of memory on retaining that state.
 304      </para>
 305
 306      <para>
 307       At a more basic level, it is not clear that solving query optimization
 308       with a GA algorithm designed for TSP is appropriate.  In the TSP case,
 309       the cost associated with any substring (partial tour) is independent
 310       of the rest of the tour, but this is certainly not true for query
 311       optimization.  Thus it is questionable whether edge recombination
 312       crossover is the most effective mutation procedure.
 313      </para>
 314
 315    </sect2>
 316   </sect1>
 317
 318  <sect1 id="geqo-biblio">
 319   <title>Further Reading</title>
 320
 321   <para>
 322    The following resources contain additional information about
 323    genetic algorithms:
 324
 325    <itemizedlist>
 326     <listitem>
 327      <para>
 328       <ulink url="http://www.cs.bham.ac.uk/Mirrors/ftp.de.uu.net/EC/clife/www/location.htm">
 329       The Hitch-Hiker's Guide to Evolutionary Computation</ulink>, (FAQ for <ulink
 330       url="news://comp.ai.genetic"></ulink>)
 331      </para>
 332     </listitem>
 333
 334     <listitem>
 335      <para>
 336       <ulink url="http://www.red3d.com/cwr/evolve.html">
 337       Evolutionary Computation and its application to art and design</ulink>, by
 338       Craig Reynolds
 339      </para>
 340     </listitem>
 341
 342     <listitem>
 343      <para>
 344       <xref linkend="ELMA04">
 345      </para>
 346     </listitem>
 347
 348     <listitem>
 349      <para>
 350       <xref linkend="FONG">
 351      </para>
 352     </listitem>
 353    </itemizedlist>
 354   </para>
 355
 356  </sect1>
 357 </chapter>