doc/src/sgml/arch-dev.sgml

   1 <!-- $PostgreSQL$ -->
   2
   3  <chapter id="overview">
   4   <title>Overview of PostgreSQL Internals</title>
   5
   6   <note>
   7    <title>Author</title>
   8    <para>
   9     This chapter originated as part of
  10     <xref linkend="SIM98">, Stefan Simkovics'
  11     Master's Thesis prepared at Vienna University of Technology under the direction
  12     of O.Univ.Prof.Dr. Georg Gottlob and Univ.Ass. Mag. Katrin Seyr.
  13    </para>
  14   </note>
  15
  16   <para>
  17    This chapter gives an overview of the internal structure of the
  18    backend of <productname>PostgreSQL</productname>.  After having
  19    read the following sections you should have an idea of how a query
  20    is processed. This chapter does not aim to provide a detailed
  21    description of the internal operation of
  22    <productname>PostgreSQL</productname>, as such a document would be
  23    very extensive. Rather, this chapter is intended to help the reader
  24    understand the general sequence of operations that occur within the
  25    backend from the point at which a query is received, to the point
  26    at which the results are returned to the client.
  27   </para>
  28
  29   <sect1 id="query-path">
  30    <title>The Path of a Query</title>
  31
  32    <para>
  33     Here we give a short overview of the stages a query has to pass in
  34     order to obtain a result.
  35    </para>
  36
  37    <procedure>
  38     <step>
  39      <para>
  40       A connection from an application program to the <productname>PostgreSQL</productname>
  41       server has to be established. The application program transmits a
  42       query to the server and waits to receive the results sent back by the
  43       server.
  44      </para>
  45     </step>
  46
  47     <step>
  48      <para>
  49       The <firstterm>parser stage</firstterm> checks the query
  50       transmitted by the application
  51       program for correct syntax and creates
  52       a <firstterm>query tree</firstterm>.
  53      </para>
  54     </step>
  55
  56     <step>
  57      <para>
  58       The <firstterm>rewrite system</firstterm> takes
  59       the query tree created by the parser stage and looks for
  60       any <firstterm>rules</firstterm> (stored in the
  61       <firstterm>system catalogs</firstterm>) to apply to
  62       the query tree.  It performs the
  63       transformations given in the <firstterm>rule bodies</firstterm>.
  64      </para>
  65
  66      <para>
  67       One application of the rewrite system is in the realization of
  68       <firstterm>views</firstterm>.
  69       Whenever a query against a view
  70       (i.e., a <firstterm>virtual table</firstterm>) is made,
  71       the rewrite system rewrites the user's query to
  72       a query that accesses the <firstterm>base tables</firstterm> given in
  73       the <firstterm>view definition</firstterm> instead.
  74      </para>
  75     </step>
  76
  77     <step>
  78      <para>
  79       The <firstterm>planner/optimizer</firstterm> takes
  80       the (rewritten) query tree and creates a
  81       <firstterm>query plan</firstterm> that will be the input to the
  82       <firstterm>executor</firstterm>.
  83      </para>
  84
  85      <para>
  86       It does so by first creating all possible <firstterm>paths</firstterm>
  87       leading to the same result. For example if there is an index on a
  88       relation to be scanned, there are two paths for the
  89       scan. One possibility is a simple sequential scan and the other
  90       possibility is to use the index. Next the cost for the execution of
  91       each path is estimated and the cheapest path is chosen.  The cheapest
  92       path is expanded into a complete plan that the executor can use.
  93      </para>
  94     </step>
  95
  96     <step>
  97      <para>
  98       The executor recursively steps through
  99       the <firstterm>plan tree</firstterm> and
 100       retrieves rows in the way represented by the plan.
 101       The executor makes use of the
 102       <firstterm>storage system</firstterm> while scanning
 103       relations, performs <firstterm>sorts</firstterm> and <firstterm>joins</firstterm>,
 104       evaluates <firstterm>qualifications</firstterm> and finally hands back the rows derived.
 105      </para>
 106     </step>
 107    </procedure>
 108
 109    <para>
 110     In the following sections we will cover each of the above listed items
 111     in more detail to give a better understanding of <productname>PostgreSQL</productname>'s internal
 112     control and data structures.
 113    </para>
 114   </sect1>
 115
 116   <sect1 id="connect-estab">
 117    <title>How Connections are Established</title>
 118
 119    <para>
 120     <productname>PostgreSQL</productname> is implemented using a
 121     simple <quote>process per user</> client/server model.  In this model
 122     there is one <firstterm>client process</firstterm> connected to
 123     exactly one <firstterm>server process</firstterm>.  As we do not
 124     know ahead of time how many connections will be made, we have to
 125     use a <firstterm>master process</firstterm> that spawns a new
 126     server process every time a connection is requested. This master
 127     process is called <literal>postgres</literal> and listens at a
 128     specified TCP/IP port for incoming connections. Whenever a request
 129     for a connection is detected the <literal>postgres</literal>
 130     process spawns a new server process. The server tasks
 131     communicate with each other using <firstterm>semaphores</firstterm> and
 132     <firstterm>shared memory</firstterm> to ensure data integrity
 133     throughout concurrent data access.
 134    </para>
 135
 136    <para>
 137     The client process can be any program that understands the
 138     <productname>PostgreSQL</productname> protocol described in
 139     <xref linkend="protocol">.  Many clients are based on the
 140     C-language library <application>libpq</>, but several independent
 141     implementations of the protocol exist, such as the Java
 142     <application>JDBC</> driver.
 143    </para>
 144
 145    <para>
 146     Once a connection is established the client process can send a query
 147     to the <firstterm>backend</firstterm> (server). The query is transmitted using plain text,
 148     i.e., there is no parsing done in the <firstterm>frontend</firstterm> (client). The
 149     server parses the query, creates an <firstterm>execution plan</firstterm>,
 150     executes the plan and returns the retrieved rows to the client
 151     by transmitting them over the established connection.
 152    </para>
 153   </sect1>
 154
 155   <sect1 id="parser-stage">
 156    <title>The Parser Stage</title>
 157
 158    <para>
 159     The <firstterm>parser stage</firstterm> consists of two parts:
 160
 161     <itemizedlist>
 162      <listitem>
 163       <para>
 164        The <firstterm>parser</firstterm> defined in
 165        <filename>gram.y</filename> and <filename>scan.l</filename> is
 166        built using the Unix tools <application>yacc</application>
 167        and <application>lex</application>.
 168       </para>
 169      </listitem>
 170      <listitem>
 171       <para>
 172        The <firstterm>transformation process</firstterm> does
 173        modifications and augmentations to the data structures returned by the parser.
 174       </para>
 175      </listitem>
 176     </itemizedlist>
 177    </para>
 178
 179    <sect2>
 180     <title>Parser</title>
 181
 182     <para>
 183      The parser has to check the query string (which arrives as plain
 184      ASCII text) for valid syntax. If the syntax is correct a
 185      <firstterm>parse tree</firstterm> is built up and handed back;
 186      otherwise an error is returned. The parser and lexer are
 187      implemented using the well-known Unix tools <application>yacc</>
 188      and <application>lex</>.
 189     </para>
 190
 191     <para>
 192      The <firstterm>lexer</firstterm> is defined in the file
 193      <filename>scan.l</filename> and is responsible
 194      for recognizing <firstterm>identifiers</firstterm>,
 195      the <firstterm>SQL key words</firstterm> etc. For
 196      every key word or identifier that is found, a <firstterm>token</firstterm>
 197      is generated and handed to the parser.
 198     </para>
 199
 200     <para>
 201      The parser is defined in the file <filename>gram.y</filename> and
 202      consists of a set of <firstterm>grammar rules</firstterm> and
 203      <firstterm>actions</firstterm> that are executed whenever a rule
 204      is fired. The code of the actions (which is actually C code) is
 205      used to build up the parse tree.
 206     </para>
 207
 208     <para>
 209      The file <filename>scan.l</filename> is transformed to the C
 210      source file <filename>scan.c</filename> using the program
 211      <application>lex</application> and <filename>gram.y</filename> is
 212      transformed to <filename>gram.c</filename> using
 213      <application>yacc</application>.  After these transformations
 214      have taken place a normal C compiler can be used to create the
 215      parser. Never make any changes to the generated C files as they
 216      will be overwritten the next time <application>lex</application>
 217      or <application>yacc</application> is called.
 218
 219      <note>
 220       <para>
 221        The mentioned transformations and compilations are normally done
 222        automatically using the <firstterm>makefiles</firstterm>
 223        shipped with the <productname>PostgreSQL</productname>
 224        source distribution.
 225       </para>
 226      </note>
 227     </para>
 228
 229     <para>
 230      A detailed description of <application>yacc</application> or
 231      the grammar rules given in <filename>gram.y</filename> would be
 232      beyond the scope of this paper. There are many books and
 233      documents dealing with <application>lex</application> and
 234      <application>yacc</application>. You should be familiar with
 235      <application>yacc</application> before you start to study the
 236      grammar given in <filename>gram.y</filename> otherwise you won't
 237      understand what happens there.
 238     </para>
 239
 240    </sect2>
 241
 242    <sect2>
 243      <title>Transformation Process</title>
 244
 245     <para>
 246      The parser stage creates a parse tree using only fixed rules about
 247      the syntactic structure of SQL.  It does not make any lookups in the
 248      system catalogs, so there is no possibility to understand the detailed
 249      semantics of the requested operations.  After the parser completes,
 250      the <firstterm>transformation process</firstterm> takes the tree handed
 251      back by the parser as input and does the semantic interpretation needed
 252      to understand which tables, functions, and operators are referenced by
 253      the query.  The data structure that is built to represent this
 254      information is called the <firstterm>query tree</>.
 255     </para>
 256
 257     <para>
 258      The reason for separating raw parsing from semantic analysis is that
 259      system catalog lookups can only be done within a transaction, and we
 260      do not wish to start a transaction immediately upon receiving a query
 261      string.  The raw parsing stage is sufficient to identify the transaction
 262      control commands (<command>BEGIN</>, <command>ROLLBACK</>, etc), and
 263      these can then be correctly executed without any further analysis.
 264      Once we know that we are dealing with an actual query (such as
 265      <command>SELECT</> or <command>UPDATE</>), it is okay to
 266      start a transaction if we're not already in one.  Only then can the
 267      transformation process be invoked.
 268     </para>
 269
 270     <para>
 271      The query tree created by the transformation process is structurally
 272      similar to the raw parse tree in most places, but it has many differences
 273      in detail.  For example, a <structname>FuncCall</> node in the
 274      parse tree represents something that looks syntactically like a function
 275      call.  This might be transformed to either a <structname>FuncExpr</>
 276      or <structname>Aggref</> node depending on whether the referenced
 277      name turns out to be an ordinary function or an aggregate function.
 278      Also, information about the actual data types of columns and expression
 279      results is added to the query tree.
 280     </para>
 281    </sect2>
 282   </sect1>
 283
 284   <sect1 id="rule-system">
 285    <title>The <productname>PostgreSQL</productname> Rule System</title>
 286
 287    <para>
 288     <productname>PostgreSQL</productname> supports a powerful
 289     <firstterm>rule system</firstterm> for the specification
 290     of <firstterm>views</firstterm> and ambiguous <firstterm>view updates</firstterm>.
 291     Originally the <productname>PostgreSQL</productname>
 292     rule system consisted of two implementations:
 293
 294     <itemizedlist>
 295      <listitem>
 296       <para>
 297        The first one worked using <firstterm>row level</firstterm> processing and was
 298        implemented deep in the <firstterm>executor</firstterm>. The rule system was
 299        called whenever an individual row had been accessed. This
 300        implementation was removed in 1995 when the last official release
 301        of the <productname>Berkeley Postgres</productname> project was
 302        transformed into <productname>Postgres95</productname>.
 303       </para>
 304      </listitem>
 305
 306      <listitem>
 307       <para>
 308        The second implementation of the rule system is a technique
 309        called <firstterm>query rewriting</firstterm>.
 310        The <firstterm>rewrite system</firstterm> is a module
 311        that exists between the <firstterm>parser stage</firstterm> and the
 312        <firstterm>planner/optimizer</firstterm>. This technique is still implemented.
 313       </para>
 314      </listitem>
 315     </itemizedlist>
 316    </para>
 317
 318    <para>
 319     The query rewriter is discussed in some detail in
 320     <xref linkend="rules">, so there is no need to cover it here.
 321     We will only point out that both the input and the output of the
 322     rewriter are query trees, that is, there is no change in the
 323     representation or level of semantic detail in the trees.  Rewriting
 324     can be thought of as a form of macro expansion.
 325    </para>
 326
 327   </sect1>
 328
 329   <sect1 id="planner-optimizer">
 330    <title>Planner/Optimizer</title>
 331
 332    <para>
 333     The task of the <firstterm>planner/optimizer</firstterm> is to
 334     create an optimal execution plan. A given SQL query (and hence, a
 335     query tree) can be actually executed in a wide variety of
 336     different ways, each of which will produce the same set of
 337     results.  If it is computationally feasible, the query optimizer
 338     will examine each of these possible execution plans, ultimately
 339     selecting the execution plan that is expected to run the fastest.
 340    </para>
 341
 342    <note>
 343     <para>
 344      In some situations, examining each possible way in which a query
 345      can be executed would take an excessive amount of time and memory
 346      space. In particular, this occurs when executing queries
 347      involving large numbers of join operations. In order to determine
 348      a reasonable (not necessarily optimal) query plan in a reasonable amount
 349      of time, <productname>PostgreSQL</productname> uses a <xref
 350      linkend="geqo" endterm="geqo-title"> when the number of joins
 351      exceeds a threshold (see <xref linkend="guc-geqo-threshold">).
 352     </para>
 353    </note>
 354
 355    <para>
 356     The planner's search procedure actually works with data structures
 357     called <firstterm>paths</>, which are simply cut-down representations of
 358     plans containing only as much information as the planner needs to make
 359     its decisions. After the cheapest path is determined, a full-fledged
 360     <firstterm>plan tree</> is built to pass to the executor.  This represents
 361     the desired execution plan in sufficient detail for the executor to run it.
 362     In the rest of this section we'll ignore the distinction between paths
 363     and plans.
 364    </para>
 365
 366    <sect2>
 367     <title>Generating Possible Plans</title>
 368
 369     <para>
 370      The planner/optimizer starts by generating plans for scanning each
 371      individual relation (table) used in the query.  The possible plans
 372      are determined by the available indexes on each relation.
 373      There is always the possibility of performing a
 374      sequential scan on a relation, so a sequential scan plan is always
 375      created. Assume an index is defined on a
 376      relation (for example a B-tree index) and a query contains the
 377      restriction
 378      <literal>relation.attribute OPR constant</literal>. If
 379      <literal>relation.attribute</literal> happens to match the key of the B-tree
 380      index and <literal>OPR</literal> is one of the operators listed in
 381      the index's <firstterm>operator class</>, another plan is created using
 382      the B-tree index to scan the relation. If there are further indexes
 383      present and the restrictions in the query happen to match a key of an
 384      index, further plans will be considered.  Index scan plans are also
 385      generated for indexes that have a sort ordering that can match the
 386      query's <literal>ORDER BY</> clause (if any), or a sort ordering that
 387      might be useful for merge joining (see below).
 388     </para>
 389
 390     <para>
 391      If the query requires joining two or more relations,
 392      plans for joining relations are considered
 393      after all feasible plans have been found for scanning single relations.
 394      The three available join strategies are:
 395
 396      <itemizedlist>
 397       <listitem>
 398        <para>
 399         <firstterm>nested loop join</firstterm>: The right relation is scanned
 400         once for every row found in the left relation. This strategy
 401         is easy to implement but can be very time consuming.  (However,
 402         if the right relation can be scanned with an index scan, this can
 403         be a good strategy.  It is possible to use values from the current
 404         row of the left relation as keys for the index scan of the right.)
 405        </para>
 406       </listitem>
 407
 408       <listitem>
 409        <para>
 410         <firstterm>merge join</firstterm>: Each relation is sorted on the join
 411         attributes before the join starts. Then the two relations are
 412         scanned in parallel, and matching rows are combined to form
 413         join rows. This kind of join is more
 414         attractive because each relation has to be scanned only once.
 415         The required sorting might be achieved either by an explicit sort
 416         step, or by scanning the relation in the proper order using an
 417         index on the join key.
 418        </para>
 419       </listitem>
 420
 421       <listitem>
 422        <para>
 423         <firstterm>hash join</firstterm>: the right relation is first scanned
 424         and loaded into a hash table, using its join attributes as hash keys.
 425         Next the left relation is scanned and the
 426         appropriate values of every row found are used as hash keys to
 427         locate the matching rows in the table.
 428        </para>
 429       </listitem>
 430      </itemizedlist>
 431     </para>
 432
 433     <para>
 434      When the query involves more than two relations, the final result
 435      must be built up by a tree of join steps, each with two inputs.
 436      The planner examines different possible join sequences to find the
 437      cheapest one.
 438     </para>
 439
 440     <para>
 441      If the query uses fewer than <xref linkend="guc-geqo-threshold">
 442      relations, a near-exhaustive search is conducted to find the best
 443      join sequence.  The planner preferentially considers joins between any
 444      two relations for which there exist a corresponding join clause in the
 445      <literal>WHERE</literal> qualification (i.e., for
 446      which a restriction like <literal>where rel1.attr1=rel2.attr2</literal>
 447      exists). Join pairs with no join clause are considered only when there
 448      is no other choice, that is, a particular relation has no available
 449      join clauses to any other relation. All possible plans are generated for
 450      every join pair considered by the planner, and the one that is
 451      (estimated to be) the cheapest is chosen.
 452     </para>
 453
 454     <para>
 455      When <varname>geqo_threshold</varname> is exceeded, the join
 456      sequences considered are determined by heuristics, as described
 457      in <xref linkend="geqo">.  Otherwise the process is the same.
 458     </para>
 459
 460     <para>
 461      The finished plan tree consists of sequential or index scans of
 462      the base relations, plus nested-loop, merge, or hash join nodes as
 463      needed, plus any auxiliary steps needed, such as sort nodes or
 464      aggregate-function calculation nodes.  Most of these plan node
 465      types have the additional ability to do <firstterm>selection</>
 466      (discarding rows that do not meet a specified boolean condition)
 467      and <firstterm>projection</> (computation of a derived column set
 468      based on given column values, that is, evaluation of scalar
 469      expressions where needed).  One of the responsibilities of the
 470      planner is to attach selection conditions from the
 471      <literal>WHERE</literal> clause and computation of required
 472      output expressions to the most appropriate nodes of the plan
 473      tree.
 474     </para>
 475    </sect2>
 476   </sect1>
 477
 478   <sect1 id="executor">
 479    <title>Executor</title>
 480
 481    <para>
 482     The <firstterm>executor</firstterm> takes the plan handed back by the
 483     planner/optimizer and recursively processes it to extract the required set
 484     of rows.  This is essentially a demand-pull pipeline mechanism.
 485     Each time a plan node is called, it must deliver one more row, or
 486     report that it is done delivering rows.
 487    </para>
 488
 489    <para>
 490     To provide a concrete example, assume that the top
 491     node is a <literal>MergeJoin</literal> node.
 492     Before any merge can be done two rows have to be fetched (one from
 493     each subplan). So the executor recursively calls itself to
 494     process the subplans (it starts with the subplan attached to
 495     <literal>lefttree</literal>). The new top node (the top node of the left
 496     subplan) is, let's say, a
 497     <literal>Sort</literal> node and again recursion is needed to obtain
 498     an input row.  The child node of the <literal>Sort</literal> might
 499     be a <literal>SeqScan</> node, representing actual reading of a table.
 500     Execution of this node causes the executor to fetch a row from the
 501     table and return it up to the calling node.  The <literal>Sort</literal>
 502     node will repeatedly call its child to obtain all the rows to be sorted.
 503     When the input is exhausted (as indicated by the child node returning
 504     a NULL instead of a row), the <literal>Sort</literal> code performs
 505     the sort, and finally is able to return its first output row, namely
 506     the first one in sorted order.  It keeps the remaining rows stored so
 507     that it can deliver them in sorted order in response to later demands.
 508    </para>
 509
 510    <para>
 511     The <literal>MergeJoin</literal> node similarly demands the first row
 512     from its right subplan.  Then it compares the two rows to see if they
 513     can be joined; if so, it returns a join row to its caller.  On the next
 514     call, or immediately if it cannot join the current pair of inputs,
 515     it advances to the next row of one table
 516     or the other (depending on how the comparison came out), and again
 517     checks for a match.  Eventually, one subplan or the other is exhausted,
 518     and the <literal>MergeJoin</literal> node returns NULL to indicate that
 519     no more join rows can be formed.
 520    </para>
 521
 522    <para>
 523     Complex queries can involve many levels of plan nodes, but the general
 524     approach is the same: each node computes and returns its next output
 525     row each time it is called.  Each node is also responsible for applying
 526     any selection or projection expressions that were assigned to it by
 527     the planner.
 528    </para>
 529
 530    <para>
 531     The executor mechanism is used to evaluate all four basic SQL query types:
 532     <command>SELECT</>, <command>INSERT</>, <command>UPDATE</>, and
 533     <command>DELETE</>.  For <command>SELECT</>, the top-level executor
 534     code only needs to send each row returned by the query plan tree off
 535     to the client.  For <command>INSERT</>, each returned row is inserted
 536     into the target table specified for the <command>INSERT</>.  (A simple
 537     <command>INSERT ... VALUES</> command creates a trivial plan tree
 538     consisting of a single <literal>Result</> node, which computes just one
 539     result row.  But <command>INSERT ... SELECT</> can demand the full power
 540     of the executor mechanism.)  For <command>UPDATE</>, the planner arranges
 541     that each computed row includes all the updated column values, plus
 542     the <firstterm>TID</> (tuple ID, or row ID) of the original target row;
 543     the executor top level uses this information to create a new updated row
 544     and mark the old row deleted.  For <command>DELETE</>, the only column
 545     that is actually returned by the plan is the TID, and the executor top
 546     level simply uses the TID to visit each target row and mark it deleted.
 547    </para>
 548
 549   </sect1>
 550
 551  </chapter>