doc/src/sgml/arch-dev.sgml

   1 <!-- doc/src/sgml/arch-dev.sgml -->
   2
   3  <chapter id="overview">
   4   <title>Overview of PostgreSQL Internals</title>
   5
   6   <note>
   7    <title>Author</title>
   8    <para>
   9     This chapter originated as part of
  10     <xref linkend="sim98"/> Stefan Simkovics'
  11     Master's Thesis prepared at Vienna University of Technology under the direction
  12     of O.Univ.Prof.Dr. Georg Gottlob and Univ.Ass. Mag. Katrin Seyr.
  13    </para>
  14   </note>
  15
  16   <para>
  17    This chapter gives an overview of the internal structure of the
  18    backend of <productname>PostgreSQL</productname>.  After having
  19    read the following sections you should have an idea of how a query
  20    is processed.  This chapter is intended to help the reader
  21    understand the general sequence of operations that occur within the
  22    backend from the point at which a query is received, to the point
  23    at which the results are returned to the client.
  24   </para>
  25
  26   <sect1 id="query-path">
  27    <title>The Path of a Query</title>
  28
  29    <para>
  30     Here we give a short overview of the stages a query has to pass
  31     to obtain a result.
  32    </para>
  33
  34    <procedure>
  35     <step>
  36      <para>
  37       A connection from an application program to the <productname>PostgreSQL</productname>
  38       server has to be established. The application program transmits a
  39       query to the server and waits to receive the results sent back by the
  40       server.
  41      </para>
  42     </step>
  43
  44     <step>
  45      <para>
  46       The <firstterm>parser stage</firstterm> checks the query
  47       transmitted by the application
  48       program for correct syntax and creates
  49       a <firstterm>query tree</firstterm>.
  50      </para>
  51     </step>
  52
  53     <step>
  54      <para>
  55       The <firstterm>rewrite system</firstterm> takes
  56       the query tree created by the parser stage and looks for
  57       any <firstterm>rules</firstterm> (stored in the
  58       <firstterm>system catalogs</firstterm>) to apply to
  59       the query tree.  It performs the
  60       transformations given in the <firstterm>rule bodies</firstterm>.
  61      </para>
  62
  63      <para>
  64       One application of the rewrite system is in the realization of
  65       <firstterm>views</firstterm>.
  66       Whenever a query against a view
  67       (i.e., a <firstterm>virtual table</firstterm>) is made,
  68       the rewrite system rewrites the user's query to
  69       a query that accesses the <firstterm>base tables</firstterm> given in
  70       the <firstterm>view definition</firstterm> instead.
  71      </para>
  72     </step>
  73
  74     <step>
  75      <para>
  76       The <firstterm>planner/optimizer</firstterm> takes
  77       the (rewritten) query tree and creates a
  78       <firstterm>query plan</firstterm> that will be the input to the
  79       <firstterm>executor</firstterm>.
  80      </para>
  81
  82      <para>
  83       It does so by first creating all possible <firstterm>paths</firstterm>
  84       leading to the same result. For example if there is an index on a
  85       relation to be scanned, there are two paths for the
  86       scan. One possibility is a simple sequential scan and the other
  87       possibility is to use the index. Next the cost for the execution of
  88       each path is estimated and the cheapest path is chosen.  The cheapest
  89       path is expanded into a complete plan that the executor can use.
  90      </para>
  91     </step>
  92
  93     <step>
  94      <para>
  95       The executor recursively steps through
  96       the <firstterm>plan tree</firstterm> and
  97       retrieves rows in the way represented by the plan.
  98       The executor makes use of the
  99       <firstterm>storage system</firstterm> while scanning
 100       relations, performs <firstterm>sorts</firstterm> and <firstterm>joins</firstterm>,
 101       evaluates <firstterm>qualifications</firstterm> and finally hands back the rows derived.
 102      </para>
 103     </step>
 104    </procedure>
 105
 106    <para>
 107     In the following sections we will cover each of the above listed items
 108     in more detail to give a better understanding of <productname>PostgreSQL</productname>'s internal
 109     control and data structures.
 110    </para>
 111   </sect1>
 112
 113   <sect1 id="connect-estab">
 114    <title>How Connections Are Established</title>
 115
 116    <para>
 117     <productname>PostgreSQL</productname> implements a
 118     <quote>process per user</quote> client/server model.
 119     In this model, every
 120     <glossterm linkend="glossary-client">client process</glossterm>
 121     connects to exactly one
 122     <glossterm linkend="glossary-backend">backend process</glossterm>.
 123     As we do not know ahead of time how many connections will be made,
 124     we have to use a <quote>supervisor process</quote> that spawns a new
 125     backend process every time a connection is requested. This supervisor
 126     process is called
 127     <glossterm linkend="glossary-postmaster">postmaster</glossterm>
 128     and listens at a specified TCP/IP port for incoming connections.
 129     Whenever it detects a request for a connection, it spawns a new
 130     backend process.  Those backend processes communicate with each
 131     other and with other processes of the
 132     <glossterm linkend="glossary-instance">instance</glossterm>
 133     using <firstterm>semaphores</firstterm> and
 134     <glossterm linkend="glossary-shared-memory">shared memory</glossterm>
 135     to ensure data integrity throughout concurrent data access.
 136    </para>
 137
 138    <para>
 139     The client process can be any program that understands the
 140     <productname>PostgreSQL</productname> protocol described in
 141     <xref linkend="protocol"/>.  Many clients are based on the
 142     C-language library <application>libpq</application>, but several independent
 143     implementations of the protocol exist, such as the Java
 144     <application>JDBC</application> driver.
 145    </para>
 146
 147    <para>
 148     Once a connection is established, the client process can send a query
 149     to the backend process it's connected to. The query is transmitted using
 150     plain text, i.e., there is no parsing done in the client. The backend
 151     process parses the query, creates an <firstterm>execution plan</firstterm>,
 152     executes the plan, and returns the retrieved rows to the client
 153     by transmitting them over the established connection.
 154    </para>
 155   </sect1>
 156
 157   <sect1 id="parser-stage">
 158    <title>The Parser Stage</title>
 159
 160    <para>
 161     The <firstterm>parser stage</firstterm> consists of two parts:
 162
 163     <itemizedlist>
 164      <listitem>
 165       <para>
 166        The <firstterm>parser</firstterm> defined in
 167        <filename>gram.y</filename> and <filename>scan.l</filename> is
 168        built using the Unix tools <application>bison</application>
 169        and <application>flex</application>.
 170       </para>
 171      </listitem>
 172      <listitem>
 173       <para>
 174        The <firstterm>transformation process</firstterm> does
 175        modifications and augmentations to the data structures returned by the parser.
 176       </para>
 177      </listitem>
 178     </itemizedlist>
 179    </para>
 180
 181    <sect2 id="parser-stage-parser">
 182     <title>Parser</title>
 183
 184     <para>
 185      The parser has to check the query string (which arrives as plain
 186      text) for valid syntax. If the syntax is correct a
 187      <firstterm>parse tree</firstterm> is built up and handed back;
 188      otherwise an error is returned. The parser and lexer are
 189      implemented using the well-known Unix tools <application>bison</application>
 190      and <application>flex</application>.
 191     </para>
 192
 193     <para>
 194      The <firstterm>lexer</firstterm> is defined in the file
 195      <filename>scan.l</filename> and is responsible
 196      for recognizing <firstterm>identifiers</firstterm>,
 197      the <firstterm>SQL key words</firstterm> etc. For
 198      every key word or identifier that is found, a <firstterm>token</firstterm>
 199      is generated and handed to the parser.
 200     </para>
 201
 202     <para>
 203      The parser is defined in the file <filename>gram.y</filename> and
 204      consists of a set of <firstterm>grammar rules</firstterm> and
 205      <firstterm>actions</firstterm> that are executed whenever a rule
 206      is fired. The code of the actions (which is actually C code) is
 207      used to build up the parse tree.
 208     </para>
 209
 210     <para>
 211      The file <filename>scan.l</filename> is transformed to the C
 212      source file <filename>scan.c</filename> using the program
 213      <application>flex</application> and <filename>gram.y</filename> is
 214      transformed to <filename>gram.c</filename> using
 215      <application>bison</application>.  After these transformations
 216      have taken place a normal C compiler can be used to create the
 217      parser. Never make any changes to the generated C files as they
 218      will be overwritten the next time <application>flex</application>
 219      or <application>bison</application> is called.
 220
 221      <note>
 222       <para>
 223        The mentioned transformations and compilations are normally done
 224        automatically using the <firstterm>makefiles</firstterm>
 225        shipped with the <productname>PostgreSQL</productname>
 226        source distribution.
 227       </para>
 228      </note>
 229     </para>
 230
 231     <para>
 232      A detailed description of <application>bison</application> or
 233      the grammar rules given in <filename>gram.y</filename> would be
 234      beyond the scope of this manual. There are many books and
 235      documents dealing with <application>flex</application> and
 236      <application>bison</application>. You should be familiar with
 237      <application>bison</application> before you start to study the
 238      grammar given in <filename>gram.y</filename> otherwise you won't
 239      understand what happens there.
 240     </para>
 241
 242    </sect2>
 243
 244    <sect2 id="parser-stage-transformation-process">
 245      <title>Transformation Process</title>
 246
 247     <para>
 248      The parser stage creates a parse tree using only fixed rules about
 249      the syntactic structure of SQL.  It does not make any lookups in the
 250      system catalogs, so there is no possibility to understand the detailed
 251      semantics of the requested operations.  After the parser completes,
 252      the <firstterm>transformation process</firstterm> takes the tree handed
 253      back by the parser as input and does the semantic interpretation needed
 254      to understand which tables, functions, and operators are referenced by
 255      the query.  The data structure that is built to represent this
 256      information is called the <firstterm>query tree</firstterm>.
 257     </para>
 258
 259     <para>
 260      The reason for separating raw parsing from semantic analysis is that
 261      system catalog lookups can only be done within a transaction, and we
 262      do not wish to start a transaction immediately upon receiving a query
 263      string.  The raw parsing stage is sufficient to identify the transaction
 264      control commands (<command>BEGIN</command>, <command>ROLLBACK</command>, etc.), and
 265      these can then be correctly executed without any further analysis.
 266      Once we know that we are dealing with an actual query (such as
 267      <command>SELECT</command> or <command>UPDATE</command>), it is okay to
 268      start a transaction if we're not already in one.  Only then can the
 269      transformation process be invoked.
 270     </para>
 271
 272     <para>
 273      The query tree created by the transformation process is structurally
 274      similar to the raw parse tree in most places, but it has many differences
 275      in detail.  For example, a <structname>FuncCall</structname> node in the
 276      parse tree represents something that looks syntactically like a function
 277      call.  This might be transformed to either a <structname>FuncExpr</structname>
 278      or <structname>Aggref</structname> node depending on whether the referenced
 279      name turns out to be an ordinary function or an aggregate function.
 280      Also, information about the actual data types of columns and expression
 281      results is added to the query tree.
 282     </para>
 283    </sect2>
 284   </sect1>
 285
 286   <sect1 id="rule-system">
 287    <title>The <productname>PostgreSQL</productname> Rule System</title>
 288
 289    <para>
 290     <productname>PostgreSQL</productname> supports a powerful
 291     <firstterm>rule system</firstterm> for the specification
 292     of <firstterm>views</firstterm> and ambiguous <firstterm>view updates</firstterm>.
 293     Originally the <productname>PostgreSQL</productname>
 294     rule system consisted of two implementations:
 295
 296     <itemizedlist>
 297      <listitem>
 298       <para>
 299        The first one worked using <firstterm>row level</firstterm> processing and was
 300        implemented deep in the <firstterm>executor</firstterm>. The rule system was
 301        called whenever an individual row had been accessed. This
 302        implementation was removed in 1995 when the last official release
 303        of the <productname>Berkeley Postgres</productname> project was
 304        transformed into <productname>Postgres95</productname>.
 305       </para>
 306      </listitem>
 307
 308      <listitem>
 309       <para>
 310        The second implementation of the rule system is a technique
 311        called <firstterm>query rewriting</firstterm>.
 312        The <firstterm>rewrite system</firstterm> is a module
 313        that exists between the <firstterm>parser stage</firstterm> and the
 314        <firstterm>planner/optimizer</firstterm>. This technique is still implemented.
 315       </para>
 316      </listitem>
 317     </itemizedlist>
 318    </para>
 319
 320    <para>
 321     The query rewriter is discussed in some detail in
 322     <xref linkend="rules"/>, so there is no need to cover it here.
 323     We will only point out that both the input and the output of the
 324     rewriter are query trees, that is, there is no change in the
 325     representation or level of semantic detail in the trees.  Rewriting
 326     can be thought of as a form of macro expansion.
 327    </para>
 328
 329   </sect1>
 330
 331   <sect1 id="planner-optimizer">
 332    <title>Planner/Optimizer</title>
 333
 334    <para>
 335     The task of the <firstterm>planner/optimizer</firstterm> is to
 336     create an optimal execution plan. A given SQL query (and hence, a
 337     query tree) can be actually executed in a wide variety of
 338     different ways, each of which will produce the same set of
 339     results.  If it is computationally feasible, the query optimizer
 340     will examine each of these possible execution plans, ultimately
 341     selecting the execution plan that is expected to run the fastest.
 342    </para>
 343
 344    <note>
 345     <para>
 346      In some situations, examining each possible way in which a query
 347      can be executed would take an excessive amount of time and memory.
 348      In particular, this occurs when executing queries
 349      involving large numbers of join operations. In order to determine
 350      a reasonable (not necessarily optimal) query plan in a reasonable amount
 351      of time, <productname>PostgreSQL</productname> uses a <firstterm>Genetic
 352      Query Optimizer</firstterm> (see <xref linkend="geqo"/>) when the number of joins
 353      exceeds a threshold (see <xref linkend="guc-geqo-threshold"/>).
 354     </para>
 355    </note>
 356
 357    <para>
 358     The planner's search procedure actually works with data structures
 359     called <firstterm>paths</firstterm>, which are simply cut-down representations of
 360     plans containing only as much information as the planner needs to make
 361     its decisions. After the cheapest path is determined, a full-fledged
 362     <firstterm>plan tree</firstterm> is built to pass to the executor.  This represents
 363     the desired execution plan in sufficient detail for the executor to run it.
 364     In the rest of this section we'll ignore the distinction between paths
 365     and plans.
 366    </para>
 367
 368    <sect2 id="planner-optimizer-generating-possible-plans">
 369     <title>Generating Possible Plans</title>
 370
 371     <para>
 372      The planner/optimizer starts by generating plans for scanning each
 373      individual relation (table) used in the query.  The possible plans
 374      are determined by the available indexes on each relation.
 375      There is always the possibility of performing a
 376      sequential scan on a relation, so a sequential scan plan is always
 377      created. Assume an index is defined on a
 378      relation (for example a B-tree index) and a query contains the
 379      restriction
 380      <literal>relation.attribute OPR constant</literal>. If
 381      <literal>relation.attribute</literal> happens to match the key of the B-tree
 382      index and <literal>OPR</literal> is one of the operators listed in
 383      the index's <firstterm>operator class</firstterm>, another plan is created using
 384      the B-tree index to scan the relation. If there are further indexes
 385      present and the restrictions in the query happen to match a key of an
 386      index, further plans will be considered.  Index scan plans are also
 387      generated for indexes that have a sort ordering that can match the
 388      query's <literal>ORDER BY</literal> clause (if any), or a sort ordering that
 389      might be useful for merge joining (see below).
 390     </para>
 391
 392     <para>
 393      If the query requires joining two or more relations,
 394      plans for joining relations are considered
 395      after all feasible plans have been found for scanning single relations.
 396      The three available join strategies are:
 397
 398      <itemizedlist>
 399       <listitem>
 400        <para>
 401         <firstterm>nested loop join</firstterm>: The right relation is scanned
 402         once for every row found in the left relation. This strategy
 403         is easy to implement but can be very time consuming.  (However,
 404         if the right relation can be scanned with an index scan, this can
 405         be a good strategy.  It is possible to use values from the current
 406         row of the left relation as keys for the index scan of the right.)
 407        </para>
 408       </listitem>
 409
 410       <listitem>
 411        <para>
 412         <firstterm>merge join</firstterm>: Each relation is sorted on the join
 413         attributes before the join starts. Then the two relations are
 414         scanned in parallel, and matching rows are combined to form
 415         join rows. This kind of join is
 416         attractive because each relation has to be scanned only once.
 417         The required sorting might be achieved either by an explicit sort
 418         step, or by scanning the relation in the proper order using an
 419         index on the join key.
 420        </para>
 421       </listitem>
 422
 423       <listitem>
 424        <para>
 425         <firstterm>hash join</firstterm>: the right relation is first scanned
 426         and loaded into a hash table, using its join attributes as hash keys.
 427         Next the left relation is scanned and the
 428         appropriate values of every row found are used as hash keys to
 429         locate the matching rows in the table.
 430        </para>
 431       </listitem>
 432      </itemizedlist>
 433     </para>
 434
 435     <para>
 436      When the query involves more than two relations, the final result
 437      must be built up by a tree of join steps, each with two inputs.
 438      The planner examines different possible join sequences to find the
 439      cheapest one.
 440     </para>
 441
 442     <para>
 443      If the query uses fewer than <xref linkend="guc-geqo-threshold"/>
 444      relations, a near-exhaustive search is conducted to find the best
 445      join sequence.  The planner preferentially considers joins between any
 446      two relations for which there exists a corresponding join clause in the
 447      <literal>WHERE</literal> qualification (i.e., for
 448      which a restriction like <literal>where rel1.attr1=rel2.attr2</literal>
 449      exists). Join pairs with no join clause are considered only when there
 450      is no other choice, that is, a particular relation has no available
 451      join clauses to any other relation. All possible plans are generated for
 452      every join pair considered by the planner, and the one that is
 453      (estimated to be) the cheapest is chosen.
 454     </para>
 455
 456     <para>
 457      When <varname>geqo_threshold</varname> is exceeded, the join
 458      sequences considered are determined by heuristics, as described
 459      in <xref linkend="geqo"/>.  Otherwise the process is the same.
 460     </para>
 461
 462     <para>
 463      The finished plan tree consists of sequential or index scans of
 464      the base relations, plus nested-loop, merge, or hash join nodes as
 465      needed, plus any auxiliary steps needed, such as sort nodes or
 466      aggregate-function calculation nodes.  Most of these plan node
 467      types have the additional ability to do <firstterm>selection</firstterm>
 468      (discarding rows that do not meet a specified Boolean condition)
 469      and <firstterm>projection</firstterm> (computation of a derived column set
 470      based on given column values, that is, evaluation of scalar
 471      expressions where needed).  One of the responsibilities of the
 472      planner is to attach selection conditions from the
 473      <literal>WHERE</literal> clause and computation of required
 474      output expressions to the most appropriate nodes of the plan
 475      tree.
 476     </para>
 477    </sect2>
 478   </sect1>
 479
 480   <sect1 id="executor">
 481    <title>Executor</title>
 482
 483    <para>
 484     The <firstterm>executor</firstterm> takes the plan created by the
 485     planner/optimizer and recursively processes it to extract the required set
 486     of rows.  This is essentially a demand-pull pipeline mechanism.
 487     Each time a plan node is called, it must deliver one more row, or
 488     report that it is done delivering rows.
 489    </para>
 490
 491    <para>
 492     To provide a concrete example, assume that the top
 493     node is a <literal>MergeJoin</literal> node.
 494     Before any merge can be done two rows have to be fetched (one from
 495     each subplan). So the executor recursively calls itself to
 496     process the subplans (it starts with the subplan attached to
 497     <literal>lefttree</literal>). The new top node (the top node of the left
 498     subplan) is, let's say, a
 499     <literal>Sort</literal> node and again recursion is needed to obtain
 500     an input row.  The child node of the <literal>Sort</literal> might
 501     be a <literal>SeqScan</literal> node, representing actual reading of a table.
 502     Execution of this node causes the executor to fetch a row from the
 503     table and return it up to the calling node.  The <literal>Sort</literal>
 504     node will repeatedly call its child to obtain all the rows to be sorted.
 505     When the input is exhausted (as indicated by the child node returning
 506     a NULL instead of a row), the <literal>Sort</literal> code performs
 507     the sort, and finally is able to return its first output row, namely
 508     the first one in sorted order.  It keeps the remaining rows stored so
 509     that it can deliver them in sorted order in response to later demands.
 510    </para>
 511
 512    <para>
 513     The <literal>MergeJoin</literal> node similarly demands the first row
 514     from its right subplan.  Then it compares the two rows to see if they
 515     can be joined; if so, it returns a join row to its caller.  On the next
 516     call, or immediately if it cannot join the current pair of inputs,
 517     it advances to the next row of one table
 518     or the other (depending on how the comparison came out), and again
 519     checks for a match.  Eventually, one subplan or the other is exhausted,
 520     and the <literal>MergeJoin</literal> node returns NULL to indicate that
 521     no more join rows can be formed.
 522    </para>
 523
 524    <para>
 525     Complex queries can involve many levels of plan nodes, but the general
 526     approach is the same: each node computes and returns its next output
 527     row each time it is called.  Each node is also responsible for applying
 528     any selection or projection expressions that were assigned to it by
 529     the planner.
 530    </para>
 531
 532    <para>
 533     The executor mechanism is used to evaluate all five basic SQL query
 534     types: <command>SELECT</command>, <command>INSERT</command>,
 535     <command>UPDATE</command>, <command>DELETE</command>, and
 536     <command>MERGE</command>.
 537     For <command>SELECT</command>, the top-level executor code
 538     only needs to send each row returned by the query plan tree
 539     off to the client.  <command>INSERT ... SELECT</command>,
 540     <command>UPDATE</command>, <command>DELETE</command>, and
 541     <command>MERGE</command>
 542     are effectively <command>SELECT</command>s under a special
 543     top-level plan node called <literal>ModifyTable</literal>.
 544    </para>
 545
 546    <para>
 547     <command>INSERT ... SELECT</command> feeds the rows up
 548     to <literal>ModifyTable</literal> for insertion.  For
 549     <command>UPDATE</command>, the planner arranges that each
 550     computed row includes all the updated column values, plus the
 551     <firstterm>TID</firstterm> (tuple ID, or row ID) of the original
 552     target row; this data is fed up to the <literal>ModifyTable</literal>
 553     node, which uses the information to create a new updated row and
 554     mark the old row deleted.  For <command>DELETE</command>, the only
 555     column that is actually returned by the plan is the TID, and the
 556     <literal>ModifyTable</literal> node simply uses the TID to visit each
 557     target row and mark it deleted.  For <command>MERGE</command>, the
 558     planner joins the source and target relations, and includes all
 559     column values required by any of the <literal>WHEN</literal> clauses,
 560     plus the TID of the target row; this data is fed up to the
 561     <literal>ModifyTable</literal> node, which uses the information to
 562     work out which <literal>WHEN</literal> clause to execute, and then
 563     inserts, updates or deletes the target row, as required.
 564    </para>
 565
 566    <para>
 567     A simple <command>INSERT ... VALUES</command> command creates a
 568     trivial plan tree consisting of a single <literal>Result</literal>
 569     node, which computes just one result row, feeding that up
 570     to <literal>ModifyTable</literal> to perform the insertion.
 571    </para>
 572
 573   </sect1>
 574
 575  </chapter>