1 <!-- doc/src/sgml/arch-dev.sgml -->
3 <chapter id=
"overview">
4 <title>Overview of PostgreSQL Internals
</title>
9 This chapter originated as part of
10 <xref linkend=
"sim98"/> Stefan Simkovics'
11 Master's Thesis prepared at Vienna University of Technology under the direction
12 of O.Univ.Prof.Dr. Georg Gottlob and Univ.Ass. Mag. Katrin Seyr.
17 This chapter gives an overview of the internal structure of the
18 backend of
<productname>PostgreSQL
</productname>. After having
19 read the following sections you should have an idea of how a query
20 is processed. This chapter is intended to help the reader
21 understand the general sequence of operations that occur within the
22 backend from the point at which a query is received, to the point
23 at which the results are returned to the client.
26 <sect1 id=
"query-path">
27 <title>The Path of a Query
</title>
30 Here we give a short overview of the stages a query has to pass
37 A connection from an application program to the
<productname>PostgreSQL
</productname>
38 server has to be established. The application program transmits a
39 query to the server and waits to receive the results sent back by the
46 The
<firstterm>parser stage
</firstterm> checks the query
47 transmitted by the application
48 program for correct syntax and creates
49 a
<firstterm>query tree
</firstterm>.
55 The
<firstterm>rewrite system
</firstterm> takes
56 the query tree created by the parser stage and looks for
57 any
<firstterm>rules
</firstterm> (stored in the
58 <firstterm>system catalogs
</firstterm>) to apply to
59 the query tree. It performs the
60 transformations given in the
<firstterm>rule bodies
</firstterm>.
64 One application of the rewrite system is in the realization of
65 <firstterm>views
</firstterm>.
66 Whenever a query against a view
67 (i.e., a
<firstterm>virtual table
</firstterm>) is made,
68 the rewrite system rewrites the user's query to
69 a query that accesses the
<firstterm>base tables
</firstterm> given in
70 the
<firstterm>view definition
</firstterm> instead.
76 The
<firstterm>planner/optimizer
</firstterm> takes
77 the (rewritten) query tree and creates a
78 <firstterm>query plan
</firstterm> that will be the input to the
79 <firstterm>executor
</firstterm>.
83 It does so by first creating all possible
<firstterm>paths
</firstterm>
84 leading to the same result. For example if there is an index on a
85 relation to be scanned, there are two paths for the
86 scan. One possibility is a simple sequential scan and the other
87 possibility is to use the index. Next the cost for the execution of
88 each path is estimated and the cheapest path is chosen. The cheapest
89 path is expanded into a complete plan that the executor can use.
95 The executor recursively steps through
96 the
<firstterm>plan tree
</firstterm> and
97 retrieves rows in the way represented by the plan.
98 The executor makes use of the
99 <firstterm>storage system
</firstterm> while scanning
100 relations, performs
<firstterm>sorts
</firstterm> and
<firstterm>joins
</firstterm>,
101 evaluates
<firstterm>qualifications
</firstterm> and finally hands back the rows derived.
107 In the following sections we will cover each of the above listed items
108 in more detail to give a better understanding of
<productname>PostgreSQL
</productname>'s internal
109 control and data structures.
113 <sect1 id=
"connect-estab">
114 <title>How Connections Are Established
</title>
117 <productname>PostgreSQL
</productname> implements a
118 <quote>process per user
</quote> client/server model.
120 <glossterm linkend=
"glossary-client">client process
</glossterm>
121 connects to exactly one
122 <glossterm linkend=
"glossary-backend">backend process
</glossterm>.
123 As we do not know ahead of time how many connections will be made,
124 we have to use a
<quote>supervisor process
</quote> that spawns a new
125 backend process every time a connection is requested. This supervisor
127 <glossterm linkend=
"glossary-postmaster">postmaster
</glossterm>
128 and listens at a specified TCP/IP port for incoming connections.
129 Whenever it detects a request for a connection, it spawns a new
130 backend process. Those backend processes communicate with each
131 other and with other processes of the
132 <glossterm linkend=
"glossary-instance">instance
</glossterm>
133 using
<firstterm>semaphores
</firstterm> and
134 <glossterm linkend=
"glossary-shared-memory">shared memory
</glossterm>
135 to ensure data integrity throughout concurrent data access.
139 The client process can be any program that understands the
140 <productname>PostgreSQL
</productname> protocol described in
141 <xref linkend=
"protocol"/>. Many clients are based on the
142 C-language library
<application>libpq
</application>, but several independent
143 implementations of the protocol exist, such as the Java
144 <application>JDBC
</application> driver.
148 Once a connection is established, the client process can send a query
149 to the backend process it's connected to. The query is transmitted using
150 plain text, i.e., there is no parsing done in the client. The backend
151 process parses the query, creates an
<firstterm>execution plan
</firstterm>,
152 executes the plan, and returns the retrieved rows to the client
153 by transmitting them over the established connection.
157 <sect1 id=
"parser-stage">
158 <title>The Parser Stage
</title>
161 The
<firstterm>parser stage
</firstterm> consists of two parts:
166 The
<firstterm>parser
</firstterm> defined in
167 <filename>gram.y
</filename> and
<filename>scan.l
</filename> is
168 built using the Unix tools
<application>bison
</application>
169 and
<application>flex
</application>.
174 The
<firstterm>transformation process
</firstterm> does
175 modifications and augmentations to the data structures returned by the parser.
181 <sect2 id=
"parser-stage-parser">
182 <title>Parser
</title>
185 The parser has to check the query string (which arrives as plain
186 text) for valid syntax. If the syntax is correct a
187 <firstterm>parse tree
</firstterm> is built up and handed back;
188 otherwise an error is returned. The parser and lexer are
189 implemented using the well-known Unix tools
<application>bison
</application>
190 and
<application>flex
</application>.
194 The
<firstterm>lexer
</firstterm> is defined in the file
195 <filename>scan.l
</filename> and is responsible
196 for recognizing
<firstterm>identifiers
</firstterm>,
197 the
<firstterm>SQL key words
</firstterm> etc. For
198 every key word or identifier that is found, a
<firstterm>token
</firstterm>
199 is generated and handed to the parser.
203 The parser is defined in the file
<filename>gram.y
</filename> and
204 consists of a set of
<firstterm>grammar rules
</firstterm> and
205 <firstterm>actions
</firstterm> that are executed whenever a rule
206 is fired. The code of the actions (which is actually C code) is
207 used to build up the parse tree.
211 The file
<filename>scan.l
</filename> is transformed to the C
212 source file
<filename>scan.c
</filename> using the program
213 <application>flex
</application> and
<filename>gram.y
</filename> is
214 transformed to
<filename>gram.c
</filename> using
215 <application>bison
</application>. After these transformations
216 have taken place a normal C compiler can be used to create the
217 parser. Never make any changes to the generated C files as they
218 will be overwritten the next time
<application>flex
</application>
219 or
<application>bison
</application> is called.
223 The mentioned transformations and compilations are normally done
224 automatically using the
<firstterm>makefiles
</firstterm>
225 shipped with the
<productname>PostgreSQL
</productname>
232 A detailed description of
<application>bison
</application> or
233 the grammar rules given in
<filename>gram.y
</filename> would be
234 beyond the scope of this manual. There are many books and
235 documents dealing with
<application>flex
</application> and
236 <application>bison
</application>. You should be familiar with
237 <application>bison
</application> before you start to study the
238 grammar given in
<filename>gram.y
</filename> otherwise you won't
239 understand what happens there.
244 <sect2 id=
"parser-stage-transformation-process">
245 <title>Transformation Process
</title>
248 The parser stage creates a parse tree using only fixed rules about
249 the syntactic structure of SQL. It does not make any lookups in the
250 system catalogs, so there is no possibility to understand the detailed
251 semantics of the requested operations. After the parser completes,
252 the
<firstterm>transformation process
</firstterm> takes the tree handed
253 back by the parser as input and does the semantic interpretation needed
254 to understand which tables, functions, and operators are referenced by
255 the query. The data structure that is built to represent this
256 information is called the
<firstterm>query tree
</firstterm>.
260 The reason for separating raw parsing from semantic analysis is that
261 system catalog lookups can only be done within a transaction, and we
262 do not wish to start a transaction immediately upon receiving a query
263 string. The raw parsing stage is sufficient to identify the transaction
264 control commands (
<command>BEGIN
</command>,
<command>ROLLBACK
</command>, etc.), and
265 these can then be correctly executed without any further analysis.
266 Once we know that we are dealing with an actual query (such as
267 <command>SELECT
</command> or
<command>UPDATE
</command>), it is okay to
268 start a transaction if we're not already in one. Only then can the
269 transformation process be invoked.
273 The query tree created by the transformation process is structurally
274 similar to the raw parse tree in most places, but it has many differences
275 in detail. For example, a
<structname>FuncCall
</structname> node in the
276 parse tree represents something that looks syntactically like a function
277 call. This might be transformed to either a
<structname>FuncExpr
</structname>
278 or
<structname>Aggref
</structname> node depending on whether the referenced
279 name turns out to be an ordinary function or an aggregate function.
280 Also, information about the actual data types of columns and expression
281 results is added to the query tree.
286 <sect1 id=
"rule-system">
287 <title>The
<productname>PostgreSQL
</productname> Rule System
</title>
290 <productname>PostgreSQL
</productname> supports a powerful
291 <firstterm>rule system
</firstterm> for the specification
292 of
<firstterm>views
</firstterm> and ambiguous
<firstterm>view updates
</firstterm>.
293 Originally the
<productname>PostgreSQL
</productname>
294 rule system consisted of two implementations:
299 The first one worked using
<firstterm>row level
</firstterm> processing and was
300 implemented deep in the
<firstterm>executor
</firstterm>. The rule system was
301 called whenever an individual row had been accessed. This
302 implementation was removed in
1995 when the last official release
303 of the
<productname>Berkeley Postgres
</productname> project was
304 transformed into
<productname>Postgres95
</productname>.
310 The second implementation of the rule system is a technique
311 called
<firstterm>query rewriting
</firstterm>.
312 The
<firstterm>rewrite system
</firstterm> is a module
313 that exists between the
<firstterm>parser stage
</firstterm> and the
314 <firstterm>planner/optimizer
</firstterm>. This technique is still implemented.
321 The query rewriter is discussed in some detail in
322 <xref linkend=
"rules"/>, so there is no need to cover it here.
323 We will only point out that both the input and the output of the
324 rewriter are query trees, that is, there is no change in the
325 representation or level of semantic detail in the trees. Rewriting
326 can be thought of as a form of macro expansion.
331 <sect1 id=
"planner-optimizer">
332 <title>Planner/Optimizer
</title>
335 The task of the
<firstterm>planner/optimizer
</firstterm> is to
336 create an optimal execution plan. A given SQL query (and hence, a
337 query tree) can be actually executed in a wide variety of
338 different ways, each of which will produce the same set of
339 results. If it is computationally feasible, the query optimizer
340 will examine each of these possible execution plans, ultimately
341 selecting the execution plan that is expected to run the fastest.
346 In some situations, examining each possible way in which a query
347 can be executed would take an excessive amount of time and memory.
348 In particular, this occurs when executing queries
349 involving large numbers of join operations. In order to determine
350 a reasonable (not necessarily optimal) query plan in a reasonable amount
351 of time,
<productname>PostgreSQL
</productname> uses a
<firstterm>Genetic
352 Query Optimizer
</firstterm> (see
<xref linkend=
"geqo"/>) when the number of joins
353 exceeds a threshold (see
<xref linkend=
"guc-geqo-threshold"/>).
358 The planner's search procedure actually works with data structures
359 called
<firstterm>paths
</firstterm>, which are simply cut-down representations of
360 plans containing only as much information as the planner needs to make
361 its decisions. After the cheapest path is determined, a full-fledged
362 <firstterm>plan tree
</firstterm> is built to pass to the executor. This represents
363 the desired execution plan in sufficient detail for the executor to run it.
364 In the rest of this section we'll ignore the distinction between paths
368 <sect2 id=
"planner-optimizer-generating-possible-plans">
369 <title>Generating Possible Plans
</title>
372 The planner/optimizer starts by generating plans for scanning each
373 individual relation (table) used in the query. The possible plans
374 are determined by the available indexes on each relation.
375 There is always the possibility of performing a
376 sequential scan on a relation, so a sequential scan plan is always
377 created. Assume an index is defined on a
378 relation (for example a B-tree index) and a query contains the
380 <literal>relation.attribute OPR constant
</literal>. If
381 <literal>relation.attribute
</literal> happens to match the key of the B-tree
382 index and
<literal>OPR
</literal> is one of the operators listed in
383 the index's
<firstterm>operator class
</firstterm>, another plan is created using
384 the B-tree index to scan the relation. If there are further indexes
385 present and the restrictions in the query happen to match a key of an
386 index, further plans will be considered. Index scan plans are also
387 generated for indexes that have a sort ordering that can match the
388 query's
<literal>ORDER BY
</literal> clause (if any), or a sort ordering that
389 might be useful for merge joining (see below).
393 If the query requires joining two or more relations,
394 plans for joining relations are considered
395 after all feasible plans have been found for scanning single relations.
396 The three available join strategies are:
401 <firstterm>nested loop join
</firstterm>: The right relation is scanned
402 once for every row found in the left relation. This strategy
403 is easy to implement but can be very time consuming. (However,
404 if the right relation can be scanned with an index scan, this can
405 be a good strategy. It is possible to use values from the current
406 row of the left relation as keys for the index scan of the right.)
412 <firstterm>merge join
</firstterm>: Each relation is sorted on the join
413 attributes before the join starts. Then the two relations are
414 scanned in parallel, and matching rows are combined to form
415 join rows. This kind of join is
416 attractive because each relation has to be scanned only once.
417 The required sorting might be achieved either by an explicit sort
418 step, or by scanning the relation in the proper order using an
419 index on the join key.
425 <firstterm>hash join
</firstterm>: the right relation is first scanned
426 and loaded into a hash table, using its join attributes as hash keys.
427 Next the left relation is scanned and the
428 appropriate values of every row found are used as hash keys to
429 locate the matching rows in the table.
436 When the query involves more than two relations, the final result
437 must be built up by a tree of join steps, each with two inputs.
438 The planner examines different possible join sequences to find the
443 If the query uses fewer than
<xref linkend=
"guc-geqo-threshold"/>
444 relations, a near-exhaustive search is conducted to find the best
445 join sequence. The planner preferentially considers joins between any
446 two relations for which there exists a corresponding join clause in the
447 <literal>WHERE
</literal> qualification (i.e., for
448 which a restriction like
<literal>where rel1.attr1=rel2.attr2
</literal>
449 exists). Join pairs with no join clause are considered only when there
450 is no other choice, that is, a particular relation has no available
451 join clauses to any other relation. All possible plans are generated for
452 every join pair considered by the planner, and the one that is
453 (estimated to be) the cheapest is chosen.
457 When
<varname>geqo_threshold
</varname> is exceeded, the join
458 sequences considered are determined by heuristics, as described
459 in
<xref linkend=
"geqo"/>. Otherwise the process is the same.
463 The finished plan tree consists of sequential or index scans of
464 the base relations, plus nested-loop, merge, or hash join nodes as
465 needed, plus any auxiliary steps needed, such as sort nodes or
466 aggregate-function calculation nodes. Most of these plan node
467 types have the additional ability to do
<firstterm>selection
</firstterm>
468 (discarding rows that do not meet a specified Boolean condition)
469 and
<firstterm>projection
</firstterm> (computation of a derived column set
470 based on given column values, that is, evaluation of scalar
471 expressions where needed). One of the responsibilities of the
472 planner is to attach selection conditions from the
473 <literal>WHERE
</literal> clause and computation of required
474 output expressions to the most appropriate nodes of the plan
480 <sect1 id=
"executor">
481 <title>Executor
</title>
484 The
<firstterm>executor
</firstterm> takes the plan created by the
485 planner/optimizer and recursively processes it to extract the required set
486 of rows. This is essentially a demand-pull pipeline mechanism.
487 Each time a plan node is called, it must deliver one more row, or
488 report that it is done delivering rows.
492 To provide a concrete example, assume that the top
493 node is a
<literal>MergeJoin
</literal> node.
494 Before any merge can be done two rows have to be fetched (one from
495 each subplan). So the executor recursively calls itself to
496 process the subplans (it starts with the subplan attached to
497 <literal>lefttree
</literal>). The new top node (the top node of the left
498 subplan) is, let's say, a
499 <literal>Sort
</literal> node and again recursion is needed to obtain
500 an input row. The child node of the
<literal>Sort
</literal> might
501 be a
<literal>SeqScan
</literal> node, representing actual reading of a table.
502 Execution of this node causes the executor to fetch a row from the
503 table and return it up to the calling node. The
<literal>Sort
</literal>
504 node will repeatedly call its child to obtain all the rows to be sorted.
505 When the input is exhausted (as indicated by the child node returning
506 a NULL instead of a row), the
<literal>Sort
</literal> code performs
507 the sort, and finally is able to return its first output row, namely
508 the first one in sorted order. It keeps the remaining rows stored so
509 that it can deliver them in sorted order in response to later demands.
513 The
<literal>MergeJoin
</literal> node similarly demands the first row
514 from its right subplan. Then it compares the two rows to see if they
515 can be joined; if so, it returns a join row to its caller. On the next
516 call, or immediately if it cannot join the current pair of inputs,
517 it advances to the next row of one table
518 or the other (depending on how the comparison came out), and again
519 checks for a match. Eventually, one subplan or the other is exhausted,
520 and the
<literal>MergeJoin
</literal> node returns NULL to indicate that
521 no more join rows can be formed.
525 Complex queries can involve many levels of plan nodes, but the general
526 approach is the same: each node computes and returns its next output
527 row each time it is called. Each node is also responsible for applying
528 any selection or projection expressions that were assigned to it by
533 The executor mechanism is used to evaluate all five basic SQL query
534 types:
<command>SELECT
</command>,
<command>INSERT
</command>,
535 <command>UPDATE
</command>,
<command>DELETE
</command>, and
536 <command>MERGE
</command>.
537 For
<command>SELECT
</command>, the top-level executor code
538 only needs to send each row returned by the query plan tree
539 off to the client.
<command>INSERT ... SELECT
</command>,
540 <command>UPDATE
</command>,
<command>DELETE
</command>, and
541 <command>MERGE
</command>
542 are effectively
<command>SELECT
</command>s under a special
543 top-level plan node called
<literal>ModifyTable
</literal>.
547 <command>INSERT ... SELECT
</command> feeds the rows up
548 to
<literal>ModifyTable
</literal> for insertion. For
549 <command>UPDATE
</command>, the planner arranges that each
550 computed row includes all the updated column values, plus the
551 <firstterm>TID
</firstterm> (tuple ID, or row ID) of the original
552 target row; this data is fed up to the
<literal>ModifyTable
</literal>
553 node, which uses the information to create a new updated row and
554 mark the old row deleted. For
<command>DELETE
</command>, the only
555 column that is actually returned by the plan is the TID, and the
556 <literal>ModifyTable
</literal> node simply uses the TID to visit each
557 target row and mark it deleted. For
<command>MERGE
</command>, the
558 planner joins the source and target relations, and includes all
559 column values required by any of the
<literal>WHEN
</literal> clauses,
560 plus the TID of the target row; this data is fed up to the
561 <literal>ModifyTable
</literal> node, which uses the information to
562 work out which
<literal>WHEN
</literal> clause to execute, and then
563 inserts, updates or deletes the target row, as required.
567 A simple
<command>INSERT ... VALUES
</command> command creates a
568 trivial plan tree consisting of a single
<literal>Result
</literal>
569 node, which computes just one result row, feeding that up
570 to
<literal>ModifyTable
</literal> to perform the insertion.