3 <chapter id=
"overview">
4 <title>Overview of PostgreSQL Internals
</title>
9 This chapter originated as part of
10 <xref linkend=
"SIM98">, Stefan Simkovics'
11 Master's Thesis prepared at Vienna University of Technology under the direction
12 of O.Univ.Prof.Dr. Georg Gottlob and Univ.Ass. Mag. Katrin Seyr.
17 This chapter gives an overview of the internal structure of the
18 backend of
<productname>PostgreSQL
</productname>. After having
19 read the following sections you should have an idea of how a query
20 is processed. This chapter does not aim to provide a detailed
21 description of the internal operation of
22 <productname>PostgreSQL
</productname>, as such a document would be
23 very extensive. Rather, this chapter is intended to help the reader
24 understand the general sequence of operations that occur within the
25 backend from the point at which a query is received, to the point
26 at which the results are returned to the client.
29 <sect1 id=
"query-path">
30 <title>The Path of a Query
</title>
33 Here we give a short overview of the stages a query has to pass in
34 order to obtain a result.
40 A connection from an application program to the
<productname>PostgreSQL
</productname>
41 server has to be established. The application program transmits a
42 query to the server and waits to receive the results sent back by the
49 The
<firstterm>parser stage
</firstterm> checks the query
50 transmitted by the application
51 program for correct syntax and creates
52 a
<firstterm>query tree
</firstterm>.
58 The
<firstterm>rewrite system
</firstterm> takes
59 the query tree created by the parser stage and looks for
60 any
<firstterm>rules
</firstterm> (stored in the
61 <firstterm>system catalogs
</firstterm>) to apply to
62 the query tree. It performs the
63 transformations given in the
<firstterm>rule bodies
</firstterm>.
67 One application of the rewrite system is in the realization of
68 <firstterm>views
</firstterm>.
69 Whenever a query against a view
70 (i.e., a
<firstterm>virtual table
</firstterm>) is made,
71 the rewrite system rewrites the user's query to
72 a query that accesses the
<firstterm>base tables
</firstterm> given in
73 the
<firstterm>view definition
</firstterm> instead.
79 The
<firstterm>planner/optimizer
</firstterm> takes
80 the (rewritten) query tree and creates a
81 <firstterm>query plan
</firstterm> that will be the input to the
82 <firstterm>executor
</firstterm>.
86 It does so by first creating all possible
<firstterm>paths
</firstterm>
87 leading to the same result. For example if there is an index on a
88 relation to be scanned, there are two paths for the
89 scan. One possibility is a simple sequential scan and the other
90 possibility is to use the index. Next the cost for the execution of
91 each path is estimated and the cheapest path is chosen. The cheapest
92 path is expanded into a complete plan that the executor can use.
98 The executor recursively steps through
99 the
<firstterm>plan tree
</firstterm> and
100 retrieves rows in the way represented by the plan.
101 The executor makes use of the
102 <firstterm>storage system
</firstterm> while scanning
103 relations, performs
<firstterm>sorts
</firstterm> and
<firstterm>joins
</firstterm>,
104 evaluates
<firstterm>qualifications
</firstterm> and finally hands back the rows derived.
110 In the following sections we will cover each of the above listed items
111 in more detail to give a better understanding of
<productname>PostgreSQL
</productname>'s internal
112 control and data structures.
116 <sect1 id=
"connect-estab">
117 <title>How Connections are Established
</title>
120 <productname>PostgreSQL
</productname> is implemented using a
121 simple
<quote>process per user<
/> client/server model. In this model
122 there is one
<firstterm>client process
</firstterm> connected to
123 exactly one
<firstterm>server process
</firstterm>. As we do not
124 know ahead of time how many connections will be made, we have to
125 use a
<firstterm>master process
</firstterm> that spawns a new
126 server process every time a connection is requested. This master
127 process is called
<literal>postgres
</literal> and listens at a
128 specified TCP/IP port for incoming connections. Whenever a request
129 for a connection is detected the
<literal>postgres
</literal>
130 process spawns a new server process. The server tasks
131 communicate with each other using
<firstterm>semaphores
</firstterm> and
132 <firstterm>shared memory
</firstterm> to ensure data integrity
133 throughout concurrent data access.
137 The client process can be any program that understands the
138 <productname>PostgreSQL
</productname> protocol described in
139 <xref linkend=
"protocol">. Many clients are based on the
140 C-language library
<application>libpq<
/>, but several independent
141 implementations of the protocol exist, such as the Java
142 <application>JDBC<
/> driver.
146 Once a connection is established the client process can send a query
147 to the
<firstterm>backend
</firstterm> (server). The query is transmitted using plain text,
148 i.e., there is no parsing done in the
<firstterm>frontend
</firstterm> (client). The
149 server parses the query, creates an
<firstterm>execution plan
</firstterm>,
150 executes the plan and returns the retrieved rows to the client
151 by transmitting them over the established connection.
155 <sect1 id=
"parser-stage">
156 <title>The Parser Stage
</title>
159 The
<firstterm>parser stage
</firstterm> consists of two parts:
164 The
<firstterm>parser
</firstterm> defined in
165 <filename>gram.y
</filename> and
<filename>scan.l
</filename> is
166 built using the Unix tools
<application>yacc
</application>
167 and
<application>lex
</application>.
172 The
<firstterm>transformation process
</firstterm> does
173 modifications and augmentations to the data structures returned by the parser.
180 <title>Parser
</title>
183 The parser has to check the query string (which arrives as plain
184 ASCII text) for valid syntax. If the syntax is correct a
185 <firstterm>parse tree
</firstterm> is built up and handed back;
186 otherwise an error is returned. The parser and lexer are
187 implemented using the well-known Unix tools
<application>yacc<
/>
188 and
<application>lex<
/>.
192 The
<firstterm>lexer
</firstterm> is defined in the file
193 <filename>scan.l
</filename> and is responsible
194 for recognizing
<firstterm>identifiers
</firstterm>,
195 the
<firstterm>SQL key words
</firstterm> etc. For
196 every key word or identifier that is found, a
<firstterm>token
</firstterm>
197 is generated and handed to the parser.
201 The parser is defined in the file
<filename>gram.y
</filename> and
202 consists of a set of
<firstterm>grammar rules
</firstterm> and
203 <firstterm>actions
</firstterm> that are executed whenever a rule
204 is fired. The code of the actions (which is actually C code) is
205 used to build up the parse tree.
209 The file
<filename>scan.l
</filename> is transformed to the C
210 source file
<filename>scan.c
</filename> using the program
211 <application>lex
</application> and
<filename>gram.y
</filename> is
212 transformed to
<filename>gram.c
</filename> using
213 <application>yacc
</application>. After these transformations
214 have taken place a normal C compiler can be used to create the
215 parser. Never make any changes to the generated C files as they
216 will be overwritten the next time
<application>lex
</application>
217 or
<application>yacc
</application> is called.
221 The mentioned transformations and compilations are normally done
222 automatically using the
<firstterm>makefiles
</firstterm>
223 shipped with the
<productname>PostgreSQL
</productname>
230 A detailed description of
<application>yacc
</application> or
231 the grammar rules given in
<filename>gram.y
</filename> would be
232 beyond the scope of this paper. There are many books and
233 documents dealing with
<application>lex
</application> and
234 <application>yacc
</application>. You should be familiar with
235 <application>yacc
</application> before you start to study the
236 grammar given in
<filename>gram.y
</filename> otherwise you won't
237 understand what happens there.
243 <title>Transformation Process
</title>
246 The parser stage creates a parse tree using only fixed rules about
247 the syntactic structure of SQL. It does not make any lookups in the
248 system catalogs, so there is no possibility to understand the detailed
249 semantics of the requested operations. After the parser completes,
250 the
<firstterm>transformation process
</firstterm> takes the tree handed
251 back by the parser as input and does the semantic interpretation needed
252 to understand which tables, functions, and operators are referenced by
253 the query. The data structure that is built to represent this
254 information is called the
<firstterm>query tree<
/>.
258 The reason for separating raw parsing from semantic analysis is that
259 system catalog lookups can only be done within a transaction, and we
260 do not wish to start a transaction immediately upon receiving a query
261 string. The raw parsing stage is sufficient to identify the transaction
262 control commands (
<command>BEGIN<
/>,
<command>ROLLBACK<
/>, etc), and
263 these can then be correctly executed without any further analysis.
264 Once we know that we are dealing with an actual query (such as
265 <command>SELECT<
/> or
<command>UPDATE<
/>), it is okay to
266 start a transaction if we're not already in one. Only then can the
267 transformation process be invoked.
271 The query tree created by the transformation process is structurally
272 similar to the raw parse tree in most places, but it has many differences
273 in detail. For example, a
<structname>FuncCall<
/> node in the
274 parse tree represents something that looks syntactically like a function
275 call. This might be transformed to either a
<structname>FuncExpr<
/>
276 or
<structname>Aggref<
/> node depending on whether the referenced
277 name turns out to be an ordinary function or an aggregate function.
278 Also, information about the actual data types of columns and expression
279 results is added to the query tree.
284 <sect1 id=
"rule-system">
285 <title>The
<productname>PostgreSQL
</productname> Rule System
</title>
288 <productname>PostgreSQL
</productname> supports a powerful
289 <firstterm>rule system
</firstterm> for the specification
290 of
<firstterm>views
</firstterm> and ambiguous
<firstterm>view updates
</firstterm>.
291 Originally the
<productname>PostgreSQL
</productname>
292 rule system consisted of two implementations:
297 The first one worked using
<firstterm>row level
</firstterm> processing and was
298 implemented deep in the
<firstterm>executor
</firstterm>. The rule system was
299 called whenever an individual row had been accessed. This
300 implementation was removed in
1995 when the last official release
301 of the
<productname>Berkeley Postgres
</productname> project was
302 transformed into
<productname>Postgres95
</productname>.
308 The second implementation of the rule system is a technique
309 called
<firstterm>query rewriting
</firstterm>.
310 The
<firstterm>rewrite system
</firstterm> is a module
311 that exists between the
<firstterm>parser stage
</firstterm> and the
312 <firstterm>planner/optimizer
</firstterm>. This technique is still implemented.
319 The query rewriter is discussed in some detail in
320 <xref linkend=
"rules">, so there is no need to cover it here.
321 We will only point out that both the input and the output of the
322 rewriter are query trees, that is, there is no change in the
323 representation or level of semantic detail in the trees. Rewriting
324 can be thought of as a form of macro expansion.
329 <sect1 id=
"planner-optimizer">
330 <title>Planner/Optimizer
</title>
333 The task of the
<firstterm>planner/optimizer
</firstterm> is to
334 create an optimal execution plan. A given SQL query (and hence, a
335 query tree) can be actually executed in a wide variety of
336 different ways, each of which will produce the same set of
337 results. If it is computationally feasible, the query optimizer
338 will examine each of these possible execution plans, ultimately
339 selecting the execution plan that is expected to run the fastest.
344 In some situations, examining each possible way in which a query
345 can be executed would take an excessive amount of time and memory
346 space. In particular, this occurs when executing queries
347 involving large numbers of join operations. In order to determine
348 a reasonable (not necessarily optimal) query plan in a reasonable amount
349 of time,
<productname>PostgreSQL
</productname> uses a
<xref
350 linkend=
"geqo" endterm=
"geqo-title"> when the number of joins
351 exceeds a threshold (see
<xref linkend=
"guc-geqo-threshold">).
356 The planner's search procedure actually works with data structures
357 called
<firstterm>paths<
/>, which are simply cut-down representations of
358 plans containing only as much information as the planner needs to make
359 its decisions. After the cheapest path is determined, a full-fledged
360 <firstterm>plan tree<
/> is built to pass to the executor. This represents
361 the desired execution plan in sufficient detail for the executor to run it.
362 In the rest of this section we'll ignore the distinction between paths
367 <title>Generating Possible Plans
</title>
370 The planner/optimizer starts by generating plans for scanning each
371 individual relation (table) used in the query. The possible plans
372 are determined by the available indexes on each relation.
373 There is always the possibility of performing a
374 sequential scan on a relation, so a sequential scan plan is always
375 created. Assume an index is defined on a
376 relation (for example a B-tree index) and a query contains the
378 <literal>relation.attribute OPR constant
</literal>. If
379 <literal>relation.attribute
</literal> happens to match the key of the B-tree
380 index and
<literal>OPR
</literal> is one of the operators listed in
381 the index's
<firstterm>operator class<
/>, another plan is created using
382 the B-tree index to scan the relation. If there are further indexes
383 present and the restrictions in the query happen to match a key of an
384 index, further plans will be considered. Index scan plans are also
385 generated for indexes that have a sort ordering that can match the
386 query's
<literal>ORDER BY<
/> clause (if any), or a sort ordering that
387 might be useful for merge joining (see below).
391 If the query requires joining two or more relations,
392 plans for joining relations are considered
393 after all feasible plans have been found for scanning single relations.
394 The three available join strategies are:
399 <firstterm>nested loop join
</firstterm>: The right relation is scanned
400 once for every row found in the left relation. This strategy
401 is easy to implement but can be very time consuming. (However,
402 if the right relation can be scanned with an index scan, this can
403 be a good strategy. It is possible to use values from the current
404 row of the left relation as keys for the index scan of the right.)
410 <firstterm>merge join
</firstterm>: Each relation is sorted on the join
411 attributes before the join starts. Then the two relations are
412 scanned in parallel, and matching rows are combined to form
413 join rows. This kind of join is more
414 attractive because each relation has to be scanned only once.
415 The required sorting might be achieved either by an explicit sort
416 step, or by scanning the relation in the proper order using an
417 index on the join key.
423 <firstterm>hash join
</firstterm>: the right relation is first scanned
424 and loaded into a hash table, using its join attributes as hash keys.
425 Next the left relation is scanned and the
426 appropriate values of every row found are used as hash keys to
427 locate the matching rows in the table.
434 When the query involves more than two relations, the final result
435 must be built up by a tree of join steps, each with two inputs.
436 The planner examines different possible join sequences to find the
441 If the query uses fewer than
<xref linkend=
"guc-geqo-threshold">
442 relations, a near-exhaustive search is conducted to find the best
443 join sequence. The planner preferentially considers joins between any
444 two relations for which there exist a corresponding join clause in the
445 <literal>WHERE
</literal> qualification (i.e., for
446 which a restriction like
<literal>where rel1.attr1=rel2.attr2
</literal>
447 exists). Join pairs with no join clause are considered only when there
448 is no other choice, that is, a particular relation has no available
449 join clauses to any other relation. All possible plans are generated for
450 every join pair considered by the planner, and the one that is
451 (estimated to be) the cheapest is chosen.
455 When
<varname>geqo_threshold
</varname> is exceeded, the join
456 sequences considered are determined by heuristics, as described
457 in
<xref linkend=
"geqo">. Otherwise the process is the same.
461 The finished plan tree consists of sequential or index scans of
462 the base relations, plus nested-loop, merge, or hash join nodes as
463 needed, plus any auxiliary steps needed, such as sort nodes or
464 aggregate-function calculation nodes. Most of these plan node
465 types have the additional ability to do
<firstterm>selection<
/>
466 (discarding rows that do not meet a specified boolean condition)
467 and
<firstterm>projection<
/> (computation of a derived column set
468 based on given column values, that is, evaluation of scalar
469 expressions where needed). One of the responsibilities of the
470 planner is to attach selection conditions from the
471 <literal>WHERE
</literal> clause and computation of required
472 output expressions to the most appropriate nodes of the plan
478 <sect1 id=
"executor">
479 <title>Executor
</title>
482 The
<firstterm>executor
</firstterm> takes the plan handed back by the
483 planner/optimizer and recursively processes it to extract the required set
484 of rows. This is essentially a demand-pull pipeline mechanism.
485 Each time a plan node is called, it must deliver one more row, or
486 report that it is done delivering rows.
490 To provide a concrete example, assume that the top
491 node is a
<literal>MergeJoin
</literal> node.
492 Before any merge can be done two rows have to be fetched (one from
493 each subplan). So the executor recursively calls itself to
494 process the subplans (it starts with the subplan attached to
495 <literal>lefttree
</literal>). The new top node (the top node of the left
496 subplan) is, let's say, a
497 <literal>Sort
</literal> node and again recursion is needed to obtain
498 an input row. The child node of the
<literal>Sort
</literal> might
499 be a
<literal>SeqScan<
/> node, representing actual reading of a table.
500 Execution of this node causes the executor to fetch a row from the
501 table and return it up to the calling node. The
<literal>Sort
</literal>
502 node will repeatedly call its child to obtain all the rows to be sorted.
503 When the input is exhausted (as indicated by the child node returning
504 a NULL instead of a row), the
<literal>Sort
</literal> code performs
505 the sort, and finally is able to return its first output row, namely
506 the first one in sorted order. It keeps the remaining rows stored so
507 that it can deliver them in sorted order in response to later demands.
511 The
<literal>MergeJoin
</literal> node similarly demands the first row
512 from its right subplan. Then it compares the two rows to see if they
513 can be joined; if so, it returns a join row to its caller. On the next
514 call, or immediately if it cannot join the current pair of inputs,
515 it advances to the next row of one table
516 or the other (depending on how the comparison came out), and again
517 checks for a match. Eventually, one subplan or the other is exhausted,
518 and the
<literal>MergeJoin
</literal> node returns NULL to indicate that
519 no more join rows can be formed.
523 Complex queries can involve many levels of plan nodes, but the general
524 approach is the same: each node computes and returns its next output
525 row each time it is called. Each node is also responsible for applying
526 any selection or projection expressions that were assigned to it by
531 The executor mechanism is used to evaluate all four basic SQL query types:
532 <command>SELECT<
/>,
<command>INSERT<
/>,
<command>UPDATE<
/>, and
533 <command>DELETE<
/>. For
<command>SELECT<
/>, the top-level executor
534 code only needs to send each row returned by the query plan tree off
535 to the client. For
<command>INSERT<
/>, each returned row is inserted
536 into the target table specified for the
<command>INSERT<
/>. (A simple
537 <command>INSERT ... VALUES<
/> command creates a trivial plan tree
538 consisting of a single
<literal>Result<
/> node, which computes just one
539 result row. But
<command>INSERT ... SELECT<
/> can demand the full power
540 of the executor mechanism.) For
<command>UPDATE<
/>, the planner arranges
541 that each computed row includes all the updated column values, plus
542 the
<firstterm>TID<
/> (tuple ID, or row ID) of the original target row;
543 the executor top level uses this information to create a new updated row
544 and mark the old row deleted. For
<command>DELETE<
/>, the only column
545 that is actually returned by the plan is the TID, and the executor top
546 level simply uses the TID to visit each target row and mark it deleted.