external/ibm-public/postfix/dist/README_FILES/SCHEDULER_README

   1 P\bPo\bos\bst\btf\bfi\bix\bx Q\bQu\bue\beu\bue\be S\bSc\bch\bhe\bed\bdu\bul\ble\ber\br
   2
   3 -------------------------------------------------------------------------------
   4
   5 O\bOv\bve\ber\brv\bvi\bie\bew\bw
   6
   7 The queue manager is by far the most complex part of the Postfix mail system.
   8 It schedules delivery of new mail, retries failed deliveries at specific times,
   9 and removes mail from the queue after the last delivery attempt. There are two
  10 major classes of mechanisms that control the operation of the queue manager.
  11
  12 Topics covered by this document:
  13
  14   * Concurrency scheduling, concerned with the number of concurrent deliveries
  15     to a specific destination, including decisions on when to suspend
  16     deliveries after persistent failures.
  17   * Preemptive scheduling, concerned with the selection of email messages and
  18     recipients for a given destination.
  19   * Credits, something this document would not be complete without.
  20
  21 C\bCo\bon\bnc\bcu\bur\brr\bre\ben\bnc\bcy\by s\bsc\bch\bhe\bed\bdu\bul\bli\bin\bng\bg
  22
  23 The following sections document the Postfix 2.5 concurrency scheduler, after a
  24 discussion of the limitations of the existing concurrency scheduler. This is
  25 followed by results of medium-concurrency experiments, and a discussion of
  26 trade-offs between performance and robustness.
  27
  28 The material is organized as follows:
  29
  30   * Drawbacks of the existing concurrency scheduler
  31   * Summary of the Postfix 2.5 concurrency feedback algorithm
  32   * Summary of the Postfix 2.5 "dead destination" detection algorithm
  33   * Pseudocode for the Postfix 2.5 concurrency scheduler
  34   * Results for delivery to concurrency limited servers
  35   * Discussion of concurrency limited server results
  36   * Limitations of less-than-1 per delivery feedback
  37   * Concurrency configuration parameters
  38
  39 D\bDr\bra\baw\bwb\bba\bac\bck\bks\bs o\bof\bf t\bth\bhe\be e\bex\bxi\bis\bst\bti\bin\bng\bg c\bco\bon\bnc\bcu\bur\brr\bre\ben\bnc\bcy\by s\bsc\bch\bhe\bed\bdu\bul\ble\ber\br
  40
  41 From the start, Postfix has used a simple but robust algorithm where the per-
  42 destination delivery concurrency is decremented by 1 after delivery failed due
  43 to connection or handshake failure, and incremented by 1 otherwise. Of course
  44 the concurrency is never allowed to exceed the maximum per-destination
  45 concurrency limit. And when a destination's concurrency level drops to zero,
  46 the destination is declared "dead" and delivery is suspended.
  47
  48 Drawbacks of +/-1 concurrency feedback per delivery are:
  49
  50   * Overshoot due to exponential delivery concurrency growth with each pseudo-
  51     cohort(*). This can be an issue with high-concurrency channels. For
  52     example, with the default initial concurrency of 5, concurrency would
  53     proceed over time as (5-10-20).
  54
  55   * Throttling down to zero concurrency after a single pseudo-cohort(*)
  56     failure. This was especially an issue with low-concurrency channels where a
  57     single failure could be sufficient to mark a destination as "dead", causing
  58     the suspension of further deliveries to the affected destination.
  59
  60 (*) A pseudo-cohort is a number of delivery requests equal to a destination's
  61 delivery concurrency.
  62
  63 The revised concurrency scheduler has a highly modular structure. It uses
  64 separate mechanisms for per-destination concurrency control and for "dead
  65 destination" detection. The concurrency control in turn is built from two
  66 separate mechanisms: it supports less-than-1 feedback per delivery to allow for
  67 more gradual concurrency adjustments, and it uses feedback hysteresis to
  68 suppress concurrency oscillations. And instead of waiting for delivery
  69 concurrency to throttle down to zero, a destination is declared "dead" after a
  70 configurable number of pseudo-cohorts reports connection or handshake failure.
  71
  72 S\bSu\bum\bmm\bma\bar\bry\by o\bof\bf t\bth\bhe\be P\bPo\bos\bst\btf\bfi\bix\bx 2\b2.\b.5\b5 c\bco\bon\bnc\bcu\bur\brr\bre\ben\bnc\bcy\by f\bfe\bee\bed\bdb\bba\bac\bck\bk a\bal\blg\bgo\bor\bri\bit\bth\bhm\bm
  73
  74 We want to increment a destination's delivery concurrency when some (not
  75 necessarily consecutive) number of deliveries complete without connection or
  76 handshake failure. This is implemented with positive feedback g(N) where N is
  77 the destination's delivery concurrency. With g(N)=1 feedback per delivery,
  78 concurrency increases by 1 after each positive feedback event; this gives us
  79 the old scheduler's exponential growth in time. With g(N)=1/N feedback per
  80 delivery, concurrency increases by 1 after an entire pseudo-cohort N of
  81 positive feedback reports; this gives us linear growth in time. Less-than-
  82 1 feedback per delivery and integer truncation naturally give us hysteresis, so
  83 that transitions to larger concurrency happen every 1/g(N) positive feedback
  84 events.
  85
  86 We want to decrement a destination's delivery concurrency when some (not
  87 necessarily consecutive) number of deliveries complete after connection or
  88 handshake failure. This is implemented with negative feedback f(N) where N is
  89 the destination's delivery concurrency. With f(N)=1 feedback per delivery,
  90 concurrency decreases by 1 after each negative feedback event; this gives us
  91 the old scheduler's behavior where concurrency is throttled down dramatically
  92 after a single pseudo-cohort failure. With f(N)=1/N feedback per delivery,
  93 concurrency backs off more gently. Again, less-than-1 feedback per delivery and
  94 integer truncation naturally give us hysteresis, so that transitions to lower
  95 concurrency happen every 1/f(N) negative feedback events.
  96
  97 However, with negative feedback we introduce a subtle twist. We "reverse" the
  98 negative hysteresis cycle so that the transition to lower concurrency happens
  99 at the b\bbe\beg\bgi\bin\bnn\bni\bin\bng\bg of a sequence of 1/f(N) negative feedback events. Otherwise, a
 100 correction for overload would be made too late. This makes the choice of f(N)
 101 relatively unimportant, as borne out by measurements later in this document.
 102
 103 In summary, the main ingredients for the Postfix 2.5 concurrency feedback
 104 algorithm are a) the option of less-than-1 positive feedback per delivery to
 105 avoid overwhelming servers, b) the option of less-than-1 negative feedback per
 106 delivery to avoid giving up too fast, c) feedback hysteresis to avoid rapid
 107 oscillation, and d) a "reverse" hysteresis cycle for negative feedback, so that
 108 it can correct for overload quickly.
 109
 110 S\bSu\bum\bmm\bma\bar\bry\by o\bof\bf t\bth\bhe\be P\bPo\bos\bst\btf\bfi\bix\bx 2\b2.\b.5\b5 "\b"d\bde\bea\bad\bd d\bde\bes\bst\bti\bin\bna\bat\bti\bio\bon\bn"\b" d\bde\bet\bte\bec\bct\bti\bio\bon\bn a\bal\blg\bgo\bor\bri\bit\bth\bhm\bm
 111
 112 We want to suspend deliveries to a specific destination after some number of
 113 deliveries suffers connection or handshake failure. The old scheduler declares
 114 a destination "dead" when negative (-1) feedback throttles the delivery
 115 concurrency down to zero. With less-than-1 feedback per delivery, this
 116 throttling down would obviously take too long. We therefore have to separate
 117 "dead destination" detection from concurrency feedback. This is implemented by
 118 introducing the concept of pseudo-cohort failure. The Postfix 2.5 concurrency
 119 scheduler declares a destination "dead" after a configurable number of pseudo-
 120 cohorts suffers from connection or handshake failures. The old scheduler
 121 corresponds to the special case where the pseudo-cohort failure limit is equal
 122 to 1.
 123
 124 P\bPs\bse\beu\bud\bdo\boc\bco\bod\bde\be f\bfo\bor\br t\bth\bhe\be P\bPo\bos\bst\btf\bfi\bix\bx 2\b2.\b.5\b5 c\bco\bon\bnc\bcu\bur\brr\bre\ben\bnc\bcy\by s\bsc\bch\bhe\bed\bdu\bul\ble\ber\br
 125
 126 The pseudo code shows how the ideas behind new concurrency scheduler are
 127 implemented as of November 2007. The actual code can be found in the module
 128 qmgr/qmgr_queue.c.
 129
 130 Types:
 131         Each destination has one set of the following variables
 132         int concurrency
 133         double success
 134         double failure
 135         double fail_cohorts
 136
 137 Feedback functions:
 138         N is concurrency; x, y are arbitrary numbers in [0..1] inclusive
 139         positive feedback: g(N) = x/N | x/sqrt(N) | x
 140         negative feedback: f(N) = y/N | y/sqrt(N) | y
 141
 142 Initialization:
 143         concurrency = initial_concurrency
 144         success = 0
 145         failure = 0
 146         fail_cohorts = 0
 147
 148 After success:
 149         fail_cohorts = 0
 150         Be prepared for feedback > hysteresis, or rounding error
 151         success += g(concurrency)
 152         while (success >= 1)            Hysteresis 1
 153             concurrency += 1            Hysteresis 1
 154             failure = 0
 155             success -= 1                Hysteresis 1
 156         Be prepared for overshoot
 157         if (concurrency > concurrency limit)
 158             concurrency = concurrency limit
 159
 160 Safety:
 161         Don't apply positive feedback unless
 162             concurrency < busy_refcount + init_dest_concurrency
 163         otherwise negative feedback effect could be delayed
 164
 165 After failure:
 166         if (concurrency > 0)
 167             fail_cohorts += 1.0 / concurrency
 168             if (fail_cohorts > cohort_failure_limit)
 169                 concurrency = 0
 170         if (concurrency > 0)
 171             Be prepared for feedback > hysteresis, rounding errors
 172             failure -= f(concurrency)
 173             while (failure < 0)
 174                 concurrency -= 1        Hysteresis 1
 175                 failure += 1            Hysteresis 1
 176                 success = 0
 177             Be prepared for overshoot
 178             if (concurrency < 1)
 179                 concurrency = 1
 180
 181 R\bRe\bes\bsu\bul\blt\bts\bs f\bfo\bor\br d\bde\bel\bli\biv\bve\ber\bry\by t\bto\bo c\bco\bon\bnc\bcu\bur\brr\bre\ben\bnc\bcy\by-\b-l\bli\bim\bmi\bit\bte\bed\bd s\bse\ber\brv\bve\ber\brs\bs
 182
 183 Discussions about the concurrency scheduler redesign started early 2004, when
 184 the primary goal was to find alternatives that did not exhibit exponential
 185 growth or rapid concurrency throttling. No code was implemented until late
 186 2007, when the primary concern had shifted towards better handling of server
 187 concurrency limits. For this reason we measure how well the new scheduler does
 188 this job. The table below compares mail delivery performance of the old +/-
 189 1 feedback per delivery with several less-than-1 feedback functions, for
 190 different limited-concurrency server scenarios. Measurements were done with a
 191 FreeBSD 6.2 client and with FreeBSD 6.2 and various Linux servers.
 192
 193 Server configuration:
 194
 195   * The mail flow was slowed down with 1 second latency per recipient
 196     ("smtpd_client_restrictions = sleep 1"). The purpose was to make results
 197     less dependent on hardware details, by avoiding slow-downs by queue file I/
 198     O, logging I/O, and network I/O.
 199   * Concurrency was limited by the server process limit ("default_process_limit
 200     = 5" and "smtpd_client_event_limit_exceptions = static:all"). Postfix was
 201     stopped and started after changing the process limit, because the same
 202     number is also used as the backlog argument to the listen(2) system call,
 203     and "postfix reload" does not re-issue this call.
 204   * Mail was discarded with "local_recipient_maps = static:all" and
 205     "local_transport = discard". The discard action in access maps or header/
 206     body checks could not be used as it fails to update the in_flow_delay
 207     counters.
 208
 209 Client configuration:
 210
 211   * Queue file overhead was minimized by sending one message to a virtual alias
 212     that expanded into 2000 different remote recipients. All recipients were
 213     accounted for according to the maillog file. The
 214     virtual_alias_expansion_limit setting was increased to avoid complaints
 215     from the cleanup(8) server.
 216   * The number of deliveries was maximized with
 217     "smtp_destination_recipient_limit = 2". A smaller limit would cause Postfix
 218     to schedule the concurrency per recipient instead of domain, which is not
 219     what we want.
 220   * Maximum concurrency was limited with "smtp_destination_concurrency_limit =
 221     20", and initial_destination_concurrency was set to the same value.
 222   * The positive and negative concurrency feedback hysteresis was 1.
 223     Concurrency was incremented by 1 at the END of 1/feedback steps of positive
 224     feedback, and was decremented by 1 at the START of 1/feedback steps of
 225     negative feedback.
 226   * The SMTP client used the default 30s SMTP connect timeout and 300s SMTP
 227     greeting timeout.
 228
 229 I\bIm\bmp\bpa\bac\bct\bt o\bof\bf t\bth\bhe\be 3\b30\b0s\bs S\bSM\bMT\bTP\bP c\bco\bon\bnn\bne\bec\bct\bt t\bti\bim\bme\beo\bou\but\bt
 230
 231 The first results are for a FreeBSD 6.2 server, where our artificially low
 232 listen(2) backlog results in a very short kernel queue for established
 233 connections. The table shows that all deferred deliveries failed due to a 30s
 234 connection timeout, and none failed due to a server greeting timeout. This
 235 measurement simulates what happens when the server's connection queue is
 236 completely full under load, and the TCP engine drops new connections.
 237
 238     c\bcl\bli\bie\ben\bnt\bt s\bse\ber\brv\bve\ber\br f\bfe\bee\bed\bdb\bba\bac\bck\bk  c\bco\bon\bnn\bne\bec\bct\bti\bio\bon\bn p\bpe\ber\brc\bce\ben\bnt\bta\bag\bge\be c\bcl\bli\bie\ben\bnt\bt         t\bti\bim\bme\bed\bd-\b-o\bou\but\bt i\bin\bn
 239     l\bli\bim\bmi\bit\bt  l\bli\bim\bmi\bit\bt  s\bst\bty\byl\ble\be     c\bca\bac\bch\bhi\bin\bng\bg    d\bde\bef\bfe\ber\brr\bre\bed\bd   c\bco\bon\bnc\bcu\bur\brr\bre\ben\bnc\bcy\by    c\bco\bon\bnn\bne\bec\bct\bt/\b/
 240                                                   a\bav\bve\ber\bra\bag\bge\be/\b/s\bst\btd\bdd\bde\bev\bv g\bgr\bre\bee\bet\bti\bin\bng\bg
 241
 242     -------------------------------------------------------------------------
 243        20     5       1/N         no        9.9   19.4    0.49   198      -
 244
 245        20     5       1/N        yes       10.3   19.4    0.49   206      -
 246
 247        20     5   1/sqrt(N)       no       10.4   19.6    0.59   208      -
 248
 249        20     5   1/sqrt(N)      yes       10.6   19.6    0.61   212      -
 250
 251        20     5         1         no       10.1   19.5    1.29   202      -
 252
 253        20     5         1        yes       10.8   19.3    1.57   216      -
 254
 255     -------------------------------------------------------------------------
 256
 257     A busy server with a completely full connection queue. N is the client
 258     delivery concurrency. Failed deliveries time out after 30s without
 259     completing the TCP handshake. See text for a discussion of results.
 260
 261 I\bIm\bmp\bpa\bac\bct\bt o\bof\bf t\bth\bhe\be 3\b30\b00\b0s\bs S\bSM\bMT\bTP\bP g\bgr\bre\bee\bet\bti\bin\bng\bg t\bti\bim\bme\beo\bou\but\bt
 262
 263 The next table shows results for a Fedora Core 8 server (results for RedHat 7.3
 264 are identical). In this case, the artificially small listen(2) backlog argument
 265 does not impact our measurement. The table shows that practically all deferred
 266 deliveries fail after the 300s SMTP greeting timeout. As these timeouts were
 267 10x longer than with the first measurement, we increased the recipient count
 268 (and thus the running time) by a factor of 10 to keep the results comparable.
 269 The deferred mail percentages are a factor 10 lower than with the first
 270 measurement, because the 1s per-recipient delay was 1/300th of the greeting
 271 timeout instead of 1/30th of the connection timeout.
 272
 273     c\bcl\bli\bie\ben\bnt\bt s\bse\ber\brv\bve\ber\br f\bfe\bee\bed\bdb\bba\bac\bck\bk  c\bco\bon\bnn\bne\bec\bct\bti\bio\bon\bn p\bpe\ber\brc\bce\ben\bnt\bta\bag\bge\be c\bcl\bli\bie\ben\bnt\bt         t\bti\bim\bme\bed\bd-\b-o\bou\but\bt i\bin\bn
 274     l\bli\bim\bmi\bit\bt  l\bli\bim\bmi\bit\bt  s\bst\bty\byl\ble\be     c\bca\bac\bch\bhi\bin\bng\bg    d\bde\bef\bfe\ber\brr\bre\bed\bd   c\bco\bon\bnc\bcu\bur\brr\bre\ben\bnc\bcy\by    c\bco\bon\bnn\bne\bec\bct\bt/\b/
 275                                                   a\bav\bve\ber\bra\bag\bge\be/\b/s\bst\btd\bdd\bde\bev\bv g\bgr\bre\bee\bet\bti\bin\bng\bg
 276
 277     -------------------------------------------------------------------------
 278        20     5       1/N         no       1.16   19.8    0.37   -      230
 279
 280        20     5       1/N        yes       1.36   19.8    0.36   -      272
 281
 282        20     5   1/sqrt(N)       no       1.21   19.9    0.23   4      238
 283
 284        20     5   1/sqrt(N)      yes       1.36   20.0    0.23   -      272
 285
 286        20     5         1         no       1.18   20.0    0.16   -      236
 287
 288        20     5         1        yes       1.39   20.0    0.16   -      278
 289
 290     -------------------------------------------------------------------------
 291
 292     A busy server with a non-full connection queue. N is the client delivery
 293     concurrency. Failed deliveries complete at the TCP level, but time out
 294     after 300s while waiting for the SMTP greeting. See text for a discussion
 295     of results.
 296
 297 I\bIm\bmp\bpa\bac\bct\bt o\bof\bf a\bac\bct\bti\biv\bve\be s\bse\ber\brv\bve\ber\br c\bco\bon\bnc\bcu\bur\brr\bre\ben\bnc\bcy\by l\bli\bim\bmi\bit\bte\ber\br
 298
 299 The final concurrency-limited result shows what happens when SMTP connections
 300 don't time out, but are rejected immediately with the Postfix server's
 301 smtpd_client_connection_count_limit feature (the server replies with a 421
 302 status and disconnects immediately). Similar results can be expected with
 303 concurrency limiting features built into other MTAs or firewalls. For this
 304 measurement we specified a server concurrency limit and a client initial
 305 destination concurrency of 5, and a server process limit of 10; all other
 306 conditions were the same as with the first measurement. The same result would
 307 be obtained with a FreeBSD or Linux server, because the "pushing back" is done
 308 entirely by the receiving side.
 309
 310     c\bcl\bli\bie\ben\bnt\bt s\bse\ber\brv\bve\ber\br f\bfe\bee\bed\bdb\bba\bac\bck\bk  c\bco\bon\bnn\bne\bec\bct\bti\bio\bon\bn p\bpe\ber\brc\bce\ben\bnt\bta\bag\bge\be c\bcl\bli\bie\ben\bnt\bt         t\bth\bhe\beo\bor\bre\bet\bti\bic\bca\bal\bl
 311     l\bli\bim\bmi\bit\bt  l\bli\bim\bmi\bit\bt  s\bst\bty\byl\ble\be     c\bca\bac\bch\bhi\bin\bng\bg    d\bde\bef\bfe\ber\brr\bre\bed\bd   c\bco\bon\bnc\bcu\bur\brr\bre\ben\bnc\bcy\by    d\bde\bef\bfe\ber\br r\bra\bat\bte\be
 312                                                   a\bav\bve\ber\bra\bag\bge\be/\b/s\bst\btd\bdd\bde\bev\bv
 313
 314     -------------------------------------------------------------------------
 315        20     5       1/N         no       16.5   5.17    0.38         1/6
 316
 317        20     5       1/N        yes       16.5   5.17    0.38         1/6
 318
 319        20     5   1/sqrt(N)       no       24.5   5.28    0.45         1/4
 320
 321        20     5   1/sqrt(N)      yes       24.3   5.28    0.46         1/4
 322
 323        20     5         1         no       49.7   5.63    0.67         1/2
 324
 325        20     5         1        yes       49.7   5.68    0.70         1/2
 326
 327     -------------------------------------------------------------------------
 328
 329     A server with active per-client concurrency limiter that replies with 421
 330     and disconnects. N is the client delivery concurrency. The theoretical
 331     defer rate is 1/(1+roundup(1/feedback)). This is always 1/2 with the fixed
 332     +/-1 feedback per delivery; with the concurrency-dependent feedback
 333     variants, the defer rate decreases with increasing concurrency. See text
 334     for a discussion of results.
 335
 336 D\bDi\bis\bsc\bcu\bus\bss\bsi\bio\bon\bn o\bof\bf c\bco\bon\bnc\bcu\bur\brr\bre\ben\bnc\bcy\by-\b-l\bli\bim\bmi\bit\bte\bed\bd s\bse\ber\brv\bve\ber\br r\bre\bes\bsu\bul\blt\bts\bs
 337
 338 All results in the previous sections are based on the first delivery runs only;
 339 they do not include any second etc. delivery attempts. It's also worth noting
 340 that the measurements look at steady-state behavior only. They don't show what
 341 happens when the client starts sending at a much higher or lower concurrency.
 342
 343 The first two examples show that the effect of feedback is negligible when
 344 concurrency is limited due to congestion. This is because the initial
 345 concurrency is already at the client's concurrency maximum, and because there
 346 is 10-100 times more positive than negative feedback. Under these conditions,
 347 it is no surprise that the contribution from SMTP connection caching is also
 348 negligible.
 349
 350 In the last example, the old +/-1 feedback per delivery will defer 50% of the
 351 mail when confronted with an active (anvil-style) server concurrency limit,
 352 where the server hangs up immediately with a 421 status (a TCP-level RST would
 353 have the same result). Less aggressive feedback mechanisms fare better than
 354 more aggressive ones. Concurrency-dependent feedback fares even better at
 355 higher concurrencies than shown here, but has limitations as discussed in the
 356 next section.
 357
 358 L\bLi\bim\bmi\bit\bta\bat\bti\bio\bon\bns\bs o\bof\bf l\ble\bes\bss\bs-\b-t\bth\bha\ban\bn-\b-1\b1 p\bpe\ber\br d\bde\bel\bli\biv\bve\ber\bry\by f\bfe\bee\bed\bdb\bba\bac\bck\bk
 359
 360 Less-than-1 feedback is of interest primarily when sending large amounts of
 361 mail to destinations with active concurrency limiters (servers that reply with
 362 421, or firewalls that send RST). When sending small amounts of mail per
 363 destination, less-than-1 per-delivery feedback won't have a noticeable effect
 364 on the per-destination concurrency, because the number of deliveries to the
 365 same destination is too small. You might just as well use zero per-delivery
 366 feedback and stay with the initial per-destination concurrency. And when mail
 367 deliveries fail due to congestion instead of active concurrency limiters, the
 368 measurements above show that per-delivery feedback has no effect. With large
 369 amounts of mail you might just as well use zero per-delivery feedback and start
 370 with the maximal per-destination concurrency.
 371
 372 The scheduler with less-than-1 concurrency feedback per delivery solves a
 373 problem with servers that have active concurrency limiters. This works only
 374 because feedback is handled in a peculiar manner: positive feedback will
 375 increment the concurrency by 1 at the e\ben\bnd\bd of a sequence of events of length 1/
 376 feedback, while negative feedback will decrement concurrency by 1 at the
 377 b\bbe\beg\bgi\bin\bnn\bni\bin\bng\bg of such a sequence. This is how Postfix adjusts quickly for overshoot
 378 without causing lots of mail to be deferred. Without this difference in
 379 feedback treatment, less-than-1 feedback per delivery would defer 50% of the
 380 mail, and would be no better in this respect than the old +/-1 feedback per
 381 delivery.
 382
 383 Unfortunately, the same feature that corrects quickly for concurrency overshoot
 384 also makes the scheduler more sensitive for noisy negative feedback. The reason
 385 is that one lonely negative feedback event has the same effect as a complete
 386 sequence of length 1/feedback: in both cases delivery concurrency is dropped by
 387 1 immediately. As a worst-case scenario, consider multiple servers behind a
 388 load balancer on a single IP address, and no backup MX address. When 1 out of K
 389 servers fails to complete the SMTP handshake or drops the connection, a
 390 scheduler with 1/N (N = concurrency) feedback stops increasing its concurrency
 391 once it reaches a concurrency level of about K, even though the good servers
 392 behind the load balancer are perfectly capable of handling more traffic.
 393
 394 This noise problem gets worse as the amount of positive feedback per delivery
 395 gets smaller. A compromise is to use fixed less-than-1 positive feedback values
 396 instead of concurrency-dependent positive feedback. For example, to tolerate 1
 397 of 4 bad servers in the above load balancer scenario, use positive feedback of
 398 1/4 per "good" delivery (no connect or handshake error), and use an equal or
 399 smaller amount of negative feedback per "bad" delivery. The downside of using
 400 concurrency-independent feedback is that some of the old +/-1 feedback problems
 401 will return at large concurrencies. Sites that must deliver mail at non-trivial
 402 per-destination concurrencies will require special configuration.
 403
 404 C\bCo\bon\bnc\bcu\bur\brr\bre\ben\bnc\bcy\by c\bco\bon\bnf\bfi\big\bgu\bur\bra\bat\bti\bio\bon\bn p\bpa\bar\bra\bam\bme\bet\bte\ber\brs\bs
 405
 406 The Postfix 2.5 concurrency scheduler is controlled with the following
 407 configuration parameters, where "transport_foo" provides a transport-specific
 408 parameter override. All parameter default settings are compatible with earlier
 409 Postfix versions.
 410
 411     P\bPa\bar\bra\bam\bme\bet\bte\ber\br n\bna\bam\bme\be                                        P\bPo\bos\bst\btf\bfi\bix\bx D\bDe\bes\bsc\bcr\bri\bip\bpt\bti\bio\bon\bn
 412                                                           v\bve\ber\brs\bsi\bio\bon\bn
 413
 414     ---------------------------------------------------------------------------
 415                                                                   Initial per-
 416     initial_destination_concurrency                          all  destination
 417     transport_initial_destination_concurrency                2.5  delivery
 418                                                                   concurrency
 419
 420                                                                   Maximum per-
 421     default_destination_concurrency_limit                    all  destination
 422     transport_destination_concurrency_limit                  all  delivery
 423                                                                   concurrency
 424
 425                                                                   Per-
 426                                                                   destination
 427                                                                   positive
 428                                                                   feedback
 429     default_destination_concurrency_positive_feedback        2.5  amount, per
 430     transport_destination_concurrency_positive_feedback      2.5  delivery that
 431                                                                   does not fail
 432                                                                   with
 433                                                                   connection or
 434                                                                   handshake
 435                                                                   failure
 436
 437                                                                   Per-
 438                                                                   destination
 439                                                                   negative
 440                                                                   feedback
 441     default_destination_concurrency_negative_feedback        2.5  amount, per
 442     transport_destination_concurrency_negative_feedback      2.5  delivery that
 443                                                                   fails with
 444                                                                   connection or
 445                                                                   handshake
 446                                                                   failure
 447
 448                                                                   Number of
 449                                                                   failed
 450                                                                   pseudo-
 451                                                                   cohorts after
 452     default_destination_concurrency_failed_cohort_limit      2.5  which a
 453     transport_destination_concurrency_failed_cohort_limit    2.5  destination
 454                                                                   is declared
 455                                                                   "dead" and
 456                                                                   delivery is
 457                                                                   suspended
 458
 459                                                                   Enable
 460                                                                   verbose
 461     destination_concurrency_feedback_debug                   2.5  logging of
 462                                                                   concurrency
 463                                                                   scheduler
 464                                                                   activity
 465
 466     ---------------------------------------------------------------------------
 467
 468 P\bPr\bre\bee\bem\bmp\bpt\bti\biv\bve\be s\bsc\bch\bhe\bed\bdu\bul\bli\bin\bng\bg
 469
 470 The following sections describe the new queue manager and its preemptive
 471 scheduler algorithm. Note that the document was originally written to describe
 472 the changes between the new queue manager (in this text referred to as nqmgr,
 473 the name it was known by before it became the default queue manager) and the
 474 old queue manager (referred to as oqmgr). This is why it refers to oqmgr every
 475 so often.
 476
 477 This document is divided into sections as follows:
 478
 479   * The structures used by nqmgr
 480   * What happens when nqmgr picks up the message - how it is assigned to
 481     transports, jobs, peers, entries
 482   * How the entry selection works
 483   * How the preemption works - what messages may be preempted and how and what
 484     messages are chosen to preempt them
 485   * How destination concurrency limits affect the scheduling algorithm
 486   * Dealing with memory resource limits
 487
 488 T\bTh\bhe\be s\bst\btr\bru\buc\bct\btu\bur\bre\bes\bs u\bus\bse\bed\bd b\bby\by n\bnq\bqm\bmg\bgr\br
 489
 490 Let's start by recapitulating the structures and terms used when referring to
 491 queue manager and how it operates. Many of these are partially described
 492 elsewhere, but it is nice to have a coherent overview in one place:
 493
 494   * Each message structure represents one mail message which Postfix is to
 495     deliver. The message recipients specify to what destinations is the message
 496     to be delivered and what transports are going to be used for the delivery.
 497
 498   * Each recipient entry groups a batch of recipients of one message which are
 499     all going to be delivered to the same destination.
 500
 501   * Each transport structure groups everything what is going to be delivered by
 502     delivery agents dedicated for that transport. Each transport maintains a
 503     set of queues (describing the destinations it shall talk to) and jobs
 504     (referencing the messages it shall deliver).
 505
 506   * Each transport queue (not to be confused with the on-disk active queue or
 507     incoming queue) groups everything what is going be delivered to given
 508     destination (aka nexthop) by its transport. Each queue belongs to one
 509     transport, so each destination may be referred to by several queues, one
 510     for each transport. Each queue maintains a list of all recipient entries
 511     (batches of message recipients) which shall be delivered to given
 512     destination (the todo list), and a list of recipient entries already being
 513     delivered by the delivery agents (the busy list).
 514
 515   * Each queue corresponds to multiple peer structures. Each peer structure is
 516     like the queue structure, belonging to one transport and referencing one
 517     destination. The difference is that it lists only the recipient entries
 518     which all originate from the same message, unlike the queue structure,
 519     whose entries may originate from various messages. For messages with few
 520     recipients, there is usually just one recipient entry for each destination,
 521     resulting in one recipient entry per peer. But for large mailing list
 522     messages the recipients may need to be split to multiple recipient entries,
 523     in which case the peer structure may list many entries for single
 524     destination.
 525
 526   * Each transport job groups everything it takes to deliver one message via
 527     its transport. Each job represents one message within the context of the
 528     transport. The job belongs to one transport and message, so each message
 529     may have multiple jobs, one for each transport. The job groups all the peer
 530     structures, which describe the destinations the job's message has to be
 531     delivered to.
 532
 533 The first four structures are common to both nqmgr and oqmgr, the latter two
 534 were introduced by nqmgr.
 535
 536 These terms are used extensively in the text below, feel free to look up the
 537 description above anytime you'll feel you have lost a sense what is what.
 538
 539 W\bWh\bha\bat\bt h\bha\bap\bpp\bpe\ben\bns\bs w\bwh\bhe\ben\bn n\bnq\bqm\bmg\bgr\br p\bpi\bic\bck\bks\bs u\bup\bp t\bth\bhe\be m\bme\bes\bss\bsa\bag\bge\be
 540
 541 Whenever nqmgr moves a queue file into the active queue, the following happens:
 542 It reads all necessary information from the queue file as oqmgr does, and also
 543 reads as many recipients as possible - more on that later, for now let's just
 544 pretend it always reads all recipients.
 545
 546 Then it resolves the recipients as oqmgr does, which means obtaining (address,
 547 nexthop, transport) triple for each recipient. For each triple, it finds the
 548 transport; if it does not exist yet, it instantiates it (unless it's dead).
 549 Within the transport, it finds the destination queue for given nexthop; if it
 550 does not exist yet, it instantiates it (unless it's dead). The triple is then
 551 bound to given destination queue. This happens in qmgr_resolve() and is
 552 basically the same as in oqmgr.
 553
 554 Then for each triple which was bound to some queue (and thus transport), the
 555 program finds the job which represents the message within that transport's
 556 context; if it does not exist yet, it instantiates it. Within the job, it finds
 557 the peer which represents the bound destination queue within this jobs context;
 558 if it does not exist yet, it instantiates it. Finally, it stores the address
 559 from the resolved triple to the recipient entry which is appended to both the
 560 queue entry list and the peer entry list. The addresses for same nexthop are
 561 batched in the entries up to recipient_concurrency limit for that transport.
 562 This happens in qmgr_assign() and apart from that it operates with job and peer
 563 structures it is basically the same as in oqmgr.
 564
 565 When the job is instantiated, it is enqueued on the transport's job list based
 566 on the time its message was picked up by nqmgr. For first batch of recipients
 567 this means it is appended to the end of the job list, but the ordering of the
 568 job list by the enqueue time is important as we will see shortly.
 569
 570 [Now you should have pretty good idea what is the state of the nqmgr after
 571 couple of messages was picked up, what is the relation between all those job,
 572 peer, queue and entry structures.]
 573
 574 H\bHo\bow\bw t\bth\bhe\be e\ben\bnt\btr\bry\by s\bse\bel\ble\bec\bct\bti\bio\bon\bn w\bwo\bor\brk\bks\bs
 575
 576 Having prepared all those above mentioned structures, the task of the nqmgr's
 577 scheduler is to choose the recipient entries one at a time and pass them to the
 578 delivery agent for corresponding transport. Now how does this work?
 579
 580 The first approximation of the new scheduling algorithm is like this:
 581
 582     foreach transport (round-robin-by-transport)
 583     do
 584         if transport busy continue
 585         if transport process limit reached continue
 586         foreach transport's job (in the order of the transport's job list)
 587         do
 588         foreach job's peer (round-robin-by-destination)
 589              if peer->queue->concurrency < peer->queue->window
 590                  return next peer entry.
 591         done
 592         done
 593     done
 594
 595 Now what is the "order of the transport's job list"? As we know already, the
 596 job list is by default kept in the order the message was picked up by the
 597 nqmgr. So by default we get the top-level round-robin transport, and within
 598 each transport we get the FIFO message delivery. The round-robin of the peers
 599 by the destination is perhaps of little importance in most real-life cases
 600 (unless the recipient_concurrency limit is reached, in one job there is only
 601 one peer structure for each destination), but theoretically it makes sure that
 602 even within single jobs, destinations are treated fairly.
 603
 604 [By now you should have a feeling you really know how the scheduler works,
 605 except for the preemption, under ideal conditions - that is, no recipient
 606 resource limits and no destination concurrency problems.]
 607
 608 H\bHo\bow\bw t\bth\bhe\be p\bpr\bre\bee\bem\bmp\bpt\bti\bio\bon\bn w\bwo\bor\brk\bks\bs
 609
 610 As you might perhaps expect by now, the transport's job list does not remain
 611 sorted by the job's message enqueue time all the time. The most cool thing
 612 about nqmgr is not the simple FIFO delivery, but that it is able to slip mail
 613 with little recipients past the mailing-list bulk mail. This is what the job
 614 preemption is about - shuffling the jobs on the transport's job list to get the
 615 best message delivery rates. Now how is it achieved?
 616
 617 First I have to tell you that there are in fact two job lists in each
 618 transport. One is the scheduler's job list, which the scheduler is free to play
 619 with, while the other one keeps the jobs always listed in the order of the
 620 enqueue time and is used for recipient pool management we will discuss later.
 621 For now, we will deal with the scheduler's job list only.
 622
 623 So, we have the job list, which is first ordered by the time the jobs' messages
 624 were enqueued, oldest messages first, the most recently picked one at the end.
 625 For now, let's assume that there are no destination concurrency problems.
 626 Without preemption, we pick some entry of the first (oldest) job on the queue,
 627 assign it to delivery agent, pick another one from the same job, assign it
 628 again, and so on, until all the entries are used and the job is delivered. We
 629 would then move onto the next job and so on and on. Now how do we manage to
 630 sneak in some entries from the recently added jobs when the first job on the
 631 job list belongs to a message going to the mailing-list and has thousands of
 632 recipient entries?
 633
 634 The nqmgr's answer is that we can artificially "inflate" the delivery time of
 635 that first job by some constant for free - it is basically the same trick you
 636 might remember as "accumulation of potential" from the amortized complexity
 637 lessons. For example, instead of delivering the entries of the first job on the
 638 job list every time a delivery agent becomes available, we can do it only every
 639 second time. If you view the moments the delivery agent becomes available on a
 640 timeline as "delivery slots", then instead of using every delivery slot for the
 641 first job, we can use only every other slot, and still the overall delivery
 642 efficiency of the first job remains the same. So the delivery 11112222 becomes
 643 1.1.1.1.2.2.2.2 (1 and 2 are the imaginary job numbers, . denotes the free
 644 slot). Now what do we do with free slots?
 645
 646 As you might have guessed, we will use them for sneaking the mail with little
 647 recipients in. For example, if we have one four-recipient mail followed by four
 648 one recipients mail, the delivery sequence (that is, the sequence in which the
 649 jobs are assigned to the delivery slots) might look like this: 12131415. Hmm,
 650 fine for sneaking in the single recipient mail, but how do we sneak in the mail
 651 with more than one recipient? Say if we have one four-recipient mail followed
 652 by two two-recipient mails?
 653
 654 The simple answer would be to use delivery sequence 12121313. But the problem
 655 is that this does not scale well. Imagine you have mail with thousand
 656 recipients followed by mail with hundred recipients. It is tempting to suggest
 657 the delivery sequence like 121212...., but alas! Imagine there arrives another
 658 mail with say ten recipients. But there are no free slots anymore, so it can't
 659 slip by, not even if it had just only one recipients. It will be stuck until
 660 the hundred-recipient mail is delivered, which really sucks.
 661
 662 So, it becomes obvious that while inflating the message to get free slots is
 663 great idea, one has to be really careful of how the free slots are assigned,
 664 otherwise one might corner himself. So, how does nqmgr really use the free
 665 slots?
 666
 667 The key idea is that one does not have to generate the free slots in a uniform
 668 way. The delivery sequence 111...1 is no worse than 1.1.1.1, in fact, it is
 669 even better as some entries are in the first case selected earlier than in the
 670 second case, and none is selected later! So it is possible to first
 671 "accumulate" the free delivery slots and then use them all at once. It is even
 672 possible to accumulate some, then use them, then accumulate some more and use
 673 them again, as in 11..1.1 .
 674
 675 Let's get back to the one hundred recipient example. We now know that we could
 676 first accumulate one hundred free slots, and only after then to preempt the
 677 first job and sneak the one hundred recipient mail in. Applying the algorithm
 678 recursively, we see the hundred recipient job can accumulate ten free delivery
 679 slots, and then we could preempt it and sneak in the ten-recipient mail... Wait
 680 wait wait! Could we? Aren't we overinflating the original one thousand
 681 recipient mail?
 682
 683 Well, despite it looks so at the first glance, another trick will allow us to
 684 answer "no, we are not!". If we had said that we will inflate the delivery time
 685 twice at maximum, and then we consider every other slot as a free slot, then we
 686 would overinflate in case of the recursive preemption. BUT! The trick is that
 687 if we use only every n-th slot as a free slot for n>2, there is always some
 688 worst inflation factor which we can guarantee not to be breached, even if we
 689 apply the algorithm recursively. To be precise, if for every k>1 normally used
 690 slots we accumulate one free delivery slot, than the inflation factor is not
 691 worse than k/(k-1) no matter how many recursive preemptions happen. And it's
 692 not worse than (k+1)/k if only non-recursive preemption happens. Now, having
 693 got through the theory and the related math, let's see how nqmgr implements
 694 this.
 695
 696 Each job has so called "available delivery slot" counter. Each transport has a
 697 transport_delivery_slot_cost parameter, which defaults to
 698 default_delivery_slot_cost parameter which is set to 5 by default. This is the
 699 k from the paragraph above. Each time k entries of the job are selected for
 700 delivery, this counter is incremented by one. Once there are some slots
 701 accumulated, job which requires no more than that number of slots to be fully
 702 delivered can preempt this job.
 703
 704 [Well, the truth is, the counter is incremented every time an entry is selected
 705 and it is divided by k when it is used. Or even more true, there is no
 706 division, the other side of the equation is multiplied by k. But for the
 707 understanding it's good enough to use the above approximation of the truth.]
 708
 709 OK, so now we know the conditions which must be satisfied so one job can
 710 preempt another one. But what job gets preempted, how do we choose what job
 711 preempts it if there are several valid candidates, and when does all this
 712 exactly happen?
 713
 714 The answer for the first part is simple. The job whose entry was selected the
 715 last time is so called current job. Normally, it is the first job on the
 716 scheduler's job list, but destination concurrency limits may change this as we
 717 will see later. It is always only the current job which may get preempted.
 718
 719 Now for the second part. The current job has certain amount of recipient
 720 entries, and as such may accumulate at maximum some amount of available
 721 delivery slots. It might have already accumulated some, and perhaps even
 722 already used some when it was preempted before (remember a job can be preempted
 723 several times). In either case, we know how many are accumulated and how many
 724 are left to deliver, so we know how many it may yet accumulate at maximum.
 725 Every other job which may be delivered by less than that number of slots is a
 726 valid candidate for preemption. How do we choose among them?
 727
 728 The answer is - the one with maximum enqueue_time/recipient_entry_count. That
 729 is, the older the job is, the more we should try to deliver it in order to get
 730 best message delivery rates. These rates are of course subject to how many
 731 recipients the message has, therefore the division by the recipient (entry)
 732 count. No one shall be surprised that message with n recipients takes n times
 733 longer to deliver than message with one recipient.
 734
 735 Now let's recap the previous two paragraphs. Isn't it too complicated? Why
 736 don't the candidates come only among the jobs which can be delivered within the
 737 number of slots the current job already accumulated? Why do we need to estimate
 738 how much it has yet to accumulate? If you found out the answer, congratulate
 739 yourself. If we did it this simple way, we would always choose the candidate
 740 with least recipient entries. If there were enough single recipient mails
 741 coming in, they would always slip by the bulk mail as soon as possible, and the
 742 two and more recipients mail would never get a chance, no matter how long they
 743 have been sitting around in the job list.
 744
 745 This candidate selection has interesting implication - that when we choose the
 746 best candidate for preemption (this is done in qmgr_choose_candidate()), it may
 747 happen that we may not use it for preemption immediately. This leads to an
 748 answer to the last part of the original question - when does the preemption
 749 happen?
 750
 751 The preemption attempt happens every time next transport's recipient entry is
 752 to be chosen for delivery. To avoid needless overhead, the preemption is not
 753 attempted if the current job could never accumulate more than
 754 transport_minimum_delivery_slots (defaults to default_minimum_delivery_slots
 755 which defaults to 3). If there is already enough accumulated slots to preempt
 756 the current job by the chosen best candidate, it is done immediately. This
 757 basically means that the candidate is moved in front of the current job on the
 758 scheduler's job list and decreasing the accumulated slot counter by the amount
 759 used by the candidate. If there is not enough slots... well, I could say that
 760 nothing happens and the another preemption is attempted the next time. But
 761 that's not the complete truth.
 762
 763 The truth is that it turns out that it is not really necessary to wait until
 764 the jobs counter accumulates all the delivery slots in advance. Say we have
 765 ten-recipient mail followed by two two-recipient mails. If the preemption
 766 happened when enough delivery slot accumulate (assuming slot cost 2), the
 767 delivery sequence becomes 11112211113311. Now what would we get if we would
 768 wait only for 50% of the necessary slots to accumulate and we promise we would
 769 wait for the remaining 50% later, after we get back to the preempted job? If we
 770 use such slot loan, the delivery sequence becomes 11221111331111. As we can
 771 see, it makes it no considerably worse for the delivery of the ten-recipient
 772 mail, but it allows the small messages to be delivered sooner.
 773
 774 The concept of these slot loans is where the transport_delivery_slot_discount
 775 and transport_delivery_slot_loan come from (they default to
 776 default_delivery_slot_discount and default_delivery_slot_loan, whose values are
 777 by default 50 and 3, respectively). The discount (resp. loan) specifies how
 778 many percent (resp. how many slots) one "gets in advance", when the number of
 779 slots required to deliver the best candidate is compared with the number of
 780 slots the current slot had accumulated so far.
 781
 782 And it pretty much concludes this chapter.
 783
 784 [Now you should have a feeling that you pretty much understand the scheduler
 785 and the preemption, or at least that you will have it after you read the last
 786 chapter couple more times. You shall clearly see the job list and the
 787 preemption happening at its head, in ideal delivery conditions. The feeling of
 788 understanding shall last until you start wondering what happens if some of the
 789 jobs are blocked, which you might eventually figure out correctly from what had
 790 been said already. But I would be surprised if your mental image of the
 791 scheduler's functionality is not completely shattered once you start wondering
 792 how it works when not all recipients may be read in-core. More on that later.]
 793
 794 H\bHo\bow\bw d\bde\bes\bst\bti\bin\bna\bat\bti\bio\bon\bn c\bco\bon\bnc\bcu\bur\brr\bre\ben\bnc\bcy\by l\bli\bim\bmi\bit\bts\bs a\baf\bff\bfe\bec\bct\bt t\bth\bhe\be s\bsc\bch\bhe\bed\bdu\bul\bli\bin\bng\bg a\bal\blg\bgo\bor\bri\bit\bth\bhm\bm
 795
 796 The nqmgr uses the same algorithm for destination concurrency control as oqmgr.
 797 Now what happens when the destination limits are reached and no more entries
 798 for that destination may be selected by the scheduler?
 799
 800 From user's point of view it is all simple. If some of the peers of a job can't
 801 be selected, those peers are simply skipped by the entry selection algorithm
 802 (the pseudo-code described before) and only the selectable ones are used. If
 803 none of the peers may be selected, the job is declared a "blocker job". Blocker
 804 jobs are skipped by the entry selection algorithm and they are also excluded
 805 from the candidates for preemption of current job. Thus the scheduler
 806 effectively behaves as if the blocker jobs didn't exist on the job list at all.
 807 As soon as at least one of the peers of a blocker job becomes unblocked (that
 808 is, the delivery agent handling the delivery of the recipient entry for given
 809 destination successfully finishes), the job's blocker status is removed and the
 810 job again participates in all further scheduler actions normally.
 811
 812 So the summary is that the users don't really have to be concerned about the
 813 interaction of the destination limits and scheduling algorithm. It works well
 814 on its own and there are no knobs they would need to control it.
 815
 816 From a programmer's point of view, the blocker jobs complicate the scheduler
 817 quite a lot. Without them, the jobs on the job list would be normally delivered
 818 in strict FIFO order. If the current job is preempted, the job preempting it is
 819 completely delivered unless it is preempted itself. Without blockers, the
 820 current job is thus always either the first job on the job list, or the top of
 821 the stack of jobs preempting the first job on the job list.
 822
 823 The visualization of the job list and the preemption stack without blockers
 824 would be like this:
 825
 826     first job->    1--2--3--5--6--8--...    <- job list
 827     on job list    |
 828                    4    <- preemption stack
 829                    |
 830     current job->  7
 831
 832 In the example above we see that job 1 was preempted by job 4 and then job 4
 833 was preempted by job 7. After job 7 is completed, remaining entries of job 4
 834 are selected, and once they are all selected, job 1 continues.
 835
 836 As we see, it's all very clean and straightforward. Now how does this change
 837 because of blockers?
 838
 839 The answer is: a lot. Any job may become blocker job at any time, and also
 840 become normal job again at any time. This has several important implications:
 841
 842  1. The jobs may be completed in arbitrary order. For example, in the example
 843     above, if the current job 7 becomes blocked, the next job 4 may complete
 844     before the job 7 becomes unblocked again. Or if both 7 and 4 are blocked,
 845     then 1 is completed, then 7 becomes unblocked and is completed, then 2 is
 846     completed and only after that 4 becomes unblocked and is completed... You
 847     get the idea.
 848
 849     [Interesting side note: even when jobs are delivered out of order, from
 850     single destination's point of view the jobs are still delivered in the
 851     expected order (that is, FIFO unless there was some preemption involved).
 852     This is because whenever a destination queue becomes unblocked (the
 853     destination limit allows selection of more recipient entries for that
 854     destination), all jobs which have peers for that destination are unblocked
 855     at once.]
 856
 857  2. The idea of the preemption stack at the head of the job list is gone. That
 858     is, it must be possible to preempt any job on the job list. For example, if
 859     the jobs 7, 4, 1 and 2 in the example above become all blocked, job 3
 860     becomes the current job. And of course we do not want the preemption to be
 861     affected by the fact that there are some blocked jobs or not. Therefore, if
 862     it turns out that job 3 might be preempted by job 6, the implementation
 863     shall make it possible.
 864
 865  3. The idea of the linear preemption stack itself is gone. It's no longer true
 866     that one job is always preempted by only one job at one time (that is
 867     directly preempted, not counting the recursively nested jobs). For example,
 868     in the example above, job 1 is directly preempted by only job 4, and job 4
 869     by job 7. Now assume job 7 becomes blocked, and job 4 is being delivered.
 870     If it accumulates enough delivery slots, it is natural that it might be
 871     preempted for example by job 8. Now job 4 is preempted by both job 7 AND
 872     job 8 at the same time.
 873
 874 Now combine the points 2) and 3) with point 1) again and you realize that the
 875 relations on the once linear job list became pretty complicated. If we extend
 876 the point 3) example: jobs 7 and 8 preempt job 4, now job 8 becomes blocked
 877 too, then job 4 completes. Tricky, huh?
 878
 879 If I illustrate the relations after the above mentioned examples (but those in
 880 point 1)), the situation would look like this:
 881
 882                                 v- parent
 883
 884     adoptive parent ->    1--2--3--5--...      <- "stack" level 0
 885                           |     |
 886     parent gone ->        ?     6              <- "stack" level 1
 887                          / \
 888     children ->         7   8   ^- child       <- "stack" level 2
 889
 890                           ^- siblings
 891
 892 Now how does nqmgr deal with all these complicated relations?
 893
 894 Well, it maintains them all as described, but fortunately, all these relations
 895 are necessary only for purposes of proper counting of available delivery slots.
 896 For purposes of ordering the jobs for entry selection, the original rule still
 897 applies: "the job preempting the current job is moved in front of the current
 898 job on the job list". So for entry selection purposes, the job relations remain
 899 as simple as this:
 900
 901     7--8--1--2--6--3--5--..   <- scheduler's job list order
 902
 903 The job list order and the preemption parent/child/siblings relations are
 904 maintained separately. And because the selection works only with the job list,
 905 you can happily forget about those complicated relations unless you want to
 906 study the nqmgr sources. In that case the text above might provide some helpful
 907 introduction to the problem domain. Otherwise I suggest you just forget about
 908 all this and stick with the user's point of view: the blocker jobs are simply
 909 ignored.
 910
 911 [By now, you should have a feeling that there is more things going under the
 912 hood than you ever wanted to know. You decide that forgetting about this
 913 chapter is the best you can do for the sake of your mind's health and you
 914 basically stick with the idea how the scheduler works in ideal conditions, when
 915 there are no blockers, which is good enough.]
 916
 917 D\bDe\bea\bal\bli\bin\bng\bg w\bwi\bit\bth\bh m\bme\bem\bmo\bor\bry\by r\bre\bes\bso\bou\bur\brc\bce\be l\bli\bim\bmi\bit\bts\bs
 918
 919 When discussing the nqmgr scheduler, we have so far assumed that all recipients
 920 of all messages in the active queue are completely read into the memory. This
 921 is simply not true. There is an upper bound on the amount of memory the nqmgr
 922 may use, and therefore it must impose some limits on the information it may
 923 store in the memory at any given time.
 924
 925 First of all, not all messages may be read in-core at once. At any time, only
 926 qmgr_message_active_limit messages may be read in-core at maximum. When read
 927 into memory, the messages are picked from the incoming and deferred message
 928 queues and moved to the active queue (incoming having priority), so if there is
 929 more than qmgr_message_active_limit messages queued in the active queue, the
 930 rest will have to wait until (some of) the messages in the active queue are
 931 completely delivered (or deferred).
 932
 933 Even with the limited amount of in-core messages, there is another limit which
 934 must be imposed in order to avoid memory exhaustion. Each message may contain
 935 huge amount of recipients (tens or hundreds of thousands are not uncommon), so
 936 if nqmgr read all recipients of all messages in the active queue, it may easily
 937 run out of memory. Therefore there must be some upper bound on the amount of
 938 message recipients which are read into the memory at the same time.
 939
 940 Before discussing how exactly nqmgr implements the recipient limits, let's see
 941 how the sole existence of the limits themselves affects the nqmgr and its
 942 scheduler.
 943
 944 The message limit is straightforward - it just limits the size of the lookahead
 945 the nqmgr's scheduler has when choosing which message can preempt the current
 946 one. Messages not in the active queue simply are not considered at all.
 947
 948 The recipient limit complicates more things. First of all, the message reading
 949 code must support reading the recipients in batches, which among other things
 950 means accessing the queue file several times and continuing where the last
 951 recipient batch ended. This is invoked by the scheduler whenever the current
 952 job has space for more recipients, subject to transport's refill_limit and
 953 refill_delay parameters. It is also done any time when all in-core recipients
 954 of the message are dealt with (which may also mean they were deferred) but
 955 there are still more in the queue file.
 956
 957 The second complication is that with some recipients left unread in the queue
 958 file, the scheduler can't operate with exact counts of recipient entries. With
 959 unread recipients, it is not clear how many recipient entries there will be, as
 960 they are subject to per-destination grouping. It is not even clear to what
 961 transports (and thus jobs) the recipients will be assigned. And with messages
 962 coming from the deferred queue, it is not even clear how many unread recipients
 963 are still to be delivered. This all means that the scheduler must use only
 964 estimates of how many recipients entries there will be. Fortunately, it is
 965 possible to estimate the minimum and maximum correctly, so the scheduler can
 966 always err on the safe side. Obviously, the better the estimates, the better
 967 results, so it is best when we are able to read all recipients in-core and turn
 968 the estimates into exact counts, or at least try to read as many as possible to
 969 make the estimates as accurate as possible.
 970
 971 The third complication is that it is no longer true that the scheduler is done
 972 with a job once all of its in-core recipients are delivered. It is possible
 973 that the job will be revived later, when another batch of recipients is read in
 974 core. It is also possible that some jobs will be created for the first time
 975 long after the first batch of recipients was read in core. The nqmgr code must
 976 be ready to handle all such situations.
 977
 978 And finally, the fourth complication is that the nqmgr code must somehow impose
 979 the recipient limit itself. Now how does it achieve it?
 980
 981 Perhaps the easiest solution would be to say that each message may have at
 982 maximum X recipients stored in-core, but such solution would be poor for
 983 several reasons. With reasonable qmgr_message_active_limit values, the X would
 984 have to be quite low to maintain reasonable memory footprint. And with low X
 985 lots of things would not work well. The nqmgr would have problems to use the
 986 transport_destination_recipient_limit efficiently. The scheduler's preemption
 987 would be suboptimal as the recipient count estimates would be inaccurate. The
 988 message queue file would have to be accessed many times to read in more
 989 recipients again and again.
 990
 991 Therefore it seems reasonable to have a solution which does not use a limit
 992 imposed on per-message basis, but which maintains a pool of available recipient
 993 slots, which can be shared among all messages in the most efficient manner. And
 994 as we do not want separate transports to compete for resources whenever
 995 possible, it seems appropriate to maintain such recipient pool for each
 996 transport separately. This is the general idea, now how does it work in
 997 practice?
 998
 999 First we have to solve little chicken-and-egg problem. If we want to use the
1000 per-transport recipient pools, we first need to know to what transport(s) is
1001 the message assigned. But we will find that out only after we read in the
1002 recipients first. So it is obvious that we first have to read in some
1003 recipients, use them to find out to what transports is the message to be
1004 assigned, and only after that we can use the per-transport recipient pools.
1005
1006 Now how many recipients shall we read for the first time? This is what
1007 qmgr_message_recipient_minimum and qmgr_message_recipient_limit values control.
1008 The qmgr_message_recipient_minimum value specifies how many recipients of each
1009 message we will read for the first time, no matter what. It is necessary to
1010 read at least one recipient before we can assign the message to a transport and
1011 create the first job. However, reading only qmgr_message_recipient_minimum
1012 recipients even if there are only few messages with few recipients in-core
1013 would be wasteful. Therefore if there is less than qmgr_message_recipient_limit
1014 recipients in-core so far, the first batch of recipients may be larger than
1015 qmgr_message_recipient_minimum - as large as is required to reach the
1016 qmgr_message_recipient_limit limit.
1017
1018 Once the first batch of recipients was read in core and the message jobs were
1019 created, the size of the subsequent recipient batches (if any - of course it's
1020 best when all recipients are read in one batch) is based solely on the position
1021 of the message jobs on their corresponding transports' job lists. Each
1022 transport has a pool of transport_recipient_limit recipient slots which it can
1023 distribute among its jobs (how this is done is described later). The subsequent
1024 recipient batch may be as large as the sum of all recipient slots of all jobs
1025 of the message permits (plus the qmgr_message_recipient_minimum amount which
1026 always applies).
1027
1028 For example, if a message has three jobs, first with 1 recipient still in-core
1029 and 4 recipient slots, second with 5 recipient in-core and 5 recipient slots,
1030 and third with 2 recipients in-core and 0 recipient slots, it has 1+5+2=7
1031 recipients in-core and 4+5+0=9 jobs' recipients slots in total. This means that
1032 we could immediately read 2+qmgr_message_recipient_minimum more recipients of
1033 that message in core.
1034
1035 The above example illustrates several things which might be worth mentioning
1036 explicitly: first, note that although the per-transport slots are assigned to
1037 particular jobs, we can't guarantee that once the next batch of recipients is
1038 read in core, that the corresponding amounts of recipients will be assigned to
1039 those jobs. The jobs lend its slots to the message as a whole, so it is
1040 possible that some jobs end up sponsoring other jobs of their message. For
1041 example, if in the example above the 2 newly read recipients were assigned to
1042 the second job, the first job sponsored the second job with 2 slots. The second
1043 notable thing is the third job, which has more recipients in-core than it has
1044 slots. Apart from sponsoring by other job we just saw it can be result of the
1045 first recipient batch, which is sponsored from global recipient pool of
1046 qmgr_message_recipient_limit recipients. It can be also sponsored from the
1047 message recipient pool of qmgr_message_recipient_minimum recipients.
1048
1049 Now how does each transport distribute the recipient slots among its jobs? The
1050 strategy is quite simple. As most scheduler activity happens on the head of the
1051 job list, it is our intention to make sure that the scheduler has the best
1052 estimates of the recipient counts for those jobs. As we mentioned above, this
1053 means that we want to try to make sure that the messages of those jobs have all
1054 recipients read in-core. Therefore the transport distributes the slots "along"
1055 the job list from start to end. In this case the job list sorted by message
1056 enqueue time is used, because it doesn't change over time as the scheduler's
1057 job list does.
1058
1059 More specifically, each time a job is created and appended to the job list, it
1060 gets all unused recipient slots from its transport's pool. It keeps them until
1061 all recipients of its message are read. When this happens, all unused recipient
1062 slots are transferred to the next job (which is now in fact now first such job)
1063 on the job list which still has some recipients unread, or eventually back to
1064 the transport pool if there is no such job. Such transfer then also happens
1065 whenever a recipient entry of that job is delivered.
1066
1067 There is also a scenario when a job is not appended to the end of the job list
1068 (for example it was created as a result of second or later recipient batch).
1069 Then it works exactly as above, except that if it was put in front of the first
1070 unread job (that is, the job of a message which still has some unread
1071 recipients in queue file), that job is first forced to return all of its unused
1072 recipient slots to the transport pool.
1073
1074 The algorithm just described leads to the following state: The first unread job
1075 on the job list always gets all the remaining recipient slots of that transport
1076 (if there are any). The jobs queued before this job are completely read (that
1077 is, all recipients of their message were already read in core) and have at
1078 maximum as many slots as they still have recipients in-core (the maximum is
1079 there because of the sponsoring mentioned before) and the jobs after this job
1080 get nothing from the transport recipient pool (unless they got something before
1081 and then the first unread job was created and enqueued in front of them later -
1082 in such case the also get at maximum as many slots as they have recipients in-
1083 core).
1084
1085 Things work fine in such state for most of the time, because the current job is
1086 either completely read in-core or has as much recipient slots as there are, but
1087 there is one situation which we still have to take care of specially. Imagine
1088 if the current job is preempted by some unread job from the job list and there
1089 are no more recipient slots available, so this new current job could read only
1090 batches of qmgr_message_recipient_minimum recipients at a time. This would
1091 really degrade performance. For this reason, each transport has extra pool of
1092 transport_extra_recipient_limit recipient slots, dedicated exactly for this
1093 situation. Each time an unread job preempts the current job, it gets half of
1094 the remaining recipient slots from the normal pool and this extra pool.
1095
1096 And that's it. It sure does sound pretty complicated, but fortunately most
1097 people don't really have to care how exactly it works as long as it works.
1098 Perhaps the only important things to know for most people are the following
1099 upper bound formulas:
1100
1101 Each transport has at maximum
1102
1103     max(
1104     qmgr_message_recipient_minimum * qmgr_message_active_limit
1105     + *_recipient_limit + *_extra_recipient_limit,
1106     qmgr_message_recipient_limit
1107     )
1108
1109 recipients in core.
1110
1111 The total amount of recipients in core is
1112
1113     max(
1114     qmgr_message_recipient_minimum * qmgr_message_active_limit
1115     + sum( *_recipient_limit + *_extra_recipient_limit ),
1116     qmgr_message_recipient_limit
1117     )
1118
1119 where the sum is over all used transports.
1120
1121 And this terribly complicated chapter concludes the documentation of nqmgr
1122 scheduler.
1123
1124 [By now you should theoretically know the nqmgr scheduler inside out. In
1125 practice, you still hope that you will never have to really understand the last
1126 or last two chapters completely, and fortunately most people really won't.
1127 Understanding how the scheduler works in ideal conditions is more than good
1128 enough for vast majority of users.]
1129
1130 C\bCr\bre\bed\bdi\bit\bts\bs
1131
1132   * Wietse Venema designed and implemented the initial queue manager with per-
1133     domain FIFO scheduling, and per-delivery +/-1 concurrency feedback.
1134   * Patrik Rak designed and implemented preemption where mail with fewer
1135     recipients can slip past mail with more recipients in a controlled manner,
1136     and wrote up its documentation.
1137   * Wietse Venema initiated a discussion with Patrik Rak and Victor Duchovni on
1138     alternatives for the +/-1 feedback scheduler's aggressive behavior. This is
1139     when K/N feedback was reviewed (N = concurrency). The discussion ended
1140     without a good solution for both negative feedback and dead site detection.
1141   * Victor Duchovni resumed work on concurrency feedback in the context of
1142     concurrency-limited servers.
1143   * Wietse Venema then re-designed the concurrency scheduler in terms of the
1144     simplest possible concepts: less-than-1 concurrency feedback per delivery,
1145     forward and reverse concurrency feedback hysteresis, and pseudo-cohort
1146     failure. At this same time, concurrency feedback was separated from dead
1147     site detection.
1148   * These simplifications, and their modular implementation, helped to develop
1149     further insights into the different roles that positive and negative
1150     concurrency feedback play, and helped to identify some worst-case
1151     scenarios.
1152