src/docs/user/field/exit_codes.diviner

   1 @title Command Line Exit Codes
   2 @group fieldmanual
   3
   4 Explains the use of exit codes in Phabricator command line scripts.
   5
   6 Overview
   7 ========
   8
   9 When you run a command from the command line, it exits with an //exit code//.
  10 This code is normally not shown on the CLI, but you can examine the exit code
  11 of the last command you ran by looking at `$?` in your shell:
  12
  13   $ ls
  14   ...
  15   $ echo $?
  16   0
  17
  18 Programs which run commands can operate on exit codes, and shell constructs
  19 like `cmdx && cmdy` operate on exit codes.
  20
  21 The code `0` means success. Other codes signal some sort of error or status
  22 condition, depending on the system and command.
  23
  24 With rare exception, Phabricator uses //all other codes// to signal
  25 **catastrophic failure**.
  26
  27 This is an explicit architectural decision and one we are unlikely to deviate
  28 from: generally, we will not accept patches which give a command a nonzero exit
  29 code to indicate an expected state, an application status, or a minor abnormal
  30 condition.
  31
  32 Generally, this decision reflects a philosophical belief that attaching
  33 application semantics to exit codes is a relic of a simpler time, and that
  34 they are not appropriate for communicating application state in a modern
  35 operational environment. This document explains the reasoning behind our use of
  36 exit codes in more detail.
  37
  38 In particular, this approach is informed by a focus on operating Phabricator
  39 clusters at scale. This is not a common deployment scenario, but we consider it
  40 the most important one. Our use of exit codes makes it easier to deploy and
  41 operate a Phabricator cluster at larger scales. It makes it slightly harder to
  42 deploy and operate a small cluster or single host by gluing together `bash`
  43 scripts. We are willingly trading the small scale away for advantages at larger
  44 scales.
  45
  46
  47 Problems With Exit Codes
  48 ========================
  49
  50 We do not use exit codes to communicate application state because doing so
  51 makes it harder to write correct scripts, and the primary benefit is that it
  52 makes it easier to write incorrect ones.
  53
  54 This is somewhat at odds with the philosophy of "worse is better", but a modern
  55 operations environment faces different forces than the interactive shell did
  56 in the 1970s, particularly at scale.
  57
  58 We consider correctness to be very important to modern operations environments.
  59 In particular, we manage a Phabricator cluster (Phacility) and believe that
  60 having reliable, repeatable processes for provisioning, configuration and
  61 deployment is critical to maintaining and scaling our operations. Our use of
  62 exit codes makes it easier to implement processes that are correct and reliable
  63 on top of Phabricator management scripts.
  64
  65 Exit codes as signals for application state are problematic because they are
  66 ambiguous: you can't use them to distinguish between dissimilar failure states
  67 which should prompt very different operational responses.
  68
  69 Exit codes primarily make writing things like `bash` scripts easier, but we
  70 think you shouldn't be writing `bash` scripts in a modern operational
  71 environment if you care very much about your software working.
  72
  73 Software environments which are powerful enough to handle errors properly are
  74 also powerful enough to parse command output to unambiguously read and react to
  75 complex state. Communicating application state through exit codes almost
  76 exclusively makes it easier to handle errors in a haphazard way which is often
  77 incorrect.
  78
  79
  80 Exit Codes are Ambiguous
  81 ========================
  82
  83 In many cases, exit codes carry very little information and many different
  84 conditions can produce the same exit code, including conditions which should
  85 prompt very different responses.
  86
  87 The command line tool `grep` searches for text. For example, you might run
  88 a command like this:
  89
  90   $ grep zebra corpus.txt
  91
  92 This searches for the text `zebra` in the file `corpus.txt`. If the text is
  93 not found, `grep` exits with a nonzero exit code (specifically, `1`).
  94
  95 Suppose you run `grep zebra corpus.txt` and observe a nonzero exit code. What
  96 does that mean? These are //some// of the possible conditions which are
  97 consistent with your observation:
  98
  99   - The text `zebra` was not found in `corpus.txt`.
 100   - `corpus.txt` does not exist.
 101   - You do not have permission to read `corpus.txt`.
 102   - `grep` is not installed.
 103   - You do not have permission to run `grep`.
 104   - There is a bug in `grep`.
 105   - Your `grep` binary is corrupt.
 106   - `grep` was killed by a signal.
 107
 108 If you're running this command interactively on a single machine, it's probably
 109 OK for all of these conditions to be conflated. You aren't going to examine the
 110 exit code anyway (it isn't even visible to you by default), and `grep` likely
 111 printed useful information to `stderr` if you hit one of the less common issues.
 112
 113 If you're running this command from operational software (like deployment,
 114 configuration or monitoring scripts) and you care about the correctness and
 115 repeatability of your process, we believe conflating these conditions is not
 116 OK. The operational response to text not being present in a file should almost
 117 always differ substantially from the response to the file not being present or
 118 `grep` being broken.
 119
 120 In a particularly bad case, a broken `grep` might cause a careless deployment
 121 script to continue down an inappropriate path and cascade into a more serious
 122 failure.
 123
 124 Even in a less severe case, unexpected conditions should be detected and raised
 125 to operations staff. `grep` being broken or a file that is expected to exist
 126 not existing are both detectable, unexpected, and likely severe conditions, but
 127 they can not be differentiated and handled by examining the exit code of
 128 `grep`. It is much better to detect and raise these problems immediately than
 129 discover them after a lengthy root cause analysis.
 130
 131 Some of these conditions can be differentiated by examining the specific exit
 132 code of the command instead of acting on all nonzero exit codes. However, many
 133 failure conditions produce the same exit codes (particularly code `1`) and
 134 there is no way to guarantee that a particular code signals a particular
 135 condition, especially across systems.
 136
 137 Realistically, it is also relatively rare for scripts to even make an effort to
 138 distinguish between exit codes, and all nonzero exit codes are often treated
 139 the same way.
 140
 141
 142 Bash Scripts are not Robust
 143 ============================
 144
 145 Exit codes that indicate application status make writing `bash` scripts (or
 146 scripts in other tools which provide a thin layer on top of what is essentially
 147 `bash`) a lot easier and more convenient.
 148
 149 For example, it is pretty tricky to parse JSON in `bash` or with standard
 150 command-line tools, and much easier to react to exit codes. This is sometimes
 151 used as an argument for communicating application status in exit codes.
 152
 153 We reject this because we don't think you should be writing `bash` scripts if
 154 you're doing real operations. Fundamentally, `bash` shell scripts are not a
 155 robust building block for creating correct, reliable operational processes.
 156
 157 Here is one problem with using `bash` scripts to perform operational tasks.
 158 Consider this command:
 159
 160   $ mysqldump | gzip > backup.sql.gz
 161
 162 Now, consider this command:
 163
 164   $ mysqldermp | gzip > backup.sql.gz
 165
 166 These commands represent a fairly standard way to accomplish a task (dumping
 167 a compressed database backup to disk) in a `bash` script.
 168
 169 Note that the second command contains a typo (`dermp` instead of `dump`) which
 170 will cause the command to exit abruptly with a nonzero exit code.
 171
 172 However, both these statements run successfully and exit with exit code `0`
 173 (indicating success). Both will create a `backup.sql.gz` file. One backs up
 174 your data; the other never backs up your data. This second command will never
 175 work and never do what the author intended, but will appear successful under
 176 casual inspection.
 177
 178 These behaviors are the same under `set -e`.
 179
 180 This fragile attitude toward error handling is endemic to `bash` scripts. The
 181 default behavior is to continue on errors, and it isn't easy to change this
 182 default. Options like `set -e` are unreliable and it is difficult to detect and
 183 react to errors in fundamental constructs like pipes. The tools that `bash`
 184 scripts employ (like `grep`) emit ambiguous error codes. Scripts can not help
 185 but propagate this ambiguity no matter how careful they are with error handling.
 186
 187 It is likely //possible// to implement these things safely and correctly in
 188 `bash`, but it is not easy or straightforward. More importantly, it is not the
 189 default: the default behavior of `bash` is to ignore errors and continue.
 190
 191 Gluing commands together in `bash` or something that sits on top of `bash`
 192 makes it easy and convenient to get a process that works fairly well most of
 193 the time at small scales, but we are not satisfied that it represents a robust
 194 foundation for operations at larger scales.
 195
 196
 197 Reacting to State
 198 =================
 199
 200 Instead of communicating application state through exit codes, we generally
 201 communicate application state through machine-parseable output with a success
 202 (`0`) exit code. All nonzero exit codes indicate catastrophic failure which
 203 requires operational intervention.
 204
 205 Callers are expected to request machine-parseable output if necessary (for
 206 example, by passing a `--json` flag or other similar flags), verify the command
 207 exits with a `0` exit code, parse the output, then react to the state it
 208 communicates as appropriate.
 209
 210 In a sufficiently powerful scripting environment (e.g., one with data
 211 structures and a JSON parser), this is straightforward and makes it easy to
 212 react precisely and correctly. It also allows scripts to communicate
 213 arbitrarily complex state. Provided your environment gives you an appropriate
 214 toolset, it is much more powerful and not significantly more complex than using
 215 error codes.
 216
 217 Most importantly, it allows the calling environment to treat nonzero exit
 218 statuses as catastrophic failure by default.
 219
 220
 221 Moving Forward
 222 ==============
 223
 224 Given these concerns, we are generally unwilling to bring changes which use
 225 exit codes to communicate application state (other than catastrophic failure)
 226 into the upstream. There are some exceptions, but these are rare. In
 227 particular, ease of use in a `bash` environment is not a compelling motivation.
 228
 229 We are broadly willing to make output machine parseable or provide an explicit
 230 machine output mode (often a `--json` flag) if there is a reasonable use case
 231 for it. However, we operate a large production cluster of Phabricator instances
 232 with the tools available in the upstream, so the lack of machine parseable
 233 output is not sufficient to motivate adding such output on its own: we also
 234 need to understand the problem you're facing, and why it isn't a problem we
 235 face. A simpler or cleaner approach to the problem may already exist.
 236
 237 If you just want to write `bash` scripts on top of Phabricator scripts and you
 238 are unswayed by these concerns, you can often just build a composite command to
 239 get roughly the same effect that you'd get out of an exit code.
 240
 241 For example, you can pipe things to `grep` to convert output into exit codes.
 242 This should generally have failure rates that are comparable to the background
 243 failure level of relying on `bash` as a scripting environment.