docs/desync.md

   1 # Some explanations about Desyncs
   2
   3 Last updated: 2014-02-23
   4
   5 ## Table of contents
   6
   7 - 1.0) Desync theory
   8     - 1.1) [OpenTTD multiplayer architecture](#11-openttd-multiplayer-architecture)
   9     - 1.2) [What is a Desync and how is it detected](#12-what-is-a-desync-and-how-is-it-detected)
  10     - 1.3) [Typical causes of Desyncs](#13-typical-causes-of-desyncs)
  11 - 2.0) What to do in case of a Desync
  12     - 2.1) [Cache debugging](#21-cache-debugging)
  13     - 2.2) [Desync recording](#22-desync-recording)
  14 - 3.0) Evaluating the Desync records
  15     - 3.1) [Replaying](#31-replaying)
  16     - 3.2) [Evaluation of the replay](#32-evaluation-of-the-replay)
  17     - 3.3) [Comparing savegames](#33-comparing-savegames)
  18
  19
  20 ## 1.1) OpenTTD multiplayer architecture
  21
  22   OpenTTD has a huge gamestate, which changes all of the time.
  23   The savegame contains the complete gamestate at a specific point
  24   in time. But this state changes completely each tick: Vehicles move
  25   and trees grow.
  26
  27   However, most of these changes in the gamestate are deterministic:
  28   Without a player interfering a vehicle follows its orders always
  29   in the same way, and trees always grow the same.
  30
  31   In OpenTTD multiplayer synchronisation works by creating a savegame
  32   when clients join, and then transferring that savegame to the client,
  33   so it has the complete gamestate at a fixed point in time.
  34
  35   Afterwards clients only receive 'commands', that is: Stuff which is
  36   not predictable, like
  37    - player actions
  38    - AI actions
  39    - GameScript actions
  40    - Admin Port command
  41    - rcon commands
  42    - ...
  43
  44   These commands contain the information on how to execute the command,
  45   and when to execute it. Time is measured in 'network frames'.
  46   Mind that network frames to not match ingame time. Network frames
  47   also run while the game is paused, to give a defined behaviour to
  48   stuff that is executing while the game is paused.
  49
  50   The deterministic part of the gamestate is run by the clients on
  51   their own. All they get from the server is the instruction to
  52   run the gamestate up to a certain network time, which basically
  53   says that there are no commands scheduled in that time.
  54
  55   When a client (which includes the server itself) wants to execute
  56   a command (i.e. a non-predictable action), it does this by
  57    - calling DoCommandP resp. DoCommandPInternal
  58    - These functions first do a local test-run of the command to
  59      check simple preconditions. (Just to give the client an
  60      immediate response without bothering the server and waiting for
  61      the response.) The test-run may not actually change the
  62      gamestate, all changes must be discarded.
  63    - If the local test-run succeeds the command is sent to the server.
  64    - The server inserts the command into the command queue, which
  65      assigns a network frame to the commands, i.e. when it shall be
  66      executed on all clients.
  67    - Enhanced with this specific timestamp, the command is send to all
  68      clients, which execute the command simultaneously in the same
  69      network frame in the same order.
  70
  71 ## 1.2) What is a Desync and how is it detected
  72
  73   In the ideal case all clients have the same gamestate as the server
  74   and run in sync. That is, vehicle movement is the same on all
  75   clients, and commands are executed the same everywhere and
  76   have the same results.
  77
  78   When a Desync happens, it means that the gamestates on the clients
  79   (including the server) are no longer the same. Just imagine
  80   that a vehicle picks the left line instead of the right line at
  81   a junction on one client.
  82
  83   The important thing here is, that no one notices when a Desync
  84   occurs. The desync client will continue to simulate the gamestate
  85   and execute commands from the server. Once the gamestate differs
  86   it will increasingly spiral out of control: If a vehicle picks a
  87   different route, it will arrive at a different time at a station,
  88   which will load different cargo, which causes other vehicles to
  89   load other stuff, which causes industries to notice different
  90   servicing, which causes industries to change production, ...
  91   the client could run all day in a different universe.
  92
  93   To limit how long a Desync can remain unnoticed, the server
  94   transfers some checksums every now and then for the gamestate.
  95   Currently this checksum is the state of the random number
  96   generator of the game logic. A lot of things in OpenTTD depend
  97   on the RNG, and if the gamestate differs, it is likely that the
  98   RNG is called at different times, and the state differs when
  99   checked.
 100
 101   The clients compare this 'checksum' with the checksum of their
 102   own gamestate at the specific network frame. If they differ,
 103   the client disconnects with a Desync error.
 104
 105   The important thing here is: The detection of the Desync is
 106   only an ultimate failure detection. It does not give any
 107   indication on when the Desync happened. The Desync may after
 108   all have occurred long ago, and just did not affect the checksum
 109   up to now. The checksum may have matched 10 times or more
 110   since the Desync happened, and only now the Desync has spiraled
 111   enough to finally affect the checksum. (There was once a desync
 112   which was only noticed by the checksum after 20 game years.)
 113
 114 ## 1.3) Typical causes of Desyncs
 115
 116   Desyncs can be caused by the following scenarios:
 117    - The savegame does not describe the complete gamestate.
 118       - Some information which affects the progression of the
 119         gamestate is not saved in the savegame.
 120       - Some information which affects the progression of the
 121         gamestate is not loaded from the savegame.
 122         This includes the case that something is not completely
 123         reset before loading the savegame, so data from the
 124         previous game is carried over to the new one.
 125    - The gamestate does not behave deterministic.
 126       - Cache mismatch: The game logic depends on some cached
 127         values, which are not invalidated properly. This is
 128         the usual case for NewGRF-specific Desyncs.
 129       - Undefined behaviour: The game logic performs multiple
 130         things in an undefined order or with an undefined
 131         result. E.g. when sorting something with a key while
 132         some keys are equal. Or some computation that depends
 133         on the CPU architecture (32/64 bit, little/big endian).
 134    - The gamestate is modified when it shall not be modified.
 135       - The test-run of a command alters the gamestate.
 136       - The gamestate is altered by a player or script without
 137         using commands.
 138
 139
 140 ## 2.1) Cache debugging
 141
 142   Desyncs which are caused by improper cache validation can
 143   often be found by enabling cache validation:
 144    - Start OpenTTD with '-d desync=2'.
 145    - This will enable validation of caches every tick.
 146      That is, cached values are recomputed every tick and compared
 147      to the cached value.
 148    - Differences are logged to 'commands-out.log' in the autosave
 149      folder.
 150
 151   Mind that this type of debugging can also be done in singleplayer.
 152
 153 ## 2.2) Desync recording
 154
 155   If you have a server, which happens to encounter Desyncs often,
 156   you can enable recording of the gamestate alterations. This
 157   will later allow the replay the gamestate and locate the Desync
 158   cause.
 159
 160   There are two levels of Desync recording, which are enabled
 161   via '-d desync=2' resp. '-d desync=3'. Both will record all
 162   commands to a file 'commands-out.log' in the autosave folder.
 163
 164   If you have the savegame from the start of the server, and
 165   this command log you can replay the whole game. (see Section 3.1)
 166
 167   If you do not start the server from a savegame, there will
 168   also be a savegame created just after a map has been generated.
 169   The savegame will be named 'dmp_cmds_*.sav' and be put into
 170   the autosave folder.
 171
 172   In addition to that '-d desync=3' also creates regular savegames
 173   at defined spots in network time. (more defined than regular
 174   autosaves). These will be created in the autosave folder
 175   and will also be named 'dmp_cmds_*.sav'.
 176
 177   These saves allow comparing the gamestate with the original
 178   gamestate during replaying, and thus greatly help debugging.
 179   However, they also take a lot of disk space.
 180
 181
 182 ## 3.1) Replaying
 183
 184   To replay a Desync recording, you need these files:
 185    - The savegame from when the server was started, resp.
 186      the automatically created savegame from when the map
 187      was generated.
 188    - The 'commands-out.log' file.
 189    - Optionally the 'dmp_cmds_*.sav'.
 190   Put these files into a safe spot. (Not your autosave folder!)
 191
 192   Next, prepare your OpenTTD for replaying:
 193    - Get the same version of OpenTTD as the original server was running.
 194    - Uncomment/enable the define 'DEBUG_DUMP_COMMANDS' in
 195      'src/network/network_func.h'.
 196      (DEBUG_FAILED_DUMP_COMMANDS is explained later)
 197    - Put the 'commands-out.log' into the root save folder, and rename
 198       it to 'commands.log'. Strip everything and including the "newgame"
 199       entry from the log.
 200    - Run 'openttd -D -d desync=0 -g startsavegame.sav'.
 201      This replays the server log. Use "-d desync=3" to also create a
 202      new 'commands-out.log' and 'dmp_cmds_*.sav' in your autosave folder.
 203
 204 ## 3.2) Evaluation of the replay
 205
 206   The replaying will also compare the checksums which are part of
 207   the 'commands-out.log' with the replayed gamestate.
 208   If they differ, it will trigger a 'NOT_REACHED'.
 209
 210   If the replay succeeds without mismatch, that is the replay reproduces
 211   the original server state:
 212    - Repeat the replay starting from incrementally later 'dmp_cmds_*.sav'
 213      while truncating the 'commands.log' at the beginning appropriately.
 214      The 'dmp_cmds_*.sav' can be your own ones from the first reply, or
 215      the ones from the original server (if you have them).
 216      (This simulates the view of joining clients during the game.)
 217    - If one of those replays fails, you have located the Desync between
 218      the last dmp_cmds that reproduces the replay and the first one
 219      that fails.
 220
 221   If the replay does not succeed without mismatch, you can check the logs
 222   whether there were failed commands. Then you may try to replay with
 223   DEBUG_FAILED_DUMP_COMMANDS enabled. If the replay then fails, the
 224   command test-run of the failed command modified the game state.
 225
 226   If you have the original 'dmp_cmds_*.sav', you can also compare those
 227   savegames with your own ones from the replay. You can also comment/disable
 228   the 'NOT_REACHED' mentioned above, to get another 'dmp_cmds_*.sav' from
 229   the replay after the mismatch has already been detected.
 230   See Section 3.3 on how to compare savegames.
 231   If the saves differ you have located the Desync between the last dmp_cmds
 232   that match and the first one that does not. The difference of the saves
 233   may point you in the direction of what causes it.
 234
 235   If the replay succeeds without mismatch, and you do not have any
 236   'dmp_cmd_*.sav' from the original server, it is a lost case.
 237   Enable creation of the 'dmp_cmd_*.sav' on the server, and wait for the
 238   next Desync.
 239
 240   Finally, you can also compare the 'commands-out.log' from the original
 241   server with the one from the replay. They will differ in stuff like
 242   dates, and the original log will contain the chat, but otherwise they
 243   should match.
 244
 245 ## 3.3) Comparing savegames
 246
 247   The binary form of the savegames from the original server and from
 248   your replay will always differ:
 249    - The savegame contains paths to used NewGRF files.
 250    - The gamelog will log your loading of the savegame.
 251    - The savegame data of AIs and the Gamescript will differ.
 252      Scripts are not run during the replay, only their recorded commands
 253      are replayed. Their internal state will thus not change in the
 254      replay and will differ.
 255
 256   To compare savegame more semantically, easiest is to first export them
 257   to a JSON format with for example:
 258
 259   https://github.com/TrueBrain/OpenTTD-savegame-reader
 260
 261   By running:
 262
 263   python -m savegame_reader --export-json dmp_cmds_NNN.sav | jq . > NNN.json
 264
 265   Now you can use any (JSON) diff tool to compare the two savegames in a
 266   somewhat human readable way.