docs/devel/migration/postcopy.rst

   1 ========
   2 Postcopy
   3 ========
   4
   5 .. contents::
   6
   7 'Postcopy' migration is a way to deal with migrations that refuse to converge
   8 (or take too long to converge) its plus side is that there is an upper bound on
   9 the amount of migration traffic and time it takes, the down side is that during
  10 the postcopy phase, a failure of *either* side causes the guest to be lost.
  11
  12 In postcopy the destination CPUs are started before all the memory has been
  13 transferred, and accesses to pages that are yet to be transferred cause
  14 a fault that's translated by QEMU into a request to the source QEMU.
  15
  16 Postcopy can be combined with precopy (i.e. normal migration) so that if precopy
  17 doesn't finish in a given time the switch is made to postcopy.
  18
  19 Enabling postcopy
  20 =================
  21
  22 To enable postcopy, issue this command on the monitor (both source and
  23 destination) prior to the start of migration:
  24
  25 ``migrate_set_capability postcopy-ram on``
  26
  27 The normal commands are then used to start a migration, which is still
  28 started in precopy mode.  Issuing:
  29
  30 ``migrate_start_postcopy``
  31
  32 will now cause the transition from precopy to postcopy.
  33 It can be issued immediately after migration is started or any
  34 time later on.  Issuing it after the end of a migration is harmless.
  35
  36 Blocktime is a postcopy live migration metric, intended to show how
  37 long the vCPU was in state of interruptible sleep due to pagefault.
  38 That metric is calculated both for all vCPUs as overlapped value, and
  39 separately for each vCPU. These values are calculated on destination
  40 side.  To enable postcopy blocktime calculation, enter following
  41 command on destination monitor:
  42
  43 ``migrate_set_capability postcopy-blocktime on``
  44
  45 Postcopy blocktime can be retrieved by query-migrate qmp command.
  46 postcopy-blocktime value of qmp command will show overlapped blocking
  47 time for all vCPU, postcopy-vcpu-blocktime will show list of blocking
  48 time per vCPU.
  49
  50 .. note::
  51   During the postcopy phase, the bandwidth limits set using
  52   ``migrate_set_parameter`` is ignored (to avoid delaying requested pages that
  53   the destination is waiting for).
  54
  55 Postcopy internals
  56 ==================
  57
  58 State machine
  59 -------------
  60
  61 Postcopy moves through a series of states (see postcopy_state) from
  62 ADVISE->DISCARD->LISTEN->RUNNING->END
  63
  64  - Advise
  65
  66     Set at the start of migration if postcopy is enabled, even
  67     if it hasn't had the start command; here the destination
  68     checks that its OS has the support needed for postcopy, and performs
  69     setup to ensure the RAM mappings are suitable for later postcopy.
  70     The destination will fail early in migration at this point if the
  71     required OS support is not present.
  72     (Triggered by reception of POSTCOPY_ADVISE command)
  73
  74  - Discard
  75
  76     Entered on receipt of the first 'discard' command; prior to
  77     the first Discard being performed, hugepages are switched off
  78     (using madvise) to ensure that no new huge pages are created
  79     during the postcopy phase, and to cause any huge pages that
  80     have discards on them to be broken.
  81
  82  - Listen
  83
  84     The first command in the package, POSTCOPY_LISTEN, switches
  85     the destination state to Listen, and starts a new thread
  86     (the 'listen thread') which takes over the job of receiving
  87     pages off the migration stream, while the main thread carries
  88     on processing the blob.  With this thread able to process page
  89     reception, the destination now 'sensitises' the RAM to detect
  90     any access to missing pages (on Linux using the 'userfault'
  91     system).
  92
  93  - Running
  94
  95     POSTCOPY_RUN causes the destination to synchronise all
  96     state and start the CPUs and IO devices running.  The main
  97     thread now finishes processing the migration package and
  98     now carries on as it would for normal precopy migration
  99     (although it can't do the cleanup it would do as it
 100     finishes a normal migration).
 101
 102  - End
 103
 104     The listen thread can now quit, and perform the cleanup of migration
 105     state, the migration is now complete.
 106
 107 Device transfer
 108 ---------------
 109
 110 Loading of device data may cause the device emulation to access guest RAM
 111 that may trigger faults that have to be resolved by the source, as such
 112 the migration stream has to be able to respond with page data *during* the
 113 device load, and hence the device data has to be read from the stream completely
 114 before the device load begins to free the stream up.  This is achieved by
 115 'packaging' the device data into a blob that's read in one go.
 116
 117 Source behaviour
 118 ----------------
 119
 120 Until postcopy is entered the migration stream is identical to normal
 121 precopy, except for the addition of a 'postcopy advise' command at
 122 the beginning, to tell the destination that postcopy might happen.
 123 When postcopy starts the source sends the page discard data and then
 124 forms the 'package' containing:
 125
 126    - Command: 'postcopy listen'
 127    - The device state
 128
 129      A series of sections, identical to the precopy streams device state stream
 130      containing everything except postcopiable devices (i.e. RAM)
 131    - Command: 'postcopy run'
 132
 133 The 'package' is sent as the data part of a Command: ``CMD_PACKAGED``, and the
 134 contents are formatted in the same way as the main migration stream.
 135
 136 During postcopy the source scans the list of dirty pages and sends them
 137 to the destination without being requested (in much the same way as precopy),
 138 however when a page request is received from the destination, the dirty page
 139 scanning restarts from the requested location.  This causes requested pages
 140 to be sent quickly, and also causes pages directly after the requested page
 141 to be sent quickly in the hope that those pages are likely to be used
 142 by the destination soon.
 143
 144 Destination behaviour
 145 ---------------------
 146
 147 Initially the destination looks the same as precopy, with a single thread
 148 reading the migration stream; the 'postcopy advise' and 'discard' commands
 149 are processed to change the way RAM is managed, but don't affect the stream
 150 processing.
 151
 152 ::
 153
 154   ------------------------------------------------------------------------------
 155                           1      2   3     4 5                      6   7
 156   main -----DISCARD-CMD_PACKAGED ( LISTEN  DEVICE     DEVICE DEVICE RUN )
 157   thread                             |       |
 158                                      |     (page request)
 159                                      |        \___
 160                                      v            \
 161   listen thread:                     --- page -- page -- page -- page -- page --
 162
 163                                      a   b        c
 164   ------------------------------------------------------------------------------
 165
 166 - On receipt of ``CMD_PACKAGED`` (1)
 167
 168    All the data associated with the package - the ( ... ) section in the diagram -
 169    is read into memory, and the main thread recurses into qemu_loadvm_state_main
 170    to process the contents of the package (2) which contains commands (3,6) and
 171    devices (4...)
 172
 173 - On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package)
 174
 175    a new thread (a) is started that takes over servicing the migration stream,
 176    while the main thread carries on loading the package.   It loads normal
 177    background page data (b) but if during a device load a fault happens (5)
 178    the returned page (c) is loaded by the listen thread allowing the main
 179    threads device load to carry on.
 180
 181 - The last thing in the ``CMD_PACKAGED`` is a 'RUN' command (6)
 182
 183    letting the destination CPUs start running.  At the end of the
 184    ``CMD_PACKAGED`` (7) the main thread returns to normal running behaviour and
 185    is no longer used by migration, while the listen thread carries on servicing
 186    page data until the end of migration.
 187
 188 Source side page bitmap
 189 -----------------------
 190
 191 The 'migration bitmap' in postcopy is basically the same as in the precopy,
 192 where each of the bit to indicate that page is 'dirty' - i.e. needs
 193 sending.  During the precopy phase this is updated as the CPU dirties
 194 pages, however during postcopy the CPUs are stopped and nothing should
 195 dirty anything any more. Instead, dirty bits are cleared when the relevant
 196 pages are sent during postcopy.
 197
 198 Postcopy features
 199 =================
 200
 201 Postcopy recovery
 202 -----------------
 203
 204 Comparing to precopy, postcopy is special on error handlings.  When any
 205 error happens (in this case, mostly network errors), QEMU cannot easily
 206 fail a migration because VM data resides in both source and destination
 207 QEMU instances.  On the other hand, when issue happens QEMU on both sides
 208 will go into a paused state.  It'll need a recovery phase to continue a
 209 paused postcopy migration.
 210
 211 The recovery phase normally contains a few steps:
 212
 213   - When network issue occurs, both QEMU will go into **POSTCOPY_PAUSED**
 214     migration state.
 215
 216   - When the network is recovered (or a new network is provided), the admin
 217     can setup the new channel for migration using QMP command
 218     'migrate-recover' on destination node, preparing for a resume.
 219
 220   - On source host, the admin can continue the interrupted postcopy
 221     migration using QMP command 'migrate' with resume=true flag set.
 222     Source QEMU will go into **POSTCOPY_RECOVER_SETUP** state trying to
 223     re-establish the channels.
 224
 225   - When both sides of QEMU successfully reconnect using a new or fixed up
 226     channel, they will go into **POSTCOPY_RECOVER** state, some handshake
 227     procedure will be needed to properly synchronize the VM states between
 228     the two QEMUs to continue the postcopy migration.  For example, there
 229     can be pages sent right during the window when the network is
 230     interrupted, then the handshake will guarantee pages lost in-flight
 231     will be resent again.
 232
 233   - After a proper handshake synchronization, QEMU will continue the
 234     postcopy migration on both sides and go back to **POSTCOPY_ACTIVE**
 235     state.  Postcopy migration will continue.
 236
 237 During a paused postcopy migration, the VM can logically still continue
 238 running, and it will not be impacted from any page access to pages that
 239 were already migrated to destination VM before the interruption happens.
 240 However, if any of the missing pages got accessed on destination VM, the VM
 241 thread will be halted waiting for the page to be migrated, it means it can
 242 be halted until the recovery is complete.
 243
 244 The impact of accessing missing pages can be relevant to different
 245 configurations of the guest.  For example, when with async page fault
 246 enabled, logically the guest can proactively schedule out the threads
 247 accessing missing pages.
 248
 249 Postcopy with hugepages
 250 -----------------------
 251
 252 Postcopy now works with hugetlbfs backed memory:
 253
 254   a) The linux kernel on the destination must support userfault on hugepages.
 255   b) The huge-page configuration on the source and destination VMs must be
 256      identical; i.e. RAMBlocks on both sides must use the same page size.
 257   c) Note that ``-mem-path /dev/hugepages``  will fall back to allocating normal
 258      RAM if it doesn't have enough hugepages, triggering (b) to fail.
 259      Using ``-mem-prealloc`` enforces the allocation using hugepages.
 260   d) Care should be taken with the size of hugepage used; postcopy with 2MB
 261      hugepages works well, however 1GB hugepages are likely to be problematic
 262      since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link,
 263      and until the full page is transferred the destination thread is blocked.
 264
 265 Postcopy with shared memory
 266 ---------------------------
 267
 268 Postcopy migration with shared memory needs explicit support from the other
 269 processes that share memory and from QEMU. There are restrictions on the type of
 270 memory that userfault can support shared.
 271
 272 The Linux kernel userfault support works on ``/dev/shm`` memory and on ``hugetlbfs``
 273 (although the kernel doesn't provide an equivalent to ``madvise(MADV_DONTNEED)``
 274 for hugetlbfs which may be a problem in some configurations).
 275
 276 The vhost-user code in QEMU supports clients that have Postcopy support,
 277 and the ``vhost-user-bridge`` (in ``tests/``) and the DPDK package have changes
 278 to support postcopy.
 279
 280 The client needs to open a userfaultfd and register the areas
 281 of memory that it maps with userfault.  The client must then pass the
 282 userfaultfd back to QEMU together with a mapping table that allows
 283 fault addresses in the clients address space to be converted back to
 284 RAMBlock/offsets.  The client's userfaultfd is added to the postcopy
 285 fault-thread and page requests are made on behalf of the client by QEMU.
 286 QEMU performs 'wake' operations on the client's userfaultfd to allow it
 287 to continue after a page has arrived.
 288
 289 .. note::
 290   There are two future improvements that would be nice:
 291     a) Some way to make QEMU ignorant of the addresses in the clients
 292        address space
 293     b) Avoiding the need for QEMU to perform ufd-wake calls after the
 294        pages have arrived
 295
 296 Retro-fitting postcopy to existing clients is possible:
 297   a) A mechanism is needed for the registration with userfault as above,
 298      and the registration needs to be coordinated with the phases of
 299      postcopy.  In vhost-user extra messages are added to the existing
 300      control channel.
 301   b) Any thread that can block due to guest memory accesses must be
 302      identified and the implication understood; for example if the
 303      guest memory access is made while holding a lock then all other
 304      threads waiting for that lock will also be blocked.
 305
 306 Postcopy preemption mode
 307 ------------------------
 308
 309 Postcopy preempt is a new capability introduced in 8.0 QEMU release, it
 310 allows urgent pages (those got page fault requested from destination QEMU
 311 explicitly) to be sent in a separate preempt channel, rather than queued in
 312 the background migration channel.  Anyone who cares about latencies of page
 313 faults during a postcopy migration should enable this feature.  By default,
 314 it's not enabled.