1 Backwards compatibility
2 =======================
4 How backwards compatibility works
5 ---------------------------------
7 When we do migration, we have two QEMU processes: the source and the
8 target. There are two cases, they are the same version or they are
9 different versions. The easy case is when they are the same version.
10 The difficult one is when they are different versions.
12 There are two things that are different, but they have very similar
13 names and sometimes get confused:
16 - machine type version
18 Let's start with a practical example, we start with:
20 - qemu-system-x86_64 (v5.2), from now on qemu-5.2.
21 - qemu-system-x86_64 (v5.1), from now on qemu-5.1.
23 Related to this are the "latest" machine types defined on each of
26 - pc-q35-5.2 (newer one in qemu-5.2) from now on pc-5.2
27 - pc-q35-5.1 (newer one in qemu-5.1) from now on pc-5.1
29 First of all, migration is only supposed to work if you use the same
30 machine type in both source and destination. The QEMU hardware
31 configuration needs to be the same also on source and destination.
32 Most aspects of the backend configuration can be changed at will,
33 except for a few cases where the backend features influence frontend
34 device feature exposure. But that is not relevant for this section.
36 I am going to list the number of combinations that we can have. Let's
37 start with the trivial ones, QEMU is the same on source and
40 1 - qemu-5.2 -M pc-5.2 -> migrates to -> qemu-5.2 -M pc-5.2
42 This is the latest QEMU with the latest machine type.
43 This have to work, and if it doesn't work it is a bug.
45 2 - qemu-5.1 -M pc-5.1 -> migrates to -> qemu-5.1 -M pc-5.1
47 Exactly the same case than the previous one, but for 5.1.
48 Nothing to see here either.
50 This are the easiest ones, we will not talk more about them in this
53 Now we start with the more interesting cases. Consider the case where
54 we have the same QEMU version in both sides (qemu-5.2) but we are using
55 the latest machine type for that version (pc-5.2) but one of an older
56 QEMU version, in this case pc-5.1.
58 3 - qemu-5.2 -M pc-5.1 -> migrates to -> qemu-5.2 -M pc-5.1
60 It needs to use the definition of pc-5.1 and the devices as they
61 were configured on 5.1, but this should be easy in the sense that
62 both sides are the same QEMU and both sides have exactly the same
63 idea of what the pc-5.1 machine is.
65 4 - qemu-5.1 -M pc-5.2 -> migrates to -> qemu-5.1 -M pc-5.2
67 This combination is not possible as the qemu-5.1 doesn't understand
68 pc-5.2 machine type. So nothing to worry here.
70 Now it comes the interesting ones, when both QEMU processes are
71 different. Notice also that the machine type needs to be pc-5.1,
72 because we have the limitation than qemu-5.1 doesn't know pc-5.2. So
73 the possible cases are:
75 5 - qemu-5.2 -M pc-5.1 -> migrates to -> qemu-5.1 -M pc-5.1
77 This migration is known as newer to older. We need to make sure
78 when we are developing 5.2 we need to take care about not to break
79 migration to qemu-5.1. Notice that we can't make updates to
80 qemu-5.1 to understand whatever qemu-5.2 decides to change, so it is
81 in qemu-5.2 side to make the relevant changes.
83 6 - qemu-5.1 -M pc-5.1 -> migrates to -> qemu-5.2 -M pc-5.1
85 This migration is known as older to newer. We need to make sure
86 than we are able to receive migrations from qemu-5.1. The problem is
87 similar to the previous one.
89 If qemu-5.1 and qemu-5.2 were the same, there will not be any
90 compatibility problems. But the reason that we create qemu-5.2 is to
91 get new features, devices, defaults, etc.
93 If we get a device that has a new feature, or change a default value,
94 we have a problem when we try to migrate between different QEMU
97 So we need a way to tell qemu-5.2 that when we are using machine type
98 pc-5.1, it needs to **not** use the feature, to be able to migrate to
101 And the equivalent part when migrating from qemu-5.1 to qemu-5.2.
102 qemu-5.2 has to expect that it is not going to get data for the new
103 feature, because qemu-5.1 doesn't know about it.
105 How do we tell QEMU about these device feature changes? In
106 hw/core/machine.c:hw_compat_X_Y arrays.
108 If we change a default value, we need to put back the old value on
109 that array. And the device, during initialization needs to look at
110 that array to see what value it needs to get for that feature. And
111 what are we going to put in that array, the value of a property.
113 To create a property for a device, we need to use one of the
114 DEFINE_PROP_*() macros. See include/hw/qdev-properties.h to find the
115 macros that exist. With it, we set the default value for that
116 property, and that is what it is going to get in the latest released
117 version. But if we want a different value for a previous version, we
118 can change that in the hw_compat_X_Y arrays.
120 hw_compat_X_Y is an array of registers that have the format:
126 Let's see a practical example.
128 In qemu-5.2 virtio-blk-device got multi queue support. This is a
129 change that is not backward compatible. In qemu-5.1 it has one
130 queue. In qemu-5.2 it has the same number of queues as the number of
133 When we are doing migration, if we migrate from a device that has 4
134 queues to a device that have only one queue, we don't know where to
135 put the extra information for the other 3 queues, and we fail
138 Similar problem when we migrate from qemu-5.1 that has only one queue
139 to qemu-5.2, we only sent information for one queue, but destination
140 has 4, and we have 3 queues that are not properly initialized and
143 So, how can we address this problem. Easy, just convince qemu-5.2
144 that when it is running pc-5.1, it needs to set the number of queues
145 for virtio-blk-devices to 1.
147 That way we fix the cases 5 and 6.
149 5 - qemu-5.2 -M pc-5.1 -> migrates to -> qemu-5.1 -M pc-5.1
151 qemu-5.2 -M pc-5.1 sets number of queues to be 1.
152 qemu-5.1 -M pc-5.1 expects number of queues to be 1.
154 correct. migration works.
156 6 - qemu-5.1 -M pc-5.1 -> migrates to -> qemu-5.2 -M pc-5.1
158 qemu-5.1 -M pc-5.1 sets number of queues to be 1.
159 qemu-5.2 -M pc-5.1 expects number of queues to be 1.
161 correct. migration works.
163 And now the other interesting case, case 3. In this case we have:
165 3 - qemu-5.2 -M pc-5.1 -> migrates to -> qemu-5.2 -M pc-5.1
167 Here we have the same QEMU in both sides. So it doesn't matter a
168 lot if we have set the number of queues to 1 or not, because
173 Think what happens if we do one of this double migrations:
175 A -> migrates -> B -> migrates -> C
179 A: qemu-5.1 -M pc-5.1
180 B: qemu-5.2 -M pc-5.1
181 C: qemu-5.2 -M pc-5.1
183 migration A -> B is case 6, so number of queues needs to be 1.
185 migration B -> C is case 3, so we don't care. But actually we
186 care because we haven't started the guest in qemu-5.2, it came
187 migrated from qemu-5.1. So to be in the safe place, we need to
188 always use number of queues 1 when we are using pc-5.1.
190 Now, how was this done in reality? The following commit shows how it
193 commit 9445e1e15e66c19e42bea942ba810db28052cd05
194 Author: Stefan Hajnoczi <stefanha@redhat.com>
195 Date: Tue Aug 18 15:33:47 2020 +0100
197 virtio-blk-pci: default num_queues to -smp N
199 The relevant parts for migration are::
201 @@ -1281,7 +1284,8 @@ static Property virtio_blk_properties[] = {
203 DEFINE_PROP_BIT("request-merging", VirtIOBlock, conf.request_merging, 0,
205 - DEFINE_PROP_UINT16("num-queues", VirtIOBlock, conf.num_queues, 1),
206 + DEFINE_PROP_UINT16("num-queues", VirtIOBlock, conf.num_queues,
207 + VIRTIO_BLK_AUTO_NUM_QUEUES),
208 DEFINE_PROP_UINT16("queue-size", VirtIOBlock, conf.queue_size, 256),
210 It changes the default value of num_queues. But it fishes it for old
211 machine types to have the right value::
214 GlobalProperty hw_compat_5_1[] = {
216 + { "virtio-blk-device", "num-queues", "1"},
220 A device with different features on both sides
221 ----------------------------------------------
223 Let's assume that we are using the same QEMU binary on both sides,
224 just to make the things easier. But we have a device that has
225 different features on both sides of the migration. That can be
226 because the devices are different, because the kernel driver of both
227 devices have different features, whatever.
229 How can we get this to work with migration. The way to do that is
230 "theoretically" easy. You have to get the features that the device
231 has in the source of the migration. The features that the device has
232 on the target of the migration, you get the intersection of the
233 features of both sides, and that is the way that you should launch
236 Notice that this is not completely related to QEMU. The most
237 important thing here is that this should be handled by the managing
238 application that launches QEMU. If QEMU is configured correctly, the
239 migration will succeed.
241 That said, actually doing it is complicated. Almost all devices are
242 bad at being able to be launched with only some features enabled.
243 With one big exception: cpus.
245 You can read the documentation for QEMU x86 cpu models here:
247 https://qemu-project.gitlab.io/qemu/system/qemu-cpu-models.html
249 See when they talk about migration they recommend that one chooses the
250 newest cpu model that is supported for all cpus.
252 Let's say that we have:
256 Device X has the feature Y
260 Device X has not the feature Y
262 If we try to migrate without any care from host A to host B, it will
263 fail because when migration tries to load the feature Y on
264 destination, it will find that the hardware is not there.
266 Doing this would be the equivalent of doing with cpus:
270 $ qemu-system-x86_64 -cpu host
274 $ qemu-system-x86_64 -cpu host
276 When both hosts have different cpu features this is guaranteed to
277 fail. Especially if Host B has less features than host A. If host A
278 has less features than host B, sometimes it works. Important word of
279 last sentence is "sometimes".
281 So, forgetting about cpu models and continuing with the -cpu host
282 example, let's see that the differences of the cpus is that Host A and
283 B have the following features:
285 Features: 'pcid' 'stibp' 'taa-no'
289 And we want to migrate between them, the way configure both QEMU cpu
294 $ qemu-system-x86_64 -cpu host,pcid=off,stibp=off
298 $ qemu-system-x86_64 -cpu host,taa-no=off
300 And you would be able to migrate between them. It is responsibility
301 of the management application or of the user to make sure that the
302 configuration is correct. QEMU doesn't know how to look at this kind
303 of features in general.
305 Notice that we don't recommend to use -cpu host for migration. It is
306 used in this example because it makes the example simpler.
308 Other devices have worse control about individual features. If they
309 want to be able to migrate between hosts that show different features,
310 the device needs a way to configure which ones it is going to use.
312 In this section we have considered that we are using the same QEMU
313 binary in both sides of the migration. If we use different QEMU
314 versions process, then we need to have into account all other
315 differences and the examples become even more complicated.
317 How to mitigate when we have a backward compatibility error
318 -----------------------------------------------------------
320 We broke migration for old machine types continuously during
321 development. But as soon as we find that there is a problem, we fix
322 it. The problem is what happens when we detect after we have done a
323 release that something has gone wrong.
325 Let see how it worked with one example.
327 After the release of qemu-8.0 we found a problem when doing migration
328 of the machine type pc-7.2.
330 - $ qemu-7.2 -M pc-7.2 -> qemu-7.2 -M pc-7.2
334 - $ qemu-8.0 -M pc-7.2 -> qemu-8.0 -M pc-7.2
338 - $ qemu-8.0 -M pc-7.2 -> qemu-7.2 -M pc-7.2
342 - $ qemu-7.2 -M pc-7.2 -> qemu-8.0 -M pc-7.2
346 So clearly something fails when migration between qemu-7.2 and
347 qemu-8.0 with machine type pc-7.2. The error messages, and git bisect
348 pointed to this commit.
350 In qemu-8.0 we got this commit::
352 commit 010746ae1db7f52700cb2e2c46eb94f299cfa0d2
353 Author: Jonathan Cameron <Jonathan.Cameron@huawei.com>
354 Date: Thu Mar 2 13:37:02 2023 +0000
356 hw/pci/aer: Implement PCI_ERR_UNCOR_MASK register
359 The relevant bits of the commit for our example are this ones::
361 --- a/hw/pci/pcie_aer.c
362 +++ b/hw/pci/pcie_aer.c
363 @@ -112,6 +112,10 @@ int pcie_aer_init(PCIDevice *dev,
365 pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS,
366 PCI_ERR_UNC_SUPPORTED);
367 + pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
368 + PCI_ERR_UNC_MASK_DEFAULT);
369 + pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
370 + PCI_ERR_UNC_SUPPORTED);
372 pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER,
373 PCI_ERR_UNC_SEVERITY_DEFAULT);
375 The patch changes how we configure PCI space for AER. But QEMU fails
376 when the PCI space configuration is different between source and
379 The following commit shows how this got fixed::
381 commit 5ed3dabe57dd9f4c007404345e5f5bf0e347317f
382 Author: Leonardo Bras <leobras@redhat.com>
383 Date: Tue May 2 21:27:02 2023 -0300
385 hw/pci: Disable PCI_ERR_UNCOR_MASK register for machine type < 8.0
389 The relevant parts of the fix in QEMU are as follow:
391 First, we create a new property for the device to be able to configure
392 the old behaviour or the new behaviour::
394 diff --git a/hw/pci/pci.c b/hw/pci/pci.c
395 index 8a87ccc8b0..5153ad63d6 100644
398 @@ -79,6 +79,8 @@ static Property pci_props[] = {
399 DEFINE_PROP_STRING("failover_pair_id", PCIDevice,
401 DEFINE_PROP_UINT32("acpi-index", PCIDevice, acpi_index, 0),
402 + DEFINE_PROP_BIT("x-pcie-err-unc-mask", PCIDevice, cap_present,
403 + QEMU_PCIE_ERR_UNC_MASK_BITNR, true),
404 DEFINE_PROP_END_OF_LIST()
407 Notice that we enable the feature for new machine types.
409 Now we see how the fix is done. This is going to depend on what kind
410 of breakage happens, but in this case it is quite simple::
412 diff --git a/hw/pci/pcie_aer.c b/hw/pci/pcie_aer.c
413 index 103667c368..374d593ead 100644
414 --- a/hw/pci/pcie_aer.c
415 +++ b/hw/pci/pcie_aer.c
416 @@ -112,10 +112,13 @@ int pcie_aer_init(PCIDevice *dev, uint8_t cap_ver,
419 pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS,
420 PCI_ERR_UNC_SUPPORTED);
421 - pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
422 - PCI_ERR_UNC_MASK_DEFAULT);
423 - pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
424 - PCI_ERR_UNC_SUPPORTED);
426 + if (dev->cap_present & QEMU_PCIE_ERR_UNC_MASK) {
427 + pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
428 + PCI_ERR_UNC_MASK_DEFAULT);
429 + pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
430 + PCI_ERR_UNC_SUPPORTED);
433 pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER,
434 PCI_ERR_UNC_SEVERITY_DEFAULT);
436 I.e. If the property bit is enabled, we configure it as we did for
437 qemu-8.0. If the property bit is not set, we configure it as it was in 7.2.
439 And now, everything that is missing is disabling the feature for old
442 diff --git a/hw/core/machine.c b/hw/core/machine.c
443 index 47a34841a5..07f763eb2e 100644
444 --- a/hw/core/machine.c
445 +++ b/hw/core/machine.c
446 @@ -48,6 +48,7 @@ GlobalProperty hw_compat_7_2[] = {
447 { "e1000e", "migrate-timadj", "off" },
448 { "virtio-mem", "x-early-migration", "false" },
449 { "migration", "x-preempt-pre-7-2", "true" },
450 + { TYPE_PCI_DEVICE, "x-pcie-err-unc-mask", "off" },
452 const size_t hw_compat_7_2_len = G_N_ELEMENTS(hw_compat_7_2);
454 And now, when qemu-8.0.1 is released with this fix, all combinations
455 are going to work as supposed.
457 - $ qemu-7.2 -M pc-7.2 -> qemu-7.2 -M pc-7.2 (works)
458 - $ qemu-8.0.1 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 (works)
459 - $ qemu-8.0.1 -M pc-7.2 -> qemu-7.2 -M pc-7.2 (works)
460 - $ qemu-7.2 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 (works)
462 So the normality has been restored and everything is ok, no?
464 Not really, now our matrix is much bigger. We started with the easy
465 cases, migration from the same version to the same version always
468 - $ qemu-7.2 -M pc-7.2 -> qemu-7.2 -M pc-7.2
469 - $ qemu-8.0 -M pc-7.2 -> qemu-8.0 -M pc-7.2
470 - $ qemu-8.0.1 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2
472 Now the interesting ones. When the QEMU processes versions are
473 different. For the 1st set, their fail and we can do nothing, both
474 versions are released and we can't change anything.
476 - $ qemu-7.2 -M pc-7.2 -> qemu-8.0 -M pc-7.2
477 - $ qemu-8.0 -M pc-7.2 -> qemu-7.2 -M pc-7.2
479 This two are the ones that work. The whole point of making the
480 change in qemu-8.0.1 release was to fix this issue:
482 - $ qemu-7.2 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2
483 - $ qemu-8.0.1 -M pc-7.2 -> qemu-7.2 -M pc-7.2
485 But now we found that qemu-8.0 neither can migrate to qemu-7.2 not
488 - $ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2
489 - $ qemu-8.0.1 -M pc-7.2 -> qemu-8.0 -M pc-7.2
491 So, if we start a pc-7.2 machine in qemu-8.0 we can't migrate it to
492 anything except to qemu-8.0.
496 Yeap. If we know that we are going to do this migration:
498 - $ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2
500 We can launch the appropriate devices with::
502 --device...,x-pci-e-err-unc-mask=on
504 And now we can receive a migration from 8.0. And from now on, we can
505 do that migration to new machine types if we remember to enable that
506 property for pc-7.2. Notice that we need to remember, it is not
507 enough to know that the source of the migration is qemu-8.0. Think of
510 $ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 -> qemu-8.2 -M pc-7.2
512 In the second migration, the source is not qemu-8.0, but we still have
513 that "problem" and have that property enabled. Notice that we need to
514 continue having this mark/property until we have this machine
515 rebooted. But it is not a normal reboot (that don't reload QEMU) we
516 need the machine to poweroff/poweron on a fixed QEMU. And from now
517 on we can use the proper real machine.