1 6.828 2011 Lecture 19: Virtual Machines
3 Read: A comparison of software and hardware techniques for x86
4 virtualizaton, Keith Adams and Ole Agesen, ASPLOS 2006.
6 what's a virtual machine?
7 simulation of a computer
8 running as an application on a host computer
14 one computer, multiple operating systems (OSX and Windows)
15 manage big machines (allocate CPUs/memory at o/s granularity)
16 kernel development environment (like qemu)
17 better fault isolation: contain break-ins
19 how accurate do we need?
20 handle weird quirks of operating system kernels
21 reproduce bugs exactly
22 handle malicious software
23 cannot let guest break out of virtual machine!
25 impossible for guest to distinguish VM from real computer
26 impossible for guest to escape its VM
27 some VMs compromise, require guest kernel modifications
30 1960s: IBM used VMs to share big machines
31 1990s: VMWare re-popularized VMs, for x86 hardware
34 [diagram: h/w, VMM, VMs..]
36 guest: kernel, user programs
37 VMM might run in a host O/S, e.g. OSX
38 or VMM might be stand-alone
41 divide memory among guests
42 time-share CPU among guests
43 simulate per-guest virtual disk, network
44 really e.g. slice of real disk
47 VMM interpret each guest instruction
48 maintain virtual machine state for each guest
52 idea: execute guest instructions on real CPU when possible
53 works fine for most instructions
55 how to prevent guest from executing privileged instructions?
56 could then wreck the VMM, other guests, &c
58 idea: run each guest kernel at CPL=3
59 ordinary instructions work fine
60 privileged instructions will (usually) trap to the VMM
61 VMM can apply the privileged operation to *virtual* state
62 not to the real hardware
65 Trap-and-emulate example -- CLI / STI
66 VMM maintains virtual IF for guest
67 VMM controls hardware IF
68 Probably leaves interrupts enabled when guest runs
69 Even if a guest uses CLI to disable them
70 VMM looks at virtual IF to decide when to interrupt guest
71 When guest executes CLI or STI:
72 Protection violation, since guest at CPL=3
74 VMM looks at *virtual* CPL
75 If 0, changes *virtual* IF
76 If not 0, emulates a protection trap to guest kernel
77 VMM must cause guest to see only virtual IF
78 and completely hide/protect real IF
80 trap-and-emulate is hard on an x86
81 not all privileged instructions trap at CPL=3
82 popf silently ignores changes to interrupt flag
83 pushf reveals *real* interrupt flag
84 all those traps can be slow
85 VMM must see PTE writes, which don't use privileged instructions
87 what real x86 state do we have to hide (i.e. != virtual state)?
88 CPL (low bits of CS) since it is 3, guest expecting 0
89 gdt descriptors (DPL 3, not 0)
90 gdtr (pointing to shadow gdt)
91 idt descriptors (traps go to VMM, not guest kernel)
93 pagetable (doesn't map to expected physical addresses)
94 %cr3 (points to shadow pagetable)
98 how can VMM give guest kernel illusion of dedicated physical memory?
99 guest wants to start at PA=0, use all "installed" DRAM
100 VMM must support many guests, they can't all really use PA=0
101 VMM must protect one guest's memory from other guests
103 claim DRAM size is smaller than real DRAM
104 ensure paging is enabled
105 maintain a "shadow" copy of guest's page table
106 shadow maps VAs to different PAs than guest
107 real %cr3 refers to shadow page table
108 virtual %cr3 refers to guest's page table
110 VMM allocates a guest phys mem 0x1000000 to 0x2000000
111 VMM gets trap if guest changes %cr3 (since guest kernel at CPL=3)
112 VMM copies guest's pagetable to "shadow" pagetable
113 VMM adds 0x1000000 to each PA in shadow table
114 VMM checks that each PA is < 0x2000000
116 Why can't VMM just modify the guest's page-table in-place?
118 also shadow the GDT, IDT
119 real IDT refers to VMM's trap entry points
120 VMM can forward to guest kernel if needed
121 VMM may also fake interrupts from virtual disk
122 real GDT allows execution of guest kernel by CPL=3
124 note we rely on h/w trapping to VMM if guest writes %cr3, gdtr, &c
125 do we also need a trap if guest *read*s?
127 do all instructions that read/write sensitive state cause traps at CPL=3?
128 push %cs will show CPL=3, not 0
129 sgdt reveals real GDTR
131 suppose guest turned IF off
132 VMM will leave real IF on, just postpone interrupts to guest
133 popf ignores IF if CPL=3, no trap
134 so VMM won't know if guest kernel wants interrupts
135 IRET: no ring change so won't restore restore SS/ESP
137 how can we cope with non-trapping instructions that reveal real state?
138 modify guest code, change them to INT 3, which traps
139 keep track of original instruction, emulate in VMM
140 INT 3 is one byte, so doesn't change code size/layout
141 this is a simplified version of the paper's Binary Translation
143 how does rewriter know where instruction boundaries are?
144 or whether bytes are code or data?
145 can VMM look at symbol table for function entry points?
147 idea: scan only as executed, since execution reveals instr boundaries
148 original start of kernel (making up these instructions):
160 when VMM first loads guest kernel, rewrite from entry to first jump
161 replace bad instrs (popf) with int3
162 replace jump with int3
163 then start the guest kernel
165 look where the jump could go (now we know the boundaries)
166 for each branch, xlate until first jump again
167 replace int3 w/ original branch
169 keep track of what we've rewritten, so we don't do it again
171 indirect calls/jumps?
172 same, but can't replace int3 with the original jump
173 since we're not sure address will be the same next time
174 so must take a trap every time
176 ret (function return)?
177 == indirect jump via ptr on stack
178 can't assume that ret PC on stack is from a call
179 so must take a trap every time. slow!
181 what if guest reads or writes its own code?
182 can't let guest see int3
183 must re-rewrite any code the guest modifies
184 can we use page protections to trap and emulate reads/writes?
185 no: can't set up PTE for X but no R
186 perhaps make CS != DS
187 put rewritten code in CS
188 put original code in DS
189 write-protect original code pages
192 re-rewrite if already rewritten
193 tricky: must find first instruction boundary in overwritten code
195 do we need to rewrite guest user-level code?
196 technically yes: SGDT, IF
197 but probably not in practice
198 user code only does INT, which traps to VMM
200 how to handle pagetable?
201 remember VMM keeps shadow pagetable w/ different PAs in PTEs
202 scan the whole pagetable on every %cr3 load?
203 to create the shadow page table
205 what if guest writes %cr3 often, during context switches?
206 idea: lazy population of shadow page table
207 start w/ empty shadow page table (just VMM mappings)
208 so guest will generate many page faults after it loads %cr3
209 VMM page fault handler just copies needed PTE to shadow pagetable
210 restarts guest, no guest-visible page fault
212 what if guest frequently switches among a set of page tables?
213 as it context-switches among running processes
214 probably doesn't modify them, so re-scan (or lazy faults) wasted
215 idea: VMM could cache multiple shadow page tables
216 cache indexed by address of guest pagetable
217 start with pre-populated page table on guest %cr3 write
218 would make context switch much faster
220 what if guest kernel writes a PTE?
221 store instruction is not privileged, no trap
222 does VMM need to know about that write?
223 yes, if VMM is caching multiple page tables
224 idea: VMM can write-protect guest's PTE pages
225 trap on PTE write, emulate, also in shadow pagetable
227 this is the three-way tradeoff the paper talks about
228 trace costs / hidden page faults / context switch cost
229 reducing one requires more of the others
230 and all three are expensive
232 how to guard guest kernel against writes by guest programs?
234 delete kernel PTEs on IRET, re-install on INT?
236 how to handle devices?
238 DMA addresses are physical, VMM must translate and check
239 rarely makes sense for guest to use real device
240 want to share w/ other guests
241 each guest gets a part of the disk
242 each guest looks like a distinct Internet host
243 each guest gets an X window
244 VMM might mimic some standard ethernet or disk controller
245 regardless of actual h/w on host computer
246 or guest might run special drivers that jump to VMM
251 How to cope with instructions that reveal privileged state?
252 e.g. pushf, looking at low bits of %cs
253 How to avoid expensive traps?
255 VMware's answer: binary translation (BT)
256 Replace offending instructions with code that does the right thing
257 Code must have access to VMM's virtual state for that guest
260 CLI/STI/pushf/popf -- read/write virtual IF
261 Detect memory stores that modify PTEs
262 Write-protect pages, trap the first time, and rewrite
263 New sequence modifies shadow pagetable as well as real one
265 How to hide VMM state from guest code?
266 Since unprivileged BT code now reads/writes VMM state
267 Put VMM state in very high memory
268 Use segment limits to prevent guest from using last few pages
269 But set up %gs to allow BT code to get at those pages
272 Hard to find instruction boundaries, instructions vs data
273 Translated code is a different size
274 Thus code pointers are different
275 Program expects to see original fn ptrs, return PCs on stack
276 Translated code must map before use
277 Thus every RET needs to look up in VMM state
279 Intel/AMD hardware support for virtual machines
280 has made it much easier to implement a VMM w/ reasonable performance
281 h/w itself directly maintains per-guest virtual state
282 CS (w/ CPL), EFLAGS, idtr, &c
283 h/w knows it is in "guest mode"
284 instructions directly modify virtual state
285 avoids lots of traps to VMM
286 h/w basically adds a new priv level
287 VMM mode, CPL=0, ..., CPL=3
288 guest-mode CPL=0 is not fully privileged
289 no traps to VMM on system calls
290 h/w handles CPL transition
291 what about memory, pagetables?
292 h/w supports *two* page tables
295 guest memory refs go through double lookup
296 each phys addr in guest pagetable translated through VMM's pagetable
297 thus guest can directly modify its page table w/o VMM having to shadow it
298 no need for VMM to write-protect guest pagetables
299 no need for VMM to track %cr3 changes
300 and VMM can ensure guest uses only its own memory
301 only map guest's memory in VMM page table