i386/MIT/course/lectures/all/6.828/2011/lec/l-vm.txt

   1 6.828 2011 Lecture 19: Virtual Machines
   2
   3 Read: A comparison of software and hardware techniques for x86
   4 virtualizaton, Keith Adams and Ole Agesen, ASPLOS 2006.
   5
   6 what's a virtual machine?
   7   simulation of a computer
   8   running as an application on a host computer
   9   accurate
  10   isolated
  11   fast
  12
  13 why use a VM?
  14   one computer, multiple operating systems (OSX and Windows)
  15   manage big machines (allocate CPUs/memory at o/s granularity)
  16   kernel development environment (like qemu)
  17   better fault isolation: contain break-ins
  18
  19 how accurate do we need?
  20   handle weird quirks of operating system kernels
  21   reproduce bugs exactly
  22   handle malicious software
  23     cannot let guest break out of virtual machine!
  24   usual goal:
  25     impossible for guest to distinguish VM from real computer
  26     impossible for guest to escape its VM
  27   some VMs compromise, require guest kernel modifications
  28
  29 VMs are an old idea
  30   1960s: IBM used VMs to share big machines
  31   1990s: VMWare re-popularized VMs, for x86 hardware
  32
  33 terminology
  34   [diagram: h/w, VMM, VMs..]
  35   VMM ("host")
  36   guest: kernel, user programs
  37   VMM might run in a host O/S, e.g. OSX
  38     or VMM might be stand-alone
  39
  40 VMM responsibilities
  41   divide memory among guests
  42   time-share CPU among guests
  43   simulate per-guest virtual disk, network
  44     really e.g. slice of real disk
  45
  46 why not simulation?
  47   VMM interpret each guest instruction
  48   maintain virtual machine state for each guest
  49     eflags, %cr3, &c
  50   much too slow!
  51
  52 idea: execute guest instructions on real CPU when possible
  53   works fine for most instructions
  54   e.g. add %eax, %ebx
  55   how to prevent guest from executing privileged instructions?
  56     could then wreck the VMM, other guests, &c
  57
  58 idea: run each guest kernel at CPL=3
  59   ordinary instructions work fine
  60   privileged instructions will (usually) trap to the VMM
  61   VMM can apply the privileged operation to *virtual* state
  62     not to the real hardware
  63   "trap-and-emulate"
  64
  65 Trap-and-emulate example -- CLI / STI
  66   VMM maintains virtual IF for guest
  67   VMM controls hardware IF
  68     Probably leaves interrupts enabled when guest runs
  69     Even if a guest uses CLI to disable them
  70   VMM looks at virtual IF to decide when to interrupt guest
  71   When guest executes CLI or STI:
  72     Protection violation, since guest at CPL=3
  73     Hardware traps to VMM
  74     VMM looks at *virtual* CPL
  75       If 0, changes *virtual* IF
  76       If not 0, emulates a protection trap to guest kernel
  77   VMM must cause guest to see only virtual IF
  78     and completely hide/protect real IF
  79
  80 trap-and-emulate is hard on an x86
  81   not all privileged instructions trap at CPL=3
  82     popf silently ignores changes to interrupt flag
  83     pushf reveals *real* interrupt flag
  84   all those traps can be slow
  85   VMM must see PTE writes, which don't use privileged instructions
  86
  87 what real x86 state do we have to hide (i.e. != virtual state)?
  88   CPL (low bits of CS) since it is 3, guest expecting 0
  89   gdt descriptors (DPL 3, not 0)
  90   gdtr (pointing to shadow gdt)
  91   idt descriptors (traps go to VMM, not guest kernel)
  92   idtr
  93   pagetable (doesn't map to expected physical addresses)
  94   %cr3 (points to shadow pagetable)
  95   IF in EFLAGS
  96   %cr0 &c
  97
  98 how can VMM give guest kernel illusion of dedicated physical memory?
  99   guest wants to start at PA=0, use all "installed" DRAM
 100   VMM must support many guests, they can't all really use PA=0
 101   VMM must protect one guest's memory from other guests
 102   idea:
 103     claim DRAM size is smaller than real DRAM
 104     ensure paging is enabled
 105     maintain a "shadow" copy of guest's page table
 106     shadow maps VAs to different PAs than guest
 107     real %cr3 refers to shadow page table
 108     virtual %cr3 refers to guest's page table
 109   example:
 110     VMM allocates a guest phys mem 0x1000000 to 0x2000000
 111     VMM gets trap if guest changes %cr3 (since guest kernel at CPL=3)
 112     VMM copies guest's pagetable to "shadow" pagetable
 113     VMM adds 0x1000000 to each PA in shadow table
 114     VMM checks that each PA is < 0x2000000
 115
 116 Why can't VMM just modify the guest's page-table in-place?
 117
 118 also shadow the GDT, IDT
 119   real IDT refers to VMM's trap entry points
 120     VMM can forward to guest kernel if needed
 121     VMM may also fake interrupts from virtual disk
 122   real GDT allows execution of guest kernel by CPL=3
 123
 124 note we rely on h/w trapping to VMM if guest writes %cr3, gdtr, &c
 125   do we also need a trap if guest *read*s?
 126
 127 do all instructions that read/write sensitive state cause traps at CPL=3?
 128   push %cs will show CPL=3, not 0
 129   sgdt reveals real GDTR
 130   pushf pushes real IF
 131     suppose guest turned IF off
 132     VMM will leave real IF on, just postpone interrupts to guest
 133   popf ignores IF if CPL=3, no trap
 134     so VMM won't know if guest kernel wants interrupts
 135   IRET: no ring change so won't restore restore SS/ESP
 136
 137 how can we cope with non-trapping instructions that reveal real state?
 138   modify guest code, change them to INT 3, which traps
 139   keep track of original instruction, emulate in VMM
 140   INT 3 is one byte, so doesn't change code size/layout
 141   this is a simplified version of the paper's Binary Translation
 142
 143 how does rewriter know where instruction boundaries are?
 144   or whether bytes are code or data?
 145   can VMM look at symbol table for function entry points?
 146
 147 idea: scan only as executed, since execution reveals instr boundaries
 148   original start of kernel (making up these instructions):
 149   entry:
 150     pushl %ebp
 151     ...
 152     popf
 153     ...
 154     jnz x
 155     ...
 156     jxx y
 157   x:
 158     ...
 159     jxx z
 160   when VMM first loads guest kernel, rewrite from entry to first jump
 161     replace bad instrs (popf) with int3
 162     replace jump with int3
 163     then start the guest kernel
 164   on int3 trap to VMM
 165     look where the jump could go (now we know the boundaries)
 166     for each branch, xlate until first jump again
 167     replace int3 w/ original branch
 168     re-start
 169   keep track of what we've rewritten, so we don't do it again
 170
 171 indirect calls/jumps?
 172   same, but can't replace int3 with the original jump
 173   since we're not sure address will be the same next time
 174   so must take a trap every time
 175
 176 ret (function return)?
 177   == indirect jump via ptr on stack
 178   can't assume that ret PC on stack is from a call
 179   so must take a trap every time. slow!
 180
 181 what if guest reads or writes its own code?
 182   can't let guest see int3
 183   must re-rewrite any code the guest modifies
 184   can we use page protections to trap and emulate reads/writes?
 185     no: can't set up PTE for X but no R
 186   perhaps make CS != DS
 187     put rewritten code in CS
 188     put original code in DS
 189     write-protect original code pages
 190   on write trap
 191     emulate write
 192     re-rewrite if already rewritten
 193     tricky: must find first instruction boundary in overwritten code
 194
 195 do we need to rewrite guest user-level code?
 196   technically yes: SGDT, IF
 197   but probably not in practice
 198   user code only does INT, which traps to VMM
 199
 200 how to handle pagetable?
 201   remember VMM keeps shadow pagetable w/ different PAs in PTEs
 202   scan the whole pagetable on every %cr3 load?
 203     to create the shadow page table
 204
 205 what if guest writes %cr3 often, during context switches?
 206   idea: lazy population of shadow page table
 207   start w/ empty shadow page table (just VMM mappings)
 208   so guest will generate many page faults after it loads %cr3
 209   VMM page fault handler just copies needed PTE to shadow pagetable
 210     restarts guest, no guest-visible page fault
 211
 212 what if guest frequently switches among a set of page tables?
 213   as it context-switches among running processes
 214   probably doesn't modify them, so re-scan (or lazy faults) wasted
 215   idea: VMM could cache multiple shadow page tables
 216     cache indexed by address of guest pagetable
 217   start with pre-populated page table on guest %cr3 write
 218   would make context switch much faster
 219
 220 what if guest kernel writes a PTE?
 221   store instruction is not privileged, no trap
 222   does VMM need to know about that write?
 223     yes, if VMM is caching multiple page tables
 224   idea: VMM can write-protect guest's PTE pages
 225   trap on PTE write, emulate, also in shadow pagetable
 226
 227 this is the three-way tradeoff the paper talks about
 228   trace costs / hidden page faults / context switch cost
 229   reducing one requires more of the others
 230   and all three are expensive
 231
 232 how to guard guest kernel against writes by guest programs?
 233   both are at CPL=3
 234   delete kernel PTEs on IRET, re-install on INT?
 235
 236 how to handle devices?
 237   trap INB and OUTB
 238   DMA addresses are physical, VMM must translate and check
 239   rarely makes sense for guest to use real device
 240     want to share w/ other guests
 241     each guest gets a part of the disk
 242     each guest looks like a distinct Internet host
 243     each guest gets an X window
 244   VMM might mimic some standard ethernet or disk controller
 245     regardless of actual h/w on host computer
 246   or guest might run special drivers that jump to VMM
 247
 248 Today's paper
 249
 250 Two big issues:
 251   How to cope with instructions that reveal privileged state?
 252     e.g. pushf, looking at low bits of %cs
 253   How to avoid expensive traps?
 254
 255 VMware's answer: binary translation (BT)
 256   Replace offending instructions with code that does the right thing
 257     Code must have access to VMM's virtual state for that guest
 258
 259 Example uses of BT
 260   CLI/STI/pushf/popf -- read/write virtual IF
 261   Detect memory stores that modify PTEs
 262     Write-protect pages, trap the first time, and rewrite
 263     New sequence modifies shadow pagetable as well as real one
 264
 265 How to hide VMM state from guest code?
 266   Since unprivileged BT code now reads/writes VMM state
 267   Put VMM state in very high memory
 268   Use segment limits to prevent guest from using last few pages
 269   But set up %gs to allow BT code to get at those pages
 270
 271 BT challenges
 272   Hard to find instruction boundaries, instructions vs data
 273   Translated code is a different size
 274     Thus code pointers are different
 275     Program expects to see original fn ptrs, return PCs on stack
 276     Translated code must map before use
 277     Thus every RET needs to look up in VMM state
 278
 279 Intel/AMD hardware support for virtual machines
 280   has made it much easier to implement a VMM w/ reasonable performance
 281   h/w itself directly maintains per-guest virtual state
 282     CS (w/ CPL), EFLAGS, idtr, &c
 283   h/w knows it is in "guest mode"
 284     instructions directly modify virtual state
 285     avoids lots of traps to VMM
 286   h/w basically adds a new priv level
 287     VMM mode, CPL=0, ..., CPL=3
 288     guest-mode CPL=0 is not fully privileged
 289   no traps to VMM on system calls
 290     h/w handles CPL transition
 291   what about memory, pagetables?
 292     h/w supports *two* page tables
 293     guest page table
 294     VMM's page table
 295     guest memory refs go through double lookup
 296       each phys addr in guest pagetable translated through VMM's pagetable
 297     thus guest can directly modify its page table w/o VMM having to shadow it
 298       no need for VMM to write-protect guest pagetables
 299       no need for VMM to track %cr3 changes
 300     and VMM can ensure guest uses only its own memory
 301       only map guest's memory in VMM page table
 302