i386/junos/MIT/course/lectures/all/6.828/2011/lec/l-crash.txt

   1 6.828 2011 Lecture 10: Crash Recovery, Logging
   2
   3 what is crash recovery?
   4   you're writing the file system
   5   then the power fails
   6   you reboot
   7   is your file system still useable?
   8
   9 the main problem:
  10   crash during multi-step operation
  11   leaves FS invariants violated
  12   can lead to ugly FS corruption
  13
  14 examples:
  15   create:
  16     new dirent
  17     allocate file inode
  18     crash: dirent points to free inode -- disaster!
  19     crash: inode not free but not used -- not so bad
  20   write:
  21     block content
  22     inode addrs[] and len
  23     indirect block
  24     block free bitmap
  25     crash: inode refers to free block -- disaster!
  26     crash: block not free but not used -- not so bad
  27   unlink:
  28     block free bitmaps
  29     free inode
  30     erase dirent
  31
  32 what can we hope for?
  33   after rebooting and running recovery code
  34   1. FS internal invariants maintained
  35      e.g., no block is both in free list and in a file
  36   2. all but last few operations preserved on disk
  37      e.g., data I wrote yesterday are preserved
  38      user might have to check last few operations
  39   3. no order anomalies
  40      echo 99 > result ; echo done > status
  41
  42 simplifying assumption: disk is fail-stop
  43   disk executes the writes FS sends it, and does nothing else
  44     perhaps doesn't perform the very last write
  45   thus:
  46     no wild writes
  47     no decay of sectors
  48
  49 correctness and performance often conflict
  50   safety => write to disk ASAP
  51   speed => don't write the disk (batch, write-back cache, sort by track, &c)
  52
  53 we'll discuss two approaches:
  54   synchronous meta-data update + fsck
  55   logging (xv6 and linux ext3)
  56
  57 synchronous meta-data update
  58   an old approach to crash recovery
  59   simple, slow, incomplete
  60
  61 most problem cases look like dangling references
  62   inode -> free block
  63   dirent -> free inode
  64
  65 idea: always initialize *on disk* before creating reference
  66   implement by doing the initialization write,
  67   waiting for it to complete,
  68   and only then doing the referencing write
  69   "synchronous writes"
  70
  71 example: file creation
  72   what's the right order of synchronous writes?
  73   1. mark inode as allocated
  74   2. create directory entry
  75
  76 example: file deletion
  77   1. erase directory entry
  78   2. erase inode addrs[], mark as free
  79   3. mark blocks free
  80
  81 example: rename() (not in xv6)
  82   between directories, i.e. mv d1/x d2/y
  83   1. create new dirent
  84   2. erase old dirent
  85   or the other way around?
  86   probably safest to create then erase!
  87
  88 what will be true after crash+reboot?
  89   all completed sys calls guaranteed visible on disk
  90   reachable part of FS will be mostly correct
  91     except interrupted rename leaves file in both directories!
  92   blocks and inodes may be unreferenced but not marked free
  93
  94 so: sync meta-data update system needs to check at reboot
  95   to free unreferenced inodes and blocks
  96   descend dir tree from root, remembering all i-numbers and block #s seen
  97   mark everthing else free
  98   probably have to punt on interrupted rename()
  99
 100 many kinds of UNIX used sync writes until 10 years ago
 101
 102 problems with synchronous meta-data update
 103   very slow during normal operation
 104   very slow during recovery
 105
 106 how long would fsck take?
 107   a read from a random place on disk takes about 10 milliseconds
 108   descending the directory hierarchy might involve a random read per inode
 109   so maybe (n-inodes / 100) seconds?
 110   faster if you read all inodes (and dir blocks) sequentially,
 111     then descend hierarchy in memory
 112   my server: fsck takes 10 minutes per 70GB disk w/ 2 million inodes
 113     clearly reading many inodes sequentially, not seeking
 114     still a long time, probably linear in disk size
 115
 116 ordinary performance of sync meta-data update?
 117   creating a file and writing a few bytes takes 8 writes, probably 80 ms
 118     (ialloc, init inode, write dirent, alloc data block, add to inode,
 119      write data, set length in inode, one other mystery write to data)
 120   so can create only about a dozen small files per second!
 121   think about un-tar or rm *
 122
 123 how to get better performance?
 124   RAM is cheap
 125   disk sequential throughput is high, 50 MB/sec
 126   (maybe someday solid state disks will change the landscape)
 127   we'll talk about big memory, then sequential disk throughput
 128
 129 why not use a big write-back disk cache?
 130   *no* sync meta-data update
 131   operations *only* modify in-memory disk cache (no disk write)
 132     so creat(), unlink(), write() &c return almost immediately
 133   bufs written to disk later
 134     if cache is full, write LRU dirty block
 135     write all dirty blocks every 30 seconds, to limit loss if crash
 136   this is how old Linux EXT2 file system worked
 137
 138 would write-back cache improve performance? why, exactly?
 139   after all, you have to write the disk in the end anyway
 140
 141 what can go wrong w/ write-back cache?
 142   example: unlink() followed by create()
 143     an existing file x with some content, all safely on disk
 144     one user runs unlink(x)
 145       1. delete x's dir entry **
 146       2. put blocks in free bitmap
 147       3. mark x's inode free
 148     another user then runs create(y)
 149       4. allocate a free inode
 150       5. initialize the inode to be in-use and zero-length
 151       6. create y's directory entry **
 152     again, all writes initially just to disk buffer cache
 153     suppose only ** writes forced to disk, then crash
 154     what is the problem?
 155     can fsck detect and fix this?
 156
 157 how can we get both speed and safety?
 158   write only to cache
 159   somehow remember relationships among writes
 160     e.g. don't send #1 to disk w/o #2 and #3
 161
 162 most popular solution: logging (== journaling)
 163   goal: atomic system calls w.r.t. crashes
 164   goal: fast recovery (no hour-long fsck)
 165   goal: speed of write-back cache for normal operations
 166
 167 will introduce logging in two steps
 168   first xv6's log, which only provides safety
 169   then Linux EXT3, which is also fast
 170
 171 the basic idea behind logging
 172   you want atomicity: all of a system call's writes, or none
 173     let's call an atomic operation a "transaction"
 174   record all writes the sys call *will* do in the log
 175   then record "done"
 176   then do the writes
 177   on crash+recovery:
 178     if "done" in log, replay all writes in log
 179     if no "done", ignore log
 180   this is a WRITE-AHEAD LOG
 181
 182 xv6's simple logging
 183   [diagram: buffer cache, FS tree on disk, log on disk]
 184   FS has a log on disk
 185   syscall:
 186     begin_trans()
 187       bp = bread()
 188       bp->data[] = ...
 189       log_write(bp)
 190       more writes ...
 191     commit_trans()
 192   begin_trans:
 193     need to indicate which group of writes must be atomic!
 194     lock -- xv6 allows only one transaction at a time
 195   log_write:
 196     record sector #
 197     append buffer content to log
 198     leave modified block in buffer cache (but do not write)
 199   commit_trans():
 200     record "done" and sector #s in log
 201     do the writes
 202     erase "done" from log
 203   recovery:
 204     if log says "done":
 205       copy blocks from log to real locations on disk
 206
 207 let's look at the code:
 208   sys_unlink, sheet 54
 209     begin_trans before ilock to avoid deadlock
 210       then error checks, which need the inode lock
 211       on err, commit empty transaction
 212     writei of dirent
 213     iupdate and iunlockput of file
 214       thus freeing of blocks, erasing of addrs[], freeing inode
 215     commit_trans
 216   begin_trans, sheet 41
 217     why only one transaction at a time?
 218   log_write
 219   commit_trans
 220     write_head
 221     install_trans
 222   recover_from_log
 223
 224 let's look at today's homework
 225 the log header is at 1014
 226 $ rm README
 227 bwrite sector 1015 -- 29, writei
 228 bwrite sector 1016 -- 2, iupdate
 229 bwrite sector 1017 -- 28, bfree
 230 bwrite sector 1017 -- 28, bfree
 231 bwrite sector 1017 -- 28, bfree
 232 bwrite sector 1017 -- 28, bfree
 233 bwrite sector 1016 -- 2, iupdate
 234 bwrite sector 1016 -- 2, iupdate
 235 bwrite sector 1014 -- log header <-- commit point
 236 bwrite sector 29   -- dir content
 237 bwrite sector 2    -- root and file inodes
 238 bwrite sector 28   -- free bitmap
 239 bwrite sector 1014 -- erase transaction
 240
 241 what's wrong with xv6's logging?
 242   only one transaction at a time
 243     two system calls might be modifying different parts of the FS
 244   log traffic will be huge: every operation is many records
 245   logs whole blocks even if only a few bytes written
 246   eager write to log -- slow
 247   eager write to real location -- slow
 248   every block written twice
 249   trouble with operations that don't fit in the log
 250     unlink might dirty many blocks while truncating file