1 6.828 2011 Lecture 10: Crash Recovery, Logging
3 what is crash recovery?
4 you're writing the file system
7 is your file system still useable?
10 crash during multi-step operation
11 leaves FS invariants violated
12 can lead to ugly FS corruption
18 crash: dirent points to free inode -- disaster!
19 crash: inode not free but not used -- not so bad
25 crash: inode refers to free block -- disaster!
26 crash: block not free but not used -- not so bad
33 after rebooting and running recovery code
34 1. FS internal invariants maintained
35 e.g., no block is both in free list and in a file
36 2. all but last few operations preserved on disk
37 e.g., data I wrote yesterday are preserved
38 user might have to check last few operations
40 echo 99 > result ; echo done > status
42 simplifying assumption: disk is fail-stop
43 disk executes the writes FS sends it, and does nothing else
44 perhaps doesn't perform the very last write
49 correctness and performance often conflict
50 safety => write to disk ASAP
51 speed => don't write the disk (batch, write-back cache, sort by track, &c)
53 we'll discuss two approaches:
54 synchronous meta-data update + fsck
55 logging (xv6 and linux ext3)
57 synchronous meta-data update
58 an old approach to crash recovery
59 simple, slow, incomplete
61 most problem cases look like dangling references
65 idea: always initialize *on disk* before creating reference
66 implement by doing the initialization write,
67 waiting for it to complete,
68 and only then doing the referencing write
71 example: file creation
72 what's the right order of synchronous writes?
73 1. mark inode as allocated
74 2. create directory entry
76 example: file deletion
77 1. erase directory entry
78 2. erase inode addrs[], mark as free
81 example: rename() (not in xv6)
82 between directories, i.e. mv d1/x d2/y
85 or the other way around?
86 probably safest to create then erase!
88 what will be true after crash+reboot?
89 all completed sys calls guaranteed visible on disk
90 reachable part of FS will be mostly correct
91 except interrupted rename leaves file in both directories!
92 blocks and inodes may be unreferenced but not marked free
94 so: sync meta-data update system needs to check at reboot
95 to free unreferenced inodes and blocks
96 descend dir tree from root, remembering all i-numbers and block #s seen
97 mark everthing else free
98 probably have to punt on interrupted rename()
100 many kinds of UNIX used sync writes until 10 years ago
102 problems with synchronous meta-data update
103 very slow during normal operation
104 very slow during recovery
106 how long would fsck take?
107 a read from a random place on disk takes about 10 milliseconds
108 descending the directory hierarchy might involve a random read per inode
109 so maybe (n-inodes / 100) seconds?
110 faster if you read all inodes (and dir blocks) sequentially,
111 then descend hierarchy in memory
112 my server: fsck takes 10 minutes per 70GB disk w/ 2 million inodes
113 clearly reading many inodes sequentially, not seeking
114 still a long time, probably linear in disk size
116 ordinary performance of sync meta-data update?
117 creating a file and writing a few bytes takes 8 writes, probably 80 ms
118 (ialloc, init inode, write dirent, alloc data block, add to inode,
119 write data, set length in inode, one other mystery write to data)
120 so can create only about a dozen small files per second!
121 think about un-tar or rm *
123 how to get better performance?
125 disk sequential throughput is high, 50 MB/sec
126 (maybe someday solid state disks will change the landscape)
127 we'll talk about big memory, then sequential disk throughput
129 why not use a big write-back disk cache?
130 *no* sync meta-data update
131 operations *only* modify in-memory disk cache (no disk write)
132 so creat(), unlink(), write() &c return almost immediately
133 bufs written to disk later
134 if cache is full, write LRU dirty block
135 write all dirty blocks every 30 seconds, to limit loss if crash
136 this is how old Linux EXT2 file system worked
138 would write-back cache improve performance? why, exactly?
139 after all, you have to write the disk in the end anyway
141 what can go wrong w/ write-back cache?
142 example: unlink() followed by create()
143 an existing file x with some content, all safely on disk
144 one user runs unlink(x)
145 1. delete x's dir entry **
146 2. put blocks in free bitmap
147 3. mark x's inode free
148 another user then runs create(y)
149 4. allocate a free inode
150 5. initialize the inode to be in-use and zero-length
151 6. create y's directory entry **
152 again, all writes initially just to disk buffer cache
153 suppose only ** writes forced to disk, then crash
155 can fsck detect and fix this?
157 how can we get both speed and safety?
159 somehow remember relationships among writes
160 e.g. don't send #1 to disk w/o #2 and #3
162 most popular solution: logging (== journaling)
163 goal: atomic system calls w.r.t. crashes
164 goal: fast recovery (no hour-long fsck)
165 goal: speed of write-back cache for normal operations
167 will introduce logging in two steps
168 first xv6's log, which only provides safety
169 then Linux EXT3, which is also fast
171 the basic idea behind logging
172 you want atomicity: all of a system call's writes, or none
173 let's call an atomic operation a "transaction"
174 record all writes the sys call *will* do in the log
178 if "done" in log, replay all writes in log
179 if no "done", ignore log
180 this is a WRITE-AHEAD LOG
183 [diagram: buffer cache, FS tree on disk, log on disk]
193 need to indicate which group of writes must be atomic!
194 lock -- xv6 allows only one transaction at a time
197 append buffer content to log
198 leave modified block in buffer cache (but do not write)
200 record "done" and sector #s in log
202 erase "done" from log
205 copy blocks from log to real locations on disk
207 let's look at the code:
209 begin_trans before ilock to avoid deadlock
210 then error checks, which need the inode lock
211 on err, commit empty transaction
213 iupdate and iunlockput of file
214 thus freeing of blocks, erasing of addrs[], freeing inode
216 begin_trans, sheet 41
217 why only one transaction at a time?
224 let's look at today's homework
225 the log header is at 1014
227 bwrite sector 1015 -- 29, writei
228 bwrite sector 1016 -- 2, iupdate
229 bwrite sector 1017 -- 28, bfree
230 bwrite sector 1017 -- 28, bfree
231 bwrite sector 1017 -- 28, bfree
232 bwrite sector 1017 -- 28, bfree
233 bwrite sector 1016 -- 2, iupdate
234 bwrite sector 1016 -- 2, iupdate
235 bwrite sector 1014 -- log header <-- commit point
236 bwrite sector 29 -- dir content
237 bwrite sector 2 -- root and file inodes
238 bwrite sector 28 -- free bitmap
239 bwrite sector 1014 -- erase transaction
241 what's wrong with xv6's logging?
242 only one transaction at a time
243 two system calls might be modifying different parts of the FS
244 log traffic will be huge: every operation is many records
245 logs whole blocks even if only a few bytes written
246 eager write to log -- slow
247 eager write to real location -- slow
248 every block written twice
249 trouble with operations that don't fit in the log
250 unlink might dirty many blocks while truncating file