On Tue, Nov 06, 2007 at 02:33:53AM -0800, akpm@linux-foundation.org wrote:
[mmotm.git] / fs / reiser4 / page_cache.c
blob9409c19a2a4d4976c52338d8065bafe75eb17798
1 /* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by
2 * reiser4/README */
4 /* Memory pressure hooks. Fake inodes handling. */
6 /* GLOSSARY
8 . Formatted and unformatted nodes.
9 Elements of reiser4 balanced tree to store data and metadata.
10 Unformatted nodes are pointed to by extent pointers. Such nodes
11 are used to store data of large objects. Unlike unformatted nodes,
12 formatted ones have associated format described by node4X plugin.
14 . Jnode (or journal node)
15 The in-memory header which is used to track formatted and unformatted
16 nodes, bitmap nodes, etc. In particular, jnodes are used to track
17 transactional information associated with each block(see reiser4/jnode.c
18 for details).
20 . Znode
21 The in-memory header which is used to track formatted nodes. Contains
22 embedded jnode (see reiser4/znode.c for details).
25 /* We store all file system meta data (and data, of course) in the page cache.
27 What does this mean? In stead of using bread/brelse we create special
28 "fake" inode (one per super block) and store content of formatted nodes
29 into pages bound to this inode in the page cache. In newer kernels bread()
30 already uses inode attached to block device (bd_inode). Advantage of having
31 our own fake inode is that we can install appropriate methods in its
32 address_space operations. Such methods are called by VM on memory pressure
33 (or during background page flushing) and we can use them to react
34 appropriately.
36 In initial version we only support one block per page. Support for multiple
37 blocks per page is complicated by relocation.
39 To each page, used by reiser4, jnode is attached. jnode is analogous to
40 buffer head. Difference is that jnode is bound to the page permanently:
41 jnode cannot be removed from memory until its backing page is.
43 jnode contain pointer to page (->pg field) and page contain pointer to
44 jnode in ->private field. Pointer from jnode to page is protected to by
45 jnode's spinlock and pointer from page to jnode is protected by page lock
46 (PG_locked bit). Lock ordering is: first take page lock, then jnode spin
47 lock. To go into reverse direction use jnode_lock_page() function that uses
48 standard try-lock-and-release device.
50 Properties:
52 1. when jnode-to-page mapping is established (by jnode_attach_page()), page
53 reference counter is increased.
55 2. when jnode-to-page mapping is destroyed (by page_clear_jnode(), page
56 reference counter is decreased.
58 3. on jload() reference counter on jnode page is increased, page is
59 kmapped and `referenced'.
61 4. on jrelse() inverse operations are performed.
63 5. kmapping/kunmapping of unformatted pages is done by read/write methods.
65 DEADLOCKS RELATED TO MEMORY PRESSURE. [OUTDATED. Only interesting
66 historically.]
68 [In the following discussion, `lock' invariably means long term lock on
69 znode.] (What about page locks?)
71 There is some special class of deadlock possibilities related to memory
72 pressure. Locks acquired by other reiser4 threads are accounted for in
73 deadlock prevention mechanism (lock.c), but when ->vm_writeback() is
74 invoked additional hidden arc is added to the locking graph: thread that
75 tries to allocate memory waits for ->vm_writeback() to finish. If this
76 thread keeps lock and ->vm_writeback() tries to acquire this lock, deadlock
77 prevention is useless.
79 Another related problem is possibility for ->vm_writeback() to run out of
80 memory itself. This is not a problem for ext2 and friends, because their
81 ->vm_writeback() don't allocate much memory, but reiser4 flush is
82 definitely able to allocate huge amounts of memory.
84 It seems that there is no reliable way to cope with the problems above. In
85 stead it was decided that ->vm_writeback() (as invoked in the kswapd
86 context) wouldn't perform any flushing itself, but rather should just wake
87 up some auxiliary thread dedicated for this purpose (or, the same thread
88 that does periodic commit of old atoms (ktxnmgrd.c)).
90 Details:
92 1. Page is called `reclaimable' against particular reiser4 mount F if this
93 page can be ultimately released by try_to_free_pages() under presumptions
94 that:
96 a. ->vm_writeback() for F is no-op, and
98 b. none of the threads accessing F are making any progress, and
100 c. other reiser4 mounts obey the same memory reservation protocol as F
101 (described below).
103 For example, clean un-pinned page, or page occupied by ext2 data are
104 reclaimable against any reiser4 mount.
106 When there is more than one reiser4 mount in a system, condition (c) makes
107 reclaim-ability not easily verifiable beyond trivial cases mentioned above.
109 THIS COMMENT IS VALID FOR "MANY BLOCKS ON PAGE" CASE
111 Fake inode is used to bound formatted nodes and each node is indexed within
112 fake inode by its block number. If block size of smaller than page size, it
113 may so happen that block mapped to the page with formatted node is occupied
114 by unformatted node or is unallocated. This lead to some complications,
115 because flushing whole page can lead to an incorrect overwrite of
116 unformatted node that is moreover, can be cached in some other place as
117 part of the file body. To avoid this, buffers for unformatted nodes are
118 never marked dirty. Also pages in the fake are never marked dirty. This
119 rules out usage of ->writepage() as memory pressure hook. In stead
120 ->releasepage() is used.
122 Josh is concerned that page->buffer is going to die. This should not pose
123 significant problem though, because we need to add some data structures to
124 the page anyway (jnode) and all necessary book keeping can be put there.
128 /* Life cycle of pages/nodes.
130 jnode contains reference to page and page contains reference back to
131 jnode. This reference is counted in page ->count. Thus, page bound to jnode
132 cannot be released back into free pool.
134 1. Formatted nodes.
136 1. formatted node is represented by znode. When new znode is created its
137 ->pg pointer is NULL initially.
139 2. when node content is loaded into znode (by call to zload()) for the
140 first time following happens (in call to ->read_node() or
141 ->allocate_node()):
143 1. new page is added to the page cache.
145 2. this page is attached to znode and its ->count is increased.
147 3. page is kmapped.
149 3. if more calls to zload() follow (without corresponding zrelses), page
150 counter is left intact and in its stead ->d_count is increased in znode.
152 4. each call to zrelse decreases ->d_count. When ->d_count drops to zero
153 ->release_node() is called and page is kunmapped as result.
155 5. at some moment node can be captured by a transaction. Its ->x_count
156 is then increased by transaction manager.
158 6. if node is removed from the tree (empty node with JNODE_HEARD_BANSHEE
159 bit set) following will happen (also see comment at the top of znode.c):
161 1. when last lock is released, node will be uncaptured from
162 transaction. This released reference that transaction manager acquired
163 at the step 5.
165 2. when last reference is released, zput() detects that node is
166 actually deleted and calls ->delete_node()
167 operation. page_cache_delete_node() implementation detaches jnode from
168 page and releases page.
170 7. otherwise (node wasn't removed from the tree), last reference to
171 znode will be released after transaction manager committed transaction
172 node was in. This implies squallocing of this node (see
173 flush.c). Nothing special happens at this point. Znode is still in the
174 hash table and page is still attached to it.
176 8. znode is actually removed from the memory because of the memory
177 pressure, or during umount (znodes_tree_done()). Anyway, znode is
178 removed by the call to zdrop(). At this moment, page is detached from
179 znode and removed from the inode address space.
183 #include "debug.h"
184 #include "dformat.h"
185 #include "key.h"
186 #include "txnmgr.h"
187 #include "jnode.h"
188 #include "znode.h"
189 #include "block_alloc.h"
190 #include "tree.h"
191 #include "vfs_ops.h"
192 #include "inode.h"
193 #include "super.h"
194 #include "entd.h"
195 #include "page_cache.h"
196 #include "ktxnmgrd.h"
198 #include <linux/types.h>
199 #include <linux/fs.h>
200 #include <linux/mm.h> /* for struct page */
201 #include <linux/swap.h> /* for struct page */
202 #include <linux/pagemap.h>
203 #include <linux/bio.h>
204 #include <linux/writeback.h>
205 #include <linux/blkdev.h>
207 static struct bio *page_bio(struct page *, jnode * , int rw, gfp_t gfp);
209 static struct address_space_operations formatted_fake_as_ops;
211 static const oid_t fake_ino = 0x1;
212 static const oid_t bitmap_ino = 0x2;
213 static const oid_t cc_ino = 0x3;
215 static void
216 init_fake_inode(struct super_block *super, struct inode *fake,
217 struct inode **pfake)
219 assert("nikita-2168", fake->i_state & I_NEW);
220 fake->i_mapping->a_ops = &formatted_fake_as_ops;
221 *pfake = fake;
222 /* NOTE-NIKITA something else? */
223 unlock_new_inode(fake);
227 * reiser4_init_formatted_fake - iget inodes for formatted nodes and bitmaps
228 * @super: super block to init fake inode for
230 * Initializes fake inode to which formatted nodes are bound in the page cache
231 * and inode for bitmaps.
233 int reiser4_init_formatted_fake(struct super_block *super)
235 struct inode *fake;
236 struct inode *bitmap;
237 struct inode *cc;
238 reiser4_super_info_data *sinfo;
240 assert("nikita-1703", super != NULL);
242 sinfo = get_super_private_nocheck(super);
243 fake = iget_locked(super, oid_to_ino(fake_ino));
245 if (fake != NULL) {
246 init_fake_inode(super, fake, &sinfo->fake);
248 bitmap = iget_locked(super, oid_to_ino(bitmap_ino));
249 if (bitmap != NULL) {
250 init_fake_inode(super, bitmap, &sinfo->bitmap);
252 cc = iget_locked(super, oid_to_ino(cc_ino));
253 if (cc != NULL) {
254 init_fake_inode(super, cc, &sinfo->cc);
255 return 0;
256 } else {
257 iput(sinfo->fake);
258 iput(sinfo->bitmap);
259 sinfo->fake = NULL;
260 sinfo->bitmap = NULL;
262 } else {
263 iput(sinfo->fake);
264 sinfo->fake = NULL;
267 return RETERR(-ENOMEM);
271 * reiser4_done_formatted_fake - release inode used by formatted nodes and bitmaps
272 * @super: super block to init fake inode for
274 * Releases inodes which were used as address spaces of bitmap and formatted
275 * nodes.
277 void reiser4_done_formatted_fake(struct super_block *super)
279 reiser4_super_info_data *sinfo;
281 sinfo = get_super_private_nocheck(super);
283 if (sinfo->fake != NULL) {
284 iput(sinfo->fake);
285 sinfo->fake = NULL;
288 if (sinfo->bitmap != NULL) {
289 iput(sinfo->bitmap);
290 sinfo->bitmap = NULL;
293 if (sinfo->cc != NULL) {
294 iput(sinfo->cc);
295 sinfo->cc = NULL;
297 return;
300 void reiser4_wait_page_writeback(struct page *page)
302 assert("zam-783", PageLocked(page));
304 do {
305 unlock_page(page);
306 wait_on_page_writeback(page);
307 lock_page(page);
308 } while (PageWriteback(page));
311 /* return tree @page is in */
312 reiser4_tree *reiser4_tree_by_page(const struct page *page/* page to query */)
314 assert("nikita-2461", page != NULL);
315 return &get_super_private(page->mapping->host->i_sb)->tree;
318 /* completion handler for single page bio-based read.
320 mpage_end_io_read() would also do. But it's static.
323 static void
324 end_bio_single_page_read(struct bio *bio, int err UNUSED_ARG)
326 struct page *page;
328 page = bio->bi_io_vec[0].bv_page;
330 if (test_bit(BIO_UPTODATE, &bio->bi_flags)) {
331 SetPageUptodate(page);
332 } else {
333 ClearPageUptodate(page);
334 SetPageError(page);
336 unlock_page(page);
337 bio_put(bio);
340 /* completion handler for single page bio-based write.
342 mpage_end_io_write() would also do. But it's static.
345 static void
346 end_bio_single_page_write(struct bio *bio, int err UNUSED_ARG)
348 struct page *page;
350 page = bio->bi_io_vec[0].bv_page;
352 if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
353 SetPageError(page);
354 end_page_writeback(page);
355 bio_put(bio);
358 /* ->readpage() method for formatted nodes */
359 static int formatted_readpage(struct file *f UNUSED_ARG,
360 struct page *page/* page to read */)
362 assert("nikita-2412", PagePrivate(page) && jprivate(page));
363 return reiser4_page_io(page, jprivate(page), READ,
364 reiser4_ctx_gfp_mask_get());
368 * reiser4_page_io - submit single-page bio request
369 * @page: page to perform io for
370 * @node: jnode of page
371 * @rw: read or write
372 * @gfp: gfp mask for bio allocation
374 * Submits single page read or write.
376 int reiser4_page_io(struct page *page, jnode *node, int rw, gfp_t gfp)
378 struct bio *bio;
379 int result;
381 assert("nikita-2094", page != NULL);
382 assert("nikita-2226", PageLocked(page));
383 assert("nikita-2634", node != NULL);
384 assert("nikita-2893", rw == READ || rw == WRITE);
386 if (rw) {
387 if (unlikely(page->mapping->host->i_sb->s_flags & MS_RDONLY)) {
388 unlock_page(page);
389 return 0;
393 bio = page_bio(page, node, rw, gfp);
394 if (!IS_ERR(bio)) {
395 if (rw == WRITE) {
396 set_page_writeback(page);
397 unlock_page(page);
399 reiser4_submit_bio(rw, bio);
400 result = 0;
401 } else {
402 unlock_page(page);
403 result = PTR_ERR(bio);
406 return result;
409 /* helper function to construct bio for page */
410 static struct bio *page_bio(struct page *page, jnode * node, int rw, gfp_t gfp)
412 struct bio *bio;
413 assert("nikita-2092", page != NULL);
414 assert("nikita-2633", node != NULL);
416 /* Simple implementation in the assumption that blocksize == pagesize.
418 We only have to submit one block, but submit_bh() will allocate bio
419 anyway, so lets use all the bells-and-whistles of bio code.
422 bio = bio_alloc(gfp, 1);
423 if (bio != NULL) {
424 int blksz;
425 struct super_block *super;
426 reiser4_block_nr blocknr;
428 super = page->mapping->host->i_sb;
429 assert("nikita-2029", super != NULL);
430 blksz = super->s_blocksize;
431 assert("nikita-2028", blksz == (int)PAGE_CACHE_SIZE);
433 spin_lock_jnode(node);
434 blocknr = *jnode_get_io_block(node);
435 spin_unlock_jnode(node);
437 assert("nikita-2275", blocknr != (reiser4_block_nr) 0);
438 assert("nikita-2276", !reiser4_blocknr_is_fake(&blocknr));
440 bio->bi_bdev = super->s_bdev;
441 /* fill bio->bi_sector before calling bio_add_page(), because
442 * q->merge_bvec_fn may want to inspect it (see
443 * drivers/md/linear.c:linear_mergeable_bvec() for example. */
444 bio->bi_sector = blocknr * (blksz >> 9);
446 if (!bio_add_page(bio, page, blksz, 0)) {
447 warning("nikita-3452",
448 "Single page bio cannot be constructed");
449 return ERR_PTR(RETERR(-EINVAL));
452 /* bio -> bi_idx is filled by bio_init() */
453 bio->bi_end_io = (rw == READ) ?
454 end_bio_single_page_read : end_bio_single_page_write;
456 return bio;
457 } else
458 return ERR_PTR(RETERR(-ENOMEM));
461 #if 0
462 static int can_hit_entd(reiser4_context *ctx, struct super_block *s)
464 if (ctx == NULL || ((unsigned long)ctx->magic) != context_magic)
465 return 1;
466 if (ctx->super != s)
467 return 1;
468 if (get_super_private(s)->entd.tsk == current)
469 return 0;
470 if (!lock_stack_isclean(&ctx->stack))
471 return 0;
472 if (ctx->trans->atom != NULL)
473 return 0;
474 return 1;
476 #endif
479 * reiser4_writepage - writepage of struct address_space_operations
480 * @page: page to write
481 * @wbc:
485 /* Common memory pressure notification. */
486 int reiser4_writepage(struct page *page,
487 struct writeback_control *wbc)
489 struct super_block *s;
490 reiser4_context *ctx;
492 assert("vs-828", PageLocked(page));
494 s = page->mapping->host->i_sb;
495 ctx = get_current_context_check();
497 /* assert("", can_hit_entd(ctx, s)); */
498 return write_page_by_ent(page, wbc);
501 /* ->set_page_dirty() method of formatted address_space */
502 static int formatted_set_page_dirty(struct page *page)
504 assert("nikita-2173", page != NULL);
505 BUG();
506 return __set_page_dirty_nobuffers(page);
509 /* writepages method of address space operations in reiser4 is used to involve
510 into transactions pages which are dirtied via mmap. Only regular files can
511 have such pages. Fake inode is used to access formatted nodes via page
512 cache. As formatted nodes can never be mmaped, fake inode's writepages has
513 nothing to do */
514 static int
515 writepages_fake(struct address_space *mapping, struct writeback_control *wbc)
517 return 0;
520 /* address space operations for the fake inode */
521 static struct address_space_operations formatted_fake_as_ops = {
522 /* Perform a writeback of a single page as a memory-freeing
523 * operation. */
524 .writepage = reiser4_writepage,
525 /* this is called to read formatted node */
526 .readpage = formatted_readpage,
527 /* ->sync_page() method of fake inode address space operations. Called
528 from wait_on_page() and lock_page().
530 This is most annoyingly misnomered method. Actually it is called
531 from wait_on_page_bit() and lock_page() and its purpose is to
532 actually start io by jabbing device drivers.
534 .sync_page = block_sync_page,
535 /* Write back some dirty pages from this mapping. Called from sync.
536 called during sync (pdflush) */
537 .writepages = writepages_fake,
538 /* Set a page dirty */
539 .set_page_dirty = formatted_set_page_dirty,
540 /* used for read-ahead. Not applicable */
541 .readpages = NULL,
542 .write_begin = NULL,
543 .write_end = NULL,
544 .bmap = NULL,
545 /* called just before page is being detached from inode mapping and
546 removed from memory. Called on truncate, cut/squeeze, and
547 umount. */
548 .invalidatepage = reiser4_invalidatepage,
549 /* this is called by shrink_cache() so that file system can try to
550 release objects (jnodes, buffers, journal heads) attached to page
551 and, may be made page itself free-able.
553 .releasepage = reiser4_releasepage,
554 .direct_IO = NULL
557 /* called just before page is released (no longer used by reiser4). Callers:
558 jdelete() and extent2tail(). */
559 void reiser4_drop_page(struct page *page)
561 assert("nikita-2181", PageLocked(page));
562 clear_page_dirty_for_io(page);
563 ClearPageUptodate(page);
564 #if defined(PG_skipped)
565 ClearPageSkipped(page);
566 #endif
567 unlock_page(page);
570 #define JNODE_GANG_SIZE (16)
572 /* find all jnodes from range specified and invalidate them */
573 static int
574 truncate_jnodes_range(struct inode *inode, pgoff_t from, pgoff_t count)
576 reiser4_inode *info;
577 int truncated_jnodes;
578 reiser4_tree *tree;
579 unsigned long index;
580 unsigned long end;
582 if (inode_file_plugin(inode) ==
583 file_plugin_by_id(CRYPTCOMPRESS_FILE_PLUGIN_ID))
585 * No need to get rid of jnodes here: if the single jnode of
586 * page cluster did not have page, then it was found and killed
587 * before in
588 * truncate_complete_page_cluster()->jput()->jput_final(),
589 * otherwise it will be dropped by reiser4_invalidatepage()
591 return 0;
592 truncated_jnodes = 0;
594 info = reiser4_inode_data(inode);
595 tree = reiser4_tree_by_inode(inode);
597 index = from;
598 end = from + count;
600 while (1) {
601 jnode *gang[JNODE_GANG_SIZE];
602 int taken;
603 int i;
604 jnode *node;
606 assert("nikita-3466", index <= end);
608 read_lock_tree(tree);
609 taken =
610 radix_tree_gang_lookup(jnode_tree_by_reiser4_inode(info),
611 (void **)gang, index,
612 JNODE_GANG_SIZE);
613 for (i = 0; i < taken; ++i) {
614 node = gang[i];
615 if (index_jnode(node) < end)
616 jref(node);
617 else
618 gang[i] = NULL;
620 read_unlock_tree(tree);
622 for (i = 0; i < taken; ++i) {
623 node = gang[i];
624 if (node != NULL) {
625 index = max(index, index_jnode(node));
626 spin_lock_jnode(node);
627 assert("edward-1457", node->pg == NULL);
628 /* this is always called after
629 truncate_inode_pages_range(). Therefore, here
630 jnode can not have page. New pages can not be
631 created because truncate_jnodes_range goes
632 under exclusive access on file obtained,
633 where as new page creation requires
634 non-exclusive access obtained */
635 JF_SET(node, JNODE_HEARD_BANSHEE);
636 reiser4_uncapture_jnode(node);
637 unhash_unformatted_jnode(node);
638 truncated_jnodes++;
639 jput(node);
640 } else
641 break;
643 if (i != taken || taken == 0)
644 break;
646 return truncated_jnodes;
649 /* Truncating files in reiser4: problems and solutions.
651 VFS calls fs's truncate after it has called truncate_inode_pages()
652 to get rid of pages corresponding to part of file being truncated.
653 In reiser4 it may cause existence of unallocated extents which do
654 not have jnodes. Flush code does not expect that. Solution of this
655 problem is straightforward. As vfs's truncate is implemented using
656 setattr operation, it seems reasonable to have ->setattr() that
657 will cut file body. However, flush code also does not expect dirty
658 pages without parent items, so it is impossible to cut all items,
659 then truncate all pages in two steps. We resolve this problem by
660 cutting items one-by-one. Each such fine-grained step performed
661 under longterm znode lock calls at the end ->kill_hook() method of
662 a killed item to remove its binded pages and jnodes.
664 The following function is a common part of mentioned kill hooks.
665 Also, this is called before tail-to-extent conversion (to not manage
666 few copies of the data).
668 void reiser4_invalidate_pages(struct address_space *mapping, pgoff_t from,
669 unsigned long count, int even_cows)
671 loff_t from_bytes, count_bytes;
673 if (count == 0)
674 return;
675 from_bytes = ((loff_t) from) << PAGE_CACHE_SHIFT;
676 count_bytes = ((loff_t) count) << PAGE_CACHE_SHIFT;
678 unmap_mapping_range(mapping, from_bytes, count_bytes, even_cows);
679 truncate_inode_pages_range(mapping, from_bytes,
680 from_bytes + count_bytes - 1);
681 truncate_jnodes_range(mapping->host, from, count);
685 * Local variables:
686 * c-indentation-style: "K&R"
687 * mode-name: "LC"
688 * c-basic-offset: 8
689 * tab-width: 8
690 * fill-column: 120
691 * scroll-step: 1
692 * End: