revert-mm-fix-blkdev-size-calculation-in-generic_write_checks
[linux-2.6/linux-trees-mm.git] / fs / reiser4 / page_cache.c
blob203c3904357df9f712c1cb0375868f1a93c9aa71
1 /* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by
2 * reiser4/README */
4 /* Memory pressure hooks. Fake inodes handling. */
6 /* GLOSSARY
8 . Formatted and unformatted nodes.
9 Elements of reiser4 balanced tree to store data and metadata.
10 Unformatted nodes are pointed to by extent pointers. Such nodes
11 are used to store data of large objects. Unlike unformatted nodes,
12 formatted ones have associated format described by node4X plugin.
14 . Jnode (or journal node)
15 The in-memory header which is used to track formatted and unformatted
16 nodes, bitmap nodes, etc. In particular, jnodes are used to track
17 transactional information associated with each block(see reiser4/jnode.c
18 for details).
20 . Znode
21 The in-memory header which is used to track formatted nodes. Contains
22 embedded jnode (see reiser4/znode.c for details).
25 /* We store all file system meta data (and data, of course) in the page cache.
27 What does this mean? In stead of using bread/brelse we create special
28 "fake" inode (one per super block) and store content of formatted nodes
29 into pages bound to this inode in the page cache. In newer kernels bread()
30 already uses inode attached to block device (bd_inode). Advantage of having
31 our own fake inode is that we can install appropriate methods in its
32 address_space operations. Such methods are called by VM on memory pressure
33 (or during background page flushing) and we can use them to react
34 appropriately.
36 In initial version we only support one block per page. Support for multiple
37 blocks per page is complicated by relocation.
39 To each page, used by reiser4, jnode is attached. jnode is analogous to
40 buffer head. Difference is that jnode is bound to the page permanently:
41 jnode cannot be removed from memory until its backing page is.
43 jnode contain pointer to page (->pg field) and page contain pointer to
44 jnode in ->private field. Pointer from jnode to page is protected to by
45 jnode's spinlock and pointer from page to jnode is protected by page lock
46 (PG_locked bit). Lock ordering is: first take page lock, then jnode spin
47 lock. To go into reverse direction use jnode_lock_page() function that uses
48 standard try-lock-and-release device.
50 Properties:
52 1. when jnode-to-page mapping is established (by jnode_attach_page()), page
53 reference counter is increased.
55 2. when jnode-to-page mapping is destroyed (by page_clear_jnode(), page
56 reference counter is decreased.
58 3. on jload() reference counter on jnode page is increased, page is
59 kmapped and `referenced'.
61 4. on jrelse() inverse operations are performed.
63 5. kmapping/kunmapping of unformatted pages is done by read/write methods.
65 DEADLOCKS RELATED TO MEMORY PRESSURE. [OUTDATED. Only interesting
66 historically.]
68 [In the following discussion, `lock' invariably means long term lock on
69 znode.] (What about page locks?)
71 There is some special class of deadlock possibilities related to memory
72 pressure. Locks acquired by other reiser4 threads are accounted for in
73 deadlock prevention mechanism (lock.c), but when ->vm_writeback() is
74 invoked additional hidden arc is added to the locking graph: thread that
75 tries to allocate memory waits for ->vm_writeback() to finish. If this
76 thread keeps lock and ->vm_writeback() tries to acquire this lock, deadlock
77 prevention is useless.
79 Another related problem is possibility for ->vm_writeback() to run out of
80 memory itself. This is not a problem for ext2 and friends, because their
81 ->vm_writeback() don't allocate much memory, but reiser4 flush is
82 definitely able to allocate huge amounts of memory.
84 It seems that there is no reliable way to cope with the problems above. In
85 stead it was decided that ->vm_writeback() (as invoked in the kswapd
86 context) wouldn't perform any flushing itself, but rather should just wake
87 up some auxiliary thread dedicated for this purpose (or, the same thread
88 that does periodic commit of old atoms (ktxnmgrd.c)).
90 Details:
92 1. Page is called `reclaimable' against particular reiser4 mount F if this
93 page can be ultimately released by try_to_free_pages() under presumptions
94 that:
96 a. ->vm_writeback() for F is no-op, and
98 b. none of the threads accessing F are making any progress, and
100 c. other reiser4 mounts obey the same memory reservation protocol as F
101 (described below).
103 For example, clean un-pinned page, or page occupied by ext2 data are
104 reclaimable against any reiser4 mount.
106 When there is more than one reiser4 mount in a system, condition (c) makes
107 reclaim-ability not easily verifiable beyond trivial cases mentioned above.
109 THIS COMMENT IS VALID FOR "MANY BLOCKS ON PAGE" CASE
111 Fake inode is used to bound formatted nodes and each node is indexed within
112 fake inode by its block number. If block size of smaller than page size, it
113 may so happen that block mapped to the page with formatted node is occupied
114 by unformatted node or is unallocated. This lead to some complications,
115 because flushing whole page can lead to an incorrect overwrite of
116 unformatted node that is moreover, can be cached in some other place as
117 part of the file body. To avoid this, buffers for unformatted nodes are
118 never marked dirty. Also pages in the fake are never marked dirty. This
119 rules out usage of ->writepage() as memory pressure hook. In stead
120 ->releasepage() is used.
122 Josh is concerned that page->buffer is going to die. This should not pose
123 significant problem though, because we need to add some data structures to
124 the page anyway (jnode) and all necessary book keeping can be put there.
128 /* Life cycle of pages/nodes.
130 jnode contains reference to page and page contains reference back to
131 jnode. This reference is counted in page ->count. Thus, page bound to jnode
132 cannot be released back into free pool.
134 1. Formatted nodes.
136 1. formatted node is represented by znode. When new znode is created its
137 ->pg pointer is NULL initially.
139 2. when node content is loaded into znode (by call to zload()) for the
140 first time following happens (in call to ->read_node() or
141 ->allocate_node()):
143 1. new page is added to the page cache.
145 2. this page is attached to znode and its ->count is increased.
147 3. page is kmapped.
149 3. if more calls to zload() follow (without corresponding zrelses), page
150 counter is left intact and in its stead ->d_count is increased in znode.
152 4. each call to zrelse decreases ->d_count. When ->d_count drops to zero
153 ->release_node() is called and page is kunmapped as result.
155 5. at some moment node can be captured by a transaction. Its ->x_count
156 is then increased by transaction manager.
158 6. if node is removed from the tree (empty node with JNODE_HEARD_BANSHEE
159 bit set) following will happen (also see comment at the top of znode.c):
161 1. when last lock is released, node will be uncaptured from
162 transaction. This released reference that transaction manager acquired
163 at the step 5.
165 2. when last reference is released, zput() detects that node is
166 actually deleted and calls ->delete_node()
167 operation. page_cache_delete_node() implementation detaches jnode from
168 page and releases page.
170 7. otherwise (node wasn't removed from the tree), last reference to
171 znode will be released after transaction manager committed transaction
172 node was in. This implies squallocing of this node (see
173 flush.c). Nothing special happens at this point. Znode is still in the
174 hash table and page is still attached to it.
176 8. znode is actually removed from the memory because of the memory
177 pressure, or during umount (znodes_tree_done()). Anyway, znode is
178 removed by the call to zdrop(). At this moment, page is detached from
179 znode and removed from the inode address space.
183 #include "debug.h"
184 #include "dformat.h"
185 #include "key.h"
186 #include "txnmgr.h"
187 #include "jnode.h"
188 #include "znode.h"
189 #include "block_alloc.h"
190 #include "tree.h"
191 #include "vfs_ops.h"
192 #include "inode.h"
193 #include "super.h"
194 #include "entd.h"
195 #include "page_cache.h"
196 #include "ktxnmgrd.h"
198 #include <linux/types.h>
199 #include <linux/fs.h>
200 #include <linux/mm.h> /* for struct page */
201 #include <linux/swap.h> /* for struct page */
202 #include <linux/pagemap.h>
203 #include <linux/bio.h>
204 #include <linux/writeback.h>
205 #include <linux/blkdev.h>
207 static struct bio *page_bio(struct page *, jnode *, int rw, gfp_t gfp);
209 static struct address_space_operations formatted_fake_as_ops;
211 static const oid_t fake_ino = 0x1;
212 static const oid_t bitmap_ino = 0x2;
213 static const oid_t cc_ino = 0x3;
215 static void
216 init_fake_inode(struct super_block *super, struct inode *fake,
217 struct inode **pfake)
219 assert("nikita-2168", fake->i_state & I_NEW);
220 fake->i_mapping->a_ops = &formatted_fake_as_ops;
221 *pfake = fake;
222 /* NOTE-NIKITA something else? */
223 unlock_new_inode(fake);
227 * reiser4_init_formatted_fake - iget inodes for formatted nodes and bitmaps
228 * @super: super block to init fake inode for
230 * Initializes fake inode to which formatted nodes are bound in the page cache
231 * and inode for bitmaps.
233 int reiser4_init_formatted_fake(struct super_block *super)
235 struct inode *fake;
236 struct inode *bitmap;
237 struct inode *cc;
238 reiser4_super_info_data *sinfo;
240 assert("nikita-1703", super != NULL);
242 sinfo = get_super_private_nocheck(super);
243 fake = iget_locked(super, oid_to_ino(fake_ino));
245 if (fake != NULL) {
246 init_fake_inode(super, fake, &sinfo->fake);
248 bitmap = iget_locked(super, oid_to_ino(bitmap_ino));
249 if (bitmap != NULL) {
250 init_fake_inode(super, bitmap, &sinfo->bitmap);
252 cc = iget_locked(super, oid_to_ino(cc_ino));
253 if (cc != NULL) {
254 init_fake_inode(super, cc, &sinfo->cc);
255 return 0;
256 } else {
257 iput(sinfo->fake);
258 iput(sinfo->bitmap);
259 sinfo->fake = NULL;
260 sinfo->bitmap = NULL;
262 } else {
263 iput(sinfo->fake);
264 sinfo->fake = NULL;
267 return RETERR(-ENOMEM);
271 * reiser4_done_formatted_fake - release inode used by formatted nodes and bitmaps
272 * @super: super block to init fake inode for
274 * Releases inodes which were used as address spaces of bitmap and formatted
275 * nodes.
277 void reiser4_done_formatted_fake(struct super_block *super)
279 reiser4_super_info_data *sinfo;
281 sinfo = get_super_private_nocheck(super);
283 if (sinfo->fake != NULL) {
284 iput(sinfo->fake);
285 sinfo->fake = NULL;
288 if (sinfo->bitmap != NULL) {
289 iput(sinfo->bitmap);
290 sinfo->bitmap = NULL;
293 if (sinfo->cc != NULL) {
294 iput(sinfo->cc);
295 sinfo->cc = NULL;
297 return;
300 void reiser4_wait_page_writeback(struct page *page)
302 assert("zam-783", PageLocked(page));
304 do {
305 unlock_page(page);
306 wait_on_page_writeback(page);
307 lock_page(page);
308 } while (PageWriteback(page));
311 /* return tree @page is in */
312 reiser4_tree *reiser4_tree_by_page(const struct page *page /* page to query */ )
314 assert("nikita-2461", page != NULL);
315 return &get_super_private(page->mapping->host->i_sb)->tree;
318 /* completion handler for single page bio-based read.
320 mpage_end_io_read() would also do. But it's static.
323 static void
324 end_bio_single_page_read(struct bio *bio, int err UNUSED_ARG)
326 struct page *page;
328 page = bio->bi_io_vec[0].bv_page;
330 if (test_bit(BIO_UPTODATE, &bio->bi_flags)) {
331 SetPageUptodate(page);
332 } else {
333 ClearPageUptodate(page);
334 SetPageError(page);
336 unlock_page(page);
337 bio_put(bio);
340 /* completion handler for single page bio-based write.
342 mpage_end_io_write() would also do. But it's static.
345 static void
346 end_bio_single_page_write(struct bio *bio, int err UNUSED_ARG)
348 struct page *page;
350 page = bio->bi_io_vec[0].bv_page;
352 if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
353 SetPageError(page);
354 end_page_writeback(page);
355 bio_put(bio);
358 /* ->readpage() method for formatted nodes */
359 static int formatted_readpage(struct file *f UNUSED_ARG,
360 struct page *page /* page to read */ )
362 assert("nikita-2412", PagePrivate(page) && jprivate(page));
363 return reiser4_page_io(page, jprivate(page), READ,
364 reiser4_ctx_gfp_mask_get());
368 * reiser4_page_io - submit single-page bio request
369 * @page: page to perform io for
370 * @node: jnode of page
371 * @rw: read or write
372 * @gfp: gfp mask for bio allocation
374 * Submits single page read or write.
376 int reiser4_page_io(struct page *page, jnode *node, int rw, gfp_t gfp)
378 struct bio *bio;
379 int result;
381 assert("nikita-2094", page != NULL);
382 assert("nikita-2226", PageLocked(page));
383 assert("nikita-2634", node != NULL);
384 assert("nikita-2893", rw == READ || rw == WRITE);
386 if (rw) {
387 if (unlikely(page->mapping->host->i_sb->s_flags & MS_RDONLY)) {
388 unlock_page(page);
389 return 0;
393 bio = page_bio(page, node, rw, gfp);
394 if (!IS_ERR(bio)) {
395 if (rw == WRITE) {
396 set_page_writeback(page);
397 unlock_page(page);
399 reiser4_submit_bio(rw, bio);
400 result = 0;
401 } else {
402 unlock_page(page);
403 result = PTR_ERR(bio);
406 return result;
409 /* helper function to construct bio for page */
410 static struct bio *page_bio(struct page *page, jnode * node, int rw, gfp_t gfp)
412 struct bio *bio;
413 assert("nikita-2092", page != NULL);
414 assert("nikita-2633", node != NULL);
416 /* Simple implementation in the assumption that blocksize == pagesize.
418 We only have to submit one block, but submit_bh() will allocate bio
419 anyway, so lets use all the bells-and-whistles of bio code.
422 bio = bio_alloc(gfp, 1);
423 if (bio != NULL) {
424 int blksz;
425 struct super_block *super;
426 reiser4_block_nr blocknr;
428 super = page->mapping->host->i_sb;
429 assert("nikita-2029", super != NULL);
430 blksz = super->s_blocksize;
431 assert("nikita-2028", blksz == (int)PAGE_CACHE_SIZE);
433 spin_lock_jnode(node);
434 blocknr = *jnode_get_io_block(node);
435 spin_unlock_jnode(node);
437 assert("nikita-2275", blocknr != (reiser4_block_nr) 0);
438 assert("nikita-2276", !reiser4_blocknr_is_fake(&blocknr));
440 bio->bi_bdev = super->s_bdev;
441 /* fill bio->bi_sector before calling bio_add_page(), because
442 * q->merge_bvec_fn may want to inspect it (see
443 * drivers/md/linear.c:linear_mergeable_bvec() for example. */
444 bio->bi_sector = blocknr * (blksz >> 9);
446 if (!bio_add_page(bio, page, blksz, 0)) {
447 warning("nikita-3452",
448 "Single page bio cannot be constructed");
449 return ERR_PTR(RETERR(-EINVAL));
452 /* bio -> bi_idx is filled by bio_init() */
453 bio->bi_end_io = (rw == READ) ?
454 end_bio_single_page_read : end_bio_single_page_write;
456 return bio;
457 } else
458 return ERR_PTR(RETERR(-ENOMEM));
461 /* this function is internally called by jnode_make_dirty() */
462 int reiser4_set_page_dirty_internal(struct page *page)
464 struct address_space *mapping;
466 mapping = page->mapping;
467 BUG_ON(mapping == NULL);
469 if (!TestSetPageDirty(page)) {
470 if (mapping_cap_account_dirty(mapping))
471 inc_zone_page_state(page, NR_FILE_DIRTY);
473 __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
476 /* znode must be dirty ? */
477 if (mapping->host == reiser4_get_super_fake(mapping->host->i_sb))
478 assert("", JF_ISSET(jprivate(page), JNODE_DIRTY));
479 return 0;
482 #if 0
483 static int can_hit_entd(reiser4_context *ctx, struct super_block *s)
485 if (ctx == NULL || ((unsigned long)ctx->magic) != context_magic)
486 return 1;
487 if (ctx->super != s)
488 return 1;
489 if (get_super_private(s)->entd.tsk == current)
490 return 0;
491 if (!lock_stack_isclean(&ctx->stack))
492 return 0;
493 if (ctx->trans->atom != NULL)
494 return 0;
495 return 1;
497 #endif
500 * reiser4_writepage - writepage of struct address_space_operations
501 * @page: page to write
502 * @wbc:
506 /* Common memory pressure notification. */
507 int reiser4_writepage(struct page *page,
508 struct writeback_control *wbc)
510 struct super_block *s;
511 reiser4_context *ctx;
513 assert("vs-828", PageLocked(page));
515 s = page->mapping->host->i_sb;
516 ctx = get_current_context_check();
518 //assert("", can_hit_entd(ctx, s));
519 return write_page_by_ent(page, wbc);
522 /* ->set_page_dirty() method of formatted address_space */
523 static int formatted_set_page_dirty(struct page *page)
525 assert("nikita-2173", page != NULL);
526 BUG();
527 return __set_page_dirty_nobuffers(page);
530 /* writepages method of address space operations in reiser4 is used to involve
531 into transactions pages which are dirtied via mmap. Only regular files can
532 have such pages. Fake inode is used to access formatted nodes via page
533 cache. As formatted nodes can never be mmaped, fake inode's writepages has
534 nothing to do */
535 static int
536 writepages_fake(struct address_space *mapping, struct writeback_control *wbc)
538 return 0;
541 /* address space operations for the fake inode */
542 static struct address_space_operations formatted_fake_as_ops = {
543 /* Perform a writeback of a single page as a memory-freeing
544 * operation. */
545 .writepage = reiser4_writepage,
546 /* this is called to read formatted node */
547 .readpage = formatted_readpage,
548 /* ->sync_page() method of fake inode address space operations. Called
549 from wait_on_page() and lock_page().
551 This is most annoyingly misnomered method. Actually it is called
552 from wait_on_page_bit() and lock_page() and its purpose is to
553 actually start io by jabbing device drivers.
555 .sync_page = block_sync_page,
556 /* Write back some dirty pages from this mapping. Called from sync.
557 called during sync (pdflush) */
558 .writepages = writepages_fake,
559 /* Set a page dirty */
560 .set_page_dirty = formatted_set_page_dirty,
561 /* used for read-ahead. Not applicable */
562 .readpages = NULL,
563 .prepare_write = NULL,
564 .commit_write = NULL,
565 .bmap = NULL,
566 /* called just before page is being detached from inode mapping and
567 removed from memory. Called on truncate, cut/squeeze, and
568 umount. */
569 .invalidatepage = reiser4_invalidatepage,
570 /* this is called by shrink_cache() so that file system can try to
571 release objects (jnodes, buffers, journal heads) attached to page
572 and, may be made page itself free-able.
574 .releasepage = reiser4_releasepage,
575 .direct_IO = NULL
578 /* called just before page is released (no longer used by reiser4). Callers:
579 jdelete() and extent2tail(). */
580 void reiser4_drop_page(struct page *page)
582 assert("nikita-2181", PageLocked(page));
583 clear_page_dirty_for_io(page);
584 ClearPageUptodate(page);
585 #if defined(PG_skipped)
586 ClearPageSkipped(page);
587 #endif
588 unlock_page(page);
591 #define JNODE_GANG_SIZE (16)
593 /* find all jnodes from range specified and invalidate them */
594 static int
595 truncate_jnodes_range(struct inode *inode, pgoff_t from, pgoff_t count)
597 reiser4_inode *info;
598 int truncated_jnodes;
599 reiser4_tree *tree;
600 unsigned long index;
601 unsigned long end;
603 if (inode_file_plugin(inode) ==
604 file_plugin_by_id(CRYPTCOMPRESS_FILE_PLUGIN_ID))
606 * No need to get rid of jnodes here: if the single jnode of
607 * page cluster did not have page, then it was found and killed
608 * before in
609 * truncate_complete_page_cluster()->jput()->jput_final(),
610 * otherwise it will be dropped by reiser4_invalidatepage()
612 return 0;
613 truncated_jnodes = 0;
615 info = reiser4_inode_data(inode);
616 tree = reiser4_tree_by_inode(inode);
618 index = from;
619 end = from + count;
621 while (1) {
622 jnode *gang[JNODE_GANG_SIZE];
623 int taken;
624 int i;
625 jnode *node;
627 assert("nikita-3466", index <= end);
629 read_lock_tree(tree);
630 taken =
631 radix_tree_gang_lookup(jnode_tree_by_reiser4_inode(info),
632 (void **)gang, index,
633 JNODE_GANG_SIZE);
634 for (i = 0; i < taken; ++i) {
635 node = gang[i];
636 if (index_jnode(node) < end)
637 jref(node);
638 else
639 gang[i] = NULL;
641 read_unlock_tree(tree);
643 for (i = 0; i < taken; ++i) {
644 node = gang[i];
645 if (node != NULL) {
646 index = max(index, index_jnode(node));
647 spin_lock_jnode(node);
648 assert("edward-1457", node->pg == NULL);
649 /* this is always called after
650 truncate_inode_pages_range(). Therefore, here
651 jnode can not have page. New pages can not be
652 created because truncate_jnodes_range goes
653 under exclusive access on file obtained,
654 where as new page creation requires
655 non-exclusive access obtained */
656 JF_SET(node, JNODE_HEARD_BANSHEE);
657 reiser4_uncapture_jnode(node);
658 unhash_unformatted_jnode(node);
659 truncated_jnodes++;
660 jput(node);
661 } else
662 break;
664 if (i != taken || taken == 0)
665 break;
667 return truncated_jnodes;
670 /* Truncating files in reiser4: problems and solutions.
672 VFS calls fs's truncate after it has called truncate_inode_pages()
673 to get rid of pages corresponding to part of file being truncated.
674 In reiser4 it may cause existence of unallocated extents which do
675 not have jnodes. Flush code does not expect that. Solution of this
676 problem is straightforward. As vfs's truncate is implemented using
677 setattr operation, it seems reasonable to have ->setattr() that
678 will cut file body. However, flush code also does not expect dirty
679 pages without parent items, so it is impossible to cut all items,
680 then truncate all pages in two steps. We resolve this problem by
681 cutting items one-by-one. Each such fine-grained step performed
682 under longterm znode lock calls at the end ->kill_hook() method of
683 a killed item to remove its binded pages and jnodes.
685 The following function is a common part of mentioned kill hooks.
686 Also, this is called before tail-to-extent conversion (to not manage
687 few copies of the data).
689 void reiser4_invalidate_pages(struct address_space *mapping, pgoff_t from,
690 unsigned long count, int even_cows)
692 loff_t from_bytes, count_bytes;
694 if (count == 0)
695 return;
696 from_bytes = ((loff_t) from) << PAGE_CACHE_SHIFT;
697 count_bytes = ((loff_t) count) << PAGE_CACHE_SHIFT;
699 unmap_mapping_range(mapping, from_bytes, count_bytes, even_cows);
700 truncate_inode_pages_range(mapping, from_bytes,
701 from_bytes + count_bytes - 1);
702 truncate_jnodes_range(mapping->host, from, count);
706 * Local variables:
707 * c-indentation-style: "K&R"
708 * mode-name: "LC"
709 * c-basic-offset: 8
710 * tab-width: 8
711 * fill-column: 120
712 * scroll-step: 1
713 * End: