1 ext4: import inodes chapter from wiki page
3 From: Darrick J. Wong <darrick.wong@oracle.com>
5 Import the chapter about inodes from the on-disk format wiki
6 page into the kernel documentation.
8 Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
9 Signed-off-by: Theodore Ts'o <tytso@mit.edu>
11 Documentation/filesystems/ext4/ondisk/dynamic.rst | 9 ++
12 Documentation/filesystems/ext4/ondisk/index.rst | 1 +
13 Documentation/filesystems/ext4/ondisk/inodes.rst | 575 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
14 3 files changed, 585 insertions(+)
16 diff --git a/Documentation/filesystems/ext4/ondisk/dynamic.rst b/Documentation/filesystems/ext4/ondisk/dynamic.rst
18 index 000000000000..7c5f5019b9d6
20 +++ b/Documentation/filesystems/ext4/ondisk/dynamic.rst
22 +.. SPDX-License-Identifier: GPL-2.0
27 +Dynamic metadata are created on the fly when files and blocks are
30 +.. include:: inodes.rst
31 diff --git a/Documentation/filesystems/ext4/ondisk/index.rst b/Documentation/filesystems/ext4/ondisk/index.rst
32 index dbb259f83976..f7d082c3a435 100644
33 --- a/Documentation/filesystems/ext4/ondisk/index.rst
34 +++ b/Documentation/filesystems/ext4/ondisk/index.rst
35 @@ -6,3 +6,4 @@ Data Structures and Algorithms
36 .. include:: about.rst
37 .. include:: overview.rst
38 .. include:: globals.rst
39 +.. include:: dynamic.rst
40 diff --git a/Documentation/filesystems/ext4/ondisk/inodes.rst b/Documentation/filesystems/ext4/ondisk/inodes.rst
42 index 000000000000..655ce898f3f5
44 +++ b/Documentation/filesystems/ext4/ondisk/inodes.rst
46 +.. SPDX-License-Identifier: GPL-2.0
51 +In a regular UNIX filesystem, the inode stores all the metadata
52 +pertaining to the file (time stamps, block maps, extended attributes,
53 +etc), not the directory entry. To find the information associated with a
54 +file, one must traverse the directory files to find the directory entry
55 +associated with a file, then load the inode to find the metadata for
56 +that file. ext4 appears to cheat (for performance reasons) a little bit
57 +by storing a copy of the file type (normally stored in the inode) in the
58 +directory entry. (Compare all this to FAT, which stores all the file
59 +information directly in the directory entry, but does not support hard
60 +links and is in general more seek-happy than ext4 due to its simpler
61 +block allocator and extensive use of linked lists.)
63 +The inode table is a linear array of ``struct ext4_inode``. The table is
64 +sized to have enough blocks to store at least
65 +``sb.s_inode_size * sb.s_inodes_per_group`` bytes. The number of the
66 +block group containing an inode can be calculated as
67 +``(inode_number - 1) / sb.s_inodes_per_group``, and the offset into the
68 +group's table is ``(inode_number - 1) % sb.s_inodes_per_group``. There
71 +The inode checksum is calculated against the FS UUID, the inode number,
72 +and the inode structure itself.
74 +The inode table entry is laid out in ``struct ext4_inode``.
87 + - File mode. See the table i_mode_ below.
91 + - Lower 16-bits of Owner UID.
95 + - Lower 32-bits of size in bytes.
99 + - Last access time, in seconds since the epoch. However, if the EA\_INODE
100 + inode flag is set, this inode stores an extended attribute value and
101 + this field contains the checksum of the value.
105 + - Last inode change time, in seconds since the epoch. However, if the
106 + EA\_INODE inode flag is set, this inode stores an extended attribute
107 + value and this field contains the lower 32 bits of the attribute value's
112 + - Last data modification time, in seconds since the epoch. However, if the
113 + EA\_INODE inode flag is set, this inode stores an extended attribute
114 + value and this field contains the number of the inode that owns the
115 + extended attribute.
119 + - Deletion Time, in seconds since the epoch.
123 + - Lower 16-bits of GID.
127 + - Hard link count. Normally, ext4 does not permit an inode to have more
128 + than 65,000 hard links. This applies to files as well as directories,
129 + which means that there cannot be more than 64,998 subdirectories in a
130 + directory (each subdirectory's '..' entry counts as a hard link, as does
131 + the '.' entry in the directory itself). With the DIR\_NLINK feature
132 + enabled, ext4 supports more than 64,998 subdirectories by setting this
133 + field to 1 to indicate that the number of hard links is not known.
137 + - Lower 32-bits of “block” count. If the huge\_file feature flag is not
138 + set on the filesystem, the file consumes ``i_blocks_lo`` 512-byte blocks
139 + on disk. If huge\_file is set and EXT4\_HUGE\_FILE\_FL is NOT set in
140 + ``inode.i_flags``, then the file consumes ``i_blocks_lo + (i_blocks_hi
141 + << 32)`` 512-byte blocks on disk. If huge\_file is set and
142 + EXT4\_HUGE\_FILE\_FL IS set in ``inode.i_flags``, then this file
143 + consumes (``i_blocks_lo + i_blocks_hi`` << 32) filesystem blocks on
148 + - Inode flags. See the table i_flags_ below.
152 + - See the table i_osd1_ for more details.
155 + - i\_block[EXT4\_N\_BLOCKS=15]
156 + - Block map or extent tree. See the section “The Contents of inode.i\_block”.
160 + - File version (for NFS).
164 + - Lower 32-bits of extended attribute block. ACLs are of course one of
165 + many possible extended attributes; I think the name of this field is a
166 + result of the first use of extended attributes being for ACLs.
169 + - i\_size\_high / i\_dir\_acl
170 + - Upper 32-bits of file/directory size. In ext2/3 this field was named
171 + i\_dir\_acl, though it was usually set to zero and never used.
175 + - (Obsolete) fragment address.
179 + - See the table i_osd2_ for more details.
183 + - Size of this inode - 128. Alternately, the size of the extended inode
184 + fields beyond the original ext2 inode, including this field.
188 + - Upper 16-bits of the inode checksum.
192 + - Extra change time bits. This provides sub-second precision. See Inode
193 + Timestamps section.
197 + - Extra modification time bits. This provides sub-second precision.
201 + - Extra access time bits. This provides sub-second precision.
205 + - File creation time, in seconds since the epoch.
209 + - Extra file creation time bits. This provides sub-second precision.
213 + - Upper 32-bits for version number.
221 +The ``i_mode`` value is a combination of the following flags:
230 + - S\_IXOTH (Others may execute)
232 + - S\_IWOTH (Others may write)
234 + - S\_IROTH (Others may read)
236 + - S\_IXGRP (Group members may execute)
238 + - S\_IWGRP (Group members may write)
240 + - S\_IRGRP (Group members may read)
242 + - S\_IXUSR (Owner may execute)
244 + - S\_IWUSR (Owner may write)
246 + - S\_IRUSR (Owner may read)
248 + - S\_ISVTX (Sticky bit)
250 + - S\_ISGID (Set GID)
252 + - S\_ISUID (Set UID)
254 + - These are mutually-exclusive file types:
258 + - S\_IFCHR (Character device)
260 + - S\_IFDIR (Directory)
262 + - S\_IFBLK (Block device)
264 + - S\_IFREG (Regular file)
266 + - S\_IFLNK (Symbolic link)
268 + - S\_IFSOCK (Socket)
272 +The ``i_flags`` field is a combination of these values:
281 + - This file requires secure deletion (EXT4\_SECRM\_FL). (not implemented)
283 + - This file should be preserved, should undeletion be desired
284 + (EXT4\_UNRM\_FL). (not implemented)
286 + - File is compressed (EXT4\_COMPR\_FL). (not really implemented)
288 + - All writes to the file must be synchronous (EXT4\_SYNC\_FL).
290 + - File is immutable (EXT4\_IMMUTABLE\_FL).
292 + - File can only be appended (EXT4\_APPEND\_FL).
294 + - The dump(1) utility should not dump this file (EXT4\_NODUMP\_FL).
296 + - Do not update access time (EXT4\_NOATIME\_FL).
298 + - Dirty compressed file (EXT4\_DIRTY\_FL). (not used)
300 + - File has one or more compressed clusters (EXT4\_COMPRBLK\_FL). (not used)
302 + - Do not compress file (EXT4\_NOCOMPR\_FL). (not used)
304 + - Encrypted inode (EXT4\_ENCRYPT\_FL). This bit value previously was
305 + EXT4\_ECOMPR\_FL (compression error), which was never used.
307 + - Directory has hashed indexes (EXT4\_INDEX\_FL).
309 + - AFS magic directory (EXT4\_IMAGIC\_FL).
311 + - File data must always be written through the journal
312 + (EXT4\_JOURNAL\_DATA\_FL).
314 + - File tail should not be merged (EXT4\_NOTAIL\_FL). (not used by ext4)
316 + - All directory entry data should be written synchronously (see
317 + ``dirsync``) (EXT4\_DIRSYNC\_FL).
319 + - Top of directory hierarchy (EXT4\_TOPDIR\_FL).
321 + - This is a huge file (EXT4\_HUGE\_FILE\_FL).
323 + - Inode uses extents (EXT4\_EXTENTS\_FL).
325 + - Inode stores a large extended attribute value in its data blocks
326 + (EXT4\_EA\_INODE\_FL).
328 + - This file has blocks allocated past EOF (EXT4\_EOFBLOCKS\_FL).
331 + - Inode is a snapshot (``EXT4_SNAPFILE_FL``). (not in mainline)
333 + - Snapshot is being deleted (``EXT4_SNAPFILE_DELETED_FL``). (not in
336 + - Snapshot shrink has completed (``EXT4_SNAPFILE_SHRUNK_FL``). (not in
339 + - Inode has inline data (EXT4\_INLINE\_DATA\_FL).
341 + - Create children with the same project ID (EXT4\_PROJINHERIT\_FL).
343 + - Reserved for ext4 library (EXT4\_RESERVED\_FL).
347 + - User-visible flags.
349 + - User-modifiable flags. Note that while EXT4\_JOURNAL\_DATA\_FL and
350 + EXT4\_EXTENTS\_FL can be set with setattr, they are not in the kernel's
351 + EXT4\_FL\_USER\_MODIFIABLE mask, since it needs to handle the setting of
352 + these flags in a special manner and they are masked out of the set of
353 + flags that are saved directly to i\_flags.
357 +The ``osd1`` field has multiple meanings depending on the creator:
372 + - Inode version. However, if the EA\_INODE inode flag is set, this inode
373 + stores an extended attribute value and this field contains the upper 32
374 + bits of the attribute value's reference count.
408 +The ``osd2`` field has multiple meanings depending on the filesystem creator:
422 + - l\_i\_blocks\_high
423 + - Upper 16-bits of the block count. Please see the note attached to
427 + - l\_i\_file\_acl\_high
428 + - Upper 16-bits of the extended attribute block (historically, the file
429 + ACL location). See the Extended Attributes section below.
433 + - Upper 16-bits of the Owner UID.
437 + - Upper 16-bits of the GID.
440 + - l\_i\_checksum\_lo
441 + - Lower 16-bits of the inode checksum.
464 + - Upper 16-bits of the file mode.
468 + - Upper 16-bits of the Owner UID.
472 + - Upper 16-bits of the GID.
494 + - m\_i\_file\_acl\_high
495 + - Upper 16-bits of the extended attribute block (historically, the file
499 + - m\_i\_reserved2[2]
505 +In ext2 and ext3, the inode structure size was fixed at 128 bytes
506 +(``EXT2_GOOD_OLD_INODE_SIZE``) and each inode had a disk record size of
507 +128 bytes. Starting with ext4, it is possible to allocate a larger
508 +on-disk inode at format time for all inodes in the filesystem to provide
509 +space beyond the end of the original ext2 inode. The on-disk inode
510 +record size is recorded in the superblock as ``s_inode_size``. The
511 +number of bytes actually used by struct ext4\_inode beyond the original
512 +128-byte ext2 inode is recorded in the ``i_extra_isize`` field for each
513 +inode, which allows struct ext4\_inode to grow for a new kernel without
514 +having to upgrade all of the on-disk inodes. Access to fields beyond
515 +EXT2\_GOOD\_OLD\_INODE\_SIZE should be verified to be within
516 +``i_extra_isize``. By default, ext4 inode records are 256 bytes, and (as
517 +of October 2013) the inode structure is 156 bytes
518 +(``i_extra_isize = 28``). The extra space between the end of the inode
519 +structure and the end of the inode record can be used to store extended
520 +attributes. Each inode record can be as large as the filesystem block
521 +size, though this is not terribly efficient.
526 +Each block group contains ``sb->s_inodes_per_group`` inodes. Because
527 +inode 0 is defined not to exist, this formula can be used to find the
528 +block group that an inode lives in:
529 +``bg = (inode_num - 1) / sb->s_inodes_per_group``. The particular inode
530 +can be found within the block group's inode table at
531 +``index = (inode_num - 1) % sb->s_inodes_per_group``. To get the byte
532 +address within the inode table, use
533 +``offset = index * sb->s_inode_size``.
538 +Four timestamps are recorded in the lower 128 bytes of the inode
539 +structure -- inode change time (ctime), access time (atime), data
540 +modification time (mtime), and deletion time (dtime). The four fields
541 +are 32-bit signed integers that represent seconds since the Unix epoch
542 +(1970-01-01 00:00:00 GMT), which means that the fields will overflow in
543 +January 2038. For inodes that are not linked from any directory but are
544 +still open (orphan inodes), the dtime field is overloaded for use with
545 +the orphan list. The superblock field ``s_last_orphan`` points to the
546 +first inode in the orphan list; dtime is then the number of the next
547 +orphaned inode, or zero if there are no more orphans.
549 +If the inode structure size ``sb->s_inode_size`` is larger than 128
550 +bytes and the ``i_inode_extra`` field is large enough to encompass the
551 +respective ``i_[cma]time_extra`` field, the ctime, atime, and mtime
552 +inode fields are widened to 64 bits. Within this “extra” 32-bit field,
553 +the lower two bits are used to extend the 32-bit seconds field to be 34
554 +bit wide; the upper 30 bits are used to provide nanosecond timestamp
555 +accuracy. Therefore, timestamps should not overflow until May 2446.
556 +dtime was not widened. There is also a fifth timestamp to record inode
557 +creation time (crtime); this field is 64-bits wide and decoded in the
558 +same manner as 64-bit [cma]time. Neither crtime nor dtime are accessible
559 +through the regular stat() interface, though debugfs will report them.
561 +We use the 32-bit signed time value plus (2^32 \* (extra epoch bits)).
565 + :widths: 20 20 20 20 20
568 + * - Extra epoch bits
569 + - MSB of 32-bit time
570 + - Adjustment for signed 32-bit to 64-bit tv\_sec
571 + - Decoded 64-bit tv\_sec
576 + - ``-0x80000000 - -0x00000001``
577 + - 1901-12-13 to 1969-12-31
581 + - ``0x000000000 - 0x07fffffff``
582 + - 1970-01-01 to 2038-01-19
586 + - ``0x080000000 - 0x0ffffffff``
587 + - 2038-01-19 to 2106-02-07
591 + - ``0x100000000 - 0x17fffffff``
592 + - 2106-02-07 to 2174-02-25
596 + - ``0x180000000 - 0x1ffffffff``
597 + - 2174-02-25 to 2242-03-16
601 + - ``0x200000000 - 0x27fffffff``
602 + - 2242-03-16 to 2310-04-04
606 + - ``0x280000000 - 0x2ffffffff``
607 + - 2310-04-04 to 2378-04-22
611 + - ``0x300000000 - 0x37fffffff``
612 + - 2378-04-22 to 2446-05-10
614 +This is a somewhat odd encoding since there are effectively seven times
615 +as many positive values as negative values. There have also been
616 +long-standing bugs decoding and encoding dates beyond 2038, which don't
617 +seem to be fixed as of kernel 3.12 and e2fsprogs 1.42.8. 64-bit kernels
618 +incorrectly use the extra epoch bits 1,1 for dates between 1901 and
619 +1970. At some point the kernel will be fixed and e2fsck will fix this
620 +situation, assuming that it is run before 2310.