Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Proceedings of the

Linux Symposium

June 27th–30th, 2007


Ottawa, Ontario
Canada
Conference Organizers
Andrew J. Hutton, Steamballoon, Inc.
C. Craig Ross, Linux Symposium

Review Committee
Andrew J. Hutton, Steamballoon, Inc.
Dirk Hohndel, Intel
Martin Bligh, Google
Gerrit Huizenga, IBM
Dave Jones, Red Hat, Inc.
C. Craig Ross, Linux Symposium

Proceedings Formatting Team


John W. Lockhart, Red Hat, Inc.
Gurhan Ozen, Red Hat, Inc.
John Feeney, Red Hat, Inc.
Len DiMaggio, Red Hat, Inc.
John Poelstra, Red Hat, Inc.

Authors retain copyright to all submitted papers, but have granted unlimited redistribution rights
to all as a condition of submission.
The new ext4 filesystem: current status and future plans
Avantika Mathur, Mingming Cao, Suparna Bhattacharya
IBM Linux Technology Center
mathur@us.ibm.com, cmm@us.ibm.com, suparna@in.ibm.com
Andreas Dilger, Alex Tomas
Cluster Filesystem Inc.
adilger@clusterfs.com, alex@clusterfs.com
Laurent Vivier
Bull S.A.S.
laurent.vivier@bull.net

Abstract size. Enterprise workloads are already approaching this


limit, and with disk capacities doubling every year and
Ext3 has been the most widely used general Linux R 1 TB hard disks easily available in stores, it will soon be
filesystem for many years. In keeping with increasing hit by desktop users as well.
disk capacities and state-of-the-art feature requirements,
the next generation of the ext3 filesystem, ext4, was cre- To address this limit, in August 2006, we posted a series
ated last year. This new filesystem incorporates scalabil- of patches introducing two key features to ext3: larger
ity and performance enhancements for supporting large filesystem capacity and extents mapping. The patches
filesystems, while maintaining reliability and stability. unavoidably change the on-disk format and break for-
Ext4 will be suitable for a larger variety of workloads wards compatibility. In order to maintain the stability of
and is expected to replace ext3 as the “Linux filesys- ext3 for its massive user base, we decided to branch to
tem.” ext4 from ext3.

In this paper we will first discuss the reasons for start- The primary goal of this new filesystem is to address
ing the ext4 filesystem, then explore the enhanced ca- scalability, performance, and reliability issues faced by
pabilities currently available and planned for ext4, dis- ext3. A common question is why not use XFS or start
cuss methods for migrating between ext3 and ext4, and an entirely new filesystem from scratch? We want to
finally compare ext4 and other filesystem performance give the large number of ext3 users the opportunity to
on three classic filesystem benchmarks. easily upgrade their filesystem, as was done from ext2
to ext3. Also, there has been considerable investment in
the capabilities, robustness, and reliability of ext3 and
1 Introduction e2fsck. Ext4 developers can take advantage of this pre-
vious work, and focus on adding advanced features and
Ext3 has been a very popular Linux filesystem due to its delivering a new scalable enterprise-ready filesystem in
reliability, rich feature set, relatively good performance, a short time frame.
and strong compatibility between versions. The conser-
vative design of ext3 has given it the reputation of being Thus, ext4 was born. The new filesystem has been in
stable and robust, but has also limited its ability to scale mainline Linux since version 2.6.19. As of the writing
and perform well on large configurations. of this paper, the filesystem is marked as developmen-
tal, titled ext4dev, explicitly warning users that it is not
With the pressure of increasing capabilities of new hard- ready for production use. Currently, extents and 48-bit
ware and online resizing support in ext3, the require- block numbers are included in ext4, but there are many
ment to address ext3 scalability and performance is new filesystem features in the roadmap that will be dis-
more urgent than ever. One of the outstanding limits cussed throughout this paper. The current ext4 develop-
faced by ext3 today is the 16 TB maximum filesystem ment git tree is hosted at git://git.kernel.org/

• 21 •
22 • The new ext4 filesystem: current status and future plans

pub/scm/linux/kernel/git/tytso/ext4. Up- the end of the block group descriptor structure to store
to-date ext4 patches and feature discussions can be the most significant bits of 64-bit values for bitmaps and
found at the ext4 wiki page, http://ext4.wiki. inode table pointers.
kernel.org.
Since the addresses of modified blocks in the filesys-
Some of the features in progress could possibly continue tem are logged in the journal, the journaling block layer
to change the on-disk layout. Ext4 will be converted (JBD) is also required to support at least 48-bit block ad-
from development mode to stable mode once the layout dresses. Therefore, JBD was branched to JBD2 to sup-
has been finalized. At that time, ext4 will be available port more than 32-bit block numbers at the same time
for general use by all users in need of a more scalable ext4 was forked. Although currently only ext4 is using
and modern version of ext3. In the following three sec- JBD2, it can provide general journaling support for both
tions we will discuss new capabilities currently included 32-bit and 64-bit filesystems.
in or planned for ext4 in the areas of scalability, frag-
One may question why we chose 48-bit rather than full
mentation, and reliability.
64-bit support. The 1 EB limit will be sufficient for
many years. Long before this limit is hit there will be
2 Scalability enhancements reliability issues that need to be addressed. At current
speeds, a 1 EB filesystem would take 119 years to finish
The first goal of ext4 was to become a more scalable one full e2fsck, and 65536 times that for a 264 blocks (64
filesystem. In this section we will discuss the scalability ZB) filesystem. Overcoming these kind of reliability is-
features that will be available in ext4. sues is the priority of ext4 developers before addressing
full 64-bit support and is discussed later in the paper.

2.1 Large filesystem


2.1.1 Future work
The current 16 TB filesystem size limit is caused by the
32-bit block number in ext3. To enlarge the filesystem After extending the limit created by 32-bit block num-
limit, the straightforward method is to increase the num- bers, the filesystem capacity is still restricted by the
ber of bits used to represent block numbers and then fix number of block groups in the filesystem. In ext3, for
all references to data and metadata blocks. safety concerns all block group descriptors copies are
kept in the first block group. With the new uninitial-
Previously, there was an extents[3] patch for ext3 with ized block group feature discussed in section 4.1 the
the capacity to support 48-bit physical block numbers. new block group descriptor size is 64 bytes. Given the
In ext4, instead of just extending the block numbers to default 128 MB(227 bytes) block group size, ext4 can
64-bits based on the current ext3 indirect block map- have at most 227 /64 = 221 block groups. This limits the
ping, the ext4 developers decided to use extents map- entire filesystem size to 221 ∗ 227 = 248 bytes or 256TB.
ping with 48-bit block numbers. This both increases
filesystem capacity and improves large file efficiency. The solution to this problem is to use the metablock
With 48-bit block numbers, ext4 can support a maxi- group feature (META_BG), which is already in ext3
mum filesystem size up to 2(48+12) = 260 bytes (1 EB) for all 2.6 releases. With the META_BG feature, ext4
with 4 KB block size. filesystems are partitioned into many metablock groups.
Each metablock group is a cluster of block groups
After changing the data block numbers to 48-bit, whose group descriptor structures can be stored in a sin-
the next step was to correct the references to meta- gle disk block. For ext4 filesystems with 4 KB block
data blocks correspondingly. Metadata is present in size, a single metablock group partition includes 64
the superblock, the group descriptors, and the jour- block groups, or 8 GB of disk space. The metablock
nal. New fields have been added at the end of the group feature moves the location of the group descrip-
superblock structure to store the most significant 32 tors from the congested first block group of the whole
bits for block-counter variables, s_free_blocks_count, filesystem into the first group of each metablock group
s_blocks_count, and s_r_blocks_count, extending them itself. The backups are in the second and last group of
to 64 bits. Similarly, we introduced new 32-bit fields at each metablock group. This increases the 221 maximum
2007 Linux Symposium, Volume Two • 23

ext4_extent structure leaf nodes


95 47 31 0 ext4_inode disk blocks
node header
physical block # logical block # index node
extent
uninitialized extent flag length node header
i_block ...
ext4_extent_header ext4_extent_idx extent index
eh_header extent
eh_magic ei_block
eh_entries ei_leaf root ...
eh_max ei_leaf_hi ...
eh_depth ei_unused node header
eh_generation extent index
extent

Figure 1: Ext4 extents, header and index structures


...
extent

block groups limit to the hard limit 232 , allowing support


for the full 1 EB filesystem.

2.2 Extents Figure 2: Ext4 extent tree layout

The ext3 filesystem uses an indirect block mapping the node, the capacity of entries the node can store, the
scheme providing one-to-one mapping from logical depth of the tree, and a magic number. The magic num-
blocks to disk blocks. This scheme is very efficient for ber can be used to differentiate between different ver-
sparse or small files, but has high overhead for larger sions of extents, as new enhancements are made to the
files, performing poorly especially on large file delete feature, such as increasing to 64-bit block numbers.
and truncate operations [3].
The extent header and magic number also add much-
As mentioned earlier, extents mapping is included in needed robustness to the on-disk structure of the data
ext4. This approach efficiently maps logical to physical files. For very small filesystems, the block-mapped files
blocks for large contiguous files. An extent is a single implicitly depended on the fact that random corruption
descriptor which represents a range of contiguous phys- of an indirect block would be easily detectable, because
ical blocks. Figure 1 shows the extents structure. As the number of valid filesystem blocks is a small sub-
we discussed in previously, the physical block field in set of a random 32-bit integer. With growing filesystem
an extents structure takes 48 bits. A single extent can sizes, random corruption in an indirect block is by itself
represent 215 contiguous blocks, or 128 MB, with 4 KB indistinguishable from valid block numbers.
block size. The MSB of the extent length is used to flag
uninitialized extents, used for the preallocation feature In addition to the simple magic number stored in the
discussed in Section 3.1. extent header, the tree structure of the extent tree
can be verified at runtime or by e2fsck in several
Four extents can be stored in the ext4 inode structure ways. The ext4_extent_header has some internal con-
directly. This is generally sufficient to represent small sistency (eh_entries and eh_max) that also depends on
or contiguous files. For very large, highly fragmented, the filesystem block size. eh_depth decreases from the
or sparse files, more extents are needed. In this case root toward the leaves. The ext4_extent entries in a leaf
a constant depth extent tree is used to store the extents block must have increasing ee_block numbers, and must
map of a file. Figure 2 shows the layout of the extents not overlap their neighbors with ee_len. Similarly, the
tree. The root of this tree is stored in the ext4 inode ext4_extent_idx also needs increasing ei_block values,
structure and extents are stored in the leaf nodes of the and the range of blocks that an index covers can be veri-
tree. fied against the actual range of blocks in the extent leaf.

Each node in the tree starts with an extent header (Fig- Currently, extents mapping is enabled in ext4 with the
ure 1), which contains the number of valid entries in extents mount option. After the filesystem is mounted,
24 • The new ext4 filesystem: current status and future plans

any new files will be created with extent mapping. The EXT4_HUGE_FILE_FL, to allow existing inodes to
benefits of extent maps are reflected in the performance keep i_blocks in 512-byte units without requiring a full
evaluation Section 7. filesystem conversion. In addition, the i_blocks variable
is extended to 48 bits by using some of the reserved in-
ode fields. We still have the limitation of 32 bit logi-
2.2.1 Future work cal block numbers with the current extent format, which
limits the file size to 16TB. With the flexible extents for-
Extents are not very efficient for representing sparse or mat in the future (see Section 2.2.1), we may remove
highly fragmented files. For highly fragmented files, we that limit and fully use the 48-bit i_blocks to enlarge the
could introduce a new type of extent, a block-mapped file size even more.
extent. A different magic number, stored in the extent
header, distinguishes the new type of leaf block, which 2.4 Large number of files
contains a list of allocated block numbers similar to an
ext3 indirect block. This would give us the increased ro- Some applications already create billions of files today,
bustness of the extent format, with the block allocation and even ask for support for trillions of files. In theory,
flexibility of the block-mapped format. the ext4 filesystem can support billions of files with 32-
In order to improve the robustness of the on-disk data, bit inode numbers. However, in practice, it cannot scale
there is a proposal to create an “extent tail” in the extent to this limit. This is because ext4, following ext3, still
blocks, in addition to the extent header. The extent tail allocates inode tables statically. Thus, the maximum
would contain the inode number and generation of the number of inodes has to be fixed at filesystem creation
inode that has allocated the block, and a checksum of time. To avoid running out of inodes later, users often
the extent block itself (though not the data). The check- choose a very large number of inodes up-front. The con-
sum would detect internal corruption, and could also de- sequence is unnecessary disk space has to be allocated
tect misplaced writes if the block number is included to store unused inode structures. The wasted space be-
therein. The inode number could be used to detect comes more of an issue in ext4 with the larger default
corruption that causes the tree to reference the wrong inode. This also makes the management and repair of
block (whether by higher-level corruption, or misplaced large filesystems more difficult than it should be. The
writes). The inode number could also be used to recon- uninitialized group feature (Section 4.1) addresses this
struct the data of a corrupted inode or assemble a deleted issue to some extent, but the problem still exists with
file, and also help in doing reverse-mapping of blocks aged filesystems in which the used and unused inodes
for defragmentation among other things. can be mixed and spread across the whole filesystem.

Ext3 and ext4 developers have been thinking about sup-


2.3 Large files porting dynamic inode allocation for a while [9, 3].
There are three general considerations about the dy-
In Linux, file size is calculated based on the i_blocks namic inode table allocation:
counter value. However, the unit is in sectors (512
bytes), rather than in the filesystem block size (4096 • Performance: We need an efficient way to translate
bytes by default). Since ext4’s i_blocks is a 32-bit vari- inode number to the block where the inode struc-
able in the inode structure, this limits the maximum file ture is stored.
size in ext4 to 232 ∗ 512 bytes = 241 bytes = 2 TB. This
is a scalability limit that ext3 has planned to break for a • Robustness: e2fsck should be able to locate inode
while. table blocks scattered across the filesystem, in the
case the filesystem is corrupted.
The solution for ext4 is quite straightforward. The
first part is simply changing the i_blocks units in the • Compatibility: We need to handle the possible in-
ext4 inode to filesystem blocks. An ROCOMPAT fea- ode number collision issue with 64-bit inode num-
ture flag HUGE_FILE is added in ext4 to signify that bers on 32-bit systems, due to overflow.
the i_blocks field in some inodes is in units of filesys-
tem block size. Those inodes are marked with a flag These three requirements make the design challenging.
2007 Linux Symposium, Volume Two • 25

63 50 18 3 0
32-bit inode number. This way user applications will be
32-bit block group # 15-bit relative 4-bit
ensured of getting unique inode numbers on 32-bit plat-
block # offset
forms. For 32-bit applications running on 64-bit plat-
forms, we hope they are fixed by the time ext4 is in pro-
duction, and this only starts to be an issue for filesystems
Figure 3: 64-bit inode layout
over 1TB in size.

With dynamic inode tables, the blocks storing the inode In summary, dynamic inode allocation and 64-bit inode
structure are no longer at a fixed location. One way to numbers are needed to support large numbers of files in
efficiently map the inode number to the block storing ext4. The benefits are obvious, but the changes to the
the corresponding inode structure, is encoding the block on-disk format may be intrusive. The design details are
number into the inode number directly, similar to what is still under discussion.
done in XFS. This implies the use of 64-bit inode num-
2.5 Directory scalability
bers. The low four to five bits of the inode number store
the offset bits within the inode table block. The rest
The maximum number of subdirectories contained in a
store the 32-bit block group number as well as 15-bit
single directory in ext3 is 32,000. To address directory
relative block number within the group, shown in Fig-
scalability, this limit will be eliminated in ext4 providing
ure 3. Then, a cluster of contiguous inode table blocks
unlimited sub-directory support.
(ITBC) can be allocated on demand. A bitmap at the
head of the ITBC would be used to keep track of the In order to better support large directories with many en-
free and used inodes, allowing fast inode allocation and tries, the directory indexing feature[6] will be turned on
deallocation. by default in ext4. By default in ext3, directory entries
are still stored in a linked list, which is very inefficient
In the case where the filesystem is corrupted, the ma- for directories with large numbers of entries. The di-
jority of inode tables could be located by checking the rectory indexing feature addresses this scalability issue
directory entries. To further address the reliability con- by storing directory entries in a constant depth HTree
cern, a magic number could be stored at the head of the data structure, which is a specialized BTree-like struc-
ITBC, to help e2fsck to recognize this metadata block. ture using 32-bit hashes. The fast lookup time of the
HTree significantly improves performance on large di-
Relocating inodes becomes tricky with this block-
rectories. For directories with more than 10,000 files,
number-in-inode-number proposal. If the filesystem is
improvements were often by a factor of 50 to 100 [3].
resized or defragmented, we may have to change the lo-
cation of the inode blocks, which would require chang-
ing all references to that inode number. The proposal 2.5.1 Future work
to address this concern is to have a per-group “inode
exception map” that translates an old block/inode num- While the HTree implementation allowed the ext2 direc-
ber into a new block number where the relocated inode tory format to be improved from linear to a tree search
structure is actually stored. The map will usually be compatibly, there are also limitations to this approach.
empty, unless the inode was moved. The HTree implementation has a limit of 510 * 511 4
KB directory leaf blocks (approximately 25M 24-byte
One concern with the 64-bit inode number is the possi-
filenames) that can be indexed with a 2-level tree. It
ble inode number collision with 32-bit applications, as
would be possible to change the code to allow a 3-level
applications might still be using 32-bit stat() to access
HTree. There is also currently a 2 GB file size limit on
inode numbers and could break. Investigation is under-
directories, because the code for using the high 32-bits
way to see how common this case is, and whether most
for i_size on directories was not implemented when the
applications are currently fixed to use the 64-bit stat64().
2 GB limit was fixed for regular files.
One way to address this concern is to generate 32-bit
inode numbers on 32-bit platforms. Seventeen bits is Because the hashing used to find filenames in indexed
enough to represent block group numbers on 32-bit ar- directories is essentially random compared to the lin-
chitectures, and we could limit the inode table blocks ear order in which inodes are allocated, we end up do-
to the first 210 blocks of a block group to construct the ing random seeks around the disk when accessing many
26 • The new ext4 filesystem: current status and future plans

inodes in a large directory. We need to have readdir Ext4 Large Inode


0
in hash-index order because directory entries might be Original
128-bit Inode
moved during the split of a directory leaf block, so to 127
satisfy POSIX requirements we can only safely walk the i_extra_isize
directory in hash order. i_pad1
i_ctime_extra
i_mtime_extra
To address this problem, there is a proposal to put the i_atime_extra
Fixed Fields
whole inode into the directory instead of just a directory i_crtime
entry that references a separate inode. This avoids the i_crtime_extra
i_version_hi
need to seek to the inode when doing a readdir, because
the whole inode has been read into memory already in Fast Extended
Attributes
the readdir step. If the blocks that make up the directory 255
are efficiently allocated, then reading the directory also
does not require any seeking.
Figure 4: Layout of the large inode
This would also allow dynamic inode allocation, with
the directory as the “container” of the inode table. The
inode numbers would be generated in a similar manner In order to avoid duplicating a lot of code in the kernel
as previously discussed (Section 2.4), so that the block and e2fsck, the large inodes keep the same fixed layout
that an inode resides in can be located directly from for the first 128-bytes, as shown in Figure 4. The rest
the inode number itself. Hard linked files imply that of the inode is split into two parts: a fixed-field section
the same block is allocated to multiple directories at the that allows addition of fields common to all inodes, such
same time, but this can be reconciled by the link count as nanosecond timestamps (Section 5), and a section for
in the inode itself. fast extended attributes (EAs) that consumes the rest of
the inode.
We also need to store one or more file names in the in-
ode itself, and this can be done by means of an extended The fixed-field part of the inode is dynamically sized,
attribute that uses the directory inode number as the EA based on what fields the current kernel knows about.
name. We can then return the name(s) associated with The size of this area is stored in each inode in the
that inode for a single directory immediately when do- i_extra_isize field, which is the first field beyond the
ing readdir, and skip any other name(s) for the inode that original 128-byte inode. The superblock also contains
belong to hard links in another directory. For efficient two fields, s_min_extra_isize and i_want_extra_isize,
name-to-inode lookup in the directory, we would still which allow down-level kernels to allocate a larger
use a secondary tree similar to the current ext3 HTree i_extra_isize than it would otherwise do.
(though it would need an entry per name instead of per
The s_min_extra_isize is the guaranteed mini-
directory block). But because the directory entries (the
mum amount of fixed-field space in each inode.
inodes themselves) do not get moved as the directory
s_want_extra_isize is the desired amount of fixed-field
grows, we can just use disk block or directory offset or-
space for new inode, but there is no guarantee that
der for readdir.
this much space will be available in every inode. A
ROCOMPAT feature flag EXTRA_ISIZE indicates
2.6 Large inode and fast extended attributes whether these superblock fields are valid. The ext4
code will soon also be able to expand i_extra_isize
dynamically as needed to cover the fixed fields, so
Ext3 supports different inode sizes. The inode size can long as there is space available to store the fast EAs or
be set to any power-of-two larger than 128 bytes size up migrate them to an external EA block.
to the filesystem block size by using the mke2fs -I [inode
size] option at format time. The default inode structure The remaining large inode space may be used for storing
size is 128 bytes, which is already crowded with data EA data inside the inode. Since the EAs are already in
and has little space for new fields. In ext4, the default memory after the inode is read from disk, this avoids
inode structure size will be 256 bytes. a costly seek to external EA block. This can greatly
2007 Linux Symposium, Volume Two • 27

improve the performance of applications that are using block until it is explicitly initialized through a subse-
EAs, sometimes by a factor of 3–7 [4]. An external EA quent write. Preallocation must be persistent across re-
block is still available in addition to the fast EA space, boots, unlike ext3 and ext4 block reservations [3].
which allows storing up to 4 KB of EAs for each file.
For applications involving purely sequential writes, it is
The support for fast EAs in large inodes has been avail- possible to distinguish between initialized and uninitial-
able in Linux kernels since 2.6.12, though it is rarely ized portions of the file. This can be done by maintain-
used because many people do not know of this capabil- ing a single high water mark value representing the size
ity at mke2fs time. Since ext4 will have larger inodes, of the initialized portion. However, for databases and
this feature will be enabled by default. other applications where random writes into the preal-
located blocks can occur in any order, this is not suffi-
There have also been discussions about breaking the 4 cient. The filesystem needs to be able to identify ranges
KB EA limit, in order to store larger or more EAs. It is of uninitialized blocks in the middle of the file. There-
likely that larger single EAs will be stored in their own fore, some extent based filesystems, like XFS, and now
inode (to allow arbitrary-sized EAs) and it may also be ext4, provide support for marking allocated but unini-
that for many EAs they will be stored in a directory-like tialized extents associated with a given file.
structure, possibly leveraging the same code as regular
ext4 directories and storing small values inline. Ext4 implements this by using the MSB of the extent
length field to indicate whether a given extent contains
3 Block allocation enhancements uninitialized data, as shown in Figure 1. During reads,
an uninitialized extent is treated just like a hole, so that
the VFS returns zero-filled blocks. Upon writes, the ex-
Increased filesystem throughput is the premier goal for tent must be split into initialized and uninitialized ex-
all modern filesystems. In order to meet this goal, de- tents, merging the initialized portion with an adjacent
velopers are constantly attempting to reduce filesystem initialized extent if contiguous.
fragmentation. High fragmentation rates cause greater
disk access time affecting overall throughput, and in- Until now, XFS, the other Linux filesystem that imple-
creased metadata overhead causing less efficient map- ments preallocation, provided an ioctl interface to ap-
ping. plications. With more filesystems, including ext4, now
providing this feature, a common system-call interface
There is an array of new features in line for ext4, which
for fallocate and an associated inode operation have
take advantage of the existing extents mapping and are
been introduced. This allows filesystem-specific imple-
aimed at reducing filesystem fragmentation by improv-
mentations of preallocation to be exploited by applica-
ing block allocation techniques.
tions using the posix_fallocate API.

3.1 Persistent preallocation


3.2 Delayed and multiple block allocation
Some applications, like databases and streaming media
servers, benefit from the ability to preallocate blocks for The block allocator in ext3 allocates one block at a time
a file up-front (typically extending the size of the file during the write operation, which is inefficient for larger
in the process), without having to initialize those blocks I/O. Since block allocation requests are passed through
with valid data or zeros. Preallocation helps ensure con- the VFS layer one at a time, the underlying ext3 filesys-
tiguous allocation as far as possible for a file (irrespec- tem cannot foresee and cluster future requests. This also
tive of when and in what order data actually gets writ- increases the possibility of file fragmentation.
ten) and guaranteed space allocation for writes within
the preallocated size. It is useful when an application Delayed allocation is a well-known technique in which
has some foreknowledge of how much space the file will block allocations are postponed to page flush time,
require. The filesystem internally interprets the preallo- rather than during the write() operation [3]. This method
cated but not yet initialized portions of the file as zero- provides the opportunity to combine many block allo-
filled blocks. This avoids exposing stale data for each cation requests into a single request, reducing possible
28 • The new ext4 filesystem: current status and future plans

fragmentation and saving CPU cycles. Delayed alloca- given attribute, such as SID or a combination of
tion also avoids unnecessary block allocation for short- SID and parent directory. At the deferred page-
lived files. flush time, dirty pages are written out by groups,
instead of by individual files. The number of non-
Ext4 delayed allocation patches have been imple- allocated blocks are tracked at the group-level, and
mented, but there is work underway to move this sup- upon flush time, the allocator can try to preallocate
port to the VFS layer, so multiple filesystems can benefit enough space for the entire group. This space is
from the feature. shared by the files in the group for their individual
block allocation. In this way, related files are place
With delayed allocation support, multiple block alloca-
tightly together.
tion for buffered I/O is now possible. An entire extent,
containing multiple contiguous blocks, is allocated at
once rather than one block at a time. This eliminates In summary, ext4 will have a powerful block allocation
multiple calls to ext4_get_blocks and ext4_new_blocks scheme that can efficiently handle large block I/O and
and reduces CPU utilization. reduce filesystem fragmentation with small files under
multi-threaded workloads.
Ext4 multiple block allocation builds per-block group
free extents information based on the on-disk block
3.3 Online defragmentation
bitmap. It uses this information to guide the search for
free extents to satisfy an allocation request. This free
extent information is generated at filesystem mount time Though the features discussed in this section improve
and stored in memory using a buddy structure. block allocation to avoid fragmentation in the first
place, with age, the filesystem can still become quite
The performance benefits of delayed allocation alone fragmented. The ext4 online defragmentation tool,
are very obvious, and can be seen in Section 7. In a e4defrag, has been developed to address this. This tool
previous study [3], we have seen about 30% improved can defragment individual files or the entire filesystem.
throughput and 50% reduction in CPU usage with the For each file, the tool creates a temporary inode and al-
combined two features. Overall, delayed and multi- locates contiguous extents to the temporary inode using
ple block allocation can significantly improve filesystem multiple block allocation. It then copies the original file
performance on large I/O. data to the page cache and flushes the dirty pages to the
temporary inode’s blocks. Finally, it migrates the block
There are two other features in progress that are built on
pointers from the temporary inode to the original inode.
top of delayed and multiple block allocation, trying to
further reduce fragmentation:
4 Reliability enhancements
• In-core Preallocation: Using the in-core free ex-
tents information, a more powerful in-core block Reliability is very important to ext3 and is one of the
preallocation/reservation can be built. This further reasons for its vast popularity. In keeping with this
improves block placement and reduces fragmenta- reputation, ext4 developers are putting much effort into
tion with concurrent write workloads. An inode maintaining the reliability of the filesystem. While it is
can have a number of preallocated chunks, indexed relatively easy for any filesystem designer to make their
by the logical blocks. This improvement can help fields 64-bits in size, it is much more difficult to make
HPC applications when a number of nodes write to such large amounts of space actually usable in the real
one huge file at very different offsets. world.

• Locality Groups: Currently, allocation policy deci- Despite the use of journaling and RAID, there are invari-
sions for individual file are made independently. If ably corruptions to the disk filesystem. The first line of
the allocator had knowledge of file relationship, it defense is detecting and avoiding problems proactively
could intelligently place related files close together, by a combination of robust metadata design, internal re-
greatly benefiting read performance. The locality dundancy at various levels, and built-in integrity check-
groups feature clusters related files together by a ing using checksums. The fallback will always be doing
2007 Linux Symposium, Volume Two • 29

integrity checking (fsck) to both detect and correct prob- fsck time vs. Inode Count
4500
lems that will happen anyway.
4000

One of the primary concerns with all filesystems is the 3500


speed at which a filesystem can be validated and recov-

fsck time (sec)


3000
ered after corruption. With reasonably high-end RAID ext3: 0 files
2500 ext3 100k files
storage, a full fsck of a 2TB ext3 filesystem can take 2000 ext3 2.1M files
between 2 to 4 hours for a relatively “clean” filesystem. ext4 100kfiles
1500 ext4 2.1M files
This process can degrade sharply to many days if there
1000
are large numbers of shared filesystem blocks that need
500
expensive extra passes to correct.
0
0 0.5 10 15 20 25 30 35
Some features, like extents, have already added to the Total Inode Count (millions)
robustness of the ext4 metadata as previously described.
Many more related changes are either complete, in
progress, or being designed in order to ensure that ext4 Figure 5: e2fsck performance improvement with unini-
will be usable at scales that will become practical in the tialized block groups
future.
many inodes deleted, e2fsck will be more efficient in the
4.1 Unused inode count and fast e2fsck next run.

Figure 5 shows that e2fsck time on ext3 grows linearly


In e2fsck, the checking of inodes in pass 1 is by far the with the total number of inodes in filesystem, regardless
most time consuming part of the operation. This re- of how many are used. On ext3, e2fsck takes the same
quires reading all of the large inode tables from disk, amount of time with zero used files as with 2.1 million
scanning them for valid, invalid, or unused inodes, and used files. In ext4, with the unused inode high water-
then verifying and updating the block and inode alloca- mark feature, the e2fsck time is only dependent on the
tion bitmaps. The uninitialized groups and inode table number of used inodes. As we can see, fsck of an ext4
high watermark feature allows much of the lengthy pass filesystem with 100 000 used files takes a fraction of the
1 scanning to be safely skipped. This can dramatically time ext3 takes.
reduce the total time taken by e2fsck by 2 to 20 times,
depending on how full the filesystem is. This feature In addition to the unused inodes count, it is possible
can be enabled at mke2fs time or using tune2fs via the for mke2fs and e2fsck to mark a group’s block or in-
-O uninit_groups option. ode bitmap as uninitialized, so that the kernel does not
need to read them from disk when first allocating from
With this feature, the kernel stores the number of un- the group. Similarly, e2fsck does not need to read these
used inodes at the end of each block group’s inode table. bitmaps from disk, though this does not play a major
As a result, e2fsck can skip both reading these blocks role in performance improvements. What is more sig-
from disk, and scanning them for in-use inodes. In or- nificant is that mke2fs will not write out the bitmaps or
der to ensure that the unused inode count is safe to use inode tables at format time if the mke2fs -O lazy_bg fea-
by e2fsck, the group descriptor has a CRC16 checksum ture is given. Writing out the inode tables can take a
added to it that allows validation of all fields therein. significant amount of time, and has been known to cause
problems for large filesystems due to the amount of dirty
Since typical ext3 filesystems use only in the neighbor-
pages this generates in a short time.
hood of 1% to 10% of their inodes, and the inode alloca-
tion policy keeps a majority of those inodes at the start
of the inode table, this can avoid processing a large ma- 4.2 Checksumming
jority of the inodes and speed up the pass 1 processing.
The kernel does not currently increase the unused inodes Adding metadata checksumming into ext4 will allow it
count, when files are deleted. This counter is updated on to more easily detect corruption, and behave appropri-
every e2fsck run, so in the case where a block group had ately instead of blindly trusting the data it gets from
30 • The new ext4 filesystem: current status and future plans

disk. The group descriptors already have a checksum the metadata to the checksummed journal and still be
added, per the previous section. The next immediate tar- confident that it is valid and correct at recovery time.
get for checksumming is the journal, because it has such The blocks can have metadata-specific checksums com-
a high density of important metadata and is constantly puted a single time when they are written into the
being written to, so has a higher chance of wearing out filesystem.
the platters or seeing other random corruption.

Adding checksumming to the ext4 journal is nearly 5 Other new features


complete [7]. In ext3 and ext4, each journal transac-
tion has a header block and commit block. During nor- New features are continuously being added to ext4. Two
mal journal operation the commit block is not sent to features expected to be seen in ext4 are nanosecond
the disk until the transaction header and all metadata timestamps and inode versioning. These two features
blocks which make up that transaction have been writ- provide precision when dealing with file access times
ten to disk [8]. The next transaction needs to wait for the and tracking changes to files.
previous commit block to hit to disk before it can start
to modify the filesystem. Ext3 has second resolution timestamps, but with today’s
high-speed processors, this is not sufficient to record
With this two-phase commit, if the commit block has the multiple changes to a file within a second. In ext4, since
same transaction number as the header block, it should we use a larger inode, there is room to support nanosec-
indicate that the transaction can be replayed at recovery ond resolution timestamps. High 32-bit fields for the
time. If they don’t match, the journal recovery is ended. atime, mtime and ctime timestamps, and also a new cr-
However, there are several scenarios where this can go time timestamp documenting file creation time, will be
wrong and lead to filesystem corruption. added to the ext4 inode (Figure 4). 30 bits are sufficient
to represent the nanosecond field, and the remaining 2
With journal checksumming, the journal code computes bits are used to extend the epoch by 272 years.
a CRC32 over all of the blocks in the transaction (in-
cluding the header), and the checksum is written to the The NFSv4 clients need the ability to detect updates to
commit block of the transaction. If the checksum does a file made at the server end, in order to keep the client
not match at journal recovery time, it indicates that one side cache up to date. Even with nanosecond support
or more metadata blocks in the transaction are corrupted for ctime, the timestamp is not necessarily updated at
or were not written to disk. Then the transaction (along the nanosecond level. The ext4 inode versioning feature
with later ones) is discarded as if the computer had addresses this issue by providing a global 64-bit counter
crashed slightly earlier and not written a commit block in each inode. This counter is incremented whenever the
at all. file is changed. By comparing values of the counter, one
can see whether the file has been updated. The counter is
Since the journal checksum in the commit block allows reset on file creation, and overflows are unimportant, be-
detection of blocks that were not written into the journal, cause only equality is being tested. The i_version field
as an added bonus there is no longer a need for having already present in the 128-bit inode is used for the low
a two-phase commit for each transaction. The commit 32 bits, and a high 32-bit field is added to the large ext4
block can be written at the same time as the rest of the inode.
blocks in the transaction. This can actually speed up the
filesystem operation noticeably (as much as 20% [7]),
instead of the journal checksum being an overhead. 6 Migration tool

There are also some long-term plans to add check- Ext3 developers worked to maintain backwards compat-
summing to the extent tail, the allocation bitmaps, the ibility between ext2 and ext3, a characteristic users ap-
inodes, and possibly also directories. This can be preciate and depend on. While ext4 attempts to retain
done efficiently once we have journal checksumming in compatibility with ext3 as much as possible, some of
place. Rather than computing the checksum of filesys- the incompatible on-disk layout changes are unavoid-
tem metadata each time it is changed (which has high able. Even with these changes, users can still easily
overhead for often-modified structures), we can write upgrade their ext3 filesystem to ext4, like it is possible
2007 Linux Symposium, Volume Two • 31

from ext2 to ex3. There are methods available for users 6.2 Downgrading from ext4 to ext3
to try new ext4 features immediately, or migrate their
entire filesystem to ext4 without requiring back-up and
Though not as straightforward as ext3 to ext4, there is
restore.
a path for any user who may want to downgrade from
ext4 back to ext3. In this case the user would remount
the filesystem with the noextents mount option, copy
6.1 Upgrading from ext3 to ext4
all files to temporary files and rename those files over
the original file. After all files have been converted
back to indirect block mapping format, the INCOM-
There is a simple upgrade solution for ext3 users to start
PAT_EXTENTS flag must be cleared using tune2fs, and
using extents and some ext4 features without requiring a
the filesystem can be re-mounted as ext3.
full backup or migration. By mounting an existing ext3
filesystem as ext4 (with extents enabled), any new files
are created using extents, while old files are still indi- 7 Performance evaluation
rect block mapped and interpreted as such. A flag in the
inode differentiates between the two formats, allowing
both to coexist in one ext4 filesystem. All new ext4 fea- We have conducted a performance evaluation of ext4, as
tures based on extents, such as preallocation and mul- compared to ext3 and XFS, on three well-known filesys-
tiple block allocation, are available to the new extents tem benchmarks. Ext4 was tested with extents and de-
files immediately. layed allocation enabled. The benchmarks in this anal-
ysis were chosen to show the impact of new changes
in ext4. The three benchmarks chosen were: Flexible
A tool will also be available to perform a system-wide
Filesystem Benchmark (FFSB) [1], Postmark [5], and
filesystem migration from ext3 to ext4. This migration
IOzone [2]. FFSB, configured with a large file work-
tool performs two functions: migrating from indirect to
load, was used to test the extents feature in ext4. Post-
extents mapping, and enlarging the inode to 256 bytes.
mark was chosen to see performance of ext4 on small
file workloads. Finally, we used IOzone to evaluate
overall ext4 filesystem performance.
• Extents migration: The first step can be performed
online and uses the defragmentation tool. During The tests were all run on the 2.6.21-rc4 kernel with de-
the defragmentation process, files are changed to layed allocation patches. For ext3 and ext4 tests, the
extents mapping. In this way, the files are being filesystem was mounted in writeback mode, and ap-
converted to extents and defragmented at the same propriate extents and delayed allocation mount options
time. were set for ext4. Default mount options were used for
XFS testing.
• Inode migration: Enlarging the inode structure size
must be done offline. In this case, data is backed FFSB and IOzone benchmarks were run on the same
up, and the entire filesystem is scanned and con- 4-CPU 2.8 Ghz Intel(R) Xeon(tm) System with 2 GB
verted to extents mapping and large inodes. of RAM, on a 68GB ultra320 SCSI disk (10000 rpm).
Postmark was run on a 4-CPU 700 MHz Pentium(R) III
system with 4 GB of RAM on a 9 GB SCSI disk (7200
For users who are not yet ready to move to ext4, but rpm). Full test results including raw data are available
may want to in the future, it is possible to prepare their at the ext4 wiki page, http://ext4.wiki.kernel.
ext3 filesystem to avoid offline migration later. If an org.
ext3 filesystem is formatted with a larger inode struc-
ture, 256 bytes or more, the fast extended attribute fea-
7.1 FFSB comparison
ture (Section 2.6) which is the default in ext4, can be
used instantly. When the user later wants to upgrade
to ext4, then other ext4 features using the larger inode FFSB is a powerful filesystem benchmarking tool, that
size, such as nanosecond timestamps, can also be used can be tuned to simulate very specific workloads. We
without requiring any offline migration. have tested multithreaded creation of large files. The test
32 • The new ext4 filesystem: current status and future plans

  









 

! "#!$" %&'()*



  

 

!"#
 









    


      

Figure 6: FFSB sequential write comparison Figure 8: IOzone results: throughput of transactions on
512 MB files





These results show that, aside from the obvious perfor-

mance gain on large contiguous files, ext4 is also a good
 !"#$%&



choice on smaller file workloads.









7.3 IOzone comparison







For the IOzone benchmark testing, the system was
  
booted with only 64 M of memory to really stress disk
I/O. The tests were performed with 8 MB record sizes on
various file sizes. Write, rewrite, read, reread, random
Figure 7: Postmark read write comparison
write, and random read operations were tested. Figure 8
shows throughput results for 512 MB sized files. Over-
runs 4 threads, which combined create 24 1-GB files, all, there is great improvement between ext3 and ext4,
and stress the sequential write operation. especially on rewrite, random-write and reread opera-
The results, shown in Figure 6, indicate about 35% im- tions. In this test, XFS still has better read performance,
provement in throughput and 40% decrease in CPU uti- while ext4 has shown higher throughput on write opera-
lization in ext4 as compared to ext3. This performance tions.
improvement shows a diminishing gap between ext4 and
XFS on sequential writes. As expected, the results ver-
ify extents and delayed allocation improve performance 8 Conclusion
on large contiguous file creation.
As we have discussed, the new ext4 filesystem brings
7.2 Postmark comparison many new features and enhancements to ext3, making it
a good choice for a variety of workloads. A tremendous
Postmark is a well-known benchmark simulating a mail amount of work has gone into bringing ext4 to Linux,
server performing many single-threaded transactions on with a busy roadmap ahead to finalize ext4 for produc-
small to medium files. The graph in Figure 7 shows tion use. What was once essentially a simple filesystem
about 30% throughput gain with with ext4. Similar per- has become an enterprise-ready solution, with a good
cent improvements in CPU utilization are seen, because balance of scalability, reliability, performance and sta-
metadata is much more compact with extents. The write bility. Soon, the ext3 user community will have the op-
throughput is higher than read throughput because ev- tion to upgrade their filesystem and take advantage of
erything is being written to memory. the newest generation of the ext family.
2007 Linux Symposium, Volume Two • 33

Acknowledgements [3] Mingming Cao, Theodore Y. Ts’o, Badari


Pulavarty, Suparna Bhattacharya, Andreas Dilger,
The authors would like to extend their thanks to Jean- and Alex Tomas. State of the art: Where we are
Noël Cordenner and Valérie Clément, for their help on with the ext3 filesystem. In Ottawa Linux
performance testing and analysis, and development and Symposium, 2005.
support of ext4. [4] Jonathan Corbet. Which filesystem for samba4?
We would also like to give special thanks to Andrew Technical report.
http://lwn.net/Articles/112566/.
Morton for supporting ext4, and helping to bring ext4 to
mainline Linux. We also owe thanks to all ext4 devel- [5] Jeffrey Katcher. Postmark a new filesystem
opers who work hard to make the filesystem better, es- benchmark. Technical report, Network Appliances,
pecially: Ted T’so, Stephen Tweedie, Badari Pulavarty, 2002.
Dave Kleikamp, Eric Sandeen, Amit Arora, Aneesh
Veetil, and Takashi Sato. [6] Daniel Phillips. A directory index for ext2. In 5th
Annual Linux Showcase and Conference, pages
Finally thank you to all ext3 users who have put their 173–182, 2001.
faith in the filesystem, and inspire us to strive to make
ext4 better. [7] Vijayan Prabhakaran, Lakshmi N.
Bairavasundaram, Nitin Agrawal, Haryadi S.
Gunawi, Andrea C. Arpaci-Dusseau, and Remzi
Legal Statement H. Arpaci Dusseau. Iron file systems. In SOSP’05,
pages 206–220, 2005.
Copyright
c 2007 IBM.
[8] Stephen Tweedie. Ext3 journalling filesystem. In
This work represents the view of the authors and does not Ottawa Linux Symposium, 2000.
necessarily represent the view of IBM.
[9] Stephen Tweedie and Theodore Y Ts’o. Planned
IBM and the IBM logo are trademarks or registered trade- extensions to the linux ext2/3 filesystem. In
marks of International Business Machines Corporation in the USENIX Annual Technical Conference, pages
United States and/or other countries. 235–244, 2002.
Lustre is a trademark of Cluster File Systems, Inc.

Linux is a registered trademark of Linus Torvalds in the


United States, other countries, or both.

Other company, product, and service names may be trade-


marks or service marks of others.

References in this publication to IBM products or services


do not imply that IBM intends to make them available in all
countries in which IBM operates.

This document is provided “AS IS,” with no express or im-


plied warranties. Use the information in this document at
your own risk.

References

[1] Ffsb project on sourceforge. Technical report.


http://sourceforge.net/projects/ffsb.

[2] Iozone. Technical report. http://www.iozone.org.


34 • The new ext4 filesystem: current status and future plans

You might also like