Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 51

Comparative Study of Different Log Structured File

Systems supported by the Linux Kernel


Submitted in partial fulfillment of the requirement for the award of degree
MS!"# $F !"C%&$L$'(
)n
S$F!*#" "&')&""#)&'
Submitted by
Kailash +ati Mondal
#oll &o, -.../-01..2
#eg3 &o, 01-.../0..45 of 1.0-61.0/
Under the Supervision of
Mr3 Santanu Chatter7ee
ssistant +rofessor
Department of Computer Science and "ngineering8CS"9
*est :engal ;niversity of !echnology

*"S! :"&'L ;&)<"#S)!( $F !"C%&$L$'(
:F60/1= Sector60= Salt La>e=
Kol>ata64/
Page | 1

C"#!)F)C!" $F $#)')&L)!(
I hereby declare that this is submission of my own work and to the best of my
knowledge it contains no materials previously published or written by other person, nor
material which to a substantial extent has been accepted for the award of any other degree at
*est :engal ;niversity of !echnology or any other educational institution, except where
due acknowledgment is made in the thesis.
I also declare the intellectual content of this thesis is the product of my own work,
except to the extent that assistance from the others in the project design and conception or in
style, presentation and linguistic expression is acknowledged.
Although I have tried my best to stick to accuracy and perfection, there may be some
inadvertent error in this dissertation for which I solely owe responsibility.

Date ........................... !ailash "ati #ondal
$oll %o. & '((()'*+((,
$eg. %o. & *+'((()*((-. of +(*+/+(*'

Page | 2
CK&$*L"D'"M"&!
0his project report ?Comparative study of different Log Structure File System in the
environment of Linux $perating System@ is an academic activity on the re1uirement of
partial fulfillment of the degree of #aster of 0echnology.
I am keen grateful to my respected teachers whose close guidance helped me a lot to
complete this project.
I would like to thank Dr. 2riyankar Acharyya, 3ead of the Department of Computer
Science and "ngineering= *est :engal ;niversity of !echnology= for providing me with
all the necessary facilities to make my project work.
I shall be always thankful to #r. 2antanu 4hatterjee, my project guide for their constant
supporting and guidance during the course of my work. 0heir words of encouragement
motivated me to achieve my goal and impetus to excel.
I want to thank all the faculty members of the institute who helped me for doing the project
work.
5ast but not the least I thank all my friends for their cooperation and encouragement that they
have bestowed on me.
_______________________________ Date..
!ailash "ati #ondal
#. 0ech. in 2oftware 6ngineering 7268
Dept. of 4omputer 2cience and 6ngineering
9est :engal ;niversity of 0echnology

Page | 3
:S!#C!
0his abstract presents a new techni1ue for disk storage management
called a log/structured file system. A log/structured file system writes all
modifications to disk se1uentially in a log like structure, thereby speeding up
both file writing and crash recovery. 0he log is the only structure on disk< it
contains indexing information so that files can be read back from the log
efficiently. In order to maintain large free areas on disk for fast writing, log is
divided into segments and use a segment cleaner to compress the live
information from heavily fragmented segments. 5og structure file system
present a series of simulations that demonstrate the efficiency of a simple
cleaning policy based on cost and benefit. 0here are a prototype log/structured
file system called 2prite 5=2< it outperforms current ;nix file systems by an
order of magnitude for small file writes while matching or exceeding ;nix
performance for reads and large writes. 6ven when the overhead for cleaning is
included, 2prite 5=2 can use .(> of the disk bandwidth for writing, whereas
;nix file systems typically can use only ,/*(>.4omparative study of different
5og/structure file system and their analysis have been discussed.
Page | 4

C"#!)F)C!"
0his is to certify that the dissertation entitled ?Comparative study of different Log
Structure File System in the environment of Linux $perating System@ which was
submitted by !ailash "ati #ondal 7$egistration %o & *+'((()*((-. of +(*+/+(*' and $oll
%o. / '((()'*+((,8 in partial fulfillment of the re1uirement for the award of degree #aster
of 0echnology in 2oftware 6ngineering under 4omputer 2cience A Information 0echnology
Department of 9est :engal ;niversity of 0echnology. 0he project work was carried out by
him under our supervision and has been duly completed within the given period of time to the
satisfaction of the 426 department, 9est :engal ;niversity of 0echnology.
---------------------------------------- -----------------------------------
Dr. Sriyankar Acharyya Mr. Santanu Chatterjee
Head of the Department Assistant Professor (uide!
Department of CS" # $%U& Department of CS"# $%U&
"'terna( "'aminer

Page | 5
Contents,

+age no3
)ntroductionAAAAAAAAAAAAAAAAAAAAAAA3, .56.B
Design for file systems of the 0CC.DsAAAAAAAAAAAAA , .B6.C
i9 !echnology
ii9 *or>loads
+roblems with existing file systemsAAAAAAAAAAAAA333, 0.60.
Log6structured file systemsAAAAAAAAAAAAAAAA3333, 00600
)ssues of Log6Structured File SystemsAAAAAAA333AA33,00601
File Location and #eadingAAAAAAAAAAAAAAAA33333, 0160-
Free space management 8segments9AAAAAAAAAAAAA33, 0-60/
Segment cleaning mechanismAAAAAAAAAAAAAAA3333, 0/602
Segment cleaning policiesAAAAAAAAAAAAAAAAAA, 0460B
Simulation resultsAAAAAAAAAAAAAAAAAAAAA, 0B60B
Segment usage tableAAAAAAAAAAAAAAAAAAAA, 0C60C
Crash #ecoveryAAAAAAAAAAAAAAAAAAAAAA, 0C60C
Chec>pointsAAAAAAAAAAAAAAAAAAAAAAA3, 0C61.
#oll6forwardAAAAAAAAAAAAAAAAAAAAAA3333, 1.610
!he LogFS Flash Filesystem 8Specification9AAAAAAAAA333333, 11614
Eournaling Flash File System 8EFFS9AAAAAAAAAAAAA33,14615
Flash Friendly File System 8F1FS9,
)ntroductionAAAAAAAAAAAAAAAAAA33,1B61B
Flash MemoryAAAAAAAAAAAAAAAAA 33,1B61B
)ssues of Flash Memory File SystemAAAAAAAA333,1B61C
Flash !ranslation LayerAAAAAAAAAAAAA333,1C61C
Flash Friendly File SystemAAAAAAAAAAAA333,1C6-.
Design of $n6Dis> LayoutAAAAAAAAAAAA3333,-.6-0
)node StructureAAAAAAAAAAAAAAAAA33,-16-1
Directory StructureAAAAAAAAAAAAAAA333, -16-1
Solution of *andering !ree +roblemAAAAAAAA33, -16-1
F1FS 'arbage CollectionAAAAAAAAAAAAA33,--6--
F1FS daptive *rite +olicyAAAAAAAAAAAA3,-/6-/
!hreaded Log AAAAAAAAAAAAAAAAAA,-/6-/
bout +horonix6!est6Suite :enchmar> !estingAAAAAAAA333333,-26/0
Different file system testing #esult on pts environmentAAAAAA3,/162.
ConclusionAAAAAAAAAAAAAAAAAAAAAAA333333, 20620
#eferencesAAAAAAAAAAAAAAAAAAAAAAAA3, 20621
Page | 6
)ntroduction
Bver the last decade 4"; speeds have increased dramatically while disk
access times have only improved slowly. 0his trend is likely to continue in the
future and it will cause more and more applications to become disk bound. 0o
lessen the impact of this problem, #endel $osenblum and C. !. Buterhout
devised a new disk storage management techni1ue called a log/structured file
system, which uses disks an order of magnitude more efficiently than current
file systems.
5og/structured file systems are based on the assumption that files are
cached in main memory and that increasing memory siDes will make the caches
more and more effective at satisfying read re1uests. As a result, disk traffic will
become dominated by writes. A log/structured file system writes all new
information to disk in a se1uential structure called the log. 0his approach
increases write performance dramatically by eliminating almost all seeks. 0he
se1uential nature of the log also permits much faster crash recovery. 4urrent
;nix file systems typically must scan the entire disk to restore consistency after
a crash, but a log/structured file system need only examine the most recent
portion of the log.
0he notion of logging is not new, and a number of recent file systems
have incorporated a log as an auxiliary structure to speed up writes and crash
recovery. 3owever, these other systems use the log only for temporary storage
the permanent home for information is in traditional random access storage
structure on disk. In contrast, a log/structured file system stores data
permanently in the log, there is no other structure on disk. 0he log contains
indexing information so that files can be read back with efficiency comparable
to current file systems.
=or a log/structured file system to operate efficiently, it must ensure that
there are always large extents of free space available for writing new data. 0his
is the most difficult challenge in the design of a log/structured file system. 0hat
problem have a solution based on large extents called segments, where a
segment cleaner process continually regenerates empty segments by
compressing the live data from heavily fragmented segments. 0here are a
simulator to explore different cleaning policies and discovered a simple but
effective algorithm based on cost and benefit it segregates older, more slowly
changing data from young rapidly changing data and treats them differently
during cleaning.
Page | 7
In log/structure file system there also have a prototype log/structured file
system called 2prite 5=2, which is now in production use as part of the 2prite
network operating system. :ench mark programs demonstrate that the raw
writing speed of 2prite 5=2 is more than an order of magnitude greater than that
of ;nix for small files. 6ven for other workloads, such as those including reads
and large file accesses, 2prite 5=2 is at least as fast as ;nix in all cases but one
7files read se1uentially after being written randomly8. In log/structure file
system it easily possible to measure the long term overhead for cleaning in the
production system. Bverall, 2prite 5=2 permits about -,/.,> of a diskEs raw
bandwidth to be used for writing new data 7the rest is used for cleaning8. =or
comparison, ;nix system scan only utiliDe ,/*(> of a diskEs raw bandwidth for
writing new data the rest of the time is spent seeking.
0he remainder of this portion is organiDed into reviews the issues in
designing file systems for computers of the *FF(Es. Design alternatives for a
log/structured file system and derives the structure of 2prite 5=2, with
particular focus on the cleaning mechanism. 0he crash recovery system for
2prite 5=2, long term measurements of cleaning overhead and concludes.
Design for file systems of the 0CC.Ds
=ile system design is governed by two general forces technology, which
provides a set of basic building blocks, and workload, which determines a set of
operations that must be carried out efficiently. 0his section summariDes
technology changes that are underway and describes their impact on file system
design. It also describes the workloads that influenced the design of 2prite 5=2
and shows how current file systems are ill e1uipped to deal with the workloads
and technology changes.
i9!echnology
0hree components of technology are particularly significant for file
system design processors, disks, and main memory. "rocessors are significant
because their speed is increasing at a nearly exponential rate, and the
improvements seem likely to continue through much of the *FF(Es. 0his puts
pressure on all the other elements of the computer system to speed up as well,
so that the system doesnEt become unbalanced.
Disk technology is also improving rapidly, but the improvements have
been primarily in the areas of cost and capacity rather than performance. 0here
are two components of disk performance transfer bandwidth and access time.
Although both of these factors are improving, the rate of improvement is much
Page | 8
slower than for 4"; speed. Disk transfer bandwidth can be improved
substantially with the use of disk arrays and parallel head disks but no major
improvements seem likely for access time 7it is determined by mechanical
motions that are hard to improve8. If an application causes a se1uence of small
disk transfers separated by seeks, then the application is not likely to experience
much speedup over the next ten years, even with faster processors.
0he third component of technology is main memory, which is increasing
in siDe at an exponential rate. #odern file systems cache recently used file data
in main memory, and larger main memories make larger file caches possible.
0his has two effects on file system behavior. =irst, larger file caches alter the
workload presented to the disk by absorbing a greater fraction of the read
re1uests. #ost write re1uests must eventually be reflected on disk for safety, so
disk traffic 7and disk performance8 will become more and more dominated by
writes.
0he fourth impact of large file caches is that they can serve as write
buffers where large numbers of modified blocks can be collected before writing
any of them to disk. :uffering may make it possible to write the blocks more
efficiently, for example by writing them all in a single se1uential transfer with
only one seek. Bf course, write buffering has the disadvantage of increasing the
amount of data lost during a crash. Initial design criteria of old file system it
was assume that crashes are infre1uent and that it is acceptable to lose a few
seconds or minutes of work in each crash< for applications that re1uire better
crash recovery, non volatile $A# may be used for the write buffer.
ii9*or>loads
2everal different file system workloads are common in computer
applications. Bne of the most difficult workloads for file system designs to
handle efficiently is found in office and engineering environments. Bffice and
engineering applications tend to be dominated by accesses to small files<
generally file siDes of only a few kilobytes. 2mall files usually result in small
random disk IGBs, and the creation and deletion times for such files are often
dominated by updates to file system HHmetadataEE 7the data structures used to
locate the attributes and blocks of the file8.
9orkloads dominated by se1uential accesses to large files, also pose
interesting problems, but not for file system software. A number of techni1ues
exist for ensuring that such files are laid out se1uentially on disk, so IGB
performance tends to be limited by the bandwidth of the IGB and memory
subsystems rather than the file allocation policies. In designing a log/structured
file system they decided to focus on the efficiency of small file accesses, and
Page | 9
leave it to hardware designers to improve bandwidth for large file accesses. 0he
techni1ues used in 2prite 5=2 work well for large files as well as small ones.
+roblems with existing file systems
4urrent file systems suffer from two general problems that make it hard
for them to copy with the technologies and workloads of the *FF(Es. =irst, they
spread information around the disk in a way that causes too many small
accesses. =or example, the :erkeley ;nix fast file system 7;nix ==28 is 1uite
effective at laying out each file se1uentially on disk, but it physically separates
different files. =urthermore, the attributes 7HHinodeEE8 for a file are separate from
the fileEs contents, as is the directory entry containing the fileEs name. It takes at
least five separate disk IGBs, each preceded by a seek, to create a new file in
;nix ==2 two different accesses to the fileEs attributes plus one access each for
the fileEs data, the directoryEs data, and the directoryEs attributes. 9hen writing
small files in such a system, less than ,> of the diskEs potential band width is
used for new data< the rest of the time is spent seeking.
0he second problem with current file systems is that they tend to write
synchronously the application must wait for the write to complete, rather than
continuing while the write is handled in the background. =or example even
though ;nix ==2 writes file data blocks asynchronously, file system metadata
structures such as directories and inodes are written synchronously. =or
workloads with many small files, the disk traffic is dominated by the
synchronous metadata writes. 2ynchronous writes couple the applicationEs
performance to that of the disk and make it hard for the application to benefit
from faster 4";s. 0here also exist a difficult to the potential use of the file
cache as a write buffer. ;nfortunately, network file systems like have introduced
additional synchronous behavior where it didnEt used to exist. 0his has
simplified crash recovery, but it has reduced write performance.
3ere it have been described about ;nix fast file system 7;nix ==28 as an
example of current file system design and compare it to log/structured file
systems. 0he ;nix ==2 design is used because it is well documented in the
literature and used in several popular ;nix operating systems. 0he problems
presented in this section are not uni1ue to ;nix ==2 and can be found in most
other file systems.
Log6structured file system,
A log structured file system is a file system in which data and meta/
data is written se1uentially to a circular buffer, called log. 0he design was first
Page | 10
proposed in *FII and implemented in *FF+ by Cohn !. Busterhout and #endel
$osenblum in the form of 2prite file system. 0he fundamental idea of a log
structured file system is to improve write performance by buffering a se1uence
of file system changes in the file cache and then writing all the changes to disk
se1uentially in a single disk write operation. 0he information written to disk in
the write operation includes file data blocks, attributes, index blocks,
directories, and almost all the other information used to manage the file system.
=or workloads that contain many small files, a log/structured file system
converts the many small synchronous random writes of traditional file systems
into large asynchronous se1uential transfers that can utiliDe nearly *((> of the
raw disk bandwidth.
Initially log structured file system was designed for normal optical and
magnetic media. As writes are made se1uentially to the end of the log, write
throughput is increased. Also crash recovery becomes simpler, as upon its next
mount, the file system does not need to walk all its data structures to fix any
inconsistencies, but can reconstruct its state from the last consistent point in the
log. Another feature, called snapshotting, makes old versions of data in the log
accessible and nameable. 0his creates a sort of versioning.
Issues of 5og/2tructured =ile 2ystems
Bne major issue of log structured file systems is to reclaim free space
once the tail end of the log is reached. In vanilla log/structured file system 7for
e.g. 2prite8 this is done by advancing the tail by skipping over obsolete data.
9hen live data is found, it is appended to the head of the log.
0o reduce the overhead incurred by this garbage collection, most
implementations avoid purely circular logs and divide up their storage into
segments. 0he head of the log simply advances into non adjacent segments
which are already free. If space is needed, the least full segments are reclaimed
first. 0his decreases the IGB load of the garbage collector, but becomes
increasingly ineffective as the file system fills up and nears capacity.
Another issue of log/structured file system is associated with metadata
propagation and is known as wandering tree problem. In 5inux like systems
every file has an inode which contain data about the file 7metadata8 and
addresses of data blocks or indirect blocks containing addresses of blocks. 0hus
whenever some file data is updated the corresponding direct block, indirect
blocks and inode also needs to be updated. 0his leads to a lot of overhead.
Page | 11
File Location and #eading
Although the term HHlog/structuredEE might suggest that se1uential scans
are re1uired to retrieve information from the log, this is not the case in 2prite
5=2. 0he goal was to match or exceed the read performance of ;nix ==2. 0o
accomplish this goal, 2prite 5=2 outputs index structures in the log to permit
random access retrievals. 0he basic structures used by 2prite 5=2 are identical
to those used in ;nix ==2 for each file there exists a data structure called an
inode, which contains the fileEs attributes 7type, owner, permissions, etc.8 plus
the disk addresses of the first ten blocks of the file< for files larger than ten
blocks, the inode also contains the disk addresses of one or more indirect
blocks, each of which contains the addresses of more data or indirect blocks.
Bnce a fileEs inode has been found, the number of disk IGBs re1uired to read the
file is identical in 2prite 5=2 and ;nix ==2.
In ;nix ==2 each inode is at a fixed location on disk< given the
identifying number for a file, a simple calculation yields the disk address of the
fileEs inode. In contrast 2prite 5=2 doesnEt place inodes at fixed positions< they
are written to the log. 2prite 5=2 uses a data structure called an inode map to
maintain the current location of each inode. Jiven the identifying number for a
file, the inode map must be indexed to determine the disk address of the inode.
0he inode map is divided into blocks that are written to the log< a fixed
checkpoint region on each disk identifies the locations of all the inode map
blocks. =ortunately, inode maps are compact enough to keep the active portions
cached in main memory inode map lookups rarely re1uire disk accesses.
2prite 5=2 and ;nix ==2 have the same logical structure , the log/
structured file system produces a much more compact arrangement. As a result,
the write performance of 2prite 5=2 is much better than ;nix ==2, while its
read performance is just as good.
Page | 12
Indexed structure is the same as ==2
metadata
inode
block ptr
block ptr
block ptr
block ptr
block ptr

data
block
index
block
directory entry
file name inode number
Location of
inodes is
fixed in
FFS
Free space management 8segments9
0he most difficult design issue for log/structured file systems is the
management of free space. 0he goal is to maintain large free extents for writing
new data. Initially all the free space is in a single extent on disk, but by the time
the log reaches the end of the disk the free space will have been fragmented into
many small extents corresponding to the files that were Deleted or overwritten.
2egment unit of writing and cleaning
,*+!: K *(+)!:
segment 0 segment 1 segment n
Disk : consists of segments + checkpoint region

checkpoint region
2egment summary block
4ontains each blockEs identity Linode number, offsetM
;sed to check validness of each block
=rom this point on, the file system has two choices threading and
copying. 0he first alternative is to leave the live data in place and thread the log
Page | 13
through the free extents. ;nfortunately, threading will cause the free space to
become severely fragmented, so that large contiguous writes wonEt be possible
and a log/structured file system will be no faster than traditional file systems.
0he second alternative is to copy live data out of the log in order to leave large
free extents for writing. 3ere live data is written back in a compacted form at
the head of the log< it could also be moved to another log/structured file system
to form a hierarchy of logs, or it could be moved to some totally different file
system or archive. 0he disadvantage of copying is its cost, particularly for long
lived files< in the simplest case where the log works circularly across the disk
and live data is copied back into the log, all of the long lived files will have to
be copied in every pass of the log across the disk.
2prite 5=2 uses a combination of threading and copying. 0he disk is
divided into large fixed siDe extents called segments. Any given segment is
always written se1uentially from its beginning to its end, and all live data must
be copied out of a segment before the segment can be rewritten. 0he space in
the device is split up into e1ual siDed segments. 2egments are the primary write
unit of 5og=2. 9ithin each segments, writes happen from front 7low addresses8
to back 7high addresses8. If only a partial segment has been written, the
segment number, the current position within and optionally a write buffer are
stored in the journal. 2egments are erased as a whole. 0herefore Jarbage
4ollection may be re1uired to completely free a segment before doing so.
3owever, the log is threaded on a segment by segment basis< if the
system can collect long lived data together into segments, those segments can
be skipped over so that the data doesnEt have to be copied repeatedly. 0he
segment siDe is chosen large enough that the transfer time to read or write a
whole segment is much greater than the cost of a seek to the beginning of the
segment. 0his allows whole segment operations to run at nearly the full
bandwidth of the disk, regardless of the order in which segments are accessed.
2prite 5=2 currently uses segment siDes of either ,*+ kilobytes or one
megabyte.
Segment cleaning mechanism
0he process of copying live data out of a segment is called segment
cleaning. In 2prite 5=2 it is a simple three step process read a number of
segments into memory, identify the live data, and write the live data back to a
smaller number of clean segments. After this operation is complete, the
segments that were read are marked as clean, and they can be used for new data
or for additional cleaning.
As part of segment cleaning it possible to identify which blocks of each
Page | 14
segment are live, so that they can be written out again. It must also be possible
to identify the file to which each block belongs and the position of the block
within the file< this information is needed in order to update the fileEs inode to
point to the new location of the block. 2prite 5=2 solves both of these problems
by writing a segment summary block as part of each segment. 0he summary
block identifies each piece of information that is written in the segment< for
example, for each file data block the summary block contains the file number
and block number for the block. 2egments can contain multiple segment
summary blocks when more than one log write is needed to fill the segment.
7"artial segment writes occur when the number of dirty blocks buffered in the
file cache is insufficient to fill a segment.8 2egment summary blocks impose
little overhead during writing, and they are useful during crash recovery as well
as during cleaning.
2prite 5=2 also uses the segment summary information to distinguish live
blocks from those that have been overwritten or deleted. Bnce a blockEs identity
is known, its liveness can be determined by checking the fileEs inode or indirect
block to see if the appropriate block pointer still refers to this block. If it does,
then the block is live< if it doesnt, then the block is dead. 2prite 5=2 optimiDes
this check slightly by keeping a version number in the inode map entry for each
file< the version number is incremented whenever the file is deleted or truncated
to length Dero. 0he version number combined with the inode number form an
uni1ue identifier 7uid8 for the contents of the file. 0he segment summary block
records this uid for each block in the segment< if the uid of a block does not
match the uid currently stored in the inode map when the segment is cleaned,
the block can be discarded immediately without examining the fileEs inode.
0his approach to cleaning means that there is no free block list or bitmap
in 2prite. In addition to saving memory and disk space, the elimination of these
data structures also simplifies crash recovery. If these data structures existed,
additional code would be needed to log changes to the structures and restore
consistency after crashes.
Segment cleaning policies
0he basic mechanism described above, four policy issues must be
addressed
7*8 9hen should the segment cleaner execute 2ome possible choices are for it
to run continuously in background at low priority, or only at night, or only when
disk space is nearly exhausted.
7+8 3ow many segments should it clean at a time 2egment cleaning offers an
Page | 15
opportunity to reorganiDe data on disk< the more segments cleaned at once, the
more opportunities to rearrange.
7'8 9hich segments should be cleaned An obvious choice is the ones that are
most fragmented, but this turns out not to be the best choice.
7)8 3ow should the live blocks be grouped when they are written out Bne
possibility is enhancing the locality of future reads, for example by grouping
files in the same directory together into a single output segment. Another
possibility is to sort the blocks by the time they were last modified and group
blocks of similar age together into new segments< they call this approach age
sort.
2prite 5=2 starts cleaning segments when the number of clean segments
drops below a threshold value 7typically a few tens of segments8. It cleans a few
tens of segments at a time until the number of clean segments surpasses another
threshold value 7typically ,(/*(( clean segments8. 0he overall performance of
2prite 5=2 does not seem to be very sensitive to the exact choice of the
threshold values. In contrast, the third and fourth policy decisions are critically
important they are the primary factors that determine the performance of a log
structured file system.
0he term called write cost to compare cleaning policies. 0he write cost is
the average amount of time the disk is busy per byte of new data written,
including all the cleaning overheads. 0he write cost is expressed as a multiple of
the time that would be re1uired if there were no cleaning overhead and the data
could be written at its full bandwidth with no seek time or rotational latency. A
write cost of *.( is perfect it would mean that new data could be written at the
full disk bandwidth and there is no cleaning overhead. A write cost of *( means
that only one tenth of the diskEs maximum bandwidth is actually used for
writing new data< the rest of the disk time is spent in seeks, rotational latency, or
cleaning.
=or a log/structured file system with large segments, seeks and rotational
latency are negligible both for writing and for cleaning, so the write cost is the
total number of bytes moved to and from the disk divided by the number of
those bytes that represent new data. 0his cost is determined by the utiliDation
7the fraction of data still live8 in the segments that are cleaned. In the steady
state, the cleaner must generate one clean segment for every segment of new
data written. 0o do this, it reads % segments in their entirety and writes out %Nu
segments of live data 7where u is the utiliDation of the segments and ( OuL
*8.0his creates %N7*Pu8 segments of contiguous free space for new data. 0hus
write cost Q 7total bytes read and written8 G 7new data written8
Page | 16
Q 7read segs R write live R write new8 G 7new data written8
Q 7%R%NuR% N7*Pu88 G 7%N7*/u88
Q+G 7*/u8
In the above formula conservative assumption that a segment must be
read in its entirety to recover the live blocks< in practice it may be faster to read
just the live blocks, particularly if the utiliDation is very low. If a segment to be
cleaned has no live blocks 7uQ (8 then it need not be read at all and the write
cost is *.(.
0he write cost as a function of u. =or reference, ;nix ==2 on small file
workloads utiliDes at most ,/*(> of the disk bandwidth, for a write cost of *(/
+( 79ith logging, delayed writes, and disk re1uest sorting this can probably be
improved to about +,> of the band width or a write cost of ). 0he segments
cleaned must have a utiliDation of less than (.I in order for a log/structured file
system to outperform the current ;nix ==2< the utiliDation must be less than (.,
to outperform an improved ;nix ==2.
It is important to note that the utiliDation discussed above is not the
overall fraction of the disk containing live data< it is just the fraction of live
blocks in segments that are cleaned. Sariations in file usage will cause some
segments to be less utiliDed than others, and the cleaner can choose the least
utiliDed segments to clean< these will have lower utiliDation than the overall
average for the disk.
6ven so, the performance of a log/structured file system can be improved
by reducing the overall utiliDation of the disk space. 9ith less of the disk in use
the segments that are cleaned will have fewer live blocks resulting in a lower
write cost. 5og/structured file systems provide a cost performance tradeoff if
disk space is underutiliDed, higher performance can be achieved but at a high
cost per usable byte< if disk capacity utiliDation is increased, storage costs are
reduced but so is performance. 2uch a trade of in a log/structured file system,
the write cost depends strongly on the utiliDation of the segments that are
cleaned. 0he more live data in segments cleaned the more disk bandwidth that is
needed for cleaning and not available for writing new data. 0here are two
reference points HH==2 todayEE, which represents ;nix ==2 today, and HH==2
improvedEE, which is used to estimate of the best performance possible in an
improved ;nix ==2. 9rite cost for ;nix ==2 is not sensitive to the amount of
disk space in use between performance and space utiliDation is not uni1ue to log
structured file systems. =or example, ;nix ==2 only allows F(> of the disk
space to be occupied by files. 0he remaining *(> is kept free to allow the space
allocation algorithm to operate efficiently.
Page | 17
0he key to achieving high performance at low cost in a log structured file
system is to force the disk into a bimodal segment distribution where most of
the segments are nearly full, a few are empty or nearly empty, and the cleaner
can almost always work with the empty segments. 0his allows a high overall
disk capacity utiliDation yet provides a low write cost. 0he following section
describes how such a bimodal distribution in 2prite 5=2.
Simulation results
0here are a simple file system simulator for analyDe different cleaning
policies under controlled conditions. 0he simulatorEs model does not reflect
actual file system usage patterns 7its model is much harsher than reality8, but it
helped us to understand the effects of random access patterns and locality, both
of which can be exploited to reduce the cost of cleaning. 0he simulator models a
file system as a fixed number of )!: files, with the number chosen to produce a
particular overall disk capacity utiliDation. At each step, the simulator
overwrites one of the files with new data, using one of two pseudo random
access patterns
;niform 6ach file has e1ual likelihood of being selected in each step.
3ot/and/cold =iles are divided into two groups. Bne group contains *(> of
the files< it is called hot because its files are selected F(> of the time. 0he other
group is called cold< it contains F(> of the files but they are selected only *(>
of the time. 9ithin groups each file is e1ually likely to be selected. 0his access
pattern models a simple form of locality.
In this approach the overall disk capacity utiliDation is constant and no
read traffic is modeled. 0he simulator runs until all clean segments are
exhausted, then simulates the actions of a cleaner until a threshold number of
clean segments is available again. In each run the simulator was allowed to run
until the write cost stabiliDed and all cold start variance had been removed.
Segment usage table
In order to support the cost benefit cleaning policy, 2prite 5=2 maintains
a data structure called the segment usage table. =or each segment, the table
records the number of live bytes in the segment and the most recent modified
time of any block in the segment. 0hese two values are used by the segment
cleaner when choosing segments to clean. 0he values are initially set when the
segment is written, and the count of live bytes is decremented when files are
deleted or blocks are overwritten. If the count falls to Dero then the segment can
be reused without cleaning. 0he blocks of the segment usage table are written to
Page | 18
the log, and the addresses of the blocks are stored in the checkpoint regions.
Crash #ecovery
9hen a system crash occurs, the last few operations performed on the
disk may have left it in an inconsistent state 7for example, a new file may have
been written without writing the directory containing the file8< during reboot the
operating system must review these operations in order to correct any
inconsistencies. In traditional ;nix file systems without logs, the system cannot
determine where the last changes were made, so it must scan all of the metadata
structures on disk to restore consistency. 0he cost of these scans is already high
7tens of minutes in typical configurations8, and it is getting higher as storage
systems expand.
In a log structured file system the locations of the last disk operations are
easy to determine they are at the end of the log. 0hus it should be possible to
recover very 1uickly after crashes. 0his benefit of logs is well known and has
been used to advantage both in database systems and in other file systems. 5ike
many other logging systems, 2prite 5=2 uses a two pronged approach to
recovery checkpoints, which define consistent states of the file system, and roll
forward, which is used to recover information written since the last checkpoint.
Chec>points
A checkpoint is a position in the log at which all of the file system
structures are consistent and complete. 2prite 5=2 uses a two phase process to
create a checkpoint. =irst, it writes out all modified information to the log,
including file data blocks, indirect blocks, inodes, and blocks of the inode map
and segment usage table. 2econd, it writes a checkpoint region to a special fixed
position on disk. 0he checkpoint region contains the addresses of all the blocks
in the inode map and segment usage table, plus the current time and a pointer to
the last segment written.
During reboot, 2prite 5=2 reads the checkpoint region and uses that
information to initialiDe its main memory data structures. In order to handle a
crash during a checkpoint operation there are actually two checkpoint regions,
and checkpoint operations alternate between them. 0he checkpoint time is in the
last block of the checkpoint region, so if the checkpoint fails the time will not be
updated. During reboot, the system reads both checkpoint regions and uses the
one with the most recent time.
2prite 5=2 performs checkpoints at periodic intervals as well as when the
Page | 19
file system is unmounted or the system is shut down. A long interval between
checkpoints reduces the overhead of writing the checkpoints but increases the
time needed to roll forward during recovery< a short checkpoint interval
improves recovery time but increases the cost of normal operation. 2prite 5=2
currently uses a checkpoint interval of thirty seconds, which is probably much
too short. An alternative to periodic check pointing is to perform checkpoints
after a given amount of new data has been written to the log< this would set a
limit on recovery time while reducing the checkpoint overhead when the file
system is not operating at maximum throughput.
#oll6forward
In principle it would be safe to restart after crashes by simply reading the
latest checkpoint region and discarding any data in the log after that checkpoint.
0his would result in instantaneous recovery but any data written since the last
checkpoint would be lost. In order to recover as much information as possible,
2prite 5=2 scans through the log segments that were written after the last
checkpoint. 0his operation is called roll/forward.
During roll/forward 2prite 5=2 uses the information in segment summary
blocks to recover recently written file data. 9hen a summary block indicates the
presence of a new inode, 2prite 5=2 updates the inode map it read from the
checkpoint, so that the inode map refers to the new copy of the inode. 0his
automatically incorporates the fileEs new data blocks into the recovered file
system. If data blocks are discovered for a file without a new copy of the fileEs
inode, then the roll/forward code assumes that the new version of the file on
disk is incomplete and it ignores the new data blocks
0he roll/forward code also adjusts the utiliDations in the segment usage
table read from the checkpoint. 0he utiliDations of the segments written since
the checkpoint will be Dero< they must be adjusted to reflect the live data left
after roll/forward. 0he utiliDations of older segments will also have to be
adjusted to reflect file deletions and overwrites 7both of these can be identified
by the presence of new inodes in the log8.
0he final issue in roll/forward is how to restore consistency between
directory entries and inodes. 6ach inode contains a count of the number of
directory entries refer ring to that inode< when the count drops to Dero the file is
deleted. ;nfortunately, it is possible for a crash to occur when an inode has been
written to the log with a new reference count while the block containing the
corresponding directory entry has not yet been written, or vice versa.
0o restore consistency between directories and inodes, 2prite 5=2 outputs
Page | 20
a special record in the log for each directory change. 0he record includes an
operation code 7create, link, rename, or unlink8, the location of the directory
entry 7i/number for the directory and the position within the directory8, the
contents of the directory entry 7name and i/number8, and the new reference
count for the inode named in the entry. 0hese records are collectively called the
directory operation log< 2prite 5=2 guarantees that each directory operation log
entry appears in the log before the corresponding directory block or inode.
During roll/forward, the directory operation log is used to ensure
consistency between directory entries and inodes if a log entry appears but the
inode and directory block were not both written, roll/forward updates the
directory andGor inode to complete the operation. $oll/forward operations can
cause entries to be added to or removed from directories and reference counts
on inodes to be updated. 0he recovery program appends the changed directories,
inodes, inode map, and segment usage table blocks to the log and writes a new
checkpoint region to include them. 0he only operation that canEt be completed
is the creation of a new file for which the inode is never written< in this case the
directory entry will be removed. In addition to its other functions, the directory
log made it easy to provide an atomic rename operation.
0he interaction between the directory operation log and checkpoints
introduced additional synchroniDation issues into 2prite 5=2. In particular, each
checkpoint must represent a state where the directory operation log is consistent
with the inode and directory blocks in the log. 0his re1uired additional
synchroniDation to prevent directory modifications while checkpoints are being
written.
!he LogFS Flash File system 8Specification9
2uperblocks
-----------------
0wo superblocks exist at the beginning and end of the file system. 6ach
superblock is +,- :ytes large, with another 'I)( :ytes reserved for future
purposes, making a total of )(F- :ytes. 2uperblock locations may differ for
#0D and block devices. Bn #0D the first non/bad block contains a superblock
in the first )(F- :ytes and the last non/bad block contains a superblock in the
last )(F- :ytes. Bn block devices, the first )(F- :ytes of the device contain the
first superblock and the last aligned )(F- :yte/block contains the second
superblock. =or the most part, the superblocks can be considered read/only.
0hey are written only to correct errors detected within the superblocks, move
the journal and change the file system parameters through tunefs. As a result,
the superblock does not contain any fields that re1uire constant updates, like the
amount of free space, etc.
Page | 21
Segments
666666666666
0he space in the device is split up into e1ual/siDed segments. 2egments
are the primary write unit of 5og=2. 9ithin each segments, writes happen from
front 7low addresses8 to back 7high addresses. If only a partial segment has been
written, the segment number, the current position within and optionally a write
buffer are stored in the journal. 2egments are erased as a whole therefore
Jarbage 4ollection may be re1uired to completely free a segment before doing
so.
Eournal
66666666666
0he journal contains all global information about the file system that is
subject to fre1uent change. At mount time, it has to be scanned for the most
recent commit entry, which contains a list of pointers to all currently valid
entries.
$b7ect Store
6666666666666666
All space except for the superblocks and journal is part of the object
store. 6ach segment contains a segment header and a number of objects, each
consisting of the object header and the payload. Bbjects are either inodes,
directory entries 7dentries8, file data blocks or indirect blocks.
Levels
66666666
Jarbage collection 7J48 may fail if all data is written indiscriminately.
Bne re1uirement of J4 is that data is separated roughly according to the
distance between the tree root and the data. 6ffectively that means all file data is
on level (, indirect blocks are on levels *, +, ' ) or , for *x, +x, 'x, )x or ,x
indirect blocks, respectively. Inode file data is on level - for the inodes and ./**
for indirect blocks. 6ach segment contains objects of a single level only. As a
result, each level re1uires its own separate segment to be open for writing.
)node File
666666666666
All inodes are stored in a special file, the inode file. 2ingle exception is
the inode fileTs inode 7master inode8 which for obvious reasons is stored in the
journal instead. Instead of data blocks, the leaf nodes of the inode files are
inodes.
liases
Page | 22
66666666
9rites in 5og=2 are done by means of a wandering tree. A naive
implementation would re1uire that for each write or a block, all parent blocks
are written as well, since the block pointers have changed. 2uch an
implementation would not be very efficient. In 5og=2, the block pointer
changes are cached in the journal by means of alias entries. 6ach alias consists
of its logical address inode number, block index, level and child number 7index
into block8 and the changed data. Any I/byte word can be changes in this
manner. 4urrently aliases are used for block pointers, file siDe, file used bytes
and the height of an inodes indirect tree.
Segment liases
666666666666666666666
$elated to regular aliases, these are used to handle bad blocks. Initially,
bad blocks are handled by moving the affected segment content to a spare
segment and noting this move in the journal with a segment alias, a simple 7to,
from8 tupel. J4 will later empty this segment and the alias can be removed
again. 0his is used on #0D only.
<im
666666
:y cleverly predicting the life time of data, it is possible to separate long/
living data from short/living data and thereby reduce the J4 overhead later.
6ach type of distinct life expectancy 7vim8 can have a separate segment open
for writing. 6ach 7level, vim8 tupel can be open just once. If an open segment
with unknown vim is encountered at mount time, it is closed and ignored
henceforth.
)ndirect !ree
66666666666666666
Inodes in 5og=2 are similar to ==2/style file systems with direct and
indirect block pointers. Bne difference is that 5og=2 uses a single indirect
pointer that can be either a *x, +x, etc. indirect pointer. A height field in the
inode defines the height of the indirect tree and thereby the indirection of the
pointer. Another difference is the addressing of indirect blocks. In 5og=2, the
first *- pointers in the first indirect block are left empty, corresponding to the *-
direct pointers in the inode. In ext+ 7maybe others as well8 the first pointer in
the first indirect block corresponds to logical block *+, skipping the *+ direct
pointers. 2o where ext+ is using arithmetic to better utiliDe space, 5og=2 keeps
arithmetic simple and uses compression to save space.
Compression
Page | 23
66666666666666666
:oth file data and metadata can be compressed. 4ompression for file
data can be enabled with chattr Rc and disabled with chattr /c. Doing so has no
effect on existing data, but new data will be stored accordingly. %ew inodes
will inherit the compression flag of the parent directory. #etadata is always
compressed. 3owever, the space accounting ignores this and charges for the
uncompressed siDe. =ailing to do so could result in J4 failures when, after
moving some data, indirect blocks compress worse than previously. 6ven on a
*((> full medium, J4 may not consume any extra space, so the compression
gains are lost space to the user. 3owever, they are not lost space to the file
system internals. :y cheating the user for those bytes, the file system gained
some slack space and J4 will run less often and faster.
'arbage Collection and *ear Leveling
66666666666666666666666666666666666666666666666666
Jarbage collection is invoked whenever the number of free segments
falls below a threshold. 0he best 7known8 candidate is picked based on the least
amount of valid data contained in the segment. All remaining valid data is
copied elsewhere, thereby invalidating it. 0he J4 code also checks for aliases
and writes then back if their number gets too large. 9ear leveling is done by
occasionally picking a suboptimal segment for garbage collection. If as tale
segments erase count is significantly lower than the active segmentsT erase
counts, it will be picked. 9ear leveling is rate limited, so it will never
monopoliDe the device for more than one segment worth at a time. Salues for
UoccasionallyU, Usignificantly lowerU are compile time constants.
%ashed directories
666666666666666666666666
0o satisfy efficient lookup78, directory entries are hashed and located
based on the hash. In order to both support large directories and not be overly
inefficient for small directories, several hash tables of increasing siDe are used.
=or each table, the hash value modulo the table siDe gives the table index.
0ables siDes are chosen to limit the number of indirect blocks with a fully
populated table to (, *, + or ' respectively. 2o the first table contains *- entries,
the second ,*+/*-, etc.0he last table is special in several ways. =irst its siDe
depends on the effective '+bit limit on tell dirGseek dir cookies. 2ince logfs uses
the upper half of the address space for indirect blocks, the siDe is limited to
+V'*. 2econdly the table contains hash buckets with *- entries each. ;sing
single/entry buckets would result in birthday UattacksU. At just +V*- used
entries, hash collisions would be likely 7" MQ (.,8. 3ere testing showed that in
*(,((( runs the lowest directory fill before a bucket overflow was *II,(,.,*'(
entries with an average of '*,,*)F,F*, entries. 2o for directory siDes of up to a
Page | 24
million, bucket overflows should be virtually impossible under normal
circumstances. 9ith carefully chosen filenames, it is obviously possible to
cause an overflow with just +* entries 7) higher tables R *- entries R *8. 2o
there may be a security concern if a malicious user has write access to a
directory.
Device ddress Space
6666666666666666666666666666
A device address space is used for caching. :oth block devices and #0D
provide functions to either read a single page or write a segment. "artial
segments may be written for data integrity, but where possible complete
segments are written for performance on simple block device flash media.
Meta )nodes
6666666666666666
Inodes are stored in the inode file, which is just a regular file for most
purposes. At umount time, however, the inode file needs to remain open until
all dirty inodes are written. 2o genericWshutdown super78 may not close this
inode, but shouldnTt complain about remaining inodes due to the inode file
either. 2ame goes for mapping inode of the device address space. 4urrently
logfs uses a hack that essentially copies part of fsGinode.c code over. A general
solution would be preferred.
)ndirect bloc> mapping
666666666666666666666666666666666666
9ith compression, the block device 7or mapping inode8 cannot be used to
cache indirect blocks. 2ome other place is re1uired. 4urrently logfs uses the
top half of each inodeTs address space. 0he low I0:7on '+bit8 are filled with
file data, the high I0: are used for indirect blocks. Bne problem is that *-0:
files created on -)bit systems actually have data in the top I0:. :ut files
M*-0: would cause problems anyway, so only the limit has changed.
Conclusion
0he basic principle behind a log/structured file system is a simple one,
collect large amounts of new data in a file cache in main memory, then write the
data to disk in a single large IGB that can use all of the diskEs bandwidth.
Implementing this idea is complicated by the need to maintain large free areas
on disk, but both our simulation analysis and our experience with 2prite 5=2
suggest that low cleaning overheads can be achieved with a simple policy based
on cost and benefit.
0he bottom line is that a log/structured file system can use disks an order
Page | 25
of magnitude more efficiently than existing file systems. 0his should make it
possible to take advantage of several more generations of faster processors
before IGB limitations once again threaten the scalability of computer systems.
Cournaling =lash =ile 2ystem 7C==28
0he Cournaling =lash =ile 2ystem 7or C==28 is a log/structured file system
for use on %B$ flash memory devices on the 5inux operating system. It has
been superseded by C==2+.
Design
=lash memory 7specifically %B$ flash8 must be erased prior to writing. 0he
erase process has several limitations
6rasing is very slow 7typically */*(( ms per erase block, which is *(
'
/*(
,
times slower than reading data from the same region8
It is only possible to erase flash in large segments 7usually -) !: or
more8, whereas it can be read or written in smaller blocks 7often ,*+
bytes8
=lash memory can only be erased a limited number of times 7typically
*(
'
/*(
-
8 before it becomes worn out
0hese constraints combine to produce a profound asymmetry between
patterns of read and write access to flash memory. In contrast, magnetic hard
disk drives offer nearly symmetrical read and write access read speed and write
speed are nearly identical 7as both are constrained by the rate at which the disk
spins8, it is possible to both read and write small blocks or sectors 7typically ,*+
or )(F- bytes8, and there is no practical limit to the number of times magnetic
media can be written and rewritten.
0raditional file systems, such as ext+ or =A0 which were designed for use on
magnetic media typically update their data structures in/place, with data
structures like inodes and directories updated on/disk after every modification.
0his concentrated lack of wear levelling makes conventional file systems
unsuitable for read/write use on flash devices.
C==2 enforces wear levelling by treating the flash device as a circular log.
All changes to files and directories are written to the tail of the log in nodes. In
each node, a header containing metadata is written first, followed by file data, if
any. %odes are chained together with offset pointers in the header. %odes start
out as valid and then become obsolete when a newer version of them is created.
Page | 26
0he free space remaining in the file system is the gap between the logTs tail
and its head. 9hen this runs low, a garbage collector copies valid nodes from
the head to the tail and skips obsolete ones, thus reclaiming space.
Disadvantages
At mount time, the file system driver must read the entire inode chain and
then keep it in memory. 0his can be very slow. #emory consumption of
C==2 is also proportional to the number of files in the file system.
0he circular log design means all data in the file system is re/written,
regardless of whether it is static or not. 0his generates many unnecessary
erase cycles and reduces the life of the flash medium.
Flash Friendly File System8F1FS9,
)ntroduction,
In recent years flash drive based memory systems have become 1uite
popular due to their low cost and high storage capacity. 0hese devices have
replaced traditional 66"$B# devices used in computers and are used as
storage in various digital devices like digital cameras, mobile phones, etc. 0he
structure and construction of these devices however differ considerably from
normal 7magnetic8 hard drives. Due to this, traditional file systems like ntfs and
ext) are very inefficient for flash based devices. 3ence a new group of log/
structured file systems are used with flash memory.
0his document serves as a basic introduction to log based file systems with
special focus on a particular file system called =lash =riendly =ile 2ystem
7=+=28.
=lash #emory
=lash devices are constantly/powered non/volatile memory devices. 0hey
can be either %B$ or %A%D devices.
In flash devices bit * denotes empty data. 9henever some data has to be
written bit * is flipped to (. 0hus, erasing data in flash memory consists of
flipping all ( bits to * bit. Due to the nature of these devices individual bytes
cannot be erased< rather data are erased in units of blocks called erase/blocks.
Page | 27
As writing a byte of data may re1uire changing one or more ( bits to *, the
whole block containing the byte needs to be deleted first. 0he se1uence of
operations consisting of first copying the contents of an erase/block to memory,
deleting the contents of the block and the writing back the updated in/core
content to the block is called an erase/cycle. 6very flash/ memory erase/block
has a life/time of a limited number of such erase/cycles, after which the block
becomes unusable. 0his limitation has lead to the uni1ue re1uirements of a flash
memory file system.
Issues of =lash #emory =ile 2ystem
0he most important issue of a flash memory file system is to extend the
life of each erase/block as much as possible. In traditional hard drives data is
often updated in place. As the main source of inefficiency in hard drive disks is
caused by the rotational latency and seek time of the mechanical readGwrite
head, it makes sense to reduce it by adopting the policy of in place update.
3owever, in flash devices, rotational latency caused by movement of
mechanical parts is not an issue. #oreover, fre1uently updating a particular
erase/block leads to 1uick reduction of its life/ time. 0hus flash based file
systems adopt copy/on/write 7cow8 policy, where multiple applications reading
the same block of data shares a single in/core copy of the block, and writes to
that block of data is done on separate empty blocks. A second related issue is to
choose blocks for writing in such a way so that no block is more 1uickly used
up than any other blocks. 0his is called wear/leveling. As data change, blocks
become fragmented. 6rase/blocks which contain non/obsolete data are said to
be live. Bnce no empty block remains in the flash device, the need for
reclaiming obsolete data arises. 0his is called garbage/collection.
=lash 0ranslation 5ayer
;2: sticks, which are a type of flash device, are generally formatted
with file systems like ntfs and ext). If these file systems are allowed direct
access to the raw device, its lifetime will be severely compromised. 3ence an
intermediate layer, called flash translation layer 7=058, interfaces with the disk
drivers accessing the flash device.
0he main tasks of =05 are address/translation, wear/levelling and
garbage collection. 0he higher level S=2 can treat the flash device as a regular
block device, while the =05 internally maps all block addresses to internal flash
Page | 28
addresses using various techni1ues like sector/mapping, block/mapping, hybrid/
mapping, etc.
Although convenient, the presence of =05 is inefficient due to the extra
level of translation re1uired. 3ence most flash file systems like C==2+ or 5og=2
do not use =05. 3owever, as we will see, =+=2 is an exception as it delegates
many operations to the underlying =05.
Flash6Friendly File System,
=+=2 7=lash/=riendly =ile 2ystem8 is a flash file system created by !im
Caegeuk at 2amsung for the 5inux operating system kernel. 0he source code of
=+=2 was made available to the 5inux kernel by 2amsung in +(*+. It uses log/
structured file system approach. Although there are other log/structured flash
file systems, f+fs have some uni1ue features. =irst of all it is flash aware, i.e. it
takes into consideration uni1ue features of the flash device. 2econdly, as
mentioned previously, it uses the underlying =05 to do some of its work. It also
tries to solve the wandering/tree problem. 5astly it tries to make garbage
collection more efficient.
Design of $n Dis> Layout,
2uper :lock 4heckpoint 2egment
Information
0able
%ode
Address
0able
2egment
2ummary
Area
#ain Area
#ain Area
0he on/disk layout of =+=2 is divided into the main/ area and meta/data
area. All data and inode blocks 7including direct and indirect blocks8 are stored
in the main area. 0he main area comprises of segments, each of +#: siDe,
which contain ,*+ blocks. 6ach segment is a log which is written to from start
to finish consecutively. 9hen a segment becomes filled up another free segment
is chosen. =+=2 groups segments together into sections. :y default there is one
segment per section, but this can be modified. At any time there are six sections
which are open for writing at the same time. #ost modern flash device is a
collection of multiple independent devices which can perform their own
readGwrite at the same time. =+=2 takes advantage of this feature by collecting
Page | 29
sections in Dones 7again * per Done by default8 and aligning the Dones
containing the six open sections with the independent readGwrite chips of the
device. As we will see, this not only makes device access faster but is also
useful during garbage collection.
Metadata rea,
0he meta/data area contains all the information re1uired to make sense of
the main data.=+=2 uses shadow/copy mechanism while updating metadata. As
far as =+=2 is concerned all metadata areas 7like 4", %A0, etc.8 have two fixed
locations. Bnly one of them contains the most current version of data. ;pdate of
metadata occurs in the currently obsolete location. 0his acts as a safeguard
against metadata corruption in case of crash in midst of changing metadata. Bf
course, updating in place is highly risky in flash memory, so =+=2 offloads
actual task of writing and wear/leveling to the underlying =05. As metadata siDe
is very small the overhead associated with =05 is not very bad. #etadata is
divided into the following parts

*. 2uperblock
0he 2uperblock contains read/only data concerning the whole file
system. Bnce the file system is created the 2uperblock never changes. It is
stored at the beginning of the partition and stores data like how big the file
system is, how big the segments, sections, and Dones are, how much space has
been allocated for the various parts of the UmetaU area, etc.
+. 4heckpoint
0he checkpoint area contains rest of the dynamic file system related
information like amount of free space, address of the segment to which data
should be written next, etc. After a file system crash the drive can recover part
of the data instantaneously by reading the last information from 4".
'. 2egment Information 0able
0he 2I0 stores .) bytes per segment. It primarily keeps track of which
blocks are still in active use so that the segment can be reused when it has no
active blocks, or can be cleaned when the active block count gets low.
). %ode Address 0able
Page | 30
0he %A0 is like an array which contains the address of every data and
inode block stored in the main area. 0he address of a block can be found by
referencing it like an array index into the %A0. As mentioned the %A0 is used to
solve the wandering/ tree problem.
,. 2egment 2ummary Area
It contains summary entries which contain owner information of all the
data and node blocks stored in the main area. Inode 2tructure =+=2 has three
types of nodes inode, direct node and indirect node. =+=2 assigns )!: to each
inode block which contain F+' data block indices, two direct node pointers, two
indirect node pointers and one double indirect node pointer. Bne direct node
block contains entries for *(*I data blocks. 0hus one inode block covers
)!: N 7F+' R + N *(*I R + N *(*I N *(*I R *(*I N *(*I N *(*I8 Q '.F) 0:.

Directory 2tructure
A directory entry in ;nix consist of a file name, inode number pair.
Briginal ;nix file system used to search linearly through each directory entry in
order to find a file. 5inear search is not efficient in modern file systems. =+=2
uses multi/level hash tables in order to implement directory structure. 6ach
level has a hash table with dedicated number of hash buckets.
0he number of blocks and buckets are determined by the following formula
%umber of blocks in level n Q +, if n L maxWdirectoryWhashWdepth G + Q ),
otherwise.
%umber of buckets in level n Q +Vn, if n L maxWdirectoryWhashWdepth G +
Q+V77maxWdirectoryWhashWdeth G +8 & *8, otherwise.
9hen =+=2 finds a file name in a directory, at first a hash value of the file name
is calculated. 0hen, =+=2 scans the hash table in level ( to find the dentry
consisting of the file name and its inode number. If not found, =+=2 scans the
next hash table in level *. In this way, =+=2 scans hash tables in each levels
incrementally from * to %. In each levels =+=2 needs to scan only one bucket
determined by the following e1uation
:ucket number to scan in level n Q 7hash value8 > 7number of buckets in level
Page | 31
n8.
In the case of file creation, =+=2 finds empty consecutive slots that cover the
file name. =+=2 searches the empty slots in the hash tables of whole levels from
* to % in the same way as the lookup operation.
2olution of the 9andering 0ree "roblem
9hen the address of an inode is stored in a directory, or an index block is
stored in an inode or another index block, it isnTt the block address that is stored,
but rather an offset into the %A0. 0he actual block address is stored in the %A0
at that offset. 0his means that when a data block is written, we still need to
update and write the node that points to it. :ut writing that node only re1uires
updating the %A0 entry. 0he %A0 is part of the metadata that uses two/location
journaling 7thus depending on the =05 for write/gathering8 and so does not
re1uire further indexing. 0hus introduction of the %A0 solves the wandering
tree issue.
=+=2 Jarbage 4ollection
=+=2 follows a dynamic cleaning policy. Depending on the situation the
cleaning algorithms used by f+fs changes.
0raditional garbage collection works by copying and compressing live data of
highly fragmented blocks in empty segments, thus freeing up space previously
occupied by obsolete data. 0hree issues become important during garbage
collection when to start, how many blocks to free up and which segments to
choose for cleaning.
0he first two 1uestions are easily answered. %ormally garbage collection
starts when number of free segments goes below a certain threshold 7typically
*(s of segments8, and it continues to clean segments until number of free
segments goes above another threshold.
0he 1uestion of which segments to clean 7called victim segment8 is
however not so simple. Bne greedy approach is to simply choose the segment
with the least number of live blocks in order to minimiDe the copying overhead.
3owever, this has been empirically found out to be not the best solution.
If recently copied data becomes obsolete very rapidly, then the effort of
copying them is wasted. 9e could have as well waited for a bit longer and not
copied those data at all. 0his leads to the concept of temperature of data. =ast
Page | 32
changing data is hot whereas relatively stable data is cold. At the time of victim
selection not only cost of copying data is considered but the benefit of cleaning
stable data is also taken into account. 0his leads to the cost/benefit approach to
victim selection.
2tability of a segment depends on the age of the segment, where age is
the most recent modification time of data in a segment. 3ow obsolete a segment
is known by keeping track of the number of live blocks in each segment. :oth
these information are kept in the 2egment Information 0able. At the time of
selecting a victim segment, segment age is maximiDed and number of live
blocks in a segment is minimiDed.
:y default f+fs uses cost/benefit approach , however it can be configured
to use greedy approach by the user.
=+=2 Adaptive 9rite "olicy
=+=2 does more optimiDation in order to make garbage collection more
efficient. 0he aim of ideal garbage collection is to achieve bi modal distribution
of data. In such an ideal case most of the segments are filled with rarely
changing cold data, and a few segments are always occupied by rapidly
changing hot data. ;nder such a scenario copying of live data is rarely re1uired
as segments become 1uickly filled with obsolete blocks.
Achieving bi/modal distribution of data is however not so easy. =+=2 divides
blocks into hot node blocks and warm data blocks. 6ach of the above categories
are again divided into hotGwarmGcold temperatures as follows
3ot node contains direct node blocks of directories.
9arm node contains direct node blocks except hot node blocks.

4old node contains indirect node blocks.
3ot data contains dentry blocks.
9arm data contains data blocks except hot and cold data blocks.
4old data contains multimedia data or migrated data blocks.
0he above six types of data are stored in the six separate sections that are
open for parallel writing. 0hus something approaching bi/modal distribution of
Page | 33
data is achieved by f+fs.
0hreaded 5og
Bnce the flash drive starts filling up most blocks contain live data and
garbage collection becomes very inefficient. In such a case =+=2 switches to
threaded/loging. %ow no garbage collection is done. Instead, each obsolete
block maintains a pointer to the next obsolete block. 0hese pointers or threaded/
logs are used by f+fs to simply write data directly into the obsolete blocks.
0hough this techni1ue is fast the reason it is not used more often is because
se1uential data becomes fragmented and spatial/locality is lost.
$pen6Source :enchmar>ing,
0he "horonix 0est 2uite is the most comprehensive testing and
benchmarking platform available for 5inux, solaris, #ac B2 X that provides an
extensible framework for which new tests can be easily added. 0he software is
designed to effectively carry out both 1ualitative and 1uantitative benchmarks in
a clean, reproducible, and easy to use manner.
0he "horonix 0est 2uite is based upon the extensive testing and internal
tools developed by "horonix.com since +(() along with support from leading
tier one computer hardware and software vendors. 0his software is open source
and licensed under the J%; J"5v'.
Briginally developed for automated 5inux testing, support to the
"horonix 0est 2uite has since been added for Bpen 2olaris, Apple #ac B2 X,
#icrosoft 9indows, and :2D operating systems. 0he "horonix 0est 2uite
consists of a lightweight processing core 7 pts/core! with each benchmark
consisting of an X#5/based profile and related resource scripts. 0he process
from the benchmark installation, to the actual benchmarking, to the parsing of
important hardware and software components is heavily automated and
completely repeatable, asking users only for confirmation of actions.
0he "horonix 0est 2uite interfaces with Bpen:enchmarking.org as a
collaborative web platform for the centraliDed storage of test results, sharing of
test profiles and results, advanced analytical features, and other functionality.
"horomatic is an enterprise component to orchestrate test execution across
multiple systems with remote management capabilities.
Software Features,
Page | 34
*. $uns Bn 5inux, 2olaris, #ac B2 X, 9indows A :2D Bperating
2ystems
+. 6xtensible 0esting Architecture
'. 6mbedded 0o 4loud 2cale
). ),(R 0est "rofiles
,. *((R 0est 2uites
-. Automated :atch #ode 2upport
.. Automated 0est Downloading A Installation
I. Dependency #anagement 2upport
F. #odule/based "lug/In Architecture
*(. #inimum G Average G #aximum $esult $eporting
**. 2tandard Deviation #onitoring A Insurance
*+. "%J, 2SJ Jraph $endering 2upport
*'. "D= $esult $eports
*). Detailed 2oftware, 3ardware Detection
*,. 2ystem #onitoring 2upport
*-. 4ommercial 2upportA 4ustom 6ngineering Available
$pen:enchmar>ing3org Features,
Y Jlobal Database =or $esult ;ploads, :enchmark 4omparisons
Y 2ide/by/2ide $esults 4omparisons
Y "erformance 4lassifications
Y 4rowd/2ourced Aggregated $esults Analysis
+horomatic Features,
Y 4onduct 0ests Across #ultiple 2ystems Bn A 2chedule :asis
Y "er/4ommit Jit 0esting
Y #ulti/2ystem 2upport
Y 0urn/!ey Deployment
Y $emote 0esting
$verview,
0he "horonix 0est 2uite can be used for simply comparing our
computerTs performance with our friends and colleagues or can be used within
our organiDation for internal 1uality assurance purposes, hardware validation,
and continuous performance management.
Page | 35
0he "horonix 0est 2uite has received widespread adoption and is used by
numerous technical publications for facilitating hardware performance
comparisons, is used by nearly every tier/one independent hardware vendor
7I3S8, is depended upon by software projects for monitoring the performance
of their active code/base and the monitoring of regressions, and is used by many
other =ortune ,(( companies and/other/global/organiDations/for/various/
purposes.
Bther ways the "horonix 0est 2uite has been used within real/world
applications include stressing a systemTs different sub/systems, financial
institutions and trading firms using the framework for finding
softwareGhardware that offer the peak performance combination as to maximiDe
their transactions, and hardware vendors using it for driver performance
management. 0he "horonix 0est 2uite can be adapted to run on platforms
ranging from smartphones and personal computers to multi/core workstations
and cloud computing infrastructures.
bout,
0he "horonix 0est 2uite was conceived out of the internal tools
developed by "horonix #edia over the years of operating "horonix.com, one of
the first 5inux based hardware review web/sites and is now the largest provider
Page | 36
of 5inux software and hardware benchmarks on the Internet. "horonix launched
in Cune of +(() and in mid +((. is when work on the public version of
"horonix 0est 2uite commenced. 0he lead developer of the "horonix 0est 2uite
is #ichael 5arabel, the founder of "horonix. #atthew 0ippett, formerly the
5inux 4ore 6ngineering #anager of Advanced #icro Devices 7A#D8, is a
"horonix 0est 2uite partner and shares responsibility for the overall direction of
the test suite.
#any independent hardware and software vendors have also provided
feedback, patches, and other forms of support to the development of the
"horonix 0est 2uite. #ore than two doDen different corporations and individuals
have directly contributed to the development of this software. "horonix #edia
remains the key sponsor and entity behind the development of the "horonix 0est
2uite with it striving to make 5inux benchmarking incredibly robust and
innovative, very easy to perform nearly any kind of benchmark, and when it
out/paces other leading operating systems for its benchmarking abilities.
"horonix #edia provides commercial services and custom engineering for the
long term sustainability of this open source software product.
9hile the internal 5inux testing tools at "horonix began as just a simple
set of scripts many years ago, the "horonix 0est 2uite has evolved into
becoming a one of a kind, feature rich framework and a software package that is
uni1ue regardless of platform for its capabilities and openness. Sendors no
longer need to rely upon their own engineers to develop in house utilities to
accomplish their internal testing needs whatever they may be for computer
hardware or software, but they are able to levage this extensible platform that
provides a greater feature set while lowering their costs.
0he "horonix 0est 2uite has received many awards and accolades for its
features and capabilities. Sarious milestones are shared on the press page and
release history.
%ow to get phoronix6test6suite,
0he only explicit re1uirement for the "horonix 0est 2uite on 5inux, Bpen
2olaris, :2D, 9indows ., and #ac B2 X operating systems is "3", 45I
7packages for it are usually called php)*c(i or php*c(i or just php8. 3ere only
"3", is needed and not a web server or other packages commonly associated
with "3". #any of the benchmarking profiles do re1uire the standard 5inux
development toolsGlibraries 7J44, etc8 and other common programs. 3owever,
on many 5inux distributions and operating systems the "horonix 0est 2uite is
Page | 37
able to use the softwareTs package management system for installing these
additional dependencies. Bn a clean ;buntu installation, itTs as easy as first
running sudo apt*+et insta(( php)*c(i and then setting up the "horonix 0est
2uite.
0he detailed installation instructions
0he "horonix 0est 2uite can be rapidly deployed across most operating
systems that support "3" , 45I, including 5inux, =ree:2D, %et:2D,
Bpen:2D, #ac B2 X, Bpen 2olaris, and 9indows. After we have downloaded
the "horonix 0est 2uite 7download page8 and have installed the necessary
dependencies, we can immediately begin using this software.
If we have downloaded the source package to the "horonix 0est 2uite, we
can run it locally after extracting the compressed file or use the generic
installation script. $unning the install/sh file within the phoronix/test/suiteF
directory will install the "horonix 0est 2uite files to FusrGbinGphoronix/test/suite=
GusrGshareGdocGphoronix/test/suiteG= and GusrGshareGphoronix/test/suiteG by
default. 0he installation prefix can be adjusted as the first argument sent to this
installer 8i.e3 install/sh GusrGlocal93 If we are using one of the distribution
packages, once installing it we should be able to run phoronix/test/suite from
the terminal.
Page | 38
0o see all available tests, run phoronix/test/suite list/tests. 9hen finding a test
we are interested in using, run phoronix/test/suite info Ltest nameG to view
more information. 9hen we are ready to use this test, run phoronix/test/suite
benchmark Ltest nameM3 0he benchmark option will proceed to install the test
and then execute the test and display the results.
Page | 39
0he "horonix 0est 2uite will automatically download all needed test files.
In addition, on supported platforms it will also use the distributionTs package
management system for installing the needed test dependencies. 3owever, for
the software being tested that may be installed already on the system by the
user, the "horonix 0est 2uite will ignore those installations. 0he "horonix 0est
2uite installs all tests within its configured environment to improve the
reliability of the testing process. 0his ensures all tests being used are at the same
version, being run or built from the same source packages, and minimiDes
outside changes such as patches or build/time optimiDations that may be
specific to a single distribution or environment. 0his does, however, result in
additional download time and extra disk space needed. If installing a large test
suite, it may consume several gigabytes of disk space. Bnce the files have been
downloaded, however, we can run phoronix/test/suite make/download/cache to
generate a backup of these files at KG.phoronix/test/suiteGdownload/cache. 0his
allows us to 1uickly and easily transfer these files to other "4s.
0he "horonix 0est 2uite will then prompt us for any test options and
whether we would like to save these files.
Page | 40
9hen the testing is complete, the results will be displayed and we will be
prompted to upload these results to "horonix Jlobal. =or more information and
to see all of the options for the "horonix 0est 2uite, I have read the full
"horonix 0est 2uite documentation.
Page | 41
Page | 42
9ith the random write 0hreaded IGB 0ester performance the =+=2
performance seems to seem degraded with the 5inux '.*, kernel compared to
5inux '.*) stable< the other file/systems appeared unaffected.
9hile running AIB/2tress on the Sertex ' 22D seemed to be getting into
Page | 43
some system memory caching, the 6X0) and =+=2 file/systems reported
improved results with the 5inux '.*, kernel over 5inux '.*). #eanwhile, :trfs
and X=2 were slightly lower with their readings compared to Linux 3.14
9ith Dbench there wasnTt too much to see but =+=2 from 2amsung was
the fastest. 3owever, 2amsungTs =lash/=riendly =ile/2ystem reported slightly
too fast results than what would be expected of 2erial A0A '.( solid/state drives,
so compared to the other file/systems may be waiting to commit some data to
the disk.
Page | 44
9ith the "ost #ark mail server disk benchmark, the =+=2 file/system
edged past X=2 for being the fastest.
Page | 45
Page | 46
Page | 47
Page | 48
Page | 49
Page | 50
4onclusion
=+=2 is a modern file system for flash devices that tries to tackle very
common but persistent issues. It does so by taking advantage of the flash device
itself, cooperating with the =05 and trying to make garbage collection as
painless as possible. It serves as an excellent example of decisions that need to
be made in real world engineering problems. 2o comparing other file system
=+=2 is the best log/structure file system. :eing a recent file system it is still a
work in progress and there is scope of improvements and modifications that can
be done on it.
$eferences
Z*[. $osenblum, #endel, and Cohn !. Busterhout. U0he design and implementation of a log/
structured file system.U A4# 2IJB"2 Bperating 2ystems $eview +,., 7*FF*8 */*,.
Z+[. %elson, #ichael %., :rent :. 9elch, and Cohn !. Busterhout. U4aching in the 2prite
network file system.U A4# 0ransactions on 4omputer 2ystems 70B428 -.* 7*FII8 *')/*,).
Z'[. 3itD, David, et al. U#ethod for providing parity in a raid sub/system using non/volatile
memory.U ;.2. "atent %o. ,,F)I,**(. . 2ep. *FFF.
Z)[. 2ilberschatD, Jalvin, Jagne ?Bperating 2ystem 4oncepts@, .
th
6dition, 2ection **.I
5og
2tructured =ile 2ystems page./ )'..
Z,[. Caegeuk, !im 7#arch +(*'8 & ?6mbedded 5inux 4onference +(*' / =lash =riendly =ile
2ystem@ ZBnline Sideo[. Available httpsGGwww.youtube.comGwatch\vQtW)W:a."2g)
Z-[. 4hung, 0ae/2ung et al. 7#ay, +((F8 & ?A 2urvey of =lash 0ranslation 5ayer@. Cournal of
2ystems Architecture 0he 6;$B#I4$B Cournal, Sol. ,, Issue ,/-.
Z.[. 9oodhouse, David & ?C==2 0he Cournalling =lash =ile 2ystem@
ZBnline[. Available httpGGsourceware.orgGjffs+Gjffs+/htmlG
Page | 51

You might also like