Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Log Structured File Systems

n Motivation:
n Small writes are expensive on disks
n Treat disk like a tape; do large I/Os
Log Structured FS & Unix
n Collect into memory
n Write large chunks
n Append-only log
Arvind Krishnamurthy n Log is only representation on disk
Spring 2001
n Two problems:
n Data constantly moves around
n Disk fills up leaving behind holes

Locating Data on the Disk Wrap Around Problem


n Use the same structures as Unix n Two approaches:
n Inodes, data blocks, indirect blocks, doubly indirect blocks n Compaction
n When a data block is rewritten, rewrite the I-node as well to have n Read-in information that is still alive and compact it
the new pointer
n How to find I-nodes?
n In Unix, I-number maps to a location on disk; In LFS, I-nodes float
n When you rewrite the Inodes, it goes to the end of the block
Threading
Just add another level of indirection: Inode-map
n
n
n Leave data in-place and write new information around old data
n Maps I-numbers to disk positions

n Eventually fragments
n Inode-map gets written to the disk

n How do you find Inode-map?

n Add another level of indirection

n Now it fits in memory

n Write into checkpoint space on disk

Sprite (implementation of LFS): Cleaner


n Combination of the two; open up free segments & avoid copying n Write cost: measures how expensive the I/O is
n Segmented log: n Write cost = total amount of I/O / number of new bytes written
n statically partition disks into segments (1MB in LFS) n Write cost for “u” utilization: 2/(1 – u)
n pick a segment size is large enough to amortize disk seek costs n Read in one full dirty segment (1)
n compaction within segments
n Write back live data (1-u)
n threading in-between
n Write back new data (u)
n Write cost = 4 for u = 0.5
n Segment cleaner:
n Write cost for normal file systems:
n Clean segments in the background
n Total disk access time/useful disk time
n Pick one or more segments and write them out as clean
n Seek/actual disk write cost
segments without holes
n FFS write cost: 10 (normal) – 4 (best)
n Goal: collect long-lived information into segments and get rid of

holes made up by new data n Disk does not have to be 50% utilized, only segments have to be 50%
utilized!
n Would be great to have a bi-modal distribution

1
Write Cost Graph Simulation Results
n 4 KB writes
n Randomly write blocks
n Greedy cleaner
n Pick the segment with the most amount of free data

Fraction alive in segments cleaned

Disk Utilization

Locality Observations
n 90-10 pattern: n Greedy strategy based on utilization alone doesn’t work
n 90% of accesses to 10% of the blocks n Need to consider age of the blocks as well
n Things got worse! n Some segments might tie up just a small number of blocks but they
are tied up for a long time
n Need to consider block-seconds rather than just blocks
n Maybe we should segregate data: n Cost-benefit analysis:
n Clean a whole bunch of segments at a time n Benefit = free space generated * age of data
n Segregate old data from new data = (1-u) * age
n Things didn’t improve! n Cost = (1 + u)
n Pick segment with greatest benefit/cost ratio
n Let us look at distribution of segment utilization n Voila: we get a bimodal distribution!
n Excellent research:
n High risk, simulations, lessons learnt, better algo è implementation

Unix Unix Implementation


n Multics: visionary, tried lots of amazing new ideas, ungainly n Implemented on a:
n Unix: craftsmanship, elegance, taste n PDP-11/45
n 10 pages: describe the entire system, fits together n 16-bit word, 144KB of core memory
harmoniously, same ideas are used everywhere n 1MB disk, four 2.5MB removable disks
n Current systems don’t have the same properties n Unix occupies: 42KB (“very large number of device drivers and
(committee work) enjoys a generous allotment of space for I/O buffers”)
n Unix only system ported to many different machines n A combined solution:
n Adapted to changing technologies n Kernel implements the basic system calls
n Ken Thompson: student in Berkeley in mid-60’s (time- n Assembler, linker, loader, C compiler
sharing project) n Fortran compiler, Snobol interpreter, YACC
n Dennis Ritchie: student in Multics group at MIT (others also n Text editor, text formatter, macro processor
from Multics project)

2
Unix File System Process Management
n File system paper! 50-75% of code in modern OS goes to n Fork & Exec:
file system, device drivers (support structures) n Fork copies a process
n Hierarchical file system: seems natural n Exec overlays a new process
n In comparison, TOPS10 had single directory per user n Advantages:
n Directories are like files n Fork, change some small pieces of the process
n Beauty in Unix is uniformity n Small piece of code
Child has different set of I/O pipes; used for redirection
n Byte-oriented (no records; previously 80 byte records were n

popular – punched cards!) n Disadvantages:


n No structure imposed by OS; imposed only the application n Fork copies the entire data area
People have developed virtual copies (copy on write)
n Device-independent I/O n

n Devices and files are the same n Very simple kernel, pull everything into user-level
n Operations, naming, permissions n Simplify kernel
Different people can do it differently
n Set-user Id (avoid special kernel calls) n

Random Topics Changes made to Unix


n Notion of standard I/O, channels n What changed:
n Stdin, stdout, stderr n Groups and permissions for groups
n Redirection n No restrictions on file names
“..” on a mounted directory
n Pipes & filters n

n Mounting mechanism n What didn’t change:


n The entire file system interface!
n Hooking file-systems together; very simple idea
n Even some of the implementation details: I-node, etc.
n Even restrictions such as no hard-links across mount points
n Provide mechanisms (glue), not solutions
n Process interface (fork, exec, wait, exit)
n Success story of “component programming”
n Shell command operations
n Keep the kernel at a minimum size (a “micro-kernel” in a different
n Terminology: system calls, traps, core files,
sense of the word)

Some Quotables RAIDs and availability


n “Perhaps paradoxically, the success of Unix is largely due to the fact n Suppose you need to store more data than fits on a single
that it was not designed to meet any predefined objectives.: disk (e.g., large database or file servers). How should
n “Since we are programmers, we naturally designed the system to make arrange data across disks?
it easy to write, test, and run programs.”
n “The size constraint has encouraged not only economy but a certain n Option 1: treat disks as huge pool of disk blocks
elegance of design.” n Disk1 has blocks 1, 2, …, N
n “If designers of a system are force to use that system…, they are n Disk2 has blocks N+1, N+2, …, 2N
strongly motivated to correct before it is too late.” n …………
n “The success of Unix lies not so much in new inventions, but rather in
the full exploitation of a carefully selected set of fertile ideas.”
n Option 2: RAID (Redundant Arrays of Inexpensive Disks)
n “It is hoped, however, the users of Unix will find that the most Stripe data across disks, with k disks:
important characteristics of the system are its simplicity, elegance, and
n Disk1 has blocks 1, k+1, 2k+1, …
ease of use.”
n Disk2 has blocks 2, k+2, 2k+2, …

n …………

3
More on RAIDs Adding parity bits to RAID
n Benefits n With k+1 disks
n Disk1 has blocks 1, k+1, 2k+1, …
n Load gets automatically balanced among disks
n Disk2 has blocks 2, k+2, 2k+2, …
n Can transfer large file at aggregate bandwidth of all disks n …
n Parity disk has blocks parity(1..k), parity (k+1..2k), …
n Problem --- what if one disk fails ? n If lose any disk, can recover data from other disks plus parity
n Goal --- availability --- never lose access to data n Disk1 holds 1001
n System should continue to work even if some components are not n Disk2 holds 0101
working. n Disk3 holds 1000
n Parity disk: 0100

n Solution: dedicate one disk to hold bitwise parity for other disks in What if we lose disk2? Its contents are parity of remainder!
stripe. Thus can lose any disk and data would still be available.

n Updating a disk block needs to update both data and parity --- need to
use write ahead logging to support crash recovery

You might also like