Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Outline

n File system consistency issues in the presence of failures:


n Data structure is inconsistent

Transactions & Log Structured FS n Solutions:


n Synchronous operations
n Ordering the synchronous operations
n Run a program to fix mistakes: “fsck”
Arvind Krishnamurthy
Spring 2001 n Today:
n Transactions for correctness

n Log structured file systems for better performance

User data consistency Transaction concept


n For user data, Unix uses “write back” --- forced to disk n Transactions: group actions together so that they are
every 30 seconds (or user can call “sync” to force to disk n Atomic: either happens or it does not (no partial operations)
immediately). n Serializable: transactions appear to happen one after the other
n Durable: once it happens, stays happened
No guarantee blocks are written to disk in any order.
Critical sections are atomic and serializable, but not durable
Sometimes meta-data consistency is good enough
Need two more items:
Fsck checks: Commit --- when transaction is done (durable)
Number of dir links to a file and file link count Rollback --- if failure during a transaction (means it didn’t happen at all)
I-node contains a block that is free on the disk block map
I-nodes contain overlapping blocks/fragments n Do a set of operations tentatively. If you get to commit, ok. Otherwise,
roll back the operations as if the transaction never happened.
Build a graph data structure and do connectivity analysis

Transaction implementation
Transaction implementation (cont’d)
n Key idea: fix problem of how you make multiple updates to disk, by Disk
turning multiple updates into a single disk write! Memory cache

n Example: money transfer from account x to account y: X: 0 X: 0


Y: 2 Y: 2
Begin transaction
x =x+1
y = y–1
Commit Sequence of steps to execute transaction: X=1 Y=1 commit
1. Write new value of X to log
n Keep “write-ahead” (or “redo”) log on disk of all changes in 2. Write new value of Y to log write-ahead log (on disk or
transaction. 3. Write commit tape or non-volatile RAM)
n A log is like a journal, never erased, record of everything you’ve done 4. Write x to disk
n Once both changes are on log, transaction is committed. 5. Write y to disk
n Then can “write behind” changes to disk --- if crash after commit, replay
log to make sure updates get to disk 6. Reclaim space on log

1
Transaction implementation Transaction implementation
(cont’d) (cont’d)
u What if we crash after 1?
n Can we write X back to disk before commit ?
X=1 Y=1 commit
u No commit, nothing on disk, so
just ignore changes
1. Write new value of X to log
n Yes: Keep an “undo log”
u What if we crash after 2? Ditto
2. Write new value of Y to log u What if we crash after 3 before
3. Write commit 4 or 5? n Save old value along with new value
4. Write x to disk u Commit written to log, so replay n If transaction does not commit, “undo change”!
5. Write y to disk those changes back to disk
6. Reclaim space on log
u What if we crash while we are
writing “commit” ?
u As with concurrency, we need
some primitive atomic operation
or else can’t build anything. (e.g.,
writing a single sector on disk is
atomic!)

Transactions under multiple


threads Two-phase locking
n What if two threads run same transaction at same time? use locks! n Don’t allow “unlock” before commit.
Begin transaction
Lock x, y
x=x+1
n First phase: only allowed to acquire locks (this avoids
y=y–1 deadlock concerns).
Unlock x, y
Commit
n Second phase: all unlocks happen at commit
n Problematic scenario:
n Thread A grabs locks, modifies x, y, writes to the log, unlocks
n Before A commits, thread B grabs lock, writes x, y, unlocks, and does n Thread B can’t see any of A’s changes, until A commits and
commit. releases locks. This provides serializability.
n Before A commits, it crashes!

B commits values for x and y that depends on A committing!

Transactions in file systems What is next?


n Write-ahead logging
n Almost all file systems built after 1985 (NT, Solaris) uses “write
ahead logging”
n Wednesday: original Unix paper
n Write all changes in a transaction to log (update directory, allocate
block, etc.) before sending any changes to disk.
n Example transactions: “Create file”, “Delete file”, “Move file” n Next week: networking

This eliminates any need for file system check (fsck) after crash
If crash, read log:
n Followed by: remote procedure calls, distributed file
n If log is not complete, no change!
systems
n If log is completely written, apply all changes to disk

n If log is zero, then all updates have gotten to disk.

n Pros: reliability Cons: all data written twice

2
Log Structured File Systems Motivation (contd.)
n Radical, different approach to designing file systems n File System motivations:
n Disk will run much faster if you use it as a tape drive! n File caches help a reads a lot
File caches do not help writes very much
n Technology motivations: n

n Delayed writes help but cannot delay for ever


n Some technologies are advancing more faster than others
n File caches make disk writes more often than disk reads
n CPU are getting faster every year (x2 every 1-2 years)
n Files are mostly small
n Everything else except CPU will become a bottleneck (Amdahl’s
law) n Too much synchronous I/O
n Disks are not getting much faster n Disk geometries not predictable
n Only good disk is an idle disk n RAID: whole bunch of disks with data striped across them
n Increases bandwidth, but does not change latency
n Memory is growing in size dramatically (x2 every 1.5 years)
n Does not help small files (more on this later)
n File systems è File caches are a good idea (cut down on disk
bandwidth)

Implementation Locating Data on the Disk


n Treat disk like a tape: n Use the same structures as Unix
n Collect all information in memory (Inodes, directory, data, …) n Inodes, data blocks, indirect blocks, doubly indirect blocks
n Wait a little while n How to find I-nodes?
n Do one big I/O (say 1MB) n In Unix, I-number maps to a location on disk
n Log append only – no over-write in place n In LFS, Inodes float
n Log is the only thing on disk! Main storage structure n When you rewrite the Inodes, it goes to the end of the block
n Just add another level of indirection: Inode-map
n No data other than the log
n Maps I-numbers to disk positions
n Two main problems: n Inode-map gets written to the disk

n How do you read stuff back from the log? n How do you find Inode-map?

n Add another level of indirection n Add another level of indirection

n Wrap around? Reclaim free space n Now it fits in memory

n How do you guarantee there is always free space? n Write into checkpoint memory

Wrap Around Problem Cleaner


n Two approaches: n Write whole clean segments
n Compaction n Clean segments in the background
n Read-in information that is still alive and compact it
n Write cost for “u” utilization: 2/(1 – u)
n Threading n FFS write cost: 10 (normal) – 4 (best)
n Leave data in-place and write new information around old data n Disk does not have to be 50% utilized, only segments have to be 50%
n Eventually fragments
utilized
n Would be great to have a bi-modal distribution
n Simulations:
n Sprite (first implementation of LFS): n Tried lots of approaches
n Combination of the two; open up free segments & avoid n Initial simulation: random
copying n Greedy cleaner
n Segmented log: statically partition disks into segments (1MB in n Next: hot and cold data à results got worse!
LFS), pick a segment size is large enough to amortize disk seek n Next: segregate new and old data à results didn’t change
costs, compaction within segments, threading in-between
n Finally: do cost, benefit analysis

3
RAIDs and availability More on RAIDs
n Suppose you need to store more data than fits on a single n Benefits
disk (e.g., large database or file servers). How should n Load gets automatically balanced among disks
arrange data across disks? n Can transfer large file at aggregate bandwidth of all disks

n Option 1: treat disks as huge pool of disk blocks


n Disk1 has blocks 1, 2, …, N
n Problem --- what if one disk fails ?
n Disk2 has blocks N+1, N+2, …, 2N n Goal --- availability --- never lose access to data
n ………… n System should continue to work even if some components are not
working.
n Option 2: RAID (Redundant Arrays of Inexpensive Disks)
Stripe data across disks, with k disks: n Solution: dedicate one disk to hold bitwise parity for other disks in
stripe.
n Disk1 has blocks 1, k+1, 2k+1, …

n Disk2 has blocks 2, k+2, 2k+2, …

n …………

Adding parity bits to RAID


n With k+1 disks
n Disk1 has blocks 1, k+1, 2k+1, …
n Disk2 has blocks 2, k+2, 2k+2, …
n …
n Parity disk has blocks parity(1..k), parity (k+1..2k), …

n If lose any disk, can recover data from other disks plus parity
n Disk1 holds 1001
n Disk2 holds 0101
n Disk3 holds 1000
n Parity disk: 0100
What if we lose disk2? Its contents are parity of remainder!
Thus can lose any disk and data would still be available.

n Updating a disk block needs to update both data and parity --- need to
use write ahead logging to support crash recovery

You might also like