Outline: File System Consistency Issues in The Presence of Failures

Outline
n File system consistency issues in the presence of failures:

n Data structure is inconsistent
Transactions & Log Structured FS n Solutions:

n Synchronous operations
n Ordering the synchronous operations
n Run a program to fix mistakes: “fsck”
Arvind Krishnamurthy
Spring 2001 n Today:
n Transactions for correctness
n Log structured file systems for better performance
User data consistency Transaction concept

n For user data, Unix uses “write back” --- forced to disk n Transactions: group actions together so that they are
every 30 seconds (or user can call “sync” to force to disk n Atomic: either happens or it does not (no partial operations)
immediately). n Serializable: transactions appear to happen one after the other
n Durable: once it happens, stays happened
No guarantee blocks are written to disk in any order.
Critical sections are atomic and serializable, but not durable
Sometimes meta-data consistency is good enough
Need two more items:
Fsck checks: Commit --- when transaction is done (durable)
Number of dir links to a file and file link count Rollback --- if failure during a transaction (means it didn’t happen at all)
I-node contains a block that is free on the disk block map
I-nodes contain overlapping blocks/fragments n Do a set of operations tentatively. If you get to commit, ok. Otherwise,
roll back the operations as if the transaction never happened.
Build a graph data structure and do connectivity analysis
Transaction implementation
Transaction implementation (cont’d)
n Key idea: fix problem of how you make multiple updates to disk, by Disk
turning multiple updates into a single disk write! Memory cache
n Example: money transfer from account x to account y: X: 0 X: 0

Y: 2 Y: 2
Begin transaction
x =x+1
y = y–1
Commit Sequence of steps to execute transaction: X=1 Y=1 commit
1. Write new value of X to log
n Keep “write-ahead” (or “redo”) log on disk of all changes in 2. Write new value of Y to log write-ahead log (on disk or
transaction. 3. Write commit tape or non-volatile RAM)
n A log is like a journal, never erased, record of everything you’ve done 4. Write x to disk
n Once both changes are on log, transaction is committed. 5. Write y to disk
n Then can “write behind” changes to disk --- if crash after commit, replay
log to make sure updates get to disk 6. Reclaim space on log
1
Transaction implementation Transaction implementation
(cont’d) (cont’d)
u What if we crash after 1?
n Can we write X back to disk before commit ?
X=1 Y=1 commit
u No commit, nothing on disk, so
just ignore changes
1. Write new value of X to log
n Yes: Keep an “undo log”
u What if we crash after 2? Ditto
2. Write new value of Y to log u What if we crash after 3 before
3. Write commit 4 or 5? n Save old value along with new value
4. Write x to disk u Commit written to log, so replay n If transaction does not commit, “undo change”!
5. Write y to disk those changes back to disk
6. Reclaim space on log
u What if we crash while we are
writing “commit” ?
u As with concurrency, we need
some primitive atomic operation
or else can’t build anything. (e.g.,
writing a single sector on disk is
atomic!)
Transactions under multiple

threads Two-phase locking
n What if two threads run same transaction at same time? use locks! n Don’t allow “unlock” before commit.
Begin transaction
Lock x, y
x=x+1
n First phase: only allowed to acquire locks (this avoids
y=y–1 deadlock concerns).
Unlock x, y
Commit
n Second phase: all unlocks happen at commit
n Problematic scenario:
n Thread A grabs locks, modifies x, y, writes to the log, unlocks
n Before A commits, thread B grabs lock, writes x, y, unlocks, and does n Thread B can’t see any of A’s changes, until A commits and
commit. releases locks. This provides serializability.
n Before A commits, it crashes!
B commits values for x and y that depends on A committing!
Transactions in file systems What is next?

n Write-ahead logging
n Almost all file systems built after 1985 (NT, Solaris) uses “write
ahead logging”
n Wednesday: original Unix paper
n Write all changes in a transaction to log (update directory, allocate
block, etc.) before sending any changes to disk.
n Example transactions: “Create file”, “Delete file”, “Move file” n Next week: networking
This eliminates any need for file system check (fsck) after crash
If crash, read log:
n Followed by: remote procedure calls, distributed file
n If log is not complete, no change!
systems
n If log is completely written, apply all changes to disk
n If log is zero, then all updates have gotten to disk.
n Pros: reliability Cons: all data written twice
2
Log Structured File Systems Motivation (contd.)
n Radical, different approach to designing file systems n File System motivations:
n Disk will run much faster if you use it as a tape drive! n File caches help a reads a lot
File caches do not help writes very much
n Technology motivations: n
n Delayed writes help but cannot delay for ever

n Some technologies are advancing more faster than others
n File caches make disk writes more often than disk reads
n CPU are getting faster every year (x2 every 1-2 years)
n Files are mostly small
n Everything else except CPU will become a bottleneck (Amdahl’s
law) n Too much synchronous I/O
n Disks are not getting much faster n Disk geometries not predictable
n Only good disk is an idle disk n RAID: whole bunch of disks with data striped across them
n Increases bandwidth, but does not change latency
n Memory is growing in size dramatically (x2 every 1.5 years)
n Does not help small files (more on this later)
n File systems è File caches are a good idea (cut down on disk
bandwidth)
Implementation Locating Data on the Disk

n Treat disk like a tape: n Use the same structures as Unix
n Collect all information in memory (Inodes, directory, data, …) n Inodes, data blocks, indirect blocks, doubly indirect blocks
n Wait a little while n How to find I-nodes?
n Do one big I/O (say 1MB) n In Unix, I-number maps to a location on disk
n Log append only – no over-write in place n In LFS, Inodes float
n Log is the only thing on disk! Main storage structure n When you rewrite the Inodes, it goes to the end of the block
n Just add another level of indirection: Inode-map
n No data other than the log
n Maps I-numbers to disk positions
n Two main problems: n Inode-map gets written to the disk
n How do you read stuff back from the log? n How do you find Inode-map?
n Add another level of indirection n Add another level of indirection
n Wrap around? Reclaim free space n Now it fits in memory
n How do you guarantee there is always free space? n Write into checkpoint memory
Wrap Around Problem Cleaner

n Two approaches: n Write whole clean segments
n Compaction n Clean segments in the background
n Read-in information that is still alive and compact it
n Write cost for “u” utilization: 2/(1 – u)
n Threading n FFS write cost: 10 (normal) – 4 (best)
n Leave data in-place and write new information around old data n Disk does not have to be 50% utilized, only segments have to be 50%
n Eventually fragments
utilized
n Would be great to have a bi-modal distribution
n Simulations:
n Sprite (first implementation of LFS): n Tried lots of approaches
n Combination of the two; open up free segments & avoid n Initial simulation: random
copying n Greedy cleaner
n Segmented log: statically partition disks into segments (1MB in n Next: hot and cold data à results got worse!
LFS), pick a segment size is large enough to amortize disk seek n Next: segregate new and old data à results didn’t change
costs, compaction within segments, threading in-between
n Finally: do cost, benefit analysis
3
RAIDs and availability More on RAIDs
n Suppose you need to store more data than fits on a single n Benefits
disk (e.g., large database or file servers). How should n Load gets automatically balanced among disks
arrange data across disks? n Can transfer large file at aggregate bandwidth of all disks
n Option 1: treat disks as huge pool of disk blocks

n Disk1 has blocks 1, 2, …, N
n Problem --- what if one disk fails ?
n Disk2 has blocks N+1, N+2, …, 2N n Goal --- availability --- never lose access to data
n ………… n System should continue to work even if some components are not
working.
n Option 2: RAID (Redundant Arrays of Inexpensive Disks)
Stripe data across disks, with k disks: n Solution: dedicate one disk to hold bitwise parity for other disks in
stripe.
n Disk1 has blocks 1, k+1, 2k+1, …
n …………
Adding parity bits to RAID

n With k+1 disks
n …
n Parity disk has blocks parity(1..k), parity (k+1..2k), …
n If lose any disk, can recover data from other disks plus parity
n Disk1 holds 1001
n Disk2 holds 0101
n Disk3 holds 1000
n Parity disk: 0100
What if we lose disk2? Its contents are parity of remainder!
Thus can lose any disk and data would still be available.
n Updating a disk block needs to update both data and parity --- need to
use write ahead logging to support crash recovery

Outline: File System Consistency Issues in The Presence of Failures

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Outline: File System Consistency Issues in The Presence of Failures

Uploaded by

Copyright:

Available Formats

Outline

n File system consistency issues in the presence of failures:

Transactions & Log Structured FS n Solutions:

n Log structured file systems for better performance

User data consistency Transaction concept

n Example: money transfer from account x to account y: X: 0 X: 0

Transactions under multiple

B commits values for x and y that depends on A committing!

Transactions in file systems What is next?

n If log is zero, then all updates have gotten to disk.

n Pros: reliability Cons: all data written twice

n Delayed writes help but cannot delay for ever

Implementation Locating Data on the Disk

n Add another level of indirection n Add another level of indirection

n Wrap around? Reclaim free space n Now it fits in memory

Wrap Around Problem Cleaner

n Option 1: treat disks as huge pool of disk blocks

n Disk2 has blocks 2, k+2, 2k+2, …

Adding parity bits to RAID

You might also like