Professional Documents
Culture Documents
ZFS Overview
ZFS Overview
ZFS Overview
zfs-team@sun.com
Page 1
Page 2
Existing Filesystems
Write my data?
limited size (16TB for UFS)
limited number of files
limited directory entries
Page 3
Existing Filesystems
Keep it safe?
bit rot causes silent data corruption
no defense against phantom writes,
misdirections, other firmware bugs
no defense against administrative errors
(e.g. swap on active filesystem device)
no security: spying, tampering, theft
Page 4
Existing Filesystems
Read it back?
no data integrity checks
no data authentication
data might be good, might be bad
dont know
couldnt fix it if we did
like running a server without DRAM parity
Page 5
Existing Filesystems
Do it fast?
linear-time directory ops
linear-time newfs(1M), fsck(1M)
limited read/write concurrency
fixed block size
fixed stripe width
poor random write performance
slow mirroring
ZFS: The Zettabyte Filesystem
Page 6
Existing Filesystems
Dont hassle me?
create a partition for every FS
grow: manual process
shrink: not possible
remember a bunch of c0t0d0s0 names
edit /etc/vfstab by hand
wait around for fsck(1M)
take system down to upgrade disks
ZFS: The Zettabyte Filesystem
Page 7
ZFS Objective
Page 8
Page 9
Page 10
Page 11
Page 12
Page 13
Organizing Principles
Simple administration
Extensible, modular design
Provable data integrity model
Always-available data
High performance
Page 14
Simple Administration
Pooled storage
Immense capacity
Quotas and Reservations
User undo
Page 15
Naming
and
storage
tightly
bound
FS
Pooled Storage
FSes share space
ZFS/pool interface:
object transactions
ZFS
FS
Volume
Volume
Volume
(Virtual Disk)
(Virtual Disk)
(Virtual Disk)
Naming
and
storage
decoupled
No space sharing
ZFS
ZFS
Storage Pool
Page 16
Page 17
Page 18
Immense Capacity
128-bit storage pools
128-bit filesystems
128-bit files, but limited to 64-bit access
until we have 128-bit OS support
64-bit max files per dataset
64-bit max files per directory
statvfs128() will be needed
ZFS: The Zettabyte Filesystem
Page 19
Page 20
User Undo
Unlimited snapshots
recover previous version of a file
Undelete
recover recently deleted file
No sysadmin intervention required
Page 21
Organizing Principles
Simple administration
Extensible, modular design
Provable data integrity model
Always-available data
High performance
Page 22
ZFS is Object-Based
An object is a "flat file"
Everything is stored in objects: user data,
znodes, directories, free block lists, etc.
Arbitrarily complex operations reduce to reads
and writes on a set of objects
Simplifies interfaces, design, and analysis
single I/O path
single interposition point
single object read/write model
ZFS: The Zettabyte Filesystem
Page 23
ZFS Components
ZPL
SPA
Page 24
SPA Components
Gather non-dependent
I/O into I/O groups
DMU
SPA
compression
metaslab
allocator
IOG
encryption
checksum
mirror
vdev
disk
vdev
disk
vdev
Page 25
Organizing Principles
Simple administration
Extensible, modular design
Provable data integrity model
Always-available data
High performance
Page 26
Page 27
Copy-on-Write TX Model
Problem: modify several objects atomically
DMU provides transactional interface
ZPL groups work into transactions
DMU sends whole transactions to SPA
SPA commits transaction groups
SPA never modifies active blocks
entire storage pool is a tree of blocks
rooted at the "uberblock"
transactions COW nodes of the tree
transaction group is committed when
uberblock is rewritten to point to new tree
ZFS: The Zettabyte Filesystem
Page 28
Copy-on-Write TX Model
initial block tree
Page 29
Copy-on-Write TX Model
write: COWs a data block
Page 30
Copy-on-Write TX Model
COW its level-1 indirect block
Page 31
Copy-on-Write TX Model
COW its level-2 indirect block
Page 32
Copy-on-Write TX Model
rewrite the uberblock (atomic)
Page 33
Snapshots
COW TX model enables constant-time snapshots
snapshot storage pool by copying its uberblock
snapshot single FS by copying its root block
snapshot single file by copying its dnode
Provides data recovery and fixed target for backup
Snapshot delta = incremental
Unlimited number of snapshots
c.f. 1 with UFS, 32 with WAFL
Page 34
Snapshots
Save old uberblock - describes complete snapshot
snapshot
uberblock
current
uberblock
Page 35
Checksums
Traditional model: checksum stored with block
Page 36
Checksums, contd
SPA model: checksum stored with indirect block
Self-validating
Detects bit rot, phantom writes, misdirections,
admin error (e.g. swap on active ZFS disk)
ZFS: The Zettabyte Filesystem
Page 37
Checksums, contd
Physical separation improves fault isolation,
yet doesnt require additional I/O
64-bit strength ensures data integrity
provides 99.99999999999999999%
("nineteen nines") error detection probability
Checksum vectoring provides flexibility
weaker checksums for performance
faster checksums in the future
secure checksums for data authentication
(uberblock checksum provides unforgeable
signature for the entire storage pool)
ZFS: The Zettabyte Filesystem
Page 38
Organizing Principles
Simple administration
Extensible, modular design
Provable data integrity model
Always-available data
High performance
Page 39
Always-Available Data
Always-consistent on-disk format
Elimination of fsck(1M)
Self-Healing Data
Failure Prediction and Disk Scrubbing
Hot Space
Data Migration
Real-Time Remote Replication
User Undo
ZFS: The Zettabyte Filesystem
Page 40
Page 41
Self-Healing Data
Media error under traditional FS:
bad user data causes silent data corruption
bad metadata causes SDC, panic, or both
Media error under ZFS:
checksum detects data corruption
SPA gets valid data from another replica
and uses it to repair the damaged one
SPA returns valid data to application
no sysadmin intervention required
Page 42
Failure Prediction
SPA automatically migrates data from failing
devices to healthy devices
Detects health by monitoring error rate
Employs disk scrubbing to detect latent errors
while theyre still correctable
Page 43
Hot Space
Hot spare model
Page 44
Data Migration
Allow transparent disk upgrades and
data migration from failing devices
Apply VM principles to storage
DMU names blocks by 128-bit DVA
(Data Virtual Address)
high-order 64 bits specify metaslab
SPA translates metaslab to <vdev, offset>
SPA can migrate metaslabs from one vdev
to another without affecting any DMU state
Data remains available during migration
ZFS: The Zettabyte Filesystem
Page 45
Page 46
Organizing Principles
Simple administration
Extensible, modular design
Provable data integrity model
Always-available data
High performance
Page 47
High Performance
Write Sequentialization
Dynamic striping
Parallel three-phase TX groups
Intelligent prefetch
Multiple block sizes
Sync semantics at async speed
Concurrent, constant-time directory ops
POSIX-compliant concurrent writes
Hot space
ZFS: The Zettabyte Filesystem
Page 48
Write Sequentialization
Traditional FS: random file writes become
random disk writes
ZFS: random file writes become
sequential disk writes
follows from COW model
modified blocks are newly allocated
SPA has complete allocation freedom
SPA chooses sequential free blocks
Cost of writing extra ZFS metadata more than
offset by improved locality
ZFS: The Zettabyte Filesystem
Page 49
Dynamic Striping
Traditional striping: spread data across multiple
devices at fixed stride
Inflexible: cant change stripe width,
cant add or remove devices
0
5
10
...
1
6
11
...
2
7
12
...
3
8
13
...
4
9
14
...
Page 50
Page 51
Page 52
Page 53
Page 54
Page 55
Concurrent Writes
Existing filesystems force trade-off between
POSIX compliance and write concurrency
ZFS employs byte-range locking to allow maximum
concurrency while satisfying POSIX overlapping
write semantics
Parallel read/write
Serialized
Page 56
Compression
Block-level compression in SPA
transparent to all other layers
enabled by multiple block size support
DMU translations: all 8k
8k
4k
2k
8k
SPA block
allocations:
vary with
compression
Page 57
Page 58
Futures
POSIX isnt the only game in town
DMU as native Oracle API
Object-based appliances
agnostic: NFS, database, volume emulation
DMU as "foundation classes"
UFS
ZPL
NFS
Oracle
*FS
raw
zvol
DMU
SPA
ZFS: The Zettabyte Filesystem
Page 59
Page 60
Page 61
Page 62
Page 63
Page 64
ZFS:
The Zettabyte Filesystem
Please send questions, comments and ideas to:
zfs-team@sun.com
Want to follow ZFS developments? Join:
zfs-interest@sun.com
For the latest information, visit:
http://zfs.eng
ZFS: The Zettabyte Filesystem
Page 65