ZFS Overview

ZFS:
The Zettabyte Filesystem
zfs-team@sun.com
ZFS: The Zettabyte Filesystem
February 10, 2003
Sun Microsystems Proprietary / Confidential Need to Know
Page 1
The Perfect Filesystem

Write my data
Keep it safe
Read it back
Do it fast
Dont hassle me
February 10, 2003
Page 2
Existing Filesystems
Write my data?
limited size (16TB for UFS)
limited number of files
limited directory entries
February 10, 2003
Page 3
Keep it safe?
bit rot causes silent data corruption
no defense against phantom writes,
misdirections, other firmware bugs
no defense against administrative errors
(e.g. swap on active filesystem device)
no security: spying, tampering, theft
February 10, 2003
Page 4
Read it back?
no data integrity checks
no data authentication
data might be good, might be bad
dont know
couldnt fix it if we did
like running a server without DRAM parity
February 10, 2003
Page 5
Do it fast?
linear-time directory ops
linear-time newfs(1M), fsck(1M)
limited read/write concurrency
fixed block size
fixed stripe width
poor random write performance
slow mirroring
February 10, 2003
Page 6
Dont hassle me?
create a partition for every FS
grow: manual process
shrink: not possible
remember a bunch of c0t0d0s0 names
edit /etc/vfstab by hand
wait around for fsck(1M)
take system down to upgrade disks
February 10, 2003
Page 7
ZFS Objective
End the suffering
February 10, 2003
Page 8
The ZFS Filesystem

Write my data!
immense capacity (128-bit)
theres no SI prefix for this!
zettabyte = 70-bit (a billion TB)
ZFS capacity: 256 quadrillion ZB
February 10, 2003
Page 9
The ZFS Filesystem

Keep it safe!
self-healing data
copes with every class of error
bit rot
phantom writes
misdirected reads and writes
administrative errors
disk scrubbing
real-time remote replication
encryption
February 10, 2003
Page 10
The ZFS Filesystem

Read it back!
provable data integrity model
detects and corrects errors
February 10, 2003
Page 11
The ZFS Filesystem

Do it fast!
write sequentialization
dynamic striping
multiple block sizes
constant-time snapshots
concurrent, constant-time directory ops
byte-range locking for concurrent writes
sync semantics at async speed
(critical for good NFS performance)
February 10, 2003
Page 12
The ZFS Filesystem

Dont hassle me!
FS creation is as easy as mkdir
grow and shrink are automatic
no raw device names to remember
no volumes at all
no more fsck(1M)
no more editing /etc/vfstab
all administration online
February 10, 2003
Page 13
Organizing Principles
Simple administration
Extensible, modular design
Provable data integrity model
Always-available data
High performance
February 10, 2003
Page 14
Simple Administration
Pooled storage
Immense capacity
Quotas and Reservations
User undo
February 10, 2003
Page 15
Volumes vs. Storage Pools

Traditional volumes
partition per FS
FS/volume interface:
block-level I/O
FS
Naming
and
storage
tightly
bound
FS
Pooled Storage
FSes share space
ZFS/pool interface:
object transactions
ZFS
FS
Volume
Volume
Volume
(Virtual Disk)
(Virtual Disk)
(Virtual Disk)
Naming
and
storage
decoupled
No space sharing
ZFS
ZFS
Storage Pool
All space shared
February 10, 2003
Page 16
Volumes vs. Storage Pools, contd

Both manage disks and provide mirroring
Traditional FS/volume model: volume provides
space, but FS manages it
volume doesnt know which blocks are in use
FS cant easily grow or shrink
FS creation requires new partition
ZFS model: SPA provides and manages space
many filesystems share space
grow and shrink are implicit
FS create/delete are just like mkdir/rmdir
only one pool to manage (vs. volume per FS)
February 10, 2003
Page 17
Volumes vs. Storage Pools, contd

Advantages of pooled storage
reduces fragmentation
simplifies administration
decouples logical and physical structure
filesystems named by default mount point
Proof of concept: tmpfs
all tmpfs mounts share common swap space
administration is trivial: swap -a / swap -d
FS becomes more powerful administrative point
no longer tied to physical configuration
more like a directory with heritable attributes
February 10, 2003
Page 18
Immense Capacity
128-bit storage pools
128-bit filesystems
128-bit files, but limited to 64-bit access
until we have 128-bit OS support
64-bit max files per dataset
64-bit max files per directory
statvfs128() will be needed
February 10, 2003
Page 19
Quotas and Reservations

Traditional model
quotas: per-user UFS bolt-on
(cred structures all the way down to bmap)
reservations: no (nothing to reserve against)
ZFS model
FS is now the administrative point
FS per home directory, project, workspace, ...
quotas: per-FS
reservations: per-FS
group quotas, hierarchical quotas almost free
February 10, 2003
Page 20
User Undo
Unlimited snapshots
recover previous version of a file
Undelete
recover recently deleted file
No sysadmin intervention required
February 10, 2003
Page 21
High performance
February 10, 2003
Page 22
ZFS is Object-Based
An object is a "flat file"
Everything is stored in objects: user data,
znodes, directories, free block lists, etc.
Arbitrarily complex operations reduce to reads
and writes on a set of objects
Simplifies interfaces, design, and analysis
single I/O path
single interposition point
single object read/write model
February 10, 2003
Page 23
ZFS Components
ZPL
ZFS POSIX Layer: standard POSIX

semantics (permission, mode, timestamps);
translates vnode ops into object read/write
ZFS Attribute Processor: constant-time,

ZAP concurrent attribute operations
(directories, object properties, etc)
DMU
Data Management Unit: transactions,

caching, object translations
SPA
Storage Pool Allocator: space allocation,

replication, checksums, resource controls,
encryption, compression, fault management
February 10, 2003
Page 24
SPA Components
Gather non-dependent
I/O into I/O groups
DMU
SPA
compression
metaslab
allocator
IOG
encryption
Allocate space from

metaslab layer
checksum
Apply pluggable modules

compression
encryption
checksum
mirror
vdev
disk
vdev
disk
vdev
Dispatch parallel, async

I/O to vdev stack
Issue disk I/O
February 10, 2003
Page 25
High performance
February 10, 2003
Page 26
Provable Data Integrity Model

All operations are copy-on-write
never overwrite live data
All operations are transactional
related changes succeed or fail as a whole
All data is checksummed
no silent data corruption
February 10, 2003
Page 27
Copy-on-Write TX Model
Problem: modify several objects atomically
DMU provides transactional interface
ZPL groups work into transactions
DMU sends whole transactions to SPA
SPA commits transaction groups
SPA never modifies active blocks
entire storage pool is a tree of blocks
rooted at the "uberblock"
transactions COW nodes of the tree
transaction group is committed when
uberblock is rewritten to point to new tree
February 10, 2003
Page 28
initial block tree
February 10, 2003
Page 29
write: COWs a data block
February 10, 2003
Page 30
COW its level-1 indirect block
February 10, 2003
Page 31
COW its level-2 indirect block
February 10, 2003
Page 32
rewrite the uberblock (atomic)
February 10, 2003
Page 33
Snapshots
COW TX model enables constant-time snapshots
snapshot storage pool by copying its uberblock
snapshot single FS by copying its root block
snapshot single file by copying its dnode
Provides data recovery and fixed target for backup
Snapshot delta = incremental
Unlimited number of snapshots
c.f. 1 with UFS, 32 with WAFL
February 10, 2003
Page 34
Snapshots
Save old uberblock - describes complete snapshot
snapshot
uberblock
current
uberblock
February 10, 2003
Page 35
Checksums
Traditional model: checksum stored with block
Fine for detecting bit rot, but:

cant detect phantom writes, misdirections
cant validate the checksum itself
cant protect against tampering
February 10, 2003
Page 36
Checksums, contd
SPA model: checksum stored with indirect block
Self-validating
Detects bit rot, phantom writes, misdirections,
admin error (e.g. swap on active ZFS disk)
February 10, 2003
Page 37
Checksums, contd
Physical separation improves fault isolation,
yet doesnt require additional I/O
64-bit strength ensures data integrity
provides 99.99999999999999999%
("nineteen nines") error detection probability
Checksum vectoring provides flexibility
weaker checksums for performance
faster checksums in the future
secure checksums for data authentication
(uberblock checksum provides unforgeable
signature for the entire storage pool)
February 10, 2003
Page 38
High performance
February 10, 2003
Page 39
Always-Available Data
Always-consistent on-disk format
Elimination of fsck(1M)
Self-Healing Data
Failure Prediction and Disk Scrubbing
Hot Space
Data Migration
Real-Time Remote Replication
User Undo
February 10, 2003
Page 40
Always-Consistent On-Disk Format

ZFS is always self-consistent
follows from COW transaction model
Doesnt depend on the intent log
No more fsck(1M)
no "clean bit"
no off-line maintenance
ZFS is always mountable
February 10, 2003
Page 41
Self-Healing Data
Media error under traditional FS:
bad user data causes silent data corruption
bad metadata causes SDC, panic, or both
Media error under ZFS:
checksum detects data corruption
SPA gets valid data from another replica
and uses it to repair the damaged one
SPA returns valid data to application
no sysadmin intervention required
February 10, 2003
Page 42
Failure Prediction
SPA automatically migrates data from failing
devices to healthy devices
Detects health by monitoring error rate
Employs disk scrubbing to detect latent errors
while theyre still correctable
February 10, 2003
Page 43
Hot Space
Hot spare model
"Hot space" model
No more dedicated hot spares

"hot space" spread across all devices
Keeps all devices active
uses all available I/O bandwidth
improves drive utilization
improves failure prediction
prevents silent atrophy
February 10, 2003
Page 44
Data Migration
Allow transparent disk upgrades and
data migration from failing devices
Apply VM principles to storage
DMU names blocks by 128-bit DVA
(Data Virtual Address)
high-order 64 bits specify metaslab
SPA translates metaslab to <vdev, offset>
SPA can migrate metaslabs from one vdev
to another without affecting any DMU state
Data remains available during migration
February 10, 2003
Page 45
Real-Time Remote Replication

Everything in ZFS is an object
Every change is just a write to an object
Writes are always batched into TX groups
Contents of TX group can be sent async
Latency insensitive!
Occasional ACK for remote TX group commit
February 10, 2003
Page 46
High performance
February 10, 2003
Page 47
High Performance
Write Sequentialization
Dynamic striping
Parallel three-phase TX groups
Intelligent prefetch
Multiple block sizes
Sync semantics at async speed
Concurrent, constant-time directory ops
POSIX-compliant concurrent writes
Hot space
February 10, 2003
Page 48
Write Sequentialization
Traditional FS: random file writes become
random disk writes
ZFS: random file writes become
sequential disk writes
follows from COW model
modified blocks are newly allocated
SPA has complete allocation freedom
SPA chooses sequential free blocks
Cost of writing extra ZFS metadata more than
offset by improved locality
February 10, 2003
Page 49
Dynamic Striping
Traditional striping: spread data across multiple
devices at fixed stride
Inflexible: cant change stripe width,
cant add or remove devices
0
5
10
...
1
6
11
...
2
7
12
...
3
8
13
...
4
9
14
...
Dynamic striping: round-robin allocation

balances writes across all available devices
enabled by COW model
February 10, 2003
Page 50
Three-Phase Transaction Groups

Open: accepting new transactions
Quiescing: waiting for transactions to finish
Syncing: pushing changes to disk
Up to three transaction groups active
one in each state - prevents burstiness
uses all available disk bandwidth
Open
Quiescing
Syncing
Closed
Time
February 10, 2003
Page 51
Multiple Block Sizes

No block size is optimal for everything
large blocks: less metadata
small blocks: more efficient for small objects
record-structured files have natural granularity;
we want to match it to avoid read/modify/write
ZFS supports any power of two block size
Per-object granularity
automatic block size selection by default
manual override
Enables transparent block-based compression
February 10, 2003
Page 52
Multiple Block Sizes, contd

Why not extents?
extents dont COW: writes force extent breaks
greater code complexity
Multiple block sizes combine the simplicity
of blocks with the metadata savings of extents
February 10, 2003
Page 53
Sync Semantics at Async Speed

Review: ZFS is always self-consistent on disk
However: after system crash, ZFS wont contain
transactions since last sync
Use intent log to recover recent transactions
log metadata only: lose recent writes (UFS)
log user + metadata: recover everything (NFS)
log to disk: wait for one sequential disk write
log to NVRAM on I/O bus: fast (NetApp filers)
log to NVRAM on main memory bus: blazing
Ideal configuration: log all ops to NVRAM
need HW/sales/marketing on board
big payoff: only a system vendor can do this
February 10, 2003
Page 54
Fast Directory Operations

Large directories: need constant-time operations
(lookup, create, delete)
Hot directories: need concurrent operations
Solution: extendible hashing
block-based
amortized growth cost
short chains for constant-time ops
per-block locking for high concurrency
readdir: returns entries in hash-value order
February 10, 2003
Page 55
Concurrent Writes
Existing filesystems force trade-off between
POSIX compliance and write concurrency
ZFS employs byte-range locking to allow maximum
concurrency while satisfying POSIX overlapping
write semantics
Parallel read/write
Serialized
February 10, 2003
Page 56
Compression
Block-level compression in SPA
transparent to all other layers
enabled by multiple block size support
DMU translations: all 8k
8k
4k
2k
8k
SPA block
allocations:
vary with
compression
Per-file, per-filesystem, or per-pool

Vectoring for different compression functions
February 10, 2003
Page 57
Encryption and Data Security

Block-level encryption in SPA
transparent to all other layers
supports any symmetric block cipher mode:
DES, AES, IDEA, RC6, Blowfish, SEAL, OCB...
Per-filesystem or per-pool
Vectoring for different encryption functions
Data authentication via secure checksums
Open issues:
key management
larger data security model
February 10, 2003
Page 58
Futures
POSIX isnt the only game in town
DMU as native Oracle API
Object-based appliances
agnostic: NFS, database, volume emulation
DMU as "foundation classes"
UFS
ZPL
NFS
Oracle
*FS
raw
zvol
DMU
SPA
February 10, 2003
Page 59
Case Study: Jurassic on UFS/SVM

Upgrading disks
major down time
significant manual labor
FS-to-user mapping
single FS impossible: exceeds 1TB
FS per user impractical: fragments storage
/var/mail
create/delete .lock files: serial and slow
February 10, 2003
Page 60
Case Study: Jurassic on UFS/SVM

Quotas and reservations
quotas: too expensive and broken to use
reservations: no such concept
User error recovery
restore from tape
last 24 hours lost
Reliability / Availability
several instances of data loss this year
hours of down time for fsck(1M)
February 10, 2003
Page 61
Case Study: Jurassic on ZFS

Upgrading disks
add new disks to storage pool
remove old disks from storage pool
(SPA auto-migrates the data)
FS-to-user mapping
single FS possible
FS per user better: enables per-user
reservations, snapshots, encryption, etc.
/var/mail
create/delete .lock files: parallel and fast
February 10, 2003
Page 62
Case Study: Jurassic on ZFS

Quotas and reservations
per-filesystem: e.g. per-workspace,
per-home directory, per-project
User error recovery
user undo
restore from snapshot
either way, no sysadmin required
Reliability / Availability
no fsck(1M); ZFS is always mountable
provable data integrity model
February 10, 2003
Page 63
Where Are We Now?

"Hello world" on Oct 31, 2002
complete POSIX-compliant filesystem
most key features working: pooled storage,
crash resilience, self-healing data
Full builds of ON10 on ZFS filesystems
zvol driver used for MTB-UFS test/bringup
Still plenty to do
intent log, snapshots, perf work
internal alpha program
Phase 1 putback in October
February 10, 2003
Page 64
ZFS:
The Zettabyte Filesystem
Please send questions, comments and ideas to:
zfs-team@sun.com
Want to follow ZFS developments? Join:
zfs-interest@sun.com
For the latest information, visit:
http://zfs.eng
February 10, 2003
Page 65

ZFS Overview

Uploaded by

Copyright:

Available Formats

You might also like

ZFS Overview

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ZFS Overview

Uploaded by

Copyright:

Available Formats

ZFS:

The Zettabyte Filesystem

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

The Perfect Filesystem

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

End the suffering

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

The ZFS Filesystem

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

The ZFS Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

The ZFS Filesystem

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

The ZFS Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

The ZFS Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Volumes vs. Storage Pools

ZFS: The Zettabyte Filesystem

All space shared

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Volumes vs. Storage Pools, contd

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Volumes vs. Storage Pools, contd