ZFS Overview

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 65

ZFS:

The Zettabyte Filesystem

zfs-team@sun.com

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 1

The Perfect Filesystem


Write my data
Keep it safe
Read it back
Do it fast
Dont hassle me

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 2

Existing Filesystems
Write my data?
limited size (16TB for UFS)
limited number of files
limited directory entries

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 3

Existing Filesystems
Keep it safe?
bit rot causes silent data corruption
no defense against phantom writes,
misdirections, other firmware bugs
no defense against administrative errors
(e.g. swap on active filesystem device)
no security: spying, tampering, theft

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 4

Existing Filesystems
Read it back?
no data integrity checks
no data authentication
data might be good, might be bad
dont know
couldnt fix it if we did
like running a server without DRAM parity

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 5

Existing Filesystems
Do it fast?
linear-time directory ops
linear-time newfs(1M), fsck(1M)
limited read/write concurrency
fixed block size
fixed stripe width
poor random write performance
slow mirroring
ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 6

Existing Filesystems
Dont hassle me?
create a partition for every FS
grow: manual process
shrink: not possible
remember a bunch of c0t0d0s0 names
edit /etc/vfstab by hand
wait around for fsck(1M)
take system down to upgrade disks
ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 7

ZFS Objective

End the suffering

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 8

The ZFS Filesystem


Write my data!
immense capacity (128-bit)
theres no SI prefix for this!
zettabyte = 70-bit (a billion TB)
ZFS capacity: 256 quadrillion ZB

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 9

The ZFS Filesystem


Keep it safe!
self-healing data
copes with every class of error
bit rot
phantom writes
misdirected reads and writes
administrative errors
disk scrubbing
real-time remote replication
encryption
ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 10

The ZFS Filesystem


Read it back!
provable data integrity model
detects and corrects errors

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 11

The ZFS Filesystem


Do it fast!
write sequentialization
dynamic striping
multiple block sizes
constant-time snapshots
concurrent, constant-time directory ops
byte-range locking for concurrent writes
sync semantics at async speed
(critical for good NFS performance)
ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 12

The ZFS Filesystem


Dont hassle me!
FS creation is as easy as mkdir
grow and shrink are automatic
no raw device names to remember
no volumes at all
no more fsck(1M)
no more editing /etc/vfstab
all administration online
ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 13

Organizing Principles
Simple administration
Extensible, modular design
Provable data integrity model
Always-available data
High performance

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 14

Simple Administration
Pooled storage
Immense capacity
Quotas and Reservations
User undo

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 15

Volumes vs. Storage Pools


Traditional volumes
partition per FS
FS/volume interface:
block-level I/O
FS

Naming
and
storage
tightly
bound

FS

Pooled Storage
FSes share space
ZFS/pool interface:
object transactions
ZFS

FS

Volume

Volume

Volume

(Virtual Disk)

(Virtual Disk)

(Virtual Disk)

Naming
and
storage
decoupled

No space sharing

ZFS: The Zettabyte Filesystem

ZFS

ZFS

Storage Pool

All space shared

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 16

Volumes vs. Storage Pools, contd


Both manage disks and provide mirroring
Traditional FS/volume model: volume provides
space, but FS manages it
volume doesnt know which blocks are in use
FS cant easily grow or shrink
FS creation requires new partition
ZFS model: SPA provides and manages space
many filesystems share space
grow and shrink are implicit
FS create/delete are just like mkdir/rmdir
only one pool to manage (vs. volume per FS)
ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 17

Volumes vs. Storage Pools, contd


Advantages of pooled storage
reduces fragmentation
simplifies administration
decouples logical and physical structure
filesystems named by default mount point
Proof of concept: tmpfs
all tmpfs mounts share common swap space
administration is trivial: swap -a / swap -d
FS becomes more powerful administrative point
no longer tied to physical configuration
more like a directory with heritable attributes
ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 18

Immense Capacity
128-bit storage pools
128-bit filesystems
128-bit files, but limited to 64-bit access
until we have 128-bit OS support
64-bit max files per dataset
64-bit max files per directory
statvfs128() will be needed
ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 19

Quotas and Reservations


Traditional model
quotas: per-user UFS bolt-on
(cred structures all the way down to bmap)
reservations: no (nothing to reserve against)
ZFS model
FS is now the administrative point
FS per home directory, project, workspace, ...
quotas: per-FS
reservations: per-FS
group quotas, hierarchical quotas almost free

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 20

User Undo
Unlimited snapshots
recover previous version of a file
Undelete
recover recently deleted file
No sysadmin intervention required

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 21

Organizing Principles
Simple administration
Extensible, modular design
Provable data integrity model
Always-available data
High performance

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 22

ZFS is Object-Based
An object is a "flat file"
Everything is stored in objects: user data,
znodes, directories, free block lists, etc.
Arbitrarily complex operations reduce to reads
and writes on a set of objects
Simplifies interfaces, design, and analysis
single I/O path
single interposition point
single object read/write model
ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 23

ZFS Components
ZPL

ZFS POSIX Layer: standard POSIX


semantics (permission, mode, timestamps);
translates vnode ops into object read/write

ZFS Attribute Processor: constant-time,


ZAP concurrent attribute operations
(directories, object properties, etc)
DMU

Data Management Unit: transactions,


caching, object translations

SPA

Storage Pool Allocator: space allocation,


replication, checksums, resource controls,
encryption, compression, fault management

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 24

SPA Components
Gather non-dependent
I/O into I/O groups

DMU
SPA
compression
metaslab
allocator

IOG

encryption

Allocate space from


metaslab layer

checksum

Apply pluggable modules


compression
encryption
checksum

mirror
vdev

disk
vdev

disk
vdev

Dispatch parallel, async


I/O to vdev stack
Issue disk I/O

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 25

Organizing Principles
Simple administration
Extensible, modular design
Provable data integrity model
Always-available data
High performance

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 26

Provable Data Integrity Model


All operations are copy-on-write
never overwrite live data
All operations are transactional
related changes succeed or fail as a whole
All data is checksummed
no silent data corruption

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 27

Copy-on-Write TX Model
Problem: modify several objects atomically
DMU provides transactional interface
ZPL groups work into transactions
DMU sends whole transactions to SPA
SPA commits transaction groups
SPA never modifies active blocks
entire storage pool is a tree of blocks
rooted at the "uberblock"
transactions COW nodes of the tree
transaction group is committed when
uberblock is rewritten to point to new tree
ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 28

Copy-on-Write TX Model
initial block tree

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 29

Copy-on-Write TX Model
write: COWs a data block

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 30

Copy-on-Write TX Model
COW its level-1 indirect block

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 31

Copy-on-Write TX Model
COW its level-2 indirect block

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 32

Copy-on-Write TX Model
rewrite the uberblock (atomic)

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 33

Snapshots
COW TX model enables constant-time snapshots
snapshot storage pool by copying its uberblock
snapshot single FS by copying its root block
snapshot single file by copying its dnode
Provides data recovery and fixed target for backup
Snapshot delta = incremental
Unlimited number of snapshots
c.f. 1 with UFS, 32 with WAFL

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 34

Snapshots
Save old uberblock - describes complete snapshot
snapshot
uberblock

ZFS: The Zettabyte Filesystem

current
uberblock

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 35

Checksums
Traditional model: checksum stored with block

Fine for detecting bit rot, but:


cant detect phantom writes, misdirections
cant validate the checksum itself
cant protect against tampering

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 36

Checksums, contd
SPA model: checksum stored with indirect block

Self-validating
Detects bit rot, phantom writes, misdirections,
admin error (e.g. swap on active ZFS disk)
ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 37

Checksums, contd
Physical separation improves fault isolation,
yet doesnt require additional I/O
64-bit strength ensures data integrity
provides 99.99999999999999999%
("nineteen nines") error detection probability
Checksum vectoring provides flexibility
weaker checksums for performance
faster checksums in the future
secure checksums for data authentication
(uberblock checksum provides unforgeable
signature for the entire storage pool)
ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 38

Organizing Principles
Simple administration
Extensible, modular design
Provable data integrity model
Always-available data
High performance

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 39

Always-Available Data
Always-consistent on-disk format
Elimination of fsck(1M)
Self-Healing Data
Failure Prediction and Disk Scrubbing
Hot Space
Data Migration
Real-Time Remote Replication
User Undo
ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 40

Always-Consistent On-Disk Format


ZFS is always self-consistent
follows from COW transaction model
Doesnt depend on the intent log
No more fsck(1M)
no "clean bit"
no off-line maintenance
ZFS is always mountable

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 41

Self-Healing Data
Media error under traditional FS:
bad user data causes silent data corruption
bad metadata causes SDC, panic, or both
Media error under ZFS:
checksum detects data corruption
SPA gets valid data from another replica
and uses it to repair the damaged one
SPA returns valid data to application
no sysadmin intervention required

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 42

Failure Prediction
SPA automatically migrates data from failing
devices to healthy devices
Detects health by monitoring error rate
Employs disk scrubbing to detect latent errors
while theyre still correctable

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 43

Hot Space
Hot spare model

"Hot space" model

No more dedicated hot spares


"hot space" spread across all devices
Keeps all devices active
uses all available I/O bandwidth
improves drive utilization
improves failure prediction
prevents silent atrophy
ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 44

Data Migration
Allow transparent disk upgrades and
data migration from failing devices
Apply VM principles to storage
DMU names blocks by 128-bit DVA
(Data Virtual Address)
high-order 64 bits specify metaslab
SPA translates metaslab to <vdev, offset>
SPA can migrate metaslabs from one vdev
to another without affecting any DMU state
Data remains available during migration
ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 45

Real-Time Remote Replication


Everything in ZFS is an object
Every change is just a write to an object
Writes are always batched into TX groups
Contents of TX group can be sent async
Latency insensitive!
Occasional ACK for remote TX group commit

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 46

Organizing Principles
Simple administration
Extensible, modular design
Provable data integrity model
Always-available data
High performance

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 47

High Performance
Write Sequentialization
Dynamic striping
Parallel three-phase TX groups
Intelligent prefetch
Multiple block sizes
Sync semantics at async speed
Concurrent, constant-time directory ops
POSIX-compliant concurrent writes
Hot space
ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 48

Write Sequentialization
Traditional FS: random file writes become
random disk writes
ZFS: random file writes become
sequential disk writes
follows from COW model
modified blocks are newly allocated
SPA has complete allocation freedom
SPA chooses sequential free blocks
Cost of writing extra ZFS metadata more than
offset by improved locality
ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 49

Dynamic Striping
Traditional striping: spread data across multiple
devices at fixed stride
Inflexible: cant change stripe width,
cant add or remove devices
0
5
10
...

1
6
11
...

2
7
12
...

3
8
13
...

4
9
14
...

Dynamic striping: round-robin allocation


balances writes across all available devices
enabled by COW model
ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 50

Three-Phase Transaction Groups


Open: accepting new transactions
Quiescing: waiting for transactions to finish
Syncing: pushing changes to disk
Up to three transaction groups active
one in each state - prevents burstiness
uses all available disk bandwidth
Open
Quiescing
Syncing
Closed
Time
ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 51

Multiple Block Sizes


No block size is optimal for everything
large blocks: less metadata
small blocks: more efficient for small objects
record-structured files have natural granularity;
we want to match it to avoid read/modify/write
ZFS supports any power of two block size
Per-object granularity
automatic block size selection by default
manual override
Enables transparent block-based compression
ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 52

Multiple Block Sizes, contd


Why not extents?
extents dont COW: writes force extent breaks
greater code complexity
Multiple block sizes combine the simplicity
of blocks with the metadata savings of extents

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 53

Sync Semantics at Async Speed


Review: ZFS is always self-consistent on disk
However: after system crash, ZFS wont contain
transactions since last sync
Use intent log to recover recent transactions
log metadata only: lose recent writes (UFS)
log user + metadata: recover everything (NFS)
log to disk: wait for one sequential disk write
log to NVRAM on I/O bus: fast (NetApp filers)
log to NVRAM on main memory bus: blazing
Ideal configuration: log all ops to NVRAM
need HW/sales/marketing on board
big payoff: only a system vendor can do this
ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 54

Fast Directory Operations


Large directories: need constant-time operations
(lookup, create, delete)
Hot directories: need concurrent operations
Solution: extendible hashing
block-based
amortized growth cost
short chains for constant-time ops
per-block locking for high concurrency
readdir: returns entries in hash-value order

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 55

Concurrent Writes
Existing filesystems force trade-off between
POSIX compliance and write concurrency
ZFS employs byte-range locking to allow maximum
concurrency while satisfying POSIX overlapping
write semantics
Parallel read/write

ZFS: The Zettabyte Filesystem

Serialized

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 56

Compression
Block-level compression in SPA
transparent to all other layers
enabled by multiple block size support
DMU translations: all 8k

8k

4k

2k

8k

SPA block
allocations:
vary with
compression

Per-file, per-filesystem, or per-pool


Vectoring for different compression functions
ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 57

Encryption and Data Security


Block-level encryption in SPA
transparent to all other layers
supports any symmetric block cipher mode:
DES, AES, IDEA, RC6, Blowfish, SEAL, OCB...
Per-filesystem or per-pool
Vectoring for different encryption functions
Data authentication via secure checksums
Open issues:
key management
larger data security model
ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 58

Futures
POSIX isnt the only game in town
DMU as native Oracle API
Object-based appliances
agnostic: NFS, database, volume emulation
DMU as "foundation classes"
UFS
ZPL
NFS
Oracle

*FS

raw

zvol
DMU
SPA
ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 59

Case Study: Jurassic on UFS/SVM


Upgrading disks
major down time
significant manual labor
FS-to-user mapping
single FS impossible: exceeds 1TB
FS per user impractical: fragments storage
/var/mail
create/delete .lock files: serial and slow

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 60

Case Study: Jurassic on UFS/SVM


Quotas and reservations
quotas: too expensive and broken to use
reservations: no such concept
User error recovery
restore from tape
last 24 hours lost
Reliability / Availability
several instances of data loss this year
hours of down time for fsck(1M)

ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 61

Case Study: Jurassic on ZFS


Upgrading disks
add new disks to storage pool
remove old disks from storage pool
(SPA auto-migrates the data)
FS-to-user mapping
single FS possible
FS per user better: enables per-user
reservations, snapshots, encryption, etc.
/var/mail
create/delete .lock files: parallel and fast
ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 62

Case Study: Jurassic on ZFS


Quotas and reservations
per-filesystem: e.g. per-workspace,
per-home directory, per-project
User error recovery
user undo
restore from snapshot
either way, no sysadmin required
Reliability / Availability
no fsck(1M); ZFS is always mountable
provable data integrity model
ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 63

Where Are We Now?


"Hello world" on Oct 31, 2002
complete POSIX-compliant filesystem
most key features working: pooled storage,
crash resilience, self-healing data
Full builds of ON10 on ZFS filesystems
zvol driver used for MTB-UFS test/bringup
Still plenty to do
intent log, snapshots, perf work
internal alpha program
Phase 1 putback in October
ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 64

ZFS:
The Zettabyte Filesystem
Please send questions, comments and ideas to:
zfs-team@sun.com
Want to follow ZFS developments? Join:
zfs-interest@sun.com
For the latest information, visit:
http://zfs.eng
ZFS: The Zettabyte Filesystem

February 10, 2003

Sun Microsystems Proprietary / Confidential Need to Know

Page 65

You might also like