Google File System

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Google File System

Ravi Prakash
Overview

• Built from inexpensive commodity parts


• Component failures are the norm, rather than exception
• Multi-GB files are common
• Dominant file update mode is append, rather than overwrite
• Reads tend to be sequential
• Co-designing application and file system APIs beneficial for intended
application scenario

Ravi Prakash, U.T. DallasRavi Prakash, U.T. Dallas 2


Assumptions

• System must detect, tolerate and recover quickly from component


failures
• Optimize for large file sizes
• Large streaming reads and small random reads
• Large sequential writes: appends
• Infrequent small writes to arbitrary file offsets
• Multiple clients may concurrently append to same file
• Atomic updates with minimal synchronization desirable
• High bandwidth more desirable than low latency

Ravi Prakash, U.T. Dallas 3


Operations

• Usual operations like file create, delete, open, close, read and write
• Snapshot: copy file or directory tree
• Record append:
• Useful for multi-way merge without locking
• Producer-consumer queues

Ravi Prakash, U.T. Dallas 4


Typical Application Overview (MapReduce)

Source: Dean & Ghemawat, “Mapreduce: Simplified Data Processing on Large Clusters,” OSDI 2004
Ravi Prakash, U.T. Dallas 5
Architecture

• Master: maintains file system metadata


• Chunkservers
• Commodity Linux machine running user-level server process
• Clients
• Files
• Divided into chunks of size 64MB
• Chunks replicated on multiple chunkservers: default = 3 replicas
• HeartBeat messages between Master and Chunkservers
• Chunkserver report a subset of chunks it has:
• Over multiple HeartBeats all chunks reported

Ravi Prakash, U.T. Dallas 6


Architecture

Source: Ghemavat et al., “The Google File System,” ACM SOSP 2003.
Ravi Prakash, U.T. Dallas 7
Single Master: not a bottleneck

• Master stores and provides file system metadata in memory


• File and chunk namespaces
• Mapping from files to chunks
also kept persistent in operation log
• 64 bytes metadata per chunk
• Chunk handle: 64 bit, assigned at chunk creation
• Chunk locations
• Data moves between clients and chunkservers

Ravi Prakash, U.T. Dallas 8


Tracking Chunk Locations

• Chunkservers have authoritative information about chunks they


contain
• Master requests this data:
• On start-up, and
• Periodically (HeartBeat messages)

Ravi Prakash, U.T. Dallas 9


Operation Log

• Replicated on multiple machines


• Checkpoint state when log becomes large
• Helps maintain small log size
• Master recovery from failures:
• Load latest checkpoint
• Replay operations logged after checkpoint
• Master responds to client requests only after flushing corresponding
log to stable storage

Ravi Prakash, U.T. Dallas 10


Chunk Leases and Mutation Order

• Primary chunkserver for a given chunk C:


• Replica selected by master
• Finite duration, renewable and revocable lease granted by master
• 60 seconds: typical lease duration
• Chunk version number increased by master for all available replicas when new
lease granted: helps detect stale replicas
• Primary determines the serial order of concurrent updates by clients
to a chunk

Ravi Prakash, U.T. Dallas 11


Fault Tolerance

• Chunk replication
• Master operation log and checkpoints replicated on multiple
machines
• Failed master can be restarted immediately
• On master’s machine or disk fails: new master started on another
machine
• Clients use canonical name for master which is a DNS alias
• Shadow masters: provide read-only access to clients when primary
master is down

Ravi Prakash, U.T. Dallas 12


Write Control and Data Flow

• Data pushed in pipelined fashion


• Fully utilize each machine’s network
3+ bandwidth
• Avoid network bottlenecks
• Minimize latency

Source: Ghemavat et al., “The Google File System,” ACM SOSP 2003.

Ravi Prakash, U.T. Dallas 13


Atomic Record Appends

• Client only provides data to be appended


• Not location
• Data appended at least once, atomically
• Offset returned to client
• Additional logic:
• Maximum size of append = 16 MB (1/4 of maximum chunk size)
• If insufficient space in chunk: Primary
• Pads chunk locally and at secondary replicas
• Informs client to retry on next chunk

Ravi Prakash, U.T. Dallas 14


Snapshot

• Copy-on-write
• Master revokes all leases on chunks of file to snapshot
• Master takes snapshot: logs operation to disk
• Duplicates corresponding metadata to in-memory state
• Snapshot and source file point to same chunks
• Master: on client write to chunk C after snapshot:
• Picks new chunkhandle C’
• Asks all chunkservers that contain C to create a copy C’
• Grants one of them primary lease on C’
• Directs client writes to C’
Ravi Prakash, U.T. Dallas 15
Namespace Management and Locking

Example: prevent new file creation in a directory while it is being


snapshotted
• Directory: /home/user
• Snapshot destination: /save/user
• New file being created: /home/user/foo
• Snapshot operation acquires:
• Read locks on /home and /save
• Write locks on /home/user and /save/user
Conflicting operations:
• File creation tries to acquire: will be serialized
• Read locks on /home and /home/user
• Write lock on /home/user/foo
Ravi Prakash, U.T. Dallas 16
Replica Placement

• Balancing disk space utilization: place new replicas on chunkservers


with below-average disk space utilization
• Load balancing: limit number of recent creations of chunks on each
chunkserver
• Creation predicts imminent heavy write traffic
• Fault tolerance: spread replicas across racks
• Chunk re-replication: when number of replicas falls below user-
specified goal
• Replica rebalancing: for better disk space and load balancing

Ravi Prakash, U.T. Dallas 17

You might also like