Google File System

Google File System
Ravi Prakash
Overview
• Built from inexpensive commodity parts

• Component failures are the norm, rather than exception
• Multi-GB files are common
• Dominant file update mode is append, rather than overwrite
• Reads tend to be sequential
• Co-designing application and file system APIs beneficial for intended
application scenario
Ravi Prakash, U.T. DallasRavi Prakash, U.T. Dallas 2

Assumptions
• System must detect, tolerate and recover quickly from component

failures
• Optimize for large file sizes
• Large streaming reads and small random reads
• Large sequential writes: appends
• Infrequent small writes to arbitrary file offsets
• Multiple clients may concurrently append to same file
• Atomic updates with minimal synchronization desirable
• High bandwidth more desirable than low latency
Ravi Prakash, U.T. Dallas 3

Operations
• Usual operations like file create, delete, open, close, read and write
• Snapshot: copy file or directory tree
• Record append:
• Useful for multi-way merge without locking
• Producer-consumer queues

Typical Application Overview (MapReduce)
Source: Dean & Ghemawat, “Mapreduce: Simplified Data Processing on Large Clusters,” OSDI 2004
Architecture
• Master: maintains file system metadata

• Chunkservers
• Commodity Linux machine running user-level server process
• Clients
• Files
• Divided into chunks of size 64MB
• Chunks replicated on multiple chunkservers: default = 3 replicas
• HeartBeat messages between Master and Chunkservers
• Chunkserver report a subset of chunks it has:
• Over multiple HeartBeats all chunks reported

Architecture
Source: Ghemavat et al., “The Google File System,” ACM SOSP 2003.
Single Master: not a bottleneck
• Master stores and provides file system metadata in memory

• File and chunk namespaces
• Mapping from files to chunks
also kept persistent in operation log
• 64 bytes metadata per chunk
• Chunk handle: 64 bit, assigned at chunk creation
• Chunk locations
• Data moves between clients and chunkservers

Tracking Chunk Locations
• Chunkservers have authoritative information about chunks they

contain
• Master requests this data:
• On start-up, and
• Periodically (HeartBeat messages)

Operation Log
• Replicated on multiple machines

• Checkpoint state when log becomes large
• Helps maintain small log size
• Master recovery from failures:
• Load latest checkpoint
• Replay operations logged after checkpoint
• Master responds to client requests only after flushing corresponding
log to stable storage

Chunk Leases and Mutation Order
• Primary chunkserver for a given chunk C:

• Replica selected by master
• Finite duration, renewable and revocable lease granted by master
• 60 seconds: typical lease duration
• Chunk version number increased by master for all available replicas when new
lease granted: helps detect stale replicas
• Primary determines the serial order of concurrent updates by clients
to a chunk

Fault Tolerance
• Chunk replication
• Master operation log and checkpoints replicated on multiple
machines
• Failed master can be restarted immediately
• On master’s machine or disk fails: new master started on another
machine
• Clients use canonical name for master which is a DNS alias
• Shadow masters: provide read-only access to clients when primary
master is down

Write Control and Data Flow
• Data pushed in pipelined fashion

• Fully utilize each machine’s network
3+ bandwidth
• Avoid network bottlenecks
• Minimize latency
Source: Ghemavat et al., “The Google File System,” ACM SOSP 2003.

Atomic Record Appends
• Client only provides data to be appended

• Not location
• Data appended at least once, atomically
• Offset returned to client
• Additional logic:
• Maximum size of append = 16 MB (1/4 of maximum chunk size)
• If insufficient space in chunk: Primary
• Pads chunk locally and at secondary replicas
• Informs client to retry on next chunk

Snapshot
• Copy-on-write
• Master revokes all leases on chunks of file to snapshot
• Master takes snapshot: logs operation to disk
• Duplicates corresponding metadata to in-memory state
• Snapshot and source file point to same chunks
• Master: on client write to chunk C after snapshot:
• Picks new chunkhandle C’
• Asks all chunkservers that contain C to create a copy C’
• Grants one of them primary lease on C’
• Directs client writes to C’
Namespace Management and Locking
Example: prevent new file creation in a directory while it is being

snapshotted
• Directory: /home/user
• Snapshot destination: /save/user
• New file being created: /home/user/foo
• Snapshot operation acquires:
• Read locks on /home and /save
• Write locks on /home/user and /save/user
Conflicting operations:
• File creation tries to acquire: will be serialized
• Read locks on /home and /home/user
• Write lock on /home/user/foo
Replica Placement
• Balancing disk space utilization: place new replicas on chunkservers

with below-average disk space utilization
• Load balancing: limit number of recent creations of chunks on each
chunkserver
• Creation predicts imminent heavy write traffic
• Fault tolerance: spread replicas across racks
• Chunk re-replication: when number of replicas falls below user-
specified goal
• Replica rebalancing: for better disk space and load balancing

Google File System

Uploaded by

Copyright:

Available Formats

You might also like

Google File System

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Google File System

Uploaded by

Copyright:

Available Formats

Google File System

• Built from inexpensive commodity parts

Ravi Prakash, U.T. DallasRavi Prakash, U.T. Dallas 2

• System must detect, tolerate and recover quickly from component

Ravi Prakash, U.T. Dallas 3

Ravi Prakash, U.T. Dallas 4

• Master: maintains file system metadata

Ravi Prakash, U.T. Dallas 6

• Master stores and provides file system metadata in memory

Ravi Prakash, U.T. Dallas 8

• Chunkservers have authoritative information about chunks they

Ravi Prakash, U.T. Dallas 9

• Replicated on multiple machines

Ravi Prakash, U.T. Dallas 10

• Primary chunkserver for a given chunk C:

Ravi Prakash, U.T. Dallas 11

Ravi Prakash, U.T. Dallas 12

• Data pushed in pipelined fashion

Ravi Prakash, U.T. Dallas 13

• Client only provides data to be appended

Ravi Prakash, U.T. Dallas 14

Example: prevent new file creation in a directory while it is being

• Balancing disk space utilization: place new replicas on chunkservers

Ravi Prakash, U.T. Dallas 17

You might also like