Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

What is BIGDATA?

3 V’s of BIGDATA

Volume Velocity
Social
Petabyte scale Sensor

Big Data

Variety
Structured
Semi-structured
Unstructured
What is Hadoop?
New Hardware & Software Approach
to handle BIGDATA
New Hardware Approach New Software Approach
HDFS
A self-healing distributed filesystem running on
clusters of commodity hardware, intended for storing
large files with streaming data access patterns.
Principles of HDFS
• Highly fault-tolerant
• Designed to be deployed on low-cost hardware
• Highly scalable
• Provides high throughput access to application data
• Suitable for applications that have large data
sets(typically GBs to TBs)
• Portable across heterogeneous hardware and
operating system platforms
• No support for random updates but append is
allowed
HDFS Concepts
• File is split into blocks for storing in HDFS. Blocks of the
same file are distributed across multiple machines in the
cluster.

• Concept of block
• minimum amount of data that can be read or written

• Size on normal file system: few kilobytes(~2KB)

• Size in HDFS: Default 64 MB but can be increased upto


128 MB.
Namenode = master
• Manages filesystem namespace(filesystem
tree, metadata for dirs and files)and
maintains editlog file. The namespace image
and editlogs are stored persistently on disk.

• Stores the info about which datanodes


stores the blocks of given file. This info is
stored in RAM.
Datanode = slave
• Serve as storage for data blocks

• Responsible for serving read and write


requests from the clients

• Sends periodic "heartbeat" to Namenode and


also sends block reports
Write Path in HDFS
Read Path in HDFS
Fault Tolerance and
Self-Healing in HDFS
Detecting DataNode Failures: HeartBeat
Filesystem Metadata
• The HDFS namespace is stored by Namenode.
• Namenode uses a transaction log called the EditLog
to record every change that occurs to the
filesystem meta data.
– For example, creating a new file.
– Change replication factor of a file
– EditLog is stored in the Namenode’s local filesystem
• Entire filesystem namespace including mapping of
blocks to files and file system properties is stored in
a file FsImage. Stored in Namenode’s local
filesystem.
• It periodically merges Edit log with FsImage.
HDFS Access

 WebHDFS
HDFS Shell Command
Command Operation
Lists the contents of the directory specified by path, showing
-ls path the names, permissions, owner, size and modification date
for each entry.
Behaves like -ls, but recursively displays entries in all
-lsr path
subdirectories of path.
Shows disk usage, in bytes, for all files which match path;
-du path
filenames are reported with the full HDFS protocol prefix.
Moves the file or directory indicated by src to dest, within
-mv src dest
HDFS.
Copies the file or directory identified by src to dest, within
-cp src dest
HDFS.
-rm path Removes the file or empty directory identified by path.
Removes the file or directory identified by path. Recursively
-rmr path
deletes any child entries (i.e., files or subdirectories of path).
Copies the file or directory from the local file system
-put localSrc dest
identified by localSrc to dest within the DFS.
-copyFromLocal localSrc dest Identical to -put
Copies the file or directory from the local file system
-moveFromLocal localSrc dest identified by localSrc to dest within HDFS, then deletes the
local copy on success.
Copies the file or directory in HDFS identified by src to the
-get [-crc] src localDest
local file system path identified by localDest.
HDFS Shell Command
Command Operation
-copyToLocal [-crc] src localDest Identical to -get
-moveToLocal [-crc] src localDest Works like -get, but deletes the HDFS copy on success.
-cat filename Displays the contents of filename on stdout.
Creates a directory named path in HDFS. Creates any parent
-mkdir path directories in path that are missing (e.g., like mkdir -p in
Linux).
Returns 1 if path exists; has zero length; or is a directory, or
-test -[ezd] path
0 otherwise.
Prints information about path. format is a string which
-stat [format] path accepts file size in blocks (%b), filename (%n), block size
(%o), replication (%r), and modification date (%y, %Y).
-tail [-f] file Shows the lats 1KB of file on stdout.
Changes the file permissions associated with one or more
objects identified by path.... Performs changes recursively
-chmod [-R] mode,mode,... path... with -R. mode is a 3-digit octal mode, or {augo}+/-{rwxX}.
Assumes a if no scope is specified and does not apply a
umask.
Sets the owning user and/or group for files or directories
-chown [-R] [owner][:[group]] path...
identified by path.... Sets owner recursively if -R is specified.
Returns usage information for one of the commands listed
-help cmd
above. You must omit the leading '-' character in cmd
Enable WebHDFS in Your Cluster

Step1: Add the following property into hdfs-site.xml to enabling


HDFS access:
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>

Location of hdfs-site.xml: /etc/hadoop/conf/hdfs-site.xml

Step2: Restart hdfs service from cloudera manager


1) Create a directory called temp under /user/cloudera

http://localhost:50070/webhdfs/v1/user/cloudera/temp?user.name=cl
oudera&op=MKDIRS

2) Get the status of a directory /user/cloudera

http://localhost:50070/webhdfs/v1/user/cloudera?user.name=clouder
a&op=GETFILESTATUS

3) Create and write into a file

4) Open and read a file

You might also like