Professional Documents
Culture Documents
Big Data Analytics
Big Data Analytics
1. Volume - huge amount of data is actually is not very good. Information overload happen here.
Because only small amount of data is only needed. We may not identify suitable amount of data.
2. Velocity
3. Variety
b. Many types of data - it increases the complexity. Number of variables > number of
observations - tall and skinny
b.
c. Data may be from different sources. Data may not be collected for own purpose like
secondary data. Ensure the completeness of the data.
d. Can have fake data. - missing data - ensure to clean data.
e. To clean them we can follow distributed approach
• But in parallel processing, we will divide the problem into multiple tasks, and happened parallelly.
• When executing instruction 1, error occurred, thus we need to focus just instruction 1 only.
• Executions are done in different nodes.
- Why parallel processing - slide 6.
From <https://chat.openai.com/>
➢ Some other task may need to communicate - some aggregation and transfer between nodes
might be needed - near embarrassingly parallel.
➢ Embarrassingly parallel - one that need no requirement of communication or dependency
between the processes.
Data locality
- It is expensive to transfer data from one node to another and cost some network efforts.
- Processing of the data is happen in the same place where we have the data. (which is happen in
Map side)
Data technologies - how we collect, process and share data. S3 has a trend now.
Analytics and visualization
Commodity hardware - distributed file system designed to run on hardware based on open standards.
- Capable of running on different OS without special drivers.
- Off the shelf hardware which is widely available and interchangeable with other hardware of its
type.
- A commodity cluster is an ensemble of fully independent computing systems integrated by a
commodity off-the-shelf interconnection communication network.
MapReduce
- programming paradigm that enables scalability across hundreds or thousands of servers in a
Hadoop cluster.
- It does the two tasks, of mapping and reducing.
- It will do the map stage, shuffle stage, and reduce stage.
- Mapper: a function or a task which is used to process all the input records from a file and
generate the output which works as input for Reducer. Creates the small chunks of data.
○ Hadoop has 2 mappers and 2 reducers by default.
- Reducer: combination of shuffle and reducer stage and stores the data in HDFS. Involves with the
<key, value> pairs
Keywords
1. Name-node
○ master node in Hadoop HDFS (Hadoop Distributed File System) architecture that maintains
and manages the blocks (data is saved in small files) present on the data-nodes (slave
nodes).
○ A highly available server that manages the file system Namespace and controls access to
files by clients.
2. Namespace
- File or directory which is handled by the Name-node.
Usually, when a file is stored in HDFS, it will store as chunks, we call as blocks (default 64MB).
Example
- File is 150MB.
- These blocks are separated in nodes, called as data nodes
- Each data node has blocks inside.
- Data nodes are handled by name-node. It saves meta data.
- To avoid the failure and losing of data in blocks, HDFS, creates 3 copies of the same block
elsewhere.
- When you run a MapReduce job, you submit that to a job tracker.
- Each data node has a task tracker.
- We can only have one name node. Secondary name node is not like stand by name node.
- If we lots of metadata to record, then it is not practical to have one name node. - no horizontal
capability.
- If something happens to the name node, we cannot use the name node and perform activities -
single point of failure. So replication is useless if that happens.
- Job tracker access the resource manager, schedular, monitoring. That is too many things and can
over burn.
- Name node - one name node can handle 4000 nodes per cluster. If we need to exceed more than
that, one name node will not be able to handle it.
- One name node only have one name space.
- Suitable only for batch processing of huge amount of data. Not possible for real time data
streaming.
- Does not support - graphing, storage need different pipe lines.
Hadoop 2.0
- HDFS Federation
- More name nodes will be added. More nodes to the cluster.
- High availability because we have more name nodes.
- Non map reduce type processing can be supported.
- NNX - is a name node in slide 8. this helps to segregate name space according to departments in
the organization.
- Name space layer - has files, blocks, directories. File system operations can be done and will be
based on the name space.
- Name nodes can access all the data nodes. Like data node 1, 2 or 3.
- Data node is considered as a common storage and they are independent. No worry of
coordination.
- Once we add a data node, that has to get registered with all the name nodes in the cluster.
- Block pools - as collection of blocks belongs to single pool. Each name node there would be
different pools. Will be managed independently by the respective name node.
- Different blocks can be identified using the block Id. It also contains the name space id.
- Name space volume - name space + block pool
- If we want to delete the name space, the block pool assigned also be deleted.
Resource manager
- Schedular
- Application manager
➢ We used the word slot before, and container is the same. Resources will be called as containers.
➢ Application master will be there for each application. It is introduced with YARN.
What is a combiner?
• Slide 5 - secondary name node is stand by name node in case if the main name node fails.
1. Fs-image
a. In both active and standby name-nodes.
b. OS filesystem that contains the complete directory structure (namespace) of the hdfs with
details about the location of the data on the data blocks and which blocks are stored in
which nodes.
c. File is used by the name-node when it is started.
2. Edits
a. Called as edit logs is a transaction log that records that changes in the HDFS file system or
any action performed on the HDFS cluster such as addition of new block, replication and
deletion.
b. It records the changes since the last fs-image was created.
3. Fs-time - contains the time stamp of the last checkpoint.
Checkpointing
1. Primary and the secondary name-node.
2. Name-node keeps meta data in two files: edits and fsimage.
3. The edit file will usually keep growing and when in a recovery, edit file will be very large.
• Secondary node ask the primary name node to copy the edit into a new file.
• Using http get method secondary name node receives the latest edit and fsimage files (Secondary
name-node usually has a fsimage).
• Secondary name node then create a new fsimage namespace meta data file inside.
• At this point, the primary name-nodes edit logs will be cleared to save space.
• That file is sent using http post method to the primary name node.
• After that the latest fs image will be created along with the light weight edit file.
Blocks
- Data is stored in data nodes in blocks
- Minimum amount data that HDFS can read or write.
- Default now is 128MB. Previously it is 64Mb.
➢ That will help to decide where a copy of the data block should be stored.
• Slide 16 - if some network issue happen in one of the racks. We can still use the copies stored in
data nodes
• Write once and read many - can add new data. Already stored data cannot be edited.
• Reading data for many different operations can be done in HDFS.
- Replication factor - 3
- Client will have the large file, would be splitting to blocks based on the block size.
- Client will tell the name node, about the block details and replication factor.
- Name node - see what are the possible data nodes that can be used to store the data.
- Identify 3 data nodes.
- Pass that info to the client - by the name node
- Client starts writing the data. Client will receive the data node location
- Send data to the first data node which is closer to the client in the same rack
- Pass the data to the data node and send to hard drive
- Send data to the next data node by the first data node.
- Name node has the log of the stored material in data node.
HDFS delete
1. Initial delete is not an actual delete, just changing the file path.
2. Change it to trash directory - that task is fast.
3. We can set a duration that the file should be there in trash directory and after that the file can be
deleted.
4. We can restore the file from the trash directory before the said duration.
5. Once deleted from HDFS namespace, the free space will be there in the data node.
6. Showing increase availability of space
Goals of HDFS
Map Reduce
- Mapper : happen in data nodes, which have the data
- Reducer : aggregation or summarization happen in another data node
Example:
Splitting
Mapping
Deer, 1
Bear, 1
Cat, 1
Dog, 1
River, 1
Cat, 1
Deer, 1
Bear, 1
Dog, 1
Bear, 1
Bear, 1
Cat, 1
Cat, 1
Deer, 1
Deer, 1
River, 1
Reducing
#count the number of words
Bear, 2
Cat, 2
Deer, 2
Two input files will be used. Employee and Department. One output file will be generated.
Count the number of departments each of the employees work in and the hours.