Big Data Analytics

1st Lesson
Tuesday, January 17, 2023 1:27 PM
Movie - Money Ball
Big Data Analytics Page 1

3 terms - volume, velocity and variety.
Characteristics of big data or dimensions
1. Volume - huge amount of data is actually is not very good. Information overload happen here.
Because only small amount of data is only needed. We may not identify suitable amount of data.

-
- Hard disks, storage cost are reducing
2. Velocity

- Data at a high speed. Online streaming service, we will be continuously getting data.
3. Variety

a.
b. Many types of data - it increases the complexity. Number of variables > number of
observations - tall and skinny

Additional Vs
1. Veracity
a. How consistent the data is
b.
c. Data may be from different sources. Data may not be collected for own purpose like
secondary data. Ensure the completeness of the data.
d. Can have fake data. - missing data - ensure to clean data.
e. To clean them we can follow distributed approach

2nd Lesson
Thursday, February 2, 2023 2:55 PM
Architecture and Tools

Outline
- Parallel processing
- Scalability
- Ecosystems and tools
Why we need parallel processing?

- We have some data stored in drive, and if we want to compute those data, we have to transfer
that data to the memory.
- That is what happens in single node
➢ Once we get a problem, they would be processed sequentially, like instruction 1 to n.

➢ Problem in linear processing? If we get an error, while processing, it will start again from the top.
• But in parallel processing, we will divide the problem into multiple tasks, and happened parallelly.
• When executing instruction 1, error occurred, thus we need to focus just instruction 1 only.
• Executions are done in different nodes.
- Why parallel processing - slide 6.

Data scaling
When we need to scale up, we can increase the storage, speed.
- In big data, there is a limitation, up to a certain level we cannot scale up vertically because there
are less resources.
- Answer: use horizontal scaling
- Having multiple nodes and create a computing cluster.
Filtering - can be done as embarrassingly parallel processing task.

Example
• Large collection of images, and you want to apply a filter to each image to enhance its quality. The
filtering operation on each image is completely independent of the others, and there is no need to
share information or coordinate between them. In this case, you can distribute the images across
multiple processors or computing nodes and apply the filter to each image concurrently. Each
processor works on its assigned image independently, without needing to wait for or interact with
the others.
From <https://chat.openai.com/>
➢ Some other task may need to communicate - some aggregation and transfer between nodes
might be needed - near embarrassingly parallel.
➢ Embarrassingly parallel - one that need no requirement of communication or dependency
between the processes.
Set of numbers - find the top 10 out of those numbers

- Each node, process the data stored in those nodes
- Find the top 10 values for each node.
- Transfer top 10 values for another node (one separate node).
- 5 top 10 values will be there in that node.
- Do a sort and get the top 10.
- We call it nearly embarrassing.
Data locality
- It is expensive to transfer data from one node to another and cost some network efforts.
- Processing of the data is happen in the same place where we have the data. (which is happen in
Map side)
Data technologies - how we collect, process and share data. S3 has a trend now.
Analytics and visualization
No SQL databases - when the structure is not clear.

L03 Intro to Hadoop
Monday, February 20, 2023 6:20 PM
Hadoop - a distributed storage and processing of large datasets.

- It provides real time analytics and allows emerging data formats.
- Node - single computer
- Cluster - collection of nodes
Commodity hardware - distributed file system designed to run on hardware based on open standards.
- Capable of running on different OS without special drivers.
- Off the shelf hardware which is widely available and interchangeable with other hardware of its
type.
- A commodity cluster is an ensemble of fully independent computing systems integrated by a
commodity off-the-shelf interconnection communication network.
Hadoop = HDFS + MapReduce + YARN + Hadoop common or core

- HDFS - provides the shared storage.
- MapReduce - analysis part is done.
MapReduce
- programming paradigm that enables scalability across hundreds or thousands of servers in a
Hadoop cluster.
- It does the two tasks, of mapping and reducing.
- It will do the map stage, shuffle stage, and reduce stage.
- Mapper: a function or a task which is used to process all the input records from a file and
generate the output which works as input for Reducer. Creates the small chunks of data.
○ Hadoop has 2 mappers and 2 reducers by default.
- Reducer: combination of shuffle and reducer stage and stores the data in HDFS. Involves with the
<key, value> pairs
- How MapReduce the processing of data

It does a parallel processing by having chunks of the file.

○ It does a parallel processing by having chunks of the file.
○ What if we do not have a mapper but only the reducer?
▪ We will end up running out of memory
▪ More computational time and resources
○ Hence, it uses Mappers and Reducers
▪ Example : to calculate the sales of each store for a book sale.
▪ Mappers - collects the sales of each store and piles them up according to (key, value)
pairs. (store loc, sales amt)
▪ Reducers - only need to collect the stores that are assigned to them (reducer 1 - LA
and Miami / reducer 2 - NY) and get the total.
Keywords
1. Name-node
○ master node in Hadoop HDFS (Hadoop Distributed File System) architecture that maintains
and manages the blocks (data is saved in small files) present on the data-nodes (slave
nodes).
○ A highly available server that manages the file system Namespace and controls access to
files by clients.
2. Namespace
- File or directory which is handled by the Name-node.
Note on how HDFS work
Usually, when a file is stored in HDFS, it will store as chunks, we call as blocks (default 64MB).
Example
- File is 150MB.
- These blocks are separated in nodes, called as data nodes
- Each data node has blocks inside.
- Data nodes are handled by name-node. It saves meta data.
- To avoid the failure and losing of data in blocks, HDFS, creates 3 copies of the same block
elsewhere.
- When you run a MapReduce job, you submit that to a job tracker.
- Each data node has a task tracker.

- Each data node has a task tracker.
- Lot of network cost will be saved since the task tracker is performing the tasks in the same place
where the data is saved.

4th L Hadoop 2.0 and 3.0
Hadoop 2.0 and limitations of Hadoop 1.0
- Hadoop yarn is the main difference.

- Earlier it was just HDFS and MapReduce
- With the new one, cluster recourse management - by YARN
- Map Reduce is used for data processing.
- YARN - yet another resource negotiator.
Limitations of Hadoop 1.0
- We can only have one name node. Secondary name node is not like stand by name node.
- If we lots of metadata to record, then it is not practical to have one name node. - no horizontal
capability.
- If something happens to the name node, we cannot use the name node and perform activities -
single point of failure. So replication is useless if that happens.
- Job tracker access the resource manager, schedular, monitoring. That is too many things and can
over burn.
- Name node - one name node can handle 4000 nodes per cluster. If we need to exceed more than
that, one name node will not be able to handle it.
- One name node only have one name space.
- Suitable only for batch processing of huge amount of data. Not possible for real time data
streaming.
- Does not support - graphing, storage need different pipe lines.
Hadoop 2.0
- HDFS Federation
- More name nodes will be added. More nodes to the cluster.
- High availability because we have more name nodes.
- Non map reduce type processing can be supported.
- NNX - is a name node in slide 8. this helps to segregate name space according to departments in
the organization.
- Name space layer - has files, blocks, directories. File system operations can be done and will be
based on the name space.
- Name nodes can access all the data nodes. Like data node 1, 2 or 3.
- Data node is considered as a common storage and they are independent. No worry of
coordination.
- Once we add a data node, that has to get registered with all the name nodes in the cluster.
- Block pools - as collection of blocks belongs to single pool. Each name node there would be
different pools. Will be managed independently by the respective name node.
- Different blocks can be identified using the block Id. It also contains the name space id.
- Name space volume - name space + block pool
- If we want to delete the name space, the block pool assigned also be deleted.
Features of Hadoop 2.0

• In YARN, whatever things done in job tracker is divided into 2 parts.
○ New central resource manager - scheduling and application manager
○ For each node, there will be a node manager, which has the container and the application
master.
- Resource manager + node manager (per node) - will do the data computation part.

➢ Slave nodes - machines other than the name node, resource managers.
➢ Data node - handle storage.
Resource manager
- Schedular
- Application manager
➢ We used the word slot before, and container is the same. Resources will be called as containers.
➢ Application master will be there for each application. It is introduced with YARN.
➢ Resource requirement is granted as a container.

➢ Each application master is getting containers from different nodes.
➢ We can execute multiple applications at the same time.
Application Master - process that coordinates the execution of an application master.

5th Lesson
Monday, March 6, 2023 10:47 PM
What is a combiner?
- Performs a similar function like a reducer.

- Executes at the mapper side.
- Reduce the amount of data that will send to the reducer.

3rd L HDFS
• Slide 5 - secondary name node is stand by name node in case if the main name node fails.
Key name-node files
1. Fs-image
a. In both active and standby name-nodes.
b. OS filesystem that contains the complete directory structure (namespace) of the hdfs with
details about the location of the data on the data blocks and which blocks are stored in
which nodes.
c. File is used by the name-node when it is started.
2. Edits
a. Called as edit logs is a transaction log that records that changes in the HDFS file system or
any action performed on the HDFS cluster such as addition of new block, replication and
deletion.
b. It records the changes since the last fs-image was created.
3. Fs-time - contains the time stamp of the last checkpoint.
Checkpointing
1. Primary and the secondary name-node.
2. Name-node keeps meta data in two files: edits and fsimage.
3. The edit file will usually keep growing and when in a recovery, edit file will be very large.
• Secondary node ask the primary name node to copy the edit into a new file.
• Using http get method secondary name node receives the latest edit and fsimage files (Secondary
name-node usually has a fsimage).
• Secondary name node then create a new fsimage namespace meta data file inside.
• At this point, the primary name-nodes edit logs will be cleared to save space.
• That file is sent using http post method to the primary name node.
• After that the latest fs image will be created along with the light weight edit file.

1. If the edits file exceeds the primary name-node, we can do a checkpointing by merging the
previous edits log as a fs image created last time in secondary name node.
2. Now we do not have the edit file in primary name node, because we have stored it in the
secondary name node.
Blocks
- Data is stored in data nodes in blocks
- Minimum amount data that HDFS can read or write.
- Default now is 128MB. Previously it is 64Mb.
RACK awareness in HDFS

➢ Data-nodes racks that are close to each other - when writing and reading data.
➢ By doing that network traffic is to be reduced
➢ That is what called as rack awareness.
➢ Rack has a rack id and that ID is saved in name-node.
➢ Replication is done through it.
➢ That will help to decide where a copy of the data block should be stored.
• Rack - using the same network switch.

• 40 50 data nodes will be there using the same network switch.
• Cluster is divided into racks. And each rack has data-nodes.

• In one data-node - one data copy
• In a rack - can have multiple data copies in different data nodes
• Next two copies should be stored in - different rack.
• The above is the replica placement strategy.
• More than 3 copies - spread them across racks

• Default number of copies - 3
Same replica will not be saved in the same rack.
• Slide 16 - if some network issue happen in one of the racks. We can still use the copies stored in
data nodes
• Write once and read many - can add new data. Already stored data cannot be edited.
• Reading data for many different operations can be done in HDFS.
Slide 18, 19 - how the large files are stored in data-nodes
- Replication factor - 3
- Client will have the large file, would be splitting to blocks based on the block size.
- Client will tell the name node, about the block details and replication factor.
- Name node - see what are the possible data nodes that can be used to store the data.
- Identify 3 data nodes.
- Pass that info to the client - by the name node
- Client starts writing the data. Client will receive the data node location
- Send data to the first data node which is closer to the client in the same rack
- Pass the data to the data node and send to hard drive
- Send data to the next data node by the first data node.
- Name node has the log of the stored material in data node.

HDFS Read - Access
1. Client make the request

2. HDFS client pass the request to the name node with the location.
3. Only transferring the location details - meta data.
4. Once the client gets the location details clients can get the data node and can get the data.
5. When reading the data, check the check sum.
6. Writing the file also, we will have to calculate the checksum. Initial checksum value will be
updated and checked for difference. If the file is corrupted, we can access the copy.
HDFS delete
1. Initial delete is not an actual delete, just changing the file path.
2. Change it to trash directory - that task is fast.
3. We can set a duration that the file should be there in trash directory and after that the file can be
deleted.
4. We can restore the file from the trash directory before the said duration.
5. Once deleted from HDFS namespace, the free space will be there in the data node.
6. Showing increase availability of space
7. However, data loss can offer.

a. Network or data node failure.
b. If the data node does not receive heart beat msg - data node is not active.
c. Heart beat msg - small msg sent in every 3 seconds.
d. Within 10 minutes, if the response is not received, data node is not accessible.
8. Bit rot - magnetic effect and different exposures can result for decays we have to see the
checksums are same to avoid the bit rots.
Goals of HDFS
- If a block is corrupted, we can detect and recover - automatic recover tool.

- Data locality - computation task near the data within the data node
Components of a Hadoop Cluster

Hadoop 1.0
- Job tracker - overall coordination
○ Once there is a task, client application, job will be submitted to the job tracker.
○ We have to get the total sales value of 2022.
○ Name node will determine the data location.
○ Job tracker needs to decide, at places mapper and reducer tasks needs to done
○ Job tracker locates, task tracker nodes with available slots at or near the data. Slot means -
portion of memory capability. (mapper and reducer activities needs to happen closer to
each other)
○ If a particular node is having 3 slots, it will be selected as the task tracker.
○ They are monitored and send heart beat msg time to time.
○ Job tracker can schedule the assigned activity to another task tracker due to any technical
issue.
○ The task tracker that fails will be black listed.
Once done, job tracker updates the status.

○ Once done, job tracker updates the status.
- Task tracker - resides in each node

○ Node in the cluster that accepts the tasks.
○ Configured with set of slots. 2 slots - 2 tasks.
○ When the job tracker find the node to assign the tasks, it looks for empty slot.
Map Reduce
- Mapper : happen in data nodes, which have the data
- Reducer : aggregation or summarization happen in another data node

Design Patterns 21st Feb
Tuesday, March 7, 2023 11:42 PM

Reduce Side Join

- In here there are two datasets.
- Mapper side it will read all the data and do the shuffling
- The same records will go to one reducer

Lab 01
Friday, February 24, 2023 2:15 PM
• Block - data is saved in small files

• One file segment is called blocks
• Minimum amount of data that HDFS can read or write - is a block.
Example:
F1 deer bear cat

F2 dog river car
F3 deer bear dog
Splitting
Deer bear cat

Dog river cat
Deer bear dog
Mapping
Deer, 1
Bear, 1
Cat, 1
Dog, 1
River, 1
Cat, 1
Deer, 1
Bear, 1
Dog, 1
Shuffling and Sorting
Bear, 1
Bear, 1
Cat, 1
Cat, 1
Deer, 1
Deer, 1
River, 1
Reducing
#count the number of words
Bear, 2
Cat, 2
Deer, 2

Deer, 2
Dog, 2
River, 1

Lab 02
Monday, March 6, 2023 5:20 PM
Open the command prompt.

Go to the big-data-labs directory.
Then go to cd docker hadoop.
Open docker desktop
Then run docker-compose up -d

Lab 4
Thursday, March 16, 2023 4:58 PM
Random Sampling - to sample large set of datasets.

Randomly collecting item sets for the sampling.
Reduce Side Join
Two input files will be used. Employee and Department. One output file will be generated.
Count the number of departments each of the employees work in and the hours.

HBase
Monday, April 3, 2023 5:03 PM

Spark LAB
Sunday, April 30, 2023 12:10 PM
Cmd notes after launching spark
val x=sc.textFile("D:/185066u/L4S2/Big Data Labs/spark-3.4.0-bin-hadoop3/README.md") - to access

the readme file to test.

Big Data Analytics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Analytics

Uploaded by

Copyright:

Available Formats

1st Lesson

Tuesday, January 17, 2023 1:27 PM

Movie - Money Ball

Big Data Analytics Page 1

Big Data Analytics Page 3

- Hard disks, storage cost are reducing

Big Data Analytics Page 4

Big Data Analytics Page 5

Big Data Analytics Page 6

Big Data Analytics Page 7

Architecture and Tools

Why we need parallel processing?

➢ Once we get a problem, they would be processed sequentially, like instruction 1 to n.

Big Data Analytics Page 8

Filtering - can be done as embarrassingly parallel processing task.

Set of numbers - find the top 10 out of those numbers

No SQL databases - when the structure is not clear.

Big Data Analytics Page 9

Hadoop - a distributed storage and processing of large datasets.

Hadoop = HDFS + MapReduce + YARN + Hadoop common or core

- How MapReduce the processing of data

Big Data Analytics Page 11

Note on how HDFS work

Big Data Analytics Page 12

Big Data Analytics Page 13

Hadoop 2.0 and limitations of Hadoop 1.0

- Hadoop yarn is the main difference.

Limitations of Hadoop 1.0

Features of Hadoop 2.0

Big Data Analytics Page 14

➢ Resource requirement is granted as a container.

Application Master - process that coordinates the execution of an application master.

Big Data Analytics Page 15

- Performs a similar function like a reducer.

Big Data Analytics Page 16

Key name-node files

Big Data Analytics Page 25

RACK awareness in HDFS

Big Data Analytics Page 26

• Rack - using the same network switch.

• Cluster is divided into racks. And each rack has data-nodes.

• More than 3 copies - spread them across racks

Same replica will not be saved in the same rack.

Slide 18, 19 - how the large files are stored in data-nodes

Big Data Analytics Page 27

1. Client make the request

7. However, data loss can offer.

- If a block is corrupted, we can detect and recover - automatic recover tool.

Components of a Hadoop Cluster

Big Data Analytics Page 28

- Task tracker - resides in each node

Big Data Analytics Page 29

Big Data Analytics Page 30

Big Data Analytics Page 37

Big Data Analytics Page 38

• Block - data is saved in small files

F1 deer bear cat

Deer bear cat

Shuffling and Sorting

Big Data Analytics Page 39

Big Data Analytics Page 40

Open the command prompt.

Big Data Analytics Page 41