Seminar 2

MS4252
Big Data Analytics

2024.1.24
Prof. Louie Wong

Class Schedule
Date Topics
Jan 17 Course Introduction
Jan 24 Introduction to Hadoop & MapReduce
Jan 31 Text Mining & Natural Language Processing
Feb 07 Data Transformation & Topic Extraction
Feb 21 Text Mining Applications - I
Feb 28 Text Mining Applications - II
Mar 06 Text Mining Applications - III
Mar 13 Mid-term 20%
MS4252 2023/24 Sem B 2

Introduction to Hadoop and MapReduce
MS4252 2023/24 Sem B 3

Restaurant Analogy
MS4252 2023/24 Sem B 4

Problem
MS4252 2023/24 Sem B 5

Distributed Processing Scenario
MS4252 2023/24 Sem B 6

Solution
MS4252 2023/24 Sem B 7

What is Hadoop?
• An open-source software framework used for storing and
processing large datasets across clusters of computers
• Commonly used in big data applications and allows for distributed
storage and processing of data
• Designed to be highly scalable and fault-tolerant, allowing for the
processing of extremely large datasets
MS4252 2023/24 Sem B 8

History of Hadoop
• Key challenges of big data
• How to store/process big data with reasonable cost, time & reliability?
• Hadoop was designed to solve this problem
• Inspired by Google’s research papers,
Doug Cutting (father of Hadoop) and Michael J. Cafarella
developed Hadoop to support distribution for a search
engine project at Yahoo!
MS4252 2023/24 Sem B 9

The Google File System Paper
MS4252 2023/24 Sem B 10

MapReduce Paper
MS4252 2023/24 Sem B 11

Hadoop
• Some characteristics of Hadoop include:
• Open-source
• Simple to use distributed file system
• Supports highly parallel processing
• It’s scalable, so it’s suitable for massive amounts of data
• It is designed to work on low-cost hardware
• It’s fault-tolerant at the data level
• automatic replication of data
• automatic fail-over
MS4252 2023/24 Sem B 12

Hadoop Cluster at Yahoo!
MS4252 2023/24 Sem B 13

Organizations using Hadoop
• Hadoop is in use at most organizations that handle big data:
• Yahoo!
• IBM
• Facebook
• Amazon
• Netflix
• LinkedIn
• …
MS4252 2023/24 Sem B 14

LinkedIn Use Case
• LinkedIn utilizes Hadoop for the following purposes:
• Process daily production database transaction logs
• Examine the users’ activities such as views and click
• Feed the extracted data back to the production systems
• Restructure the data to add to an analytical database
• Develop and test analytical models
MS4252 2023/24 Sem B 15

Core Hadoop Modules
• HDFS: a file system that distributes large files across the Hadoop
cluster of computers
• YARN: a framework for job scheduling and cluster resource
management
• MapReduce: a YARN-based system for parallel processing of large
data sets
• Automate the processing of large datasets in a distributed environment

• Allow programmers to focus on writing programs for data processing as if
they were using a single computer
MS4252 2023/24 Sem B 16

HDFS: Hadoop Distributed File System
分層
• HDFS is hierarchical with LINUX style paths and file
ownership and permissions.
• HADOOP FS commands are similar to LINUX commands
• HDFS in not built into the operating system
• Files are append-only after they are written
HADOOP FS commands from LINUX command prompt:
MS4252 2023/24 Sem B 17

HDFS Command
MS4252 2023/24 Sem B 18

A File Stored in HDFS
MS4252 2023/24 Sem B 19

How MapReduce Work
MS4252 2023/24 Sem B 20

MapReduce: A Real World Analogy
Coins Deposit
MS4252 2023/24 Sem B 21

Coins Deposit
Coins Counting Machine
MS4252 2023/24 Sem B 22

Coins Deposit
Mapper: Categorize coins by their face values

Reducer: Count the coins in each face value in parallel
MS4252 2023/24 Sem B 23

MapReduce Paradigm
• Implement two functions:
• Map (k1,v1) -> list (k2, v2)
• Reduce(k2, list(v2)) -> list (v3)
• Framework handles everything else
• Value with the same key go to the same reducer
MS4252 2023/24 Sem B 24

MapReduce Example: Word Count
Input Split Map Shuttle/Sort Reduce Output
Deer, 1 Beer, 1
Dear Beer Beer, 1 Beer, 2
Beer, 1
River River, 1
Car, 1 Car, 1 Beer, 2

Deer Beer River Car Car River Car, 1 Car, 3
Car Car River Car, 1 Car, 3
River, 1 Car, 1 Deer, 2
Deer Car Beer
River, 2
Deer, 1 Deer, 1
Deer Car Beer Deer, 2
Car, 1 Deer, 1
Beer, 1
River, 1
River, 1 River, 2
MS4252 2023/24 Sem B 25

In-Class Exercise
• What are the popular Hadoop distributions?
• Any alternatives to Hadoop?
• What are they?
• Are they better? Why?
• Is Hadoop still relevant in the future?
MS4252 2023/24 Sem B 26

Seminar 2

Uploaded by

Copyright:

Available Formats

You might also like

Seminar 2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Seminar 2

Uploaded by

Copyright:

Available Formats

MS4252

Big Data Analytics

Prof. Louie Wong

MS4252 2023/24 Sem B 2

MS4252 2023/24 Sem B 3

MS4252 2023/24 Sem B 4

MS4252 2023/24 Sem B 5

MS4252 2023/24 Sem B 6

MS4252 2023/24 Sem B 7

MS4252 2023/24 Sem B 8

MS4252 2023/24 Sem B 9

MS4252 2023/24 Sem B 10

MS4252 2023/24 Sem B 11

MS4252 2023/24 Sem B 12

MS4252 2023/24 Sem B 13

MS4252 2023/24 Sem B 14

MS4252 2023/24 Sem B 15

• Automate the processing of large datasets in a distributed environment

MS4252 2023/24 Sem B 16

MS4252 2023/24 Sem B 17

MS4252 2023/24 Sem B 18

MS4252 2023/24 Sem B 19

MS4252 2023/24 Sem B 20

MS4252 2023/24 Sem B 21

Coins Counting Machine

MS4252 2023/24 Sem B 22

Mapper: Categorize coins by their face values

MS4252 2023/24 Sem B 23

MS4252 2023/24 Sem B 24

Car, 1 Car, 1 Beer, 2

MS4252 2023/24 Sem B 25

MS4252 2023/24 Sem B 26

You might also like