Seminar 2

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

MS4252

Big Data Analytics


2024.1.24

Prof. Louie Wong


Class Schedule
Date Topics
Jan 17 Course Introduction
Jan 24 Introduction to Hadoop & MapReduce
Jan 31 Text Mining & Natural Language Processing
Feb 07 Data Transformation & Topic Extraction
Feb 21 Text Mining Applications - I
Feb 28 Text Mining Applications - II
Mar 06 Text Mining Applications - III
Mar 13 Mid-term 20%

MS4252 2023/24 Sem B 2


Introduction to Hadoop and MapReduce

MS4252 2023/24 Sem B 3


Restaurant Analogy

MS4252 2023/24 Sem B 4


Problem

MS4252 2023/24 Sem B 5


Distributed Processing Scenario

MS4252 2023/24 Sem B 6


Solution

MS4252 2023/24 Sem B 7


What is Hadoop?
• An open-source software framework used for storing and
processing large datasets across clusters of computers
• Commonly used in big data applications and allows for distributed
storage and processing of data
• Designed to be highly scalable and fault-tolerant, allowing for the
processing of extremely large datasets

MS4252 2023/24 Sem B 8


History of Hadoop
• Key challenges of big data
• How to store/process big data with reasonable cost, time & reliability?
• Hadoop was designed to solve this problem
• Inspired by Google’s research papers,
Doug Cutting (father of Hadoop) and Michael J. Cafarella
developed Hadoop to support distribution for a search
engine project at Yahoo!

MS4252 2023/24 Sem B 9


The Google File System Paper

MS4252 2023/24 Sem B 10


MapReduce Paper

MS4252 2023/24 Sem B 11


Hadoop
• Some characteristics of Hadoop include:
• Open-source
• Simple to use distributed file system
• Supports highly parallel processing
• It’s scalable, so it’s suitable for massive amounts of data
• It is designed to work on low-cost hardware
• It’s fault-tolerant at the data level
• automatic replication of data
• automatic fail-over

MS4252 2023/24 Sem B 12


Hadoop Cluster at Yahoo!

MS4252 2023/24 Sem B 13


Organizations using Hadoop
• Hadoop is in use at most organizations that handle big data:
• Yahoo!
• IBM
• Facebook
• Amazon
• Netflix
• LinkedIn
• …

MS4252 2023/24 Sem B 14


LinkedIn Use Case
• LinkedIn utilizes Hadoop for the following purposes:
• Process daily production database transaction logs
• Examine the users’ activities such as views and click
• Feed the extracted data back to the production systems
• Restructure the data to add to an analytical database
• Develop and test analytical models

MS4252 2023/24 Sem B 15


Core Hadoop Modules
• HDFS: a file system that distributes large files across the Hadoop
cluster of computers
• YARN: a framework for job scheduling and cluster resource
management
• MapReduce: a YARN-based system for parallel processing of large
data sets

• Automate the processing of large datasets in a distributed environment


• Allow programmers to focus on writing programs for data processing as if
they were using a single computer

MS4252 2023/24 Sem B 16


HDFS: Hadoop Distributed File System
分層
• HDFS is hierarchical with LINUX style paths and file
ownership and permissions.
• HADOOP FS commands are similar to LINUX commands
• HDFS in not built into the operating system
• Files are append-only after they are written
HADOOP FS commands from LINUX command prompt:

MS4252 2023/24 Sem B 17


HDFS Command

MS4252 2023/24 Sem B 18


A File Stored in HDFS

MS4252 2023/24 Sem B 19


How MapReduce Work

MS4252 2023/24 Sem B 20


MapReduce: A Real World Analogy
Coins Deposit

MS4252 2023/24 Sem B 21


MapReduce: A Real World Analogy
Coins Deposit

Coins Counting Machine

MS4252 2023/24 Sem B 22


MapReduce: A Real World Analogy
Coins Deposit

Mapper: Categorize coins by their face values


Reducer: Count the coins in each face value in parallel

MS4252 2023/24 Sem B 23


MapReduce Paradigm
• Implement two functions:
• Map (k1,v1) -> list (k2, v2)
• Reduce(k2, list(v2)) -> list (v3)
• Framework handles everything else
• Value with the same key go to the same reducer

MS4252 2023/24 Sem B 24


MapReduce Example: Word Count
Input Split Map Shuttle/Sort Reduce Output

Deer, 1 Beer, 1
Dear Beer Beer, 1 Beer, 2
Beer, 1
River River, 1

Car, 1 Car, 1 Beer, 2


Deer Beer River Car Car River Car, 1 Car, 3
Car Car River Car, 1 Car, 3
River, 1 Car, 1 Deer, 2
Deer Car Beer
River, 2
Deer, 1 Deer, 1
Deer Car Beer Deer, 2
Car, 1 Deer, 1
Beer, 1

River, 1
River, 1 River, 2

MS4252 2023/24 Sem B 25


In-Class Exercise
• What are the popular Hadoop distributions?
• Any alternatives to Hadoop?
• What are they?
• Are they better? Why?
• Is Hadoop still relevant in the future?

MS4252 2023/24 Sem B 26

You might also like