Professional Documents
Culture Documents
ds2 1 Bigdata
ds2 1 Bigdata
DS
Introduction
György Ottucsák
Objective
Practical Guide to Data Science and to continue the journey that you have started
with Data Science 1 class.
● Improvement Areas 🤨
○ Hard HW ChatGPT
○ mixed languages
https://www.kaggle.com/competitions/player10
Big Data
György Ottucsák
Information Explosion Era
○ the bulk of the customer interaction data would come from your own cash registers,
● Today
○ detailed web logs, tracking customer’s interaction, individual customer profile, target promotions
○ track customers' sentiment on social media and through the search engines that they use to come to
the website.
○ get to track your customers in real world by keeping track of markers such as Wi-Fi and cellular IDs.
○ Target example
Use-case: InfoScout
● Big Data in Retail Industry
Quiz: What is Big Data?
Which of these would you consider to be ‘big data’? (best guess).
4. all stock transactions made on the New York Stock Exchange during the
year;
What is Big Data?
● In 2001, the industry analyst Doug Laney described Big Data using 3 Vs
(Velocity, Variety, Volume) and the name kind of stuck.
The Three Vs of Data: Volume
2023
1TB ~ $50
→ 5E-05
The Three Vs of Data: Variety
● Number of types of data being generated: Database, XML, JSON
(key-value),…;
● Reacting is important
● Periodic peaks
1
Da
ta
Di
Ex
co s
ha plo
d re ve
at t
y
sc pe
ry
al o
e f
da
ta
th
at
yo
u
ne
ve
r
2
Si
ng
le
Y Vi
ew
di ou
sj kn
oi o
nt w
ed th
e
da
ta
yo
u
ne
ed
,b
ut
it’
s
3
Pr
ed
Digital Transformation
ic
W
tiv
ha e
tp An
as
t da
al
ta
yt
ca ic
n s
pr
ed
ic
t fu
tu
re
ev
en
ts
Traditional business 🡪 next generation business: Über/AirBnB/Netflix
Data Discovery→ Single View → Predictive Analytics
Cost Savings
● No revolution 🡪 just optimize
● Gentle introduction of big data
● Active Archive
○ Cold data storage to active archive
● Data Enrichment
○ Incorporate publicly available datasets
History of Hadoop
Doug Cutting
● 2006 Yahoo!
● Became operating system for big data
● Name of Hadoop
History of Hadoop - Timeline
Year Month Event
2003 October Google File System paper released
2004 December MapReduce: Simplified Data Processing on Large Clusters
2006 January Hadoop subproject created with mailing lists, jira, and wiki
2006 January Hadoop is born from Nutch 197
2006 February NDFS+ MapReduce moved out of Apache Nutch to create Hadoop
2006 February Hadoop is named after Cutting's son's yellow plush toy
2006 April Hadoop 0.1.0 released
2006 April Hadoop sorts 1.8 TB on 188 nodes in 47.9 hours
2006 May Yahoo deploys 300 machine Hadoop cluster
2006 October Yahoo Hadoop cluster reaches 600 machines
2007 April Yahoo runs two clusters of 1,000 machines
2007 October First release of Hadoop that includes HBase
2007 October Yahoo Labs creates Pig, and donates it to the ASF
2008 January YARN JIRA opened
2008 January 20 companies on "Powered by Hadoop Page"
2008 February Yahoo moves its web index onto Hadoop
2008 February Yahoo! production search index generated by a 10,000-core Hadoop cluster
2008 March First Hadoop Summit
Hadoop world record fastest system to sort a terabyte of data. Running on a 910-node cluster,
2008 April Hadoop sorted one terabyte in 209 seconds
2011January Facebook, LinkedIn, eBay and IBM collectively contribute 200,000 lines of code
2011March Apache Hadoop takes top prize at Media Guardian Innovation Awards
2011June Rob Beardon and Eric Badleschieler spin out Hortonworks out of Yahoo.
2011June Yahoo has 42K Hadoop nodes and hundreds of petabytes of storage
Advantage of Hadoop
● Be free from draconian licensing costs;
● Afford efficient data processing and analytics that would scale well with
the size of the data.
Hadoop
● Storage is offered by HDFS (Hadoop Distributed File System) and the
processing capabilities are offered by YARN (Yet Another Resource
Negotiator).
● Schema-on-read
○ HDFS doesn’t know the data, fields, columns, structure, only during the processing
● Schema-on-write
○ Traditional databases
● Divide files into big blocks and distribute across cluster (big blocks);
● Program can ask “Where do the pieces of my file live?” instead of sending data send client
computation to the data.
HDFS Blocks
HDFS Components
● one or two NameNodes
○ Java process running as a UNIX daemon
● Actual data never resides here, only metadata (e.g., maps of where
blocks are distributed).
DataNode (as many as you want per cluster)
● Stores the chunks of data, and is responsible for replicating the chunks
across other DataNodes
2. Disk failure on DataNode;
3. Not all DataNodes are used;
4. Block sizes are different;
5. Disk failure on NameNode;
HDFS- Web-based browser and simple Command
Line Interface(CLI)
HDFS-It acts like a filesystem
HDFS-code example (Cloudera)
Hadoop Cluster Architecture
NameNode Architecture
● Lots of memory
● HDFS offers classic POSIX filesystem permissions for controlling who can
read and write ( e.g., -rwxr-xr--)
● HDFS also offers extended Access Control Lists (ACL) for richer scenarios
● Outside core HDFS, the Apache Ranger HDFS plugin offers centralized
authorization policies and audit
Heterogeneous Storage
● Disk
● SSDs
● Memory.
YARN Architectural Components
● Resource Manager (one or two per cluster) that provides
- Global resource scheduler
- Hierarchical queues
● Node Manager (running next to the DataNode)
- Encapsulates RAM and CPU resources available on a worker
node into units called YARN containers
- Manages the lifecycle of YARN containers
- Container resource monitoring
● Application Master (created on-demand)
- Manages application scheduling and task execution
- Typically, specific to a higher-level framework (e.g. MapReduce
Application Master).
Hadoop Cluster Architecture
YARN ResourceManager
YARN NodeManager
YARN NodeManager 2
Container and ApplicationMaster
● Security is important
ApplicationMaster
Bringing Computation to the Data
Policy based allocation of resources
Managing Queue Limits with
Apache Ambari
Appendix
File permisson
Useful Literature
● Udacity:
● edX:
● DataCamp:
○ Introduction to PySpark
○ Introduction to Spark in R
● Books:
+ Industry experts…
+ Personal experience…