Download as pdf or txt
Download as pdf or txt
You are on page 1of 65

2

DS

Introduction
György Ottucsák
Objective
Practical Guide to Data Science and to continue the journey that you have started
with Data Science 1 class.

● Breadth-first search → it gives pointers to the audience to deepen their


knowledge.
● Practical knowledge → homework and programming exercise to stimulate
that.
● Hacking skills

“Try again. Fail Again. Fail Better.”


Feedbacks / Expectations
● Outstanding Areas 🙃 ● Wish List
○ Guest lectures ○ …
○ …
○ Big homework/Competition
○ …
○ Individual work / HW

● Improvement Areas 🤨
○ Hard HW ChatGPT

○ mixed languages

○ Hard to learn from the slides


(post slides earlier)

“Nehezen indult a házi, de maguk a feladatok nem voltak


nehezek”

“Hasznos lett volna, gyakorlati részekről felvétel (GCP)”


ChatGPT
Hacking Skills
● Super important;

● Quick idea validation;

● Best tool for the problem;

● Applying/Combining the existing technologies


are essential;

This class will focus on hacking skills.


Agenda of the Semester

● Cloud Service: Google Cloud Platform (Course License?)

● Deep Learning: Tensorflow, Keras

● Technology Stack: GCP, Kaggle, Python, Linux (bash + awk),


Excel
Requirements
+ 40% - 2x homework, Theory + Programming
+ 20% - BIG homework, Data Science Competition
+ 20% - BIG homework presentation (5-7 minutes)
+ 20% - 5x Quiz

offered Grade [-50%, 60%, 70%, 80%-]

Are you happy with the Grade?


🙃 ⇒ no exam

🤨 ⇒ go to exam* to replace points of the quiz.

1/ Min 50% per homework and exam


2/ Best solutions for the big homework ⇒ 5
*Discussion about a big data topic (predefined set), 20%
Big homework in 2022

https://www.kaggle.com/competitions/player10
Big Data

György Ottucsák
Information Explosion Era

• Where is this data coming from?

• More diverse data sources: click streams, mobile web


tracking, speech to text conversion, etc..

• Example: Retail Business in US


Retail Store Example
● 30yrs ago

○ the bulk of the customer interaction data would come from your own cash registers,

○ In physical stores, tracking both purchases and payment methods.

● Today

○ you are likely operating a digital storefront on the web;

○ detailed web logs, tracking customer’s interaction, individual customer profile, target promotions

○ track customers' sentiment on social media and through the search engines that they use to come to
the website.

○ get to track your customers in real world by keeping track of markers such as Wi-Fi and cellular IDs.

○ Target example
Use-case: InfoScout
● Big Data in Retail Industry
Quiz: What is Big Data?
Which of these would you consider to be ‘big data’? (best guess).

1. order details for a purchase at a store;

2. all orders across hundreds of branches nationwide;

3. information about a person’s stock portfolio;

4. all stock transactions made on the New York Stock Exchange during the
year;
What is Big Data?

● In 2001, the industry analyst Doug Laney described Big Data using 3 Vs
(Velocity, Variety, Volume) and the name kind of stuck.
The Three Vs of Data: Volume

2023
1TB ~ $50
→ 5E-05
The Three Vs of Data: Variety
● Number of types of data being generated: Database, XML, JSON
(key-value),…;

● Earlier MySQL, MSSQL, Oracle, IBM;

● Unstructured data: sensors, gps, mobile logs, scans, social network;

● Keep it in original format.


The Three Vs of Data: Velocity
● Speed: GB/sec, MB/sec

● Log, social media, RFID, streaming

● Reacting is important

● Periodic peaks

● eCommerce example: only purchase, vs other information (OS, time...)


3 V’s + 1/3

1
Da
ta
Di
Ex
co s
ha plo
d re ve
at t
y
sc pe
ry
al o
e f
da
ta
th
at
yo
u
ne
ve
r

2
Si
ng
le
Y Vi
ew
di ou
sj kn
oi o
nt w
ed th
e
da
ta
yo
u
ne
ed
,b
ut
it’
s
3

Pr
ed
Digital Transformation

ic
W
tiv
ha e
tp An
as
t da
al
ta
yt
ca ic
n s
pr
ed
ic
t fu
tu
re
ev
en
ts
Traditional business 🡪 next generation business: Über/AirBnB/Netflix
Data Discovery→ Single View → Predictive Analytics
Cost Savings
● No revolution 🡪 just optimize
● Gentle introduction of big data

● Active Archive
○ Cold data storage to active archive

● ETL (Extract-Transform-Load) Onboard


○ 60% of cycles to generate schema
○ Big Data/Hadoop next to the existing solution

● Data Enrichment
○ Incorporate publicly available datasets
History of Hadoop
Doug Cutting

● 2003 Nutch, part-time, scalable 5🡪 40 machines

● 2003/4 Google papers

○ MapReduce: Simplified Data Processing on Large Clusters MapReduce

○ The Google File System GFS

● 2006 Yahoo!

● Became operating system for big data

● Name of Hadoop
History of Hadoop - Timeline
Year Month Event
2003 October Google File System paper released
2004 December MapReduce: Simplified Data Processing on Large Clusters
2006 January Hadoop subproject created with mailing lists, jira, and wiki
2006 January Hadoop is born from Nutch 197
2006 February NDFS+ MapReduce moved out of Apache Nutch to create Hadoop
2006 February Hadoop is named after Cutting's son's yellow plush toy
2006 April Hadoop 0.1.0 released
2006 April Hadoop sorts 1.8 TB on 188 nodes in 47.9 hours
2006 May Yahoo deploys 300 machine Hadoop cluster
2006 October Yahoo Hadoop cluster reaches 600 machines
2007 April Yahoo runs two clusters of 1,000 machines
2007 October First release of Hadoop that includes HBase
2007 October Yahoo Labs creates Pig, and donates it to the ASF
2008 January YARN JIRA opened
2008 January 20 companies on "Powered by Hadoop Page"
2008 February Yahoo moves its web index onto Hadoop
2008 February Yahoo! production search index generated by a 10,000-core Hadoop cluster
2008 March First Hadoop Summit
Hadoop world record fastest system to sort a terabyte of data. Running on a 910-node cluster,
2008 April Hadoop sorted one terabyte in 209 seconds

2008 May Hadoop wins TeraByte Sort (World Record sortbenchmark.org)


2008 July Hadoop wins Terabyte Sort Benchmark
History of Hadoop – Timeline 2
2008October Loading 10 TB/day in Yahoo clusters
2008October Cloudera, Hadoop distributor is founded

2008November Google MapReduce implementation sorted one terabyte in 68 seconds


2009March Yahoo runs 17 clusters with 24,000 machines
2009April Hadoop sorts a petabyte
2009May Yahoo! used Hadoop to sort one terabyte in 62 seconds
2009June Second Hadoop Summit
2009July Hadoop Core is renamed Hadoop Common
2009July MapR, Hadoop distributor founded
2009July HDFS now a separate subproject
2009July MapReduce now a separate subproject
2010January Kerberos support added to Hadoop
2010May Apache HBase Graduates
2010June Third Hadoop Summit
2010June Yahoo 4,000 nodes/70 petabytes
2010June Facebook 2,300 clusters/40 petabytes
2010September Apache Hive Graduates
2010September Apache Pig Graduates
2011January Apache Zookeeper Graduates

2011January Facebook, LinkedIn, eBay and IBM collectively contribute 200,000 lines of code

2011March Apache Hadoop takes top prize at Media Guardian Innovation Awards

2011June Rob Beardon and Eric Badleschieler spin out Hortonworks out of Yahoo.

2011June Yahoo has 42K Hadoop nodes and hundreds of petabytes of storage
Advantage of Hadoop
● Be free from draconian licensing costs;

● Run on commodity hardware without requirements for custom servers


and/or networking;

● Scale linearly with the growth of data volume;

● Afford efficient data processing and analytics that would scale well with
the size of the data.
Hadoop
● Storage is offered by HDFS (Hadoop Distributed File System) and the
processing capabilities are offered by YARN (Yet Another Resource
Negotiator). 

● Hadoop = HDFS + YARN

● Schema-on-read

○ HDFS doesn’t know the data, fields, columns, structure, only during the processing

○ The repository of unstructured data is called a data lake

● Schema-on-write

○ Traditional databases

extreme convenience of relational databases vs. extreme scalability of Hadoop


HDFS + YARN

● YARN + HDFS independent, loosely coupled w/ well defined API.


Flexibility
Apache Software Foundation
● Apache Hadoop

● Hadoop Ecosystem under Apache umbrella

● Consistent software package


Linux-way (Debian)

Focused on the value add-on


Apache Big Top

*Open Source Innovation in Artificial Intelligence and Data


Hadoop Ecosystem (partial)
Summary
● A brief history of database industry and the rise of big data
challenge;
● The 3 Vs of big data challenge and the introduction of big data
platforms in the enterprise IT;
● Key drivers for building enterprise big data management platforms
on Hadoop and its ecosystem projects;
● A typical customer journey with the Hadoop adoption;
● Key Hadoop concepts:
- Hadoop = HDFS + YARN
- Schema-on-read
- Enterprise "data lake"
● The origins of Apache Hadoop and its ecosystem
● The development and governance of Hadoop and its ecosystem, as
seen by:
- The Apache Software Foundation
- ODPi.
Next – Core Hadoop Architecture
● The Hadoop Distributed File System (HDFS) and its components:
NameNode, DataNode, and Clients;

● Yet Another Resource Negotiator (YARN) and its components:


ResourceManager, NodeManager, and ApplicationManager;

● Additional YARN and HDFS features: High Availability, resource request


model, schedulers;

● Topology of Hadoop clusters.


HDFS
● It feels like a regular filesystem;

● Divide files into big blocks and distribute across cluster (big blocks);

● Store multiple replicas of each block for reliability;

● Not available to write in the middle of the file;

● Program can ask “Where do the pieces of my file live?” instead of sending data send client
computation to the data.
HDFS Blocks
HDFS Components
● one or two NameNodes
○ Java process running as a UNIX daemon

● as many DataNodes as your IT budget will allow ☺


NameNode (one or two per cluster)
●  Represents a single filesystem namespace rooted at / 

● Is the master service of HDFS

● Determines and maintains how the chunks of data are distributed


across the DataNodes

● Actual data never resides here, only metadata (e.g., maps of where
blocks are distributed).
DataNode (as many as you want per cluster)

● Stores the chunks of data, and is responsible for replicating the chunks
across other DataNodes

● Default number of replicas on most clusters is 3 (but it can be changed


on a per-file basis)

● Default block size on most clusters is 128MB.


Quiz: Are there problems?
1. Network failure between nodes;

2. Disk failure on DataNode;

3. Not all DataNodes are used;

4. Block sizes are different; 

5. Disk failure on NameNode;
HDFS- Web-based browser and simple Command
Line Interface(CLI)
HDFS-It acts like a filesystem
HDFS-code example (Cloudera)
Hadoop Cluster Architecture
NameNode Architecture
● Lots of memory

● DataNodes connect to NameNode

Access Control List


Isn't NameNode a SPOF?

● In Hadoop, prior to version 2.0, the NameNode was a single point of


failure (SPOF).

● HDFS NameNode High Availability (HA): Active/Standby or NAS.


DataNodes
● heartbeat
DataNodes
● What is the role of the NameNode?
File write
Replication and File Placement
Multi-Tenant Control
● HDFS supports the notion of users and groups of users. Please note that
these are different accounts from what you may have provisioned as
Linux or LDAP users on the servers running HDFS services;

● HDFS offers classic POSIX filesystem permissions for controlling who can
read and write ( e.g., -rwxr-xr--) 

● HDFS also offers extended Access Control Lists (ACL) for richer scenarios

● Outside core HDFS, the Apache Ranger HDFS plugin offers centralized
authorization policies and audit
Heterogeneous Storage
● Disk

● SSDs

● Memory.
YARN Architectural Components
● Resource Manager (one or two per cluster) that provides
- Global resource scheduler
- Hierarchical queues
● Node Manager (running next to the DataNode)
- Encapsulates RAM and CPU resources available on a worker
node into units called YARN containers 
- Manages the lifecycle of YARN containers 
- Container resource monitoring
● Application Master (created on-demand)
- Manages application scheduling and task execution
- Typically, specific to a higher-level framework (e.g. MapReduce
Application Master).
Hadoop Cluster Architecture
YARN ResourceManager
YARN NodeManager
YARN NodeManager 2
Container and ApplicationMaster

● Transient service (e.g. different for Pig, Hive)

● Security is important
ApplicationMaster
Bringing Computation to the Data
Policy based allocation of resources
Managing Queue Limits with
Apache Ambari
Appendix
File permisson
Useful Literature
● Udacity:

○ Intro to Hadoop and MapReduce

○ Deploying a Hadoop Cluster

● edX:

○ Introduction to Apache Hadoop

● DataCamp:

○ Introduction to PySpark

○ Introduction to Spark in R

● Books:

○ Tom White, Hadoop: The Definitive Guide

○ Hadoop MapReduce v2 Cookbook

○ Nicolo Cesa-Bianchi, Gabor Lugosi, Prediction, Learning and Games…

+ Industry experts…
+ Personal experience…

You might also like