Unit 1

Data Science – Why all the excitement?
1
Data Analysis Has Been Around for a While…
R.A. Fisher W.E. Deming

Peter Luhn
Howard
Dresner
Data Science: Why all the Excitement?
Exciting new effective
applications of data analytics
e.g.,
Google Flu Trends:
Detecting outbreaks
two weeks ahead
of CDC data
New models are estimating

which cities are most at risk
for spread of the Ebola virus.
Prediction model is built on

Various data sources,
types and analysis.
3
Why the all the Excitement?
Predicting political
champagne and election
Outcome:
4
PageRank: The web as a behavioral dataset
Sponsored search
Sponsored search
• Google revenue around $50 bn/year from marketing, 97%
of the companies revenue.
• Sponsored search uses an auction – a pure competition for

marketers trying to win access to consumers.
• In other words, a competition for models of consumers –

their likelihood of responding to the ad – and of
determining the right bid for the item.
• There are around 30 billion search requests a month.

Perhaps a trillion events of history between search
providers.
Other Data Science Applications
• Transaction Databases → Recommender systems (NetFlix), Fraud Detection

(Security and Privacy)
• Wireless Sensor Data → Smart Home, Real-time Monitoring, Internet of

Things
• Text Data, Social Media Data → Product Review and Consumer Satisfaction
(Facebook, Twitter, LinkedIn), E-discovery
• Software Log Data → Automatic Trouble Shooting (Splunk)
• Genotype and Phenotype Data → determining the DNA sequence, Patient-

Centered Care, Personalized Medicine
Where does data come from?
9
“Big Data” Sources(petabytes, zettabytes, or exabytes.)
User Generated (Web &
It’s All Happening On-line Mobile)
Every:
Click
Ad impression
Billing event
….
Fast Forward, pause,… .
Server request
Transaction
Network message
Fault
…
Internet of Things / M2M Health/Scientific Computing

5 Vs of Big Data
• Raw Data: Volume

• Change over time: Velocity
• Data types: Variety
• Data Quality: Veracity
• Information for Decision Making: Value
What is Data Science?
12
Data Science – A Definition
Data Science is the science which uses computer science, statistics and
machine learning, visualization and human-computer interactions to
collect, clean, integrate, analyze, visualize, interact with data to create
data products.
13
Goal of Data Science
Turn data into data products.

Data Science – A Visual Definition
Contrast: Databases
Databases Data Science
Data Value “Precious” “Cheap”
Data Volume Modest Massive
Examples Bank records, Online clicks,
Personnel records, GPS logs,
Census, Tweets,
Medical records Building sensor readings
Priorities Consistency, Speed,

Error recovery, Availability,
Auditability Query richness
Structured Strongly (Schema) Weakly or none (Text)

Properties Transactions, ACID* CAP* theorem (2/3),
eventual consistency
Realizations SQL NoSQL:

MongoDB, CouchDB,
Hbase, Cassandra, Riak, Memcached,
Apache River, …
ACID = Atomicity, Consistency, Isolation and Durability

CAP = Consistency, Availability, Partition Tolerance
Contrast: Business Intelligence
Business Intelligence Data Science
Querying the past Querying the past

present and future
Contrast: Machine Learning
Machine Learning Data Science

Develop new (individual) models Explore many models, build and tune hybrids
Prove mathematical properties of models

Understand empirical properties of models
Improve/validate on a few, relatively clean, Develop/use tools that can handle massive
small datasets datasets
Take action!
Publish a paper
Big Data vs Data Analytics vs Data Science:
What’s The Difference?
• Big data refers to any large and complex collection of data.
• Data analytics is the process of extracting meaningful information
from data.
• Data science is a multidisciplinary field that aims to produce broader
insights.
19
What is big data?
• As the name suggests, big data simply refers to extremely large data
sets. (petabytes, zettabytes, or exabytes. )
• This size, combined with the complexity and evolving nature of these
data sets, has enabled them to surpass the capabilities of traditional
data management tools.
• Some data sets that we can consider truly big data include:
➢Stock market data
➢Social media
➢Sporting events and games
➢Scientific and research data
20
Facets of data
• Structured
• Semi structured
• Unstructured
• Natural language
• Machine-generated
• Graph-based
• Audio, video, and images
• Streaming
21
Types of big data
• Structured data.
➢Any data set that adheres to a specific structure can be called structured data.
➢These structured data sets can be processed relatively easily compared to other data
types as users can exactly identify the structure of the data.
➢A good example for structured data will be a distributed RDBMS which contains data
in organized table structures/excel .
22
• Semi-structured data.
➢This type of data does not adhere to a specific structure yet retains some kind of
observable structure such as a grouping or an organized hierarchy.
➢Some examples of semi-structured data will be markup languages (XML), web pages,
emails, etc.
• Unstructured data.
➢This type of data consists of data that does not adhere to a schema or a preset
structure.
➢It is the most common type of data when dealing with big data—things like text,
pictures, video, and audio all come up under this type.
• Natural language
➢Natural language is a special type of unstructured data; it’s challenging to process
because it requires knowledge of specific data science techniques and linguistics.
➢The natural language processing community has had success in entity recognition,
topic recognition, summarization, text completion, and sentiment analysis, but models
trained in one domain don’t generalize well to other domains.
• Machine-generated data
➢Machine-generated data is information that’s automatically created by a computer,
process, application, or other machine without human intervention.
➢Machine-generated data is becoming a major data resource and will continue to do so
➢ Examples of machine data are web server logs, call detail records, network event logs,
and telemetry (remote sources data collected using sensors)
23
• Graph-based or network data
➢ Data that focuses on the relationship or adjacency of objects.
➢The graph structures use nodes, edges, and properties to represent and store
graphical data.
➢Graph-based data is a natural way to represent social networks, and its
structure allows you to calculate specific metrics such as the influence of a
person and the shortest path between two people
➢Examples of graph-based data can be found on many social media websites
➢For instance, on LinkedIn you can see who you know at which company.
➢Your follower list on Twitter is another example of graph-based data.
➢The power and sophistication comes from multiple, overlapping graphs of the
same nodes.
➢For example, imagine the connecting edges here to show “friends” on
Facebook. Imagine another graph with the same people which connects
business colleagues via LinkedIn. Imagine a third graph based on movie
interests on Netflix. Overlapping the three different-looking graphs makes
more interesting questions possible.
• Streaming data
➢The data flows into the system when an event happens instead of being
loaded into a data store in a batch.
➢ Examples are the “What’s trending” on Twitter, live sporting or music
events, and the stock market.
24
25
Big data systems & tools
• When it comes to managing big data, many solutions are available to

store and process the data sets. Cloud providers like AWS, Azure, and
GCP offer their own data warehousing and data lake
implementations, such as:
• AWS Redshift
• GCP BigQuery
• Azure SQL Data Warehouse
• Azure Synapse Analytics
• Azure Data Lake
26
• Data lakes and data warehouses are both widely used for storing big data, but
they are not interchangeable terms.
• A data lake is a vast pool of raw data, the purpose for which is not yet defined.
A data warehouse is a repository for structured, filtered data that has already
been processed for a specific purpose.
• There is even an emerging data management architecture trend of the data
lakehouse, which combines the flexibility of a data lake with the data
management capabilities of a data warehouse.
• The distinction is important because they serve different purposes and require
different sets of eyes to be properly optimized. While a data lake works for one
company, a data warehouse will be a better fit for another.
Data Lake Data Warehouse
Data Structure Raw Processed

Purpose of Data Not yet determined Currently in use
Users Data scientists Business professionals
Accessibility Highly accessible and quick to update More complicated and costly
to make changes
27
Data structure: raw vs. processed
• Raw data is data that has not yet been processed for a purpose. Perhaps the
greatest difference between data lakes and data warehouses is the varying
structure of raw vs. processed data. Data lakes primarily store raw,
unprocessed data, while data warehouses store processed and refined data.
• Because of this, data lakes typically require much larger storage capacity
than data warehouses.
• Additionally, raw, unprocessed data is malleable, can be quickly analyzed for
any purpose, and is ideal for machine learning. The risk of all that raw data,
however, is that data lakes sometimes become data swamps without
appropriate data quality and data governance measures in place.
• Data warehouses, by storing only processed data, save on pricey storage
space by not maintaining data that may never be used. Additionally,
processed data can be easily understood by a larger audience.
28
Purpose: undetermined vs in-use
• The purpose of individual data pieces in a data lake is not fixed. Raw
data flows into a data lake, sometimes with a specific future use in
mind and sometimes just to have on hand.
• This means that data lakes have less organization and less filtration of
data than their counterpart.
• Processed data is raw data that has been put to a specific use. Since
data warehouses only house processed data, all of the data in a data
warehouse has been used for a specific purpose within the
organization.
• This means that storage space is not wasted on data that may never
be used.
29
Users: data scientists vs business professionals
• Data lakes are often difficult to navigate by those unfamiliar with

unprocessed data. Raw, unstructured data usually requires a data
scientist and specialized tools to understand and translate it for any
specific business use.
• Alternatively, there is growing momentum behind data preparation
tools that create self-service access to the information stored in data
lakes.
• Processed data is used in charts, spreadsheets, tables, and more, so
that most, if not all, of the employees at a company can read it.
Processed data, like that stored in data warehouses, only requires
that the user be familiar with the topic represented.
30
Accessibility: flexible vs secure
• Accessibility and ease of use refers to the use of data repository as a

whole, not the data within them.
• Data lake architecture has no structure and is therefore easy to
access and easy to change. Plus, any changes that are made to the
data can be done quickly since data lakes have very few limitations.
• Data warehouses are, by design, more structured. One major benefit
of data warehouse architecture is that the processing and structure
of data makes the data itself easier to decipher, the limitations of
structure make data warehouses difficult and costly to manipulate.
31
What is data analytics?
• Data Analytics is the process of analyzing data in order to extract meaningful
data from a given data set. These analytics techniques and methods are
carried out on big data in most cases, though they certainly can be applied
to any data set.
• The primary goal of data analytics is to help individuals or organizations to
make informed decisions based on patterns, behaviors, trends, preferences,
or any type of meaningful data extracted from a collection of data.
• For example, businesses can use analytics to identify their customer
preferences, purchase habits, and market trends and then create strategies
to address them and handle evolving market conditions. In a scientific sense,
a medical research organization can collect data from medical trials and
evaluate the effectiveness of drugs or treatments accurately by analyzing
those research data.
• Combining these analytics with data visualization techniques will help you
get a clearer picture of the underlying data and present them more flexibly
and purposefully.
32
Types of analytics
While there are multiple analytics methods and techniques for data analytics,
there are four types that apply to any data set.
• Descriptive. This refers to understanding what has happened in the data
set. As the starting point in any analytics process, the descriptive analysis
will help users understand what has happened in the past.
• Diagnostic. The next step of descriptive is diagnostic, which will consider
the descriptive analysis and build on top of it to understand why something
happened. It allows users to gain knowledge on the exact information
of root causes of past events, patterns, etc.
• Predictive. As the name suggests, predictive analytics will predict what will
happen in the future. This will combine data from descriptive and diagnostic
analytics and use ML and AI techniques to predict future trends, patterns,
problems, etc.
• Prescriptive. Prescriptive analytics takes predictions from predictive
analytics and takes it a step further by exploring how the predictions will
happen. This can be considered the most important type of analytics as it
allows users to understand future events and tailor strategies to handle any
predictions effectively. 33
Accuracy of data analytics
• The most important thing to remember is that the accuracy of the

analytics is based on the underlying data set. If there are
inconsistencies or errors in the dataset, it will result in inefficiencies
or outright incorrect analytics.
• Any good analytical method will consider external factors like data
purity, bias, and variance in the analytical methods. Normalization,
purifying, and transforming raw data can significantly help in this
aspect.
34
Data analytics tools & technologies
• There are both open source and commercial products for data
analytics. They will range from simple analytics tools such as
Microsoft Excel’s Analysis ToolPak that comes with Microsoft Office
to SAP BusinessObjects suite and open source tools such as Apache
Spark.
• When considering cloud providers, Azure is known as the best
platform for data analytics needs. It provides a complete toolset to
cater to any need with its Azure Synapse Analytics suite, Apache
Spark-based Databricks, HDInsights, Machine Learning, etc.
• AWS and GCP also provide tools such as Amazon QuickSight, Amazon
Kinesis, GCP Stream Analytics to cater to analytics needs.
35
What is data science?
Unlike the first two, data science cannot be limited to a single function or
field. Data science is a multidisciplinary approach that extracts information
from data by combining:
• Scientific methods
• Maths and statistics
• Programming
• Advanced analytics
• ML and AI
• Deep learning
• In data analytics, the primary focus is to gain meaningful insights from
the underlying data. The scope of Data Science far exceeds this
purpose—data science will deal with everything, from analyzing complex
data, creating new analytics algorithms and tools for data processing and
purification, and even building powerful, useful visualizations.
36
Data science tools & technologies
• This includes programming languages like R, Python, Julia, which can be
used to create new algorithms, ML models, AI processes for big data
platforms like Apache Spark and Apache Hadoop.
• Data processing and purification tools such as Winpure, Data Ladder,
and data visualization tools such as Microsoft Power Platform, Google
Data Studio, Tableau to visualization frameworks like matplotlib and
ploty can also be considered as data science tools.
• As data science covers everything related to data, any tool or
technology that is used in Big Data and Data Analytics can somehow be
utilized in the Data Science process.
37
Applications of Data Science
Internet Search
• Search engines make use of data science algorithms to deliver the best results for
search queries in seconds.
Digital Advertisements
• The entire digital marketing spectrum uses data science algorithms, from display
banners to digital billboards.
• This is the main reason that digital ads have higher click-through rates than
traditional advertisements.
Recommender Systems
• The recommender systems not only make it easy to find relevant products from
billions of available products, but they also add a lot to the user experience. Many
companies use this system to promote their products and suggestions in accordance
with the user’s demands and relevance of information. The recommendations are
based on the user’s previous search results.
38
Applications of Big Data
Big Data for Financial Services
• Credit card companies, retail banks, private wealth management advisories, insurance
firms, venture funds, and institutional investment banks all use big data for their
financial services.
• The common problem among them all is the massive amounts of multi-structured data
living in multiple disparate systems, which big data can solve.
• As such, big data is used in several ways, including:
• Customer analytics
• Compliance analytics
• Fraud analytics
• Operational analytics
Big Data in Communications
• Gaining new subscribers, retaining customers, and expanding within current subscriber
bases are top priorities for telecommunication service providers.
• The solutions to these challenges lie in the ability to combine and analyze the masses of
customer-generated data and machine-generated data that is being created every day.
Big Data for Retail
• Whether it’s a brick-and-mortar company an online retailer, the answer to staying in the
game and being competitive is understanding the customer better.
• This requires the ability to analyze all disparate data sources that companies deal with
every day, including the weblogs, customer transaction data, social media, store-branded
credit card data, and loyalty program data.
39
Applications of Data Analytics
Healthcare
• Instrument and machine data are increasingly being used to track and optimize patient flow,
treatment, and equipment used in hospitals. It is estimated that there will be a one percent
efficiency gain that could yield more than $63 billion in global healthcare savings by
leveraging software from data analytics companies.
Travel
• Data analytics can optimize the buying experience through mobile/weblog and social media
data analysis.
• Travel websites can gain insights into the customer’s preferences.
• Products can be upsold by correlating current sales to the subsequent browsing increase in
browse-to-buy conversions via customized packages and offers.
• Data analytics that is based on social media data can also deliver personalized travel
recommendations.
Gaming
• Data analytics helps in collecting data to optimize and spend within and across games.
Gaming companies are also able to learn more about what their users like and dislike.
Energy Management
• Most firms are using data analytics for energy management, including smart-grid
management, energy optimization, energy distribution, and building automation in utility
companies.
• The application here is centered on the controlling and monitoring of network devices and
dispatch crews, as well as managing service outages. Utilities have the ability to integrate
millions of data points in the network performance and gives engineers the opportunity to
use the analytics to monitor the network.
40
Data science is all about:
• Asking the correct questions and analyzing the raw data.
• Modeling the data using various complex and efficient algorithms.
• Visualizing the data to get a better perspective.
• Understanding the data to make better decisions and finding the final
result.
Example:
Let suppose we want to travel from station A to station B by car.
• Now, we need to take some decisions such as which route will be the best route to reach
faster at the location, in which route there will be no traffic jam, and which will be cost-
effective.
• All these decision factors will act as input data, and we will get an appropriate answer
from these decisions, so this analysis of data is called the data analysis, which is a part of
data science.
41
Need for Data Science:
42
• Some years ago, data was less and mostly available in a structured form,
which could be easily stored in excel sheets, and processed using BI tools.
• But in today's world, data is becoming so vast, i.e., approximately 2.5
quintals bytes of data is generating on every day, which led to data
explosion. It is estimated as per researches, that at present , 1.7 MB of data
will be created at every single second, by a single person on earth. Every
Company requires data to work, grow, and improve their businesses.
• Now, handling of such huge amount of data is a challenging task for every
organization. So to handle, process, and analysis of this, we required some
complex, powerful, and efficient algorithms and technology, and that
technology came into existence as data Science.
• Following are some main reasons for using data science technology:
• With the help of data science technology, we can convert the massive amount of raw
and unstructured data into meaningful insights.
• Data science technology is opting by various companies, whether it is a big brand or a
startup. Google, Amazon, Netflix, etc, which handle the huge amount of data, are
using data science algorithms for better customer experience.
• Data science is working for automating transportation such as creating a self-driving
car, which is the future of transportation.
• Data science can help in different predictions such as various survey, elections, flight
ticket confirmation, etc.
43
Data Science Components:
44
The main components of Data Science are given below:
1. Statistics: Statistics is one of the most important components of data science. Statistics is a
way to collect and analyze the numerical data in a large amount and finding meaningful
insights from it.
2. Domain Expertise: In data science, domain expertise binds data science together.
Domain expertise means specialized knowledge or skills of a particular area. In data
science, there are various areas for which we need domain experts.
3. Data engineering: Data engineering is a part of data science, which involves acquiring,
storing, retrieving, and transforming the data. Data engineering also includes
metadata (data about data) to the data.
4. Visualization: Data visualization is meant by representing data in a visual context so
that people can easily understand the significance of data. Data visualization makes
it easy to access the huge amount of data in visuals.
5. Advanced computing: Heavy lifting of data science is advanced computing. Advanced
computing involves designing, writing, debugging, and maintaining the source code of
computer programs.
6. Mathematics: Mathematics is the critical part of data science. Mathematics involves the
study of quantity, structure, space, and changes. For a data scientist, knowledge of good
mathematics is essential.
7. Machine learning: Machine learning is backbone of data science. Machine learning is all
about to provide training to a machine so that it can act as a human brain. In data science,
we use various machine learning algorithms to solve the problems.
45
46
Tools for Data Science
• Data Analysis tools: R, Python, Statistics, SAS, Jupyter, R Studio,

MATLAB, Excel, RapidMiner.
• Data Warehousing: ETL, SQL, Hadoop, Informatica/Talend, AWS
Redshift
• Data Visualization tools: R, Jupyter, Tableau, Cognos.
• Machine learning tools: Spark, Mahout, Azure ML studio.
47
Machine learning in Data Science
• Following are the name of some machine learning algorithms used in
data science:
1. Regression
2. Decision tree
3. Clustering
4. Principal component analysis
5. Support vector machines
6. Naive Bayes
7. Artificial neural network
8. Apriori
48
1. Linear Regression Algorithm:
• Linear regression is the most popular machine learning algorithm
based on supervised learning.
• This algorithm work on regression, which is a method of modeling
target values based on independent variables.
• It represents the form of the linear equation, which has a
relationship between the set of inputs and predictive output.
• This algorithm is mostly used in forecasting and predictions. Since it
shows the linear relationship between input and output variable,
hence it is called linear regression.
The below equation can describe the

relationship between x and y
variables:
1.Y= mx+c
Where, y= Dependent variable
X=independent variable
M=slope
C= intercept.
49
2. Decision Tree: Decision Tree algorithm is another machine learning algorithm, which
belongs to the supervised learning algorithm. This is one of the most popular machine
learning algorithms. It can be used for both classification and regression problems.
• In the decision tree algorithm, we can solve the problem, by using tree representation
in which, each node represents a feature, each branch represents a decision, and each
leaf represents the outcome.
• Following is the example for a Job offer problem:
50
3. K-Means Clustering:
• K-means clustering is one of the most popular algorithms of machine learning,
which belongs to the unsupervised learning algorithm.
• It solves the clustering problem.
• If we are given a data set of items, with certain features and values, and we need
to categorize those set of items into groups, so such type of problems can be
solved using k-means clustering algorithm.
• K-means clustering algorithm aims at minimizing an objective function, which
known as squared error function, and it is given as:
Where,J(V)=>Objective function
'||xi - vj||'=>Euclidean distance between xi and vj.
ci' => Number of data points in ith cluster.
C => Number of clusters.

51
• How to solve a problem in Data Science using Machine learning
algorithms?
• Now, let's understand what are the most common types of problems
occurred in data science and what is the approach to solving the
problems. So in data science, problems are solved using algorithms,
and below is the diagram representation for applicable algorithms for
possible questions:
52
Data Science Lifecycle
Communicate
53
The main phases of data science life cycle are given below:
1. Discovery:
• The first phase is discovery, which involves asking the right questions.
• When you start any data science project, you need to determine what are the basic
requirements, priorities, and project budget.
• In this phase, we need to determine all the requirements of the project such as the
number of people, technology, time, data, an end goal, and then we can frame the
business problem on first hypothesis level.
2. Data preparation: Data preparation is also known as Data Munging. In this
phase, we need to perform the following tasks:
• Data cleaning
• Data Reduction
• Data integration
• Data transformation
3. Model Planning: In this phase, we need to determine the various methods and
techniques to establish the relation between input variables. We will apply
Exploratory data analytics(EDA) by using various statistical formula and
visualization tools to understand the relations between variable and to see what
data can inform us. Common tools used for model planning are:
• SQL Analysis Services
•R
• SAS
• Python
54
4. Model-building:
• In this phase, the process of model building starts.
• We will create datasets for training and testing purpose.
• We will apply different techniques such as association, classification, and
clustering, to build the model.
• Following are some common Model building tools:
➢SAS Enterprise Miner
➢WEKA
➢SPCS Modeler
➢MATLAB
5. Operationalize:
• In this phase, we will deliver the final reports of the project, along with briefings,
code, and technical documents.
• This phase provides you a clear overview of complete project performance and
other components on a small scale before the full deployment.
6. Communicate results:
• In this phase, we will check if we reach the goal, which we have set on the initial
phase.
• We will communicate the findings and final result with the business team.
55
Applications of Data Science:
• Image recognition and speech recognition:
➢Data science is currently using for Image and speech recognition.
➢When you upload an image on Facebook and start getting the suggestion to tag to
your friends. This automatic tagging suggestion uses image recognition algorithm,
which is part of data science.
➢When you say something using, "Ok Google, Siri, Cortana", etc., and these devices
respond as per voice control, so this is possible with speech recognition algorithm.
• Gaming world:
➢In the gaming world, the use of Machine learning algorithms is increasing day by
day. EA Sports, Sony, Nintendo, are widely using data science for enhancing user
experience.
• Internet search:
➢ When we want to search for something on the internet, then we use different
types of search engines such as Google, Yahoo, Bing, Ask, etc.
➢All these search engines use the data science technology to make the search
experience better, and you can get a search result with a fraction of seconds.
• Transport:
➢Transport industries also using data science technology to create self-driving cars.
With self-driving cars, it will be easy to reduce the number of road accidents. 56
Applications of Data Science:
• Healthcare:
➢ In the healthcare sector, data science is providing lots of benefits.
➢Data science is being used for tumor detection, drug discovery, medical
image analysis, virtual medical bots, etc.
• Recommendation systems:
➢Most of the companies, such as Amazon, Netflix, Google Play, etc., are using
data science technology for making a better user experience with
personalized recommendations.
➢when you search for something on Amazon, and you started getting
suggestions for similar products, so this is because of data science
technology.
• Risk detection:
➢ Finance industries always had an issue of fraud and risk of losses, but with
the help of data science, this can be rescued.
➢Most of the finance companies are looking for the data scientist to avoid risk
and any type of losses with an increase in customer satisfaction.
57
Types of Data Science Job
If you learn data science, then you get the opportunity to find the various exciting
job roles in this domain. The main job roles are given below:
• Data Scientist: A data scientist is a professional who works with an enormous
amount of data to come up with compelling business insights through the
deployment of various tools, techniques, methodologies, algorithms, etc.
• Data Analyst: Data analyst is an individual, who performs mining of huge amount
of data, models the data, looks for patterns, relationship, trends, and so on. At the
end of the day, he comes up with visualization and reporting for analyzing the
data for decision making and problem-solving process.
• Machine learning expert: The machine learning expert is the one who works with
various machine learning algorithms used in data science such as regression,
clustering, classification, decision tree, random forest, etc.
• Data engineer: A data engineer works with massive amount of data and
responsible for building and maintaining the data architecture of a data science
project. Data engineer also works for the creation of data set processes used in
modeling, mining, acquisition, and verification.
• Data Architect
• Data Administrator
• Business Analyst
• Business Intelligence Manager
58
BIG DATA ECOSYSTEM
• With the advances in technology and the rapid evolution of computing

technology, it is becoming a very tedious to process and manage huge
amount of information without the use of supercomputers.
• There is an urgent need for companies to deploy special tools and
technologies that can be used to store, access, analyze large amounts of data
in near-real time.
• Big Data cannot be stored in a single machine and thus, several machines are
required.
• Common tools that are used to manipulate Big Data are Hadoop,
MapReduce, and BigTable.
59
Computer Clusters
• A computer cluster is defined as a single logical unit which consist of
several computers that are linked through a fast local area network (LAN).
• The components of a cluster, which is commonly termed as nodes, operate
their own instance of an operating system.
• Computer Clusters are needed for Big Data.
Apache Hadoop
• It is an open-source software framework for processing and querying vast amounts of
data on large clusters of commodity.
• Hadoop is being written in Java and can process huge volume of structured and
unstructured data
• It is implemented for Google MapReduce as an open source and is based on simple
programming model called MapReduce.
• It provides reliability through replication
• The Apache Hadoop ecosystem is composed of the Hadoop Kernel, MapReduce, HDFS
and several other components like Apache Hive, Base and Zookeeper
60
Characteristics of Hadoop
• Scalable– New nodes are added without disruption and without any
change on the format of the data.
• Cost effective– There is parallel computing to all the commodity
servers using Hadoop. This decrease cost makes it affordable to
process massive amount of data.
• Flexible– Hadoop is able to process any type of data from various
sources and deep analysis can be performed.
• Fault tolerant– When a node is damaged, the system is able to
redirect the work to another location to continue the processing
without missing any data.
61
Hadoop/MapReduce
Computing Paradigm
62
Large-Scale Data Analytics
• MapReduce computing paradigm (E.g., Hadoop) vs. Traditional

database systems
Database
vs.
 Many enterprises are turning to Hadoop

 Especially applications generating big data
 Web applications, social networks, scientific applications
63
Why Hadoop is able to compete?
Database
vs.
Scalability (petabytes of data, Performance (tons of indexing,

thousands of machines) tuning, data organization tech.)
Flexibility in accepting all data

formats (no schema) Features:
- Provenance tracking
Efficient and simple fault-tolerant - Annotation management
mechanism - ….
Commodity inexpensive hardware
64
What is Hadoop
• Hadoop is a software framework for distributed processing of large

datasets across large clusters of computers
• Large datasets → Terabytes or petabytes of data
• Large clusters → hundreds or thousands of nodes
• Hadoop is open-source implementation for Google MapReduce
• Hadoop is based on a simple programming model called MapReduce
• Hadoop is based on a simple data model, any data will fit
65
Hadoop Master/Slave Architecture
• Hadoop is designed as a master-slave shared-nothing architecture
Master node (single node)
Many slave nodes
66
Design Principles of Hadoop
• Need to process big data

• Need to parallelize computation across thousands of nodes
• Commodity hardware
• Large number of low-end cheap machines working in parallel to solve a
computing problem
• This is in contrast to Parallel DBs
• Small number of high-end expensive machines
67
Design Principles of Hadoop
• Automatic parallelization & distribution

• Hidden from the end-user
• Fault tolerance and automatic recovery

• Nodes/tasks will fail and will recover automatically
• Clean and simple programming abstraction

• Users only provide two functions “map” and “reduce”
68
Who Uses MapReduce/Hadoop
• Google: Inventors of MapReduce computing paradigm

• Yahoo: Developing Hadoop open-source of MapReduce
• IBM, Microsoft, Oracle
• Facebook, Amazon, AOL, NetFlex
• Many others + universities and research labs
69
Hadoop scalability
◼ Hadoop can reach massive scalability by
exploiting a simple distribution architecture and
coordination model
◼ Huge clusters can be made up using (cheap)
commodity hardware
 A 1000-CPU machine would be much more
expensive than 1000 single-CPU or 250 quad-core
machines
◼ Cluster can easily scale up with little or no
modifications to the programs
Hadoop Components
◼ HDFS: Hadoop Distributed File System

 Abstraction of a file system over a cluster
 Stores large amount of data by transparently
spreading it on different machines
◼ MapReduce
 Simple programming model that enables parallel
execution of data processing programs
 Executes the work on the data near the data
◼ In a nutshell: HDFS places the data on the cluster
and MapReduce does the processing work
• Hadoop is basically a
middleware platforms that
Hadoop Principle manages a cluster of
machines
I’m one
big data • The core components is a
set distributed file system
(HDFS)
• Files in HDFS are split into

blocks that are scattered
Hadoop over the cluster
HDFS
• The cluster can grow
indefinitely simply by adding
new nodes
The MapReduce Paradigm
Parallel processing paradigm

Programmer is unaware of parallelism
Programs are structured into a two-phase
Map Reduce
execution
x4
x5
x3
Data elements are An algorithm is applied to all
classified into the elements of the same
categories category
Eurostat
MapReduce and Hadoop
Hadoop MapReduce is logically

placed on top of HDFS
MapReduce
HDFS
Eurostat
MapReduce and Hadoop
MR works on (big) files

Hadoop loaded on HDFS
Each node in the cluster

executes the MR program
MR MR MR MR in parallel, applying map
and reduces phases on the
blocks it stores
HDFS HDFS HDFS HDFS
Output is written
on HDFS
Scalability principle:
Perform the computation were the data is
Main Properties of HDFS
• Large: A HDFS instance may consist of thousands of server

machines, each storing part of the file system’s data
• Replication: Each data block is replicated many times (default is
3)
• Failure: Failure is the norm rather than exception
• Fault Tolerance: Detection of faults and quick, automatic
recovery from them is a core architectural goal of HDFS
• Namenode is consistently checking Datanodes
76
Map-Reduce Execution Engine
(Example: Color Count)
Input blocks Produces (k, v) Shuffle & Sorting Consumes(k, [v])
on HDFS ( , 1) based on k ( , [1,1,1,1,1,1..])
Produces(k’, v’)
Map Parse-hash ( , 100)
Reduce
Map Parse-hash
Reduce
Map Parse-hash
Reduce
Map Parse-hash
Users only provide the “Map” and “Reduce” functions

77
Properties of MapReduce Engine (Cont’d)
• Task Tracker is the slave node (runs on each datanode)

• Receives the task from Job Tracker
• Runs the task until completion (either map or reduce task)
• Always in communication with the Job Tracker reporting progress
Map Parse-hash
Reduce
Map Parse-hash
Reduce
In this example, 1 map-reduce
job consists of 4 map tasks and 3
Map Parse-hash
reduce tasks
Reduce
Map Parse-hash
78
Key-Value Pairs
• Mappers and Reducers are users’ code (provided functions)

• Just need to obey the Key-Value pairs interface
• Mappers:
• Consume <key, value> pairs
• Produce <key, value> pairs
• Reducers:
• Consume <key, <list of values>>
• Produce <key, value>
• Shuffling and Sorting:
• Hidden phase between mappers and reducers
• Groups all similar keys from all mappers, sorts and passes them to a certain
reducer in the form of <key, <list of values>>
79
MapReduce Phases
Deciding on what will be the key and what will be the value ➔ developer’s
responsibility
80
Example 1: Word Count
• Job: Count the occurrences of each word in a data set
Map Reduce
Tasks Tasks
81
Example 2: Color Count
Job: Count the number of each color in a data set
Input blocks Produces (k, v) Shuffle & Sorting Consumes(k, [v])

on HDFS ( , 1) based on k ( , [1,1,1,1,1,1..])
Produces(k’, v’)
Map Parse-hash ( , 100)
Reduce Part0001
Map Parse-hash
Reduce Part0002
Map Parse-hash
Reduce Part0003
Map Parse-hash
That’s the output file, it has
3 parts on probably 3
different machines 82
Example 3: Color Filter
Job: Select only the blue and the green colors

• Each map task will select only the
Input blocks Produces (k, v) blue or green colors
on HDFS ( , 1)
• No need for reduce phase

Write to HDFS
Map Part0001
Write to HDFS
Map Part0002
That’s the output file, it has
4 parts on probably 4
Write to HDFS
Map Part0003 different machines
Write to HDFS
Map Part0004
83
84
HDFS has 2 components –
Namenode and Datanodes
 Master/slave architecture
 HDFS cluster consists of a single Namenode, a master server that
manages the file system namespace and regulates access to files by
clients.
 There are a number of DataNodes usually one per node in a cluster.
 The DataNodes manage storage attached to the nodes that they run on.
 HDFS exposes a file system namespace and allows user data to be stored
in files.
 A file is split into one or more blocks and set of blocks are stored in
DataNodes.
 DataNodes: serves read, write requests, performs block creation,
deletion, and replication upon instruction from Namenode.
12/1/2023
85
How HDFS stores the data
1) File to be stored on HDFS

10
2 2) Splitting into 256MB
256MB 256MB 1.1GB
256MB 256MB
M blocks
B
3) Ask NameNode
4) Blocks with their replicas (by default 3) are where to put them
distributed across Data Nodes
256MB 256MB 256MB 256MB
256MB 256MB 256MB 256MB

10
2
256MB 256MB 256MB
M
10 10
B
2 2
256MB
M M
B B
DataNode DataNode DataNode DataNode
1 2 3 4
HDFS Architecture
Metadata(Name, replicas..)
Metadata ops Namenode (/home/foo/data,6. ..
Client
Block ops
Read Datanodes Datanodes
replication
B
Blocks
Rack1 Write Rack2
Client
12/1/2023
87
File system Namespace
• Hierarchical file system with directories and files
• Create, remove, move, rename etc.
• Namenode maintains the file system
• Any meta information changes to the file system recorded by the
Namenode.
• An application can specify the number of replicas of the file needed:
replication factor of the file. This information is stored in the Namenode.
12/1/2023
88
Data Replication
 HDFS is designed to store very large files across machines in a large
cluster.
 Each file is a sequence of blocks.
 All blocks in the file except the last are of the same size.
 Blocks are replicated for fault tolerance.
 Block size and replicas are configurable per file.
 The Namenode receives a Heartbeat and a BlockReport from each
DataNode in the cluster.
 BlockReport contains all the blocks on a Datanode.
12/1/2023
89
Filesystem Metadata
• The HDFS namespace is stored by Namenode.
• Namenode uses a transaction log called the EditLog to record every change
that occurs to the filesystem meta data.
• For example, creating a new file.
• Change replication factor of a file
• EditLog is stored in the Namenode’s local filesystem
• Entire filesystem namespace including mapping of blocks to files and file
system properties is stored in a file FsImage. Stored in Namenode’s local
filesystem.
12/1/2023
90
Datanode
 A Datanode stores data in files in its local file system.
 Datanode has no knowledge about HDFS filesystem
 It stores each block of HDFS data in a separate file.
 Datanode does not create all files in the same directory.
 It uses heuristics to determine optimal number of files per directory and
creates directories appropriately:
 Research issue?
 When the filesystem starts up it generates a list of all HDFS blocks and
send this report to Namenode: Blockreport.
12/1/2023
91
YARN(“Yet Another Resource Negotiator”)
• Hadoop YARN is the storage unit of Hadoop with the various processing tools.
Why YARN?
• MapReduce Version 1 performed both processing and resource management functions.
• It consisted of a Job Tracker which was the single master. The Job Tracker allocated the
resources, performed scheduling and monitored the processing jobs.
• It assigned map and reduce tasks on a number of subordinate processes called the Task
Trackers. The Task Trackers periodically reported their progress to the Job Tracker.
This design resulted in scalability bottleneck due to a single Job Tracker.

92
• The practical limits of such a design are reached with a cluster of 5000
nodes and 40,000 tasks running concurrently.
• Apart from this limitation, the utilization of computational resources is
inefficient in MRV1.
• To overcome all these issues, YARN was introduced in Hadoop version 2.0
• The basic idea behind YARN is to relieve MapReduce by taking over the
responsibility of Resource Management and Job Scheduling.
• YARN started to give Hadoop the ability to run non-MapReduce jobs within
the Hadoop framework.
93
94
Apart from Resource Management, YARN also performs Job Scheduling.
YARN performs all your processing activities by allocating resources and
scheduling tasks. Apache Hadoop YARN Architecture consists of the
following main components :
1.Resource Manager: Runs on a master daemon and manages the resource

allocation in the cluster.
2.Node Manager: They run on the slave daemons and are responsible for
the execution of a task on every single Data Node.
3.Application Master: Manages the user job lifecycle and resource needs of
individual applications. It works along with the Node Manager and monitors
the execution of tasks.
4.Container: Package of resources including RAM, CPU, Network, HDD etc

on a single node.
95
Components of YARN
You can consider YARN as the brain of your Hadoop Ecosystem. The
image below represents the YARN Architecture.
96
The first component of YARN Architecture is,
1. Resource Manager
➢It is the ultimate authority in resource allocation.
➢On receiving the processing requests, it passes parts of requests to
corresponding node managers accordingly, where the actual processing takes
place.
➢It is the arbitrator of the cluster resources and decides the allocation of the
available resources for competing applications.
➢Optimizes the cluster utilization like keeping all resources in use all the
time against various constraints such as capacity guarantees, fairness, and
SLAs.
➢It has two major components: a) Scheduler b) Application Manager
97
a) Scheduler
• The scheduler is responsible for allocating resources to the various running

applications subject to constraints of capacities, queues etc.
• It is called a pure scheduler in Resource Manager, which means that it does
not perform any monitoring or tracking of status for the applications.
• If there is an application failure or hardware failure, the Scheduler does not
guarantee to restart the failed tasks.
• Performs scheduling based on the resource requirements of the
applications.
• It has a pluggable policy plug-in, which is responsible for partitioning the
cluster resources among the various applications.
b) Application Manager
• It is responsible for accepting job submissions.

• Negotiates the first container from the Resource Manager for executing
the application specific Application Master.
• Manages running the Application Masters in a cluster and provides service
for restarting the Application Master container on failure.
98
The second component which is :
2. Node Manager
• It takes care of individual nodes in a Hadoop cluster and manages user

jobs and workflow on the given node.
• It registers with the Resource Manager and sends heartbeats with the
health status of the node.
• Its primary goal is to manage application containers assigned to it by the
resource manager.
• It keeps up-to-date with the Resource Manager.
• Application Master requests the assigned container from the Node
Manager by sending it a Container Launch Context(CLC) which includes
everything the application needs in order to run. The Node Manager
creates the requested container process and starts it.
• Monitors resource usage (memory, CPU) of individual containers.
• Performs Log management.
• It also kills the container as directed by the Resource Manager.
99
The third component of Apache Hadoop YARN is,
3. Application Master
• An application is a single job submitted to the framework. Each such
application has a unique Application Master associated with it which is a
framework specific entity.
• It is the process that coordinates an application’s execution in the cluster and
also manages faults.
• Its task is to negotiate resources from the Resource Manager and work with
the Node Manager to execute and monitor the component tasks.
• It is responsible for negotiating appropriate resource containers from the
Resource Manager, tracking their status and monitoring progress.
• Once started, it periodically sends heartbeats to the Resource Manager to
affirm its health and to update the record of its resource demands.
100
The fourth component is:
4.Container
• It is a collection of physical resources such as RAM, CPU cores, and disks
on a single node.
• YARN containers are managed by a container launch context which is
container life-cycle(CLC). This record contains a map of environment
variables, dependencies stored in a remotely accessible storage, security
tokens, payload for Node Manager services and the command necessary
to create the process.
• It grants rights to an application to use a specific amount of
resources (memory, CPU etc.) on a specific host.
101
Steps involved in application submission of Hadoop YARN:
1.Client submits an application

2.Resource Manager allocates a container to start Application Manager
3.Application Manager registers with Resource Manager
4.Application Manager asks containers from Resource Manager
5.Application Manager notifies Node Manager to launch containers
6.Application code is executed in the container
7.Client contacts Resource Manager/Application Manager to monitor application’s status
8.Application Manager unregisters with Resource Manager 102
103
MapReduce – the first data processor on Hadoop
• It is the first batch data processing framework for Hadoop
• A programing model for parallel processing of a distributed data
• Executes in parallel user’s Java code
• 2 stages: Map and Reduce
• Optimized on local data access
MapReduce
YARN
Cluster resource manager
HDFS
Hadoop Distributed File System
Data processing with MapReduce
Node 1 Node 2 Node 3 Node 4 Node 5 Node X
Data Slice Data Slice Data Slice Data Slice Data Slice Data Slice
1 2 3 4 5 X
- Extraction
- Filtering Data Data Data Data Data
Data
- Transformation
processor
proces proces proces proces process Mapping
sor sor sor sor or
Data shuffling
- Grouping
- Aggregating
Data Data
- Dissmising
collector collector
Reducing
Result
Example: The famous “word counting”
Demo
• The problem
• Q: „What happens after two rainy days in the Geneva region?”
• The goal
• Good or bad condition with MapReduce
• Solution
• Build a histogram of days of a week preceded by 2 or more bad weather days
based on meteor data for GVA
?
count
days
Mon | Tue |Wed |Thu | Fr | Sat |

Sun
Demo
• The source data (http://rp5.co.uk)

• Source: Last 5 years of weather data taken at GVA airport
• CSV format
"Local time in Geneva (airport)";"T";"P0";"P";"U";"DD";"Ff";"ff10";"WW";"W'W'";"c";"VV";"Td";
"06.06.2015 00:50";"18.0";"730.4";"767.3";"100";"variable wind direction";"2";"";"";"";"No Significant Clouds";"10.0 and more";"18.0";
"06.06.2015 00:20";"18.0";"730.4";"767.3";"94";"variable wind direction";"1";"";"";"";"Few clouds (10-30%) 300 m, scattered clouds (40-50%) 3300 m";"10.0 and more";"17.0";
"05.06.2015 23:50";"19.0";"730.5";"767.3";"88";"Wind blowing from the west";"2";"";"";"";"Few clouds (10-30%) 300 m, broken clouds (60-90%) 5400 m";"10.0 and more";"17.0";
"05.06.2015 23:20";"19.0";"729.9";"766.6";"83";"Wind blowing from the south-east";"4";"";"";"";"Few clouds (10-30%) 300 m, scattered clouds (40-50%) 2400 m, overcast (100%) 4500 m";"10.0 and more";"16.0";
"05.06.2015 22:50";"19.0";"729.9";"766.6";"94";"Wind blowing from the east-northeast";"5";"";"Light shower(s), rain";"";"Few clouds (10-30%) 1800 m, scattered clouds (40-50%) 2400 m, broken clouds (60-90%) 3000 m";"10.0 an
"05.06.2015 22:20";"20.0";"730.7";"767.3";"88";"Wind blowing from the north-west";"2";"";"Light shower(s), rain, in the vicinity thunderstorm";"";"Few clouds (10-30%) 1800 m, cumulonimbus clouds , broken clouds (60-90%) 24
"05.06.2015 21:50";"22.0";"730.2";"766.6";"73";"Wind blowing from the south";"7";"";"Thunderstorm";"";"Few clouds (10-30%) 1800 m, cumulonimbus clouds , scattered clouds (40-50%) 2100 m, broken clouds (60-90%) 3000 m"
"05.06.2015 21:20";"23.0";"729.6";"765.8";"78";"Wind blowing from the west-southwest";"4";"";"Light shower(s), rain, in the vicinity thunderstorm";"";"Few clouds (10-30%) 1740 m, cumulonimbus clouds , scattered clouds (40-5
"05.06.2015 20:50";"23.0";"728.8";"765.0";"65";"variable wind direction";"2";"";"In the vicinity thunderstorm";"";"Scattered clouds (40-50%) 1950 m, cumulonimbus clouds , scattered clouds (40-50%) 2100 m, broken clouds (60-9
"05.06.2015 20:20";"23.0";"728.2";"764.3";"74";"Wind blowing from the west-northwest";"4";"";"Light thunderstorm, rain";"";"Scattered clouds (40-50%) 1950 m, cumulonimbus clouds , scattered clouds (40-50%) 2100 m, broken
"05.06.2015 19:50";"28.0";"728.0";"763.5";"45";"Wind blowing from the south-west";"5";"11";"Thunderstorm";"";"Scattered clouds (40-50%) 1950 m, cumulonimbus clouds , scattered clouds (40-50%) 2100 m, broken clouds (60-
"05.06.2015 19:20";"28.0";"728.0";"763.5";"42";"Wind blowing from the north-northeast";"2";"";"In the vicinity thunderstorm";"";"Few clouds (10-30%) 1950 m, cumulonimbus clouds , broken clouds (60-90%) 6300 m";"10.0 and
• What is a bad weather day?:

• Weather anomalies between 8am and 10pm
1st MR job
"06.06.2015 00:50";"18.0";
"06.06.2015 00:20";"18.0";
"05.06.2015 23:50";"19.0";
"05.06.2015 23:20";"19.0";
"05.06.2015 22:50";"19.0";
"05.06.2015 22:20";"20.0";
"05.06.2015 21:50";"22.0";
"05.06.2015 21:20";"23.0";
"05.06.2015 20:50";"23.0";
"05.06.2015 20:20";"23.0";
"05.06.2015 19:50";"28.0";
"05.06.2015 19:20";"28.0";
"06.06.2015 00:50";"18.0";
"06.06.2015 00:20";"18.0";
"05.06.2015 23:50";"19.0";
2nd MR job
"05.06.2015 23:20";"19.0";
"05.06.2015 22:50";"19.0";
"05.06.2015 22:20";"20.0";
"05.06.2015 21:50";"22.0";
"05.06.2015 21:20";"23.0";
"05.06.2015 20:50";"23.0";
"05.06.2015 20:20";"23.0";
"05.06.2015 19:50";"28.0";
"05.06.2015 19:20";"28.0";
"06.06.2015 00:50";"18.0";
"06.06.2015 00:20";"18.0";
"05.06.2015 23:50";"19.0";
"05.06.2015 23:20";"19.0";
"05.06.2015 22:50";"19.0";
"05.06.2015 22:20";"20.0";
"05.06.2015 21:50";"22.0";
"05.06.2015 21:20";"23.0";
"05.06.2015 20:50";"23.0";
"05.06.2015 20:20";"23.0";
Demo – MapReduce flow
"05.06.2015 19:50";"28.0";
"05.06.2015 19:20";"28.0";
2016.09.11 0
"06.06.2015 00:50";"18.0";
2016.09.12 0
"06.06.2015 00:20";"18.0";
2016.09.13 0
"05.06.2015 23:50";"19.0";
Wednesday 3
2016.09.20 6
"05.06.2015 23:20";"19.0";
https://gitlab.cern.ch/db/hadoop-intro/tree/master/MapReduce
2016.09.26 5
Thursday 10
Saturday 23
"05.06.2015 22:50";"19.0"; Monday 32
2016.09.30 3
"05.06.2015 22:20";"20.0";
Sunday 25
Tuesday 0
2016.10.04 3
Friday 20
"05.06.2015 21:50";"22.0";
2016.10.05 0
"05.06.2015 21:20";"23.0";
2016.10.06 0
"05.06.2015 20:50";"23.0";
2016.10.07 0
"05.06.2015 20:20";"23.0";
2016.10.10 2
"05.06.2015 19:50";"28.0";
2016.10.12 1
"05.06.2015 19:20";"28.0";
2016.10.15 2
"06.06.2015 00:50";"18.0";
2016.10.20 4
"06.06.2015 00:20";"18.0";
2016.10.21 0
"05.06.2015 23:50";"19.0";
2016.10.22 0
"05.06.2015 23:20";"19.0";
2016.10.27 4
"05.06.2015 22:50";"19.0";
"05.06.2015 22:20";"20.0";
"05.06.2015 21:50";"22.0";
"05.06.2015 21:20";"23.0";
"05.06.2015 20:50";"23.0";
"05.06.2015 20:20";"23.0";
"05.06.2015 19:50";"28.0";
"05.06.2015 19:20";"28.0";
"05.06.2015 19:20";"28.0";
"05.06.2015 19:20";"28.0";
"05.06.2015 19:20";"28.0";
"05.06.2015 19:20";"28.0";
"05.06.2015 19:20";"28.0";
"05.06.2015 19:20";"28.0";
"05.06.2015 19:20";"28.0";
"05.06.2015 19:20";"28.0";
"05.06.2015 19:20";"28.0";
Record: Day of a week with counter of occurrences

"05.06.2015 19:20";"28.0";
"05.06.2015 19:20";"28.0";
"05.06.2015 19:20";"28.0";
Record: Dates with good weather preceded by

"05.06.2015 19:20";"28.0";
"05.06.2015 19:20";"28.0";
"05.06.2015 19:20";"28.0";
"05.06.2015 19:20";"28.0";
"05.06.2015 19:20";"28.0";
"05.06.2015 19:20";"28.0";
"05.06.2015 19:20";"28.0";
two or more days with bad weather

"05.06.2015 19:20";"28.0";
Record: Weather report every
Reduced data:
Reduced data:
Input Data:
Code:
hour
Apache Pig
• Apache Pig is the scripting platform for processing and analyzing large data
sets
• Apache Pig is used with Hadoop where we can perform all the data
manipulation operations in Hadoop by using Apache Pig.
• Apache Pig allows Apache Hadoop users to write complex MapReduce
transformations which are done using a simple scripting language called Pig
Latin.
• Pig Latin language provides various numbers of operators using which the
programmers can develop their own functions for reading, writing, and
processing data given.
• It is also an abstraction over MapReduce. All the scripts are internally
converted to Map and Reduce tasks and Apache Pig has a component which
is known as Pig Engine that accepts the Pig Latin scripts as input and
converts those scripts into MapReduce jobs which is used for Apache Pig.
110
111
Why Do We Need Apache Pig ?
• Some of the Programmers won’t be good at java and hence if they do not know
they might have some difficulty while doing hadoop.
• The reason is that if we know java we can understand Hadoop because Java is the
platform for hadoop. But if we don’t know java, we cannot understand Hadoop or
else we might have some difficulty understanding Hadoop.
• If we can’t understand Java or Hadoop, we can use Apache Pig. Apache Pig is benefit
for all programmers
• By using Pig Latin, it is benefit for some programmers to perform MapReduce tasks
easily without having to type complex lines of code in Java.
• Apache Pig uses multi-query approach, which reduces the length of codes. For
example, if we need to perform some operation for a task that would require us to
type 200 lines of code (LoC) in Java. Hence this can be done easily in Apache Pig by
typing 10 LoC (Lines of Code) for an operation which is given for a certain task.
Apache Pig reduces the development time by 16 times when compared to
development time in Java.
• Pig Latin is like SQL-like language and if we are familiar with Apache Pig it is easy to
learn SQL Language
• Apache Pig provides us many built-in operators in order to support data operations
like joins, filters, ordering and also it supports nested data types like tuples, bags,
and maps that were missing from MapReduce. 112
Features of Pig
• Rich set of operators − It provides rich set of operators to perform
operations like join, sort, and filter.
• Ease of programming - Program is ease when Pig Latin is similar to SQL
and if we are good at SQL language then it is easier to write Pig Script.
• Optimization opportunities - If the tasks in Apache Pig has an
automatically optimize their execution then the programmers will need
to focus only on the semantics of the language.
• Extensibility -Extensibility is done when using the existing operators
where the users can develop their own functions to read, write, and
process data.
• UDF’s (User Defined Functions) - Apache Pig creates User-defined
Functions in other programming languages such as Java and invokes
them in Pig Scripts.
• Handles all kinds of data − Apache Pig analyzes all kinds of data,
structured and unstructured data and hence it stores the results of the
data in HDFS also known as Hadoop Disturbed File System.
113
Apache Pig Vs MapReduce
Apache Pig MapReduce
Apache Pig is a data flow language. MapReduce is a data processing

paradigm.
It is a high level language. MapReduce is low level and rigid.
Performing a Join operation in Apache Pig It is quite difficult in MapReduce to
is pretty simple. perform a Join operation between
datasets.
Any novice programmer with a basic Exposure to Java is must to work with
knowledge of SQL can work conveniently with MapReduce.
Apache Pig.
Apache Pig uses multi-query approach, MapReduce will require almost 20 times
thereby reducing the length of the codes to a more the number of lines to perform
great extent. the same task.
There is no need for compilation. MapReduce jobs have a long
On execution, every Apache Pig operator compilation process.
is converted internally into a MapReduce job.
114
115
Architecture of Hive
116
Hive chiefly consists of three core parts:
• Hive Clients: Hive offers a variety of drivers designed for communication
with different applications. For example, Hive provides Thrift clients(cross
language services) for Thrift-based applications. These clients and drivers
then communicate with the Hive server, which falls under Hive services.
• Hive Services: Hive services perform client interactions with Hive. For
example, if a client wants to perform a query, it must talk with Hive services.
• Hive Storage and Computing: Hive services such as file system, job client,
and meta store then communicates with Hive storage and stores things like
metadata table information and query results.
117
Hive's Features
• Hive is designed for querying and managing only structured data stored in tables
• Hive is scalable, fast, and uses familiar concepts
• Schema gets stored in a database, while processed data goes into a Hadoop
Distributed File System (HDFS)
• Tables and databases get created first; then data gets loaded into the proper tables
• Hive supports four file formats: ORC(stores rows in columnar format) ,
SEQUENCEFILE, RCFILE (Record Columnar File), and TEXTFILE
• Hive uses an SQL-inspired language, sparing the user from dealing with the
complexity of MapReduce programming. It makes learning more accessible by
utilizing familiar concepts found in relational databases, such as columns, tables,
rows, and schema, etc.
• The most significant difference between the Hive Query Language (HQL) and SQL is
that Hive executes queries on Hadoop's infrastructure instead of on a traditional
database
• Since Hadoop's programming works on flat files, Hive uses directory structures to
"partition" data, improving performance on specific queries
• Hive supports partition and buckets for fast and simple data retrieval
• Hive supports custom user-defined functions (UDF) for tasks like data cleansing and
filtering. Hive UDFs can be defined according to programmers' requirements
118
Limitations of Hive
• Hive doesn’t support OLTP. Hive supports Online Analytical
Processing (OLAP), but not Online Transaction Processing (OLTP).
• It doesn’t support subqueries.
• It has a high latency.
• Hive tables don’t support delete or update operations.
119
Hive vs. Relational Databases
Relational Database Hive
Maintains a data
Maintains a database
warehouse
Fixed schema Varied schema
Sparse tables Dense tables
Supports automation
Doesn’t support partitioning
partition
Stores both normalized

Stores normalized data
and denormalized data
Uses SQL (Structured Query Language) Uses HQL (Hive Query
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
Bigger Picture: Hadoop vs. Other Systems
Distributed Databases Hadoop
Computing Model - Notion of transactions - Notion of jobs
- Transaction is the unit of work - Job is the unit of work
- ACID properties, Concurrency control - No concurrency control
Data Model - Structured data with known schema - Any data will fit in any format
- Read/Write mode - (un)(semi)structured
- ReadOnly mode
Cost Model - Expensive servers - Cheap commodity machines
Fault Tolerance - Failures are rare - Failures are common over thousands
- Recovery mechanisms of machines
- Simple yet efficient fault tolerance
Key Characteristics - Efficiency, optimizations, fine-tuning - Scalability, flexibility, fault tolerance
• Cloud Computing
• A computing model where any computing infrastructure can
run on the cloud
• Hardware & Software are provided as remote services
• Elastic: grows and shrinks based on the user’s demand
• Example: Amazon EC2
136
Hadoop pros & cons
◼ Good for
 Repetitive tasks on big size data
◼ Not good for

 Replacing a RDMBS
 Complex processing requiring various phases
and/or iterations
 Processing small to medium size data

Unit 1

Uploaded by

Copyright:

Available Formats

You might also like

Unit 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 1

Uploaded by

Copyright:

Available Formats

Data Science – Why all the excitement?

R.A. Fisher W.E. Deming

New models are estimating

Prediction model is built on

• Sponsored search uses an auction – a pure competition for

• In other words, a competition for models of consumers –

• There are around 30 billion search requests a month.

• Transaction Databases → Recommender systems (NetFlix), Fraud Detection

• Wireless Sensor Data → Smart Home, Real-time Monitoring, Internet of

• Software Log Data → Automatic Trouble Shooting (Splunk)

• Genotype and Phenotype Data → determining the DNA sequence, Patient-

Internet of Things / M2M Health/Scientific Computing

• Raw Data: Volume

Turn data into data products.

Priorities Consistency, Speed,

Structured Strongly (Schema) Weakly or none (Text)

Realizations SQL NoSQL:

ACID = Atomicity, Consistency, Isolation and Durability

Querying the past Querying the past

Machine Learning Data Science

Prove mathematical properties of models

• When it comes to managing big data, many solutions are available to

Data Lake Data Warehouse

Data Structure Raw Processed

• Data lakes are often difficult to navigate by those unfamiliar with

• Accessibility and ease of use refers to the use of data repository as a

• The most important thing to remember is that the accuracy of the

• Data Analysis tools: R, Python, Statistics, SAS, Jupyter, R Studio,

The below equation can describe the

'||xi - vj||'=>Euclidean distance between xi and vj.

ci' => Number of data points in ith cluster.

C => Number of clusters.

• With the advances in technology and the rapid evolution of computing

• MapReduce computing paradigm (E.g., Hadoop) vs. Traditional

 Many enterprises are turning to Hadoop

Scalability (petabytes of data, Performance (tons of indexing,

Flexibility in accepting all data

Commodity inexpensive hardware

• Hadoop is a software framework for distributed processing of large

• Hadoop is designed as a master-slave shared-nothing architecture

Master node (single node)

Many slave nodes

• Need to process big data

• Automatic parallelization & distribution

• Fault tolerance and automatic recovery

• Clean and simple programming abstraction

• Google: Inventors of MapReduce computing paradigm

◼ HDFS: Hadoop Distributed File System

• Files in HDFS are split into

Parallel processing paradigm

Hadoop MapReduce is logically

MR works on (big) files

Each node in the cluster

• Large: A HDFS instance may consist of thousands of server

Users only provide the “Map” and “Reduce” functions

• Task Tracker is the slave node (runs on each datanode)

• Mappers and Reducers are users’ code (provided functions)

Input blocks Produces (k, v) Shuffle & Sorting Consumes(k, [v])