Download as pdf or txt
Download as pdf or txt
You are on page 1of 179

Established as per the Section 2(f) of the UGC Act, 1956

Approved by AICTE, COA and BCI, New Delhi

UNIT- II

Analytics and Big Data


Dr. S.Senthil
Professor and Director
School of Computer Science and Applications
REVA University
DATA SCIENCE

• Science of extracting knowledge from data.


• Science of drawing out hidden patterns amongst data using
statistical and mathematical techniques.
• Employs techniques and theories drawn from many fields from the
broad areas of mathematics, statistics, information technology
including Machine learning, Data Engineering, Probability models,
Statistical learning, Pattern recognition and etc.
DATA SCIENCE

Blend of various tools, algorithms, and machine learning


principles with the goal to discover hidden patterns from the raw data.
BIG DATA ANALYTICS IN PRACTICE

• Ethihad airways uses big data technique to harvest and analyze


gigabytes of data generated by thousands of sensors working
inside its planes.
• This allows to monitor planes in real time, reduce fuel costs,
manage plane maintenance and even spot problems before they
happen (Predicting when parts will fail)
BIG DATA ANALYTICS IN PRACTICE

❖ Many people uses face book to update their status, share photos and like
content.

• The Obama presidential campaign used all that data on the network to
not just find voters, but to assemble an army of volunteers.
BIG DATA ANALYTICS IN PRACTICE

• One of the highest rated TV shows (Sathyameva Jeyathe)


aggregates and analyses the millions of messages it receives from
viewers on controversial issues like female feticide, caste
discrimination and child abuse – uses that data for political change.
BIG DATA ANALYTICS IN PRACTICE

❖ Using Big Data to win elections


• India’s current ruling party, the BJP, used Big Data analytics effectively
in the elections.

• The BJP accurately mined data from almost every Internet user in the
country, and used this data to accurately understand voter sentiments
and local issues.

• The targeting was done not on national issues, but local issues which
were considered far more important.
BIG DATA ANALYTICS IN PRACTICE

❖ Big Data for finding a perfect match


• Online portal, Matrimony.com which is in the business of matchmaking, adds over
8,000 subscribers on a daily basis.

• Today, by capturing customer data from multiple channels including e-mails,


SMSes, banner ads (across the website), telesales and from their retail center, the
online firm uses IBM’s technology to gauge insights and leverage it for driving
personalized marketing campaigns to match potential partners faster and attract
more subscribers.
BIG DATA ANALYTICS IN PRACTICE
❖ Big Data for detecting water leakages
• The Bangalore Water Supply and Sewerage Board (BWSSB) is using Big Data and
predictive analytics technology from IBM to create systems for monitoring water
distribution systems.

• Bangalore’s massive population growth (from 5.4 million in 2000 to over 10 million)
has put tremendous strain on the city’s water supply and distribution systems.

• In partnership with IBM, the BWSSB has built an operational dashboard, which
serves as a “command center” for managing the city’s water supply networks.

• Around 45% of the water supplied by the BWSSB goes unaccounted.

• Implementing this solution will help minimize unaccounted water by detecting large
changes in water flow, through real-time monitoring
BIG DATA ANALYTICS IN PRACTICE
❖ Using Big Data to predict ticket confirmations for trains
• Using the power of Big Data, ixigo has launched a PNR prediction feature for train
travelers.

• For any given train’s wait-listed status, ixigo is now able to show the near accurate
probability with which the ticket will confirm, so that travelers may decide whether
or not to book a wait-listed ticket.

• PNR prediction feature also shows the probability of getting your ticket confirmed
if already booked and solves a huge pain area for millions of daily train travelers.

• The company claims that its app gives far more accurate PNR prediction than all
existing PNR prediction services since ixigo has mined data from over 10 million
PNRs over the last two years.

• The company claims an accuracy rate of 90% accuracy and hopes to raise it to
95% over a period of time.
HOW IS IT DIFFERENT?

• Data Analyst usually explains what is going on by processing history of the


data.

• Data Scientist not only does the exploratory analysis to discover insights
from it, but also uses various advanced machine learning algorithms
to identify the occurrence of a particular event in the future.

• A Data Scientist will look at the data from many angles, sometimes angles
not known earlier.
HOW IS IT DIFFERENT?
COMPONENTS OF DATA SCIENCE
COMPONENTS OF DATA SCIENCE

❖ Three Components (OPD of Data)


• Organizing
• Packaging and
• Delivering data.
COMPONENTS OF DATA SCIENCE

❖ Organizing the data:


• involves the physical storage and format of data and incorporated best practices in
data management.

❖ Packaging the data:


• Packaging data involves logically manipulating and joining the underlying raw data into
a new representation and package

❖ Delivering the data:


• Delivering data involves ensuring that the message the data has is being accessed by
those that need to hear it.
COMPONENTS OF DATA SCIENCE

❖ Plus, at all steps have answers to these questions.


• What is being created?
• How will it be created?
• Who will be involved in creating it?
• Why is it to be created?
BI VERSUS DATA SCIENCE
BI VERSUS DATA SCIENCE
BI ENGAGEMENT PROCESS
BI ENGAGEMENT PROCESS

❖ Step 1: Build the Data Model.


• The process starts by building the underlying data model.

• Whether it’s a data warehouse or data mart or hub-and-spoke approach, or whether


it’s a star schema, snowflake schema, or third normal form schema, the BI Analyst
must go through a formal requirements gathering process with the business users to
identify all (or at least the vast majority of) the questions that the business users want
to answer.

• In this requirements gathering process, the BI analyst must identify the first and
second level questions the business users want to address in order to build a robust
and scalable data warehouse.
BI ENGAGEMENT PROCESS

• 1st level question: How many patients did we treat last month?

• 2nd level question: How did that compare to the previous month?

• 2nd level question: What were the major DRG (Diagnosis Related Groups) types
treated?

• 1st level question: How many patients came through ER last night?

• 2nd level question: How did that compare to the previous night?

• 2nd level question: What were the top admission reasons?


BI ENGAGEMENT PROCESS

• 1st level question: What percentage of beds was used at Hospital X last week?

• 2nd level question: What is the trend of bed utilization over the past year?

• 2nd level question: What departments had the largest increase in bed utilization?

• The BI Analyst then works closely with the data warehouse team to define and build the
underlying data models that supports the questions being asked.

• Data warehouse is a “schema-on-load” approach because the data schema must be


defined and built prior to loading data into the data warehouse. Without an underlying data
model, the BI tools will not work.
BI ENGAGEMENT PROCESS

❖ Step 2: Define The Report.


• Once the analytic requirements have been transcribed into a data
model, then step 2 of the process is where the BI Analyst uses a
Business Intelligence (BI) product – SAP Business Objects,
MicroStrategy, Cognos, Qlikview, Pentaho, etc. – to create the SQL-
based query for the desired questions.
BI ENGAGEMENT PROCESS
BI ENGAGEMENT PROCESS

• The BI Analyst will use the BI tool’s graphical user interface (GUI) to create the
SQL query by selecting the measures and dimensions; selecting page, column
and page descriptors; specifying constraints, subtotals and totals, creating
special calculations (mean, moving average, rank, share of) and selecting sort
criteria.

• The BI GUI hides much of the complexity of creating the SQL query.
BI ENGAGEMENT PROCESS

❖ Step 3: Generate SQL commands.


• Once the BI Analyst or the business user has defined the desired report or
query request, the BI tool then creates the SQL commands.

• In some cases, the BI Analyst will modify the SQL commands generated by the
BI tool to include unique SQL commands that may not be supported by the BI
tool.
BI ENGAGEMENT PROCESS

❖ Step 4: Create Report.


• In step 4, the BI tool issues the SQL commands against the data
warehouse and creates the corresponding report or dashboard
widget.

• This is a highly iterative process, where the Business Analyst will tweak
the SQL (either using the GUI or hand-coding the SQL statement) to
fine-tune the SQL request.
BI ENGAGEMENT PROCESS

• The BI Analyst can also specify graphical rendering options (bar


charts, line charts, pie charts) until they get the exact report
and/or graphic that they want.
BI ENGAGEMENT PROCESS

• The BI approach leans heavily on the pre-built data warehouse (schema-on-


load), which enables users to quickly, and easily ask further questions – as
long as the data that they need is already in the data warehouse.

• If the data is not in the data warehouse, then adding data to an existing
warehouse (and creating all the supporting ETL processes) can take months
to make happen.
THE DATA SCIENCE ENGAGEMENT PROCESS
THE DATA SCIENCE ENGAGEMENT PROCESS

❖ Step 1: Define Hypothesis To Test.


• Starts with the Data Scientist identifying the prediction or hypothesis
that they want to test.

• Again, this is a result of collaborating with the business users to


understand the key sources of business differentiation (e.g., how the
organization delivers value) and then brainstorming data and variables
that might yield better predictors of performance.
THE DATA SCIENCE ENGAGEMENT PROCESS

A Vision workshop process can add considerable value in driving the


collaboration between the business users and the data scientists to identify data
sources that may help improve predictive value
THE DATA SCIENCE ENGAGEMENT PROCESS

❖ Step 2: Gather Data.


• Data scientist gathers relevant and/or interesting data from a multitude of
sources – ideally both internal and external to the organization.

• The data lake (A data lake is a centralized repository that allows you to store all
your structured and unstructured data at any scale) is a great approach for this
process, as the data scientist can grab any data they want, test it, ascertain its
value given the hypothesis or prediction, and then decide whether to include that
data in the predictive model or throw it away.
THE DATA SCIENCE ENGAGEMENT PROCESS

❖ Step 3: Build Data Model.


• Data scientist defines and builds the schema necessary to address the
hypothesis being tested.

• The data scientist can’t define the schema until they know the hypothesis that
they are testing and know what data sources they are going to be using to build
their analytic models.
THE DATA SCIENCE ENGAGEMENT PROCESS

❖ “schema on query” process is notably different than the traditional data


warehouse “schema on load” process.
• The data scientist doesn’t spend months integrating all the different data sources
together into a formal data model first.

• Instead, the data scientist will define the schema as needed based upon the data that is
being used in the analysis.

• The data scientist will likely iterate through several different versions of the schema
until finding a schema (and analytic model) that sufficiently answers the hypothesis
being tested.
THE DATA SCIENCE ENGAGEMENT PROCESS

❖ Step 4: Explore The Data.


• Leverages the outstanding data visualization tools to uncover correlations and
outliers of interest in the data.

• Data visualization tools like Tableau, Spotfire, Domo and DataRPM are great data
scientist tools for exploring the data and identifying variables that they might
want to test.
THE DATA SCIENCE ENGAGEMENT PROCESS
THE DATA SCIENCE ENGAGEMENT PROCESS

❖ Step 5: Build and Refine Analytic Models.


• Here the real data science work begins – where the data scientist starts
using tools like SAS, SAS Miner, R, Mahout, MADlib, and Alpine Miner to
build analytic models.

• At this point, the data scientist will explore different analytic techniques
and algorithms to try to create the most predictive models.
THE DATA SCIENCE ENGAGEMENT PROCESS

❖ Step 6: Ascertain Goodness of Fit.


• Data scientist will try to ascertain the model’s goodness of fit.

• The goodness of fit of a statistical model describes how well the model fits a set of
observations.

• A number of different analytic techniques will be used to determine the goodness of fit
including Kolmogorov–Smirnov test, Pearson's chi-squared test, ANalysis Of VAriance
(ANOVA) and confusion (or error) matrix.
THE DATA SCIENCE ENGAGEMENT PROCESS

❖ In the BI process,
• the schema must be built first and must be built to support a wide variety of
questions across a wide range of business functions.

• So the data model must be extensible and scalable which means that it is heavily
engineered.

• Think production quality.


THE DATA SCIENCE ENGAGEMENT PROCESS

In the data science process,


the schema is built to only support the hypothesis being tested so the data model
can be done more quickly and with less overhead.

Think ad hoc quality.


THE DATA SCIENCE ENGAGEMENT PROCESS

• The data science process is highly collaborative; the more subject matter
experts involved in the process, the better the resulting model.

• And maybe even more importantly, involvement of the business users


throughout the process ensures that the data scientists focuses on
uncovering analytic insights that pass the S.A.M. test.
• Strategic (to the business),

• Actionable (insights that the organization can actually act on), and

• Material (where the value of acting on the insights is greater than the cost of acting on
the insights).
DATA SCIENTIST

“A data scientist is someone who is better at statistics than any


software engineer and better at software engineering than any
statistician.”
DATA SCIENTIST

• The term "Data Science" (originally used interchangeably with "datalogy") was
initially used as a substitute for computer science by Peter Naur in 1960.

• In 2008, DJ Patil and Jeff Hammerbacher coined the term "Data Scientist" to
define their jobs at Linkedin and Facebook, respectively.
ROLE OF A DATA SCIENTIST
ROLE OF A DATA SCIENTIST
❖ Business Acumen Skills
• Data Scientist should have the prowess to counter the pressure of
business.
• List of traits that needs to be honed to play the role of a data scientist
✓ Understanding of domain
✓ Business strategy
✓ Problem solving
✓ Communication
✓ Presentation
✓ Inquisitiveness
ROLE OF A DATA SCIENTIST

❖ Technology Expertise
• Good database knowledge such as RDBMS
• Good NoSQL database knowledge such as MongoDB, Cassandra, Hbase, etc
• Programming languages such as Java, Python, C++, etc.
• Open source tools such as Hadoop.
• Data Warehousing
• Data Mining
• Visualization such as Tableau, Flare, Google Visualization APIs, etc
ROLE OF A DATA SCIENTIST
❖ Mathematics Expertise
• Job of data scientist will require him to comprehend data, interpret data, make sense of
it, and analyse it, he/she will have to dabble in learning algorithms.
• Mathematics
• Statistics
• Artificial Intelligence (AI)
• Algorithms
• Machine learning
• Pattern recognition
• Natural Language Processing
ROLE OF A DATA SCIENTIST
• Identifying the data-analytics problems that offer the greatest opportunities to the
organization.
• Determining the correct data sets and variables.
• Collecting large sets of structured and unstructured data from disparate sources.
• Cleaning and validating the data to ensure accuracy, completeness, and uniformity.
• Devising and applying models and algorithms to mine the stores of big data.
• Analyzing the data to identify patterns and trends.
• Interpreting the data to discover solutions and opportunities.
• Communicating findings to stakeholders using visualization and other means.
CLASSIFICATION OF ANALYTICS

❖ First school of thought


• Basic, Operationalized, Advanced and Monetized
❖ Second school of thought
• Analytics 1.0, 2.0 and 3.0
FIRST SCHOOL OF THOUGHT

❖ Basic Analytics
• Slicing and dicing of data to help with basic business insights.
• Reporting on historical data, basic visualization, etc.,
❖ Operationalized analytics
• Gets woven into enterprise’s business process
FIRST SCHOOL OF THOUGHT

❖ Advanced analytics
• Forecasting for the future by way of predictive and prescriptive modeling.
❖ Monetized analytics
• To derive direct business revenue.
SECOND SCHOOL OF THOUGHT

❖ Analytics 1.0
• Mid 1950’s to 2009
• Descriptive statistics (and Diagnostic)
• Report on events, occurrences, etc of the past.
• What happened?
• Why did it happen?
SECOND SCHOOL OF THOUGHT

❖ Analytics 2.0
• 2005 to 2012
• Descriptive statistics + Predictive statistics
• Use data from the past to make predictions for the future
• What will happen?
• Why will it happen?
SECOND SCHOOL OF THOUGHT

❖ Analytics 3.0
• 2012 to present
• Descriptive + Predictive + Prescriptive statistics
• Use data from the past to make prophecies for the future and at the same time make
recommendations to leverage the situation to one’s advantage.
• What will happen?
• When will it happen?
• Why will it happen?
• What should be the action taken to take advantage of what will happen?
ANALYTICS 1.0, 2.0, 3.0
ANALYTICS 1.0, 2.0, 3.0
ANALYTICS 1.0, 2.0, 3.0
ANALYTICS 1.0, 2.0, 3.0

❖ Descriptive Analytics
• which use data aggregation and data mining to provide insight into the
past and answer: “What has happened?”
• Insight into the past
• Use Descriptive Analytics when you need to understand at an aggregate
level what is going on in your company, and when you want to
summarize and describe different aspects of your business.
ANALYTICS 1.0, 2.0, 3.0
❖ Descriptive analytics
• Describing or summarizing the existing data using existing business intelligence tools
to better understand what is going on or what has happened.
• simplest form of analytics
• purpose of this analytics type is just to summarize the findings and understand what is
going on.
• It is said that 80% of business analytics mainly involves descriptions based on
aggregations of past performance.
• The tools used in this phase are MS Excel, MATLAB, SPSS, STATA, etc.
ANALYTICS 1.0, 2.0, 3.0

❖ Diagnostic Analysis
• Focus on past performance to determine what happened and why.
• Diagnostic analytics is used to determine why something happened in the past.
• It is characterized by techniques such as drill-down, data discovery, data mining and
correlations.
• Diagnostic analytics takes a deeper look at data to understand the root causes of the
events.
• has a limited ability to give actionable insights.
• It just provides an understanding of causal relationships and sequences while looking
backward.
ANALYTICS 1.0, 2.0, 3.0

❖ Predictive Analytics
• which use statistical models and forecasting techniques to understand
the future and answer: “What could happen?”
• Understanding the future
• Use Predictive Analytics any time you need to know something about the
future, or fill in the information that you do not have.
ANALYTICS 1.0, 2.0, 3.0

❖ Predictive Analytics
• Emphasizes on predicting the possible outcome using statistical models and machine
learning techniques.
• It is important to note that it cannot predict if an event will occur in the future; it merely
forecasts what are the probabilities of the occurrence of the event.
• A predictive model builds on the preliminary descriptive analytics stage to derive the
possibility of the outcomes.
ANALYTICS 1.0, 2.0, 3.0

• The essence of predictive analytics is to devise models such that the


existing data is understood to extrapolate the future occurrence or
simply, predict the future data.
• The most popular tools for predictive analytics include Python, R,
RapidMiner, etc.
ANALYTICS 1.0, 2.0, 3.0

❖ Prescriptive Analytics
• which use optimization and simulation algorithms to advise on possible
outcomes and answer: “What should we do?”
• Advise on possible outcomes
• Use Prescriptive Analytics any time you need to provide users with
advice on what action to take.
ANALYTICS 1.0, 2.0, 3.0

❖ Prescriptive analytics
• It is a type of predictive analytics that is used to recommend one or more
course of action on analyzing the data.
• It can suggest all favorable outcomes according to a specified course of
action and also suggest various course of actions to get to a particular
outcome.
• Hence, it uses a strong feedback system that constantly learns and updates
the relationship between the action and the outcome.
• Recommendation engines also use prescriptive analytics.
ANALYTICS 1.0, 2.0, 3.0
Analytics 1.0 Analytics 2.0 Analytics 3.0
Era : Mid 1950s to 2009 2005 to 2012 2012 to present
Descriptive statistics Descriptive statistics + Predictive Statistics Descriptive statistics + Predictive
(report on events, occurrences, etc. (Use data from the past to make Statistics + Prescriptive statistics
of the past) predictions for the future) (Use data from the past to make
prophecies for the future and at the
same time make recommendations to
leverage the situation to one’s
advantage

Key questions asked: Key questions asked: Key questions asked:


What happened? What will happen? What will happen?
Why did it happen? Why will it happen? When will it happen?
Why will it happen?
What should be the action taken to
take advantage of what will happen?

Data from legacy systems, ERP, CRM Big Data A blend of big data and data from
and 3rd party applications legacy systems, ERP, CRM and 3rd party
applications
ANALYTICS 1.0, 2.0, 3.0
Small and Structured data sources. Big data is being taken up seriously. Data is A blend of big data and traditional
Data stored in enterprise data mainly unstructured, arriving at a much analytics to yield insights and offerings
warehouses or data marts higher pace. This fast flow of data entailed with speed and impact
that the influx of big volume data had to be
stored and processed rapidly, often on
massively parallel servers using Hadoop

Data was internally sourced. Data was often externally sourced. Data is both internally and externally
sourced

Relational databases Database appliances, Hadoop clusters, SQL In memory analytics, in database
to Hadoop environments, etc., processing, agile analytical methods,
machine learning techniques, etc.
ANALYTICS 1.0, 2.0, 3.0
DATA ANALYTICS LIFECYCLE

• Data science projects differ from BI projects


• More exploratory in nature
• Critical to have a project process
• Participants should be thorough and rigorous
• Break large projects into smaller pieces
• Spend time to plan and scope the work
• Documenting adds rigor and credibility
DATA ANALYTICS LIFECYCLE

• Data Analytics Lifecycle Overview


• Phase 1: Discovery
• Phase 2: Data Preparation
• Phase 3: Model Planning
• Phase 4: Model Building
• Phase 5: Communicate Results
• Phase 6: Operationalize
DATA ANALYTICS LIFECYCLE OVERVIEW

• The data analytic lifecycle is designed for Big Data problems and data
science projects
• With six phases the project work can occur in several phases
simultaneously
• The cycle is iterative to portray a real project
• Work can return to earlier phases as new information is uncovered
KEY ROLES FOR A SUCCESSFUL ANALYTICS PROJECT
KEY ROLES FOR A SUCCESSFUL ANALYTICS PROJECT

• Business User – understands the domain area


• Project Sponsor – provides requirements
• Project Manager – ensures meeting objectives
• Business Intelligence Analyst – provides business domain expertise based on
deep understanding of the data
• Database Administrator (DBA) – creates DB environment
• Data Engineer – provides technical skills, assists data management and
extraction, supports analytic sandbox
• Data Scientist – provides analytic techniques and modeling
PHASE 1 : DISCOVERY

• Learning the Business Domain


• Resources
• Framing the Problem
• Identifying Key Stakeholders
• Interviewing the Analytics Sponsor
• Developing Initial Hypotheses
• Identifying Potential Data Sources
PHASE 2: DATA PREPARATION

❖ Includes steps to explore, preprocess, and condition data


❖ Create robust environment – analytics sandbox
• An Analytics Sandbox is a separate environment that is part of the overall data lake
architecture, meaning that it is a centralized environment meant to be used by multiple
users and is maintained with the support of IT

❖ Data preparation tends to be the most labor-intensive step in the analytics


lifecycle
• Often at least 50% of the data science project’s time

❖ The data preparation phase is generally the most iterative and the one that
teams tend to underestimate most often.
2.1 PREPARING THE ANALYTIC SANDBOX

• Create the analytic sandbox (also called workspace)


• Allows team to explore data without interfering with live production data
• Sandbox collects all kinds of data (expansive approach)
• The sandbox allows organizations to undertake ambitious projects beyond
traditional data analysis and BI to perform advanced predictive analytics
• Although the concept of an analytics sandbox is relatively new, this concept
has become acceptable to data science teams and IT groups
2.2 PERFORMING ETLT
(EXTRACT, TRANSFORM, LOAD, TRANSFORM)
• In ETL users perform extract, transform, load
• In the sandbox the process is often ELT – early load preserves the raw data
which can be useful to examine
• Example – in credit card fraud detection, outliers can represent high-risk
transactions that might be inadvertently filtered out or transformed before
being loaded into the database
• Hadoop is often used here
2.3 LEARNING ABOUT THE DATA
• Becoming familiar with the data is critical
• This activity accomplishes several goals:
• Determines the data available to the team early in the project
• Highlights gaps – identifies data not currently available
• Identifies data outside the organization that might be useful.
LEARNING ABOUT THE DATA SAMPLE DATASET INVENTORY
2.4 DATA CONDITIONING
• Data conditioning includes cleaning data, normalizing datasets, and
performing transformations
• Often viewed as a preprocessing step prior to data analysis, it might be performed by
data owner, IT department, DBA, etc.
• Best to have data scientists involved
• Data science teams prefer more data than too little
2.4 DATA CONDITIONING
❖ Additional questions and considerations
• What are the data sources? Target fields?
• How clean is the data?
• How consistent are the contents and files? Missing or inconsistent values?
• Assess the consistence of the data types – numeric, alphanumeric?
• Review the contents to ensure the data makes sense
• Look for evidence of systematic error
2.5 SURVEY AND VISUALIZE
• Leverage data visualization tools to gain an overview of the data
• Shneiderman’s mantra:
• “Overview first, zoom and filter, then details-on-demand”
• This enables the user to find areas of interest, zoom and filter to find more detailed
information about a particular area, then find the detailed data in that area.
2.5 SURVEY AND VISUALIZE -
GUIDELINES AND CONSIDERATIONS
• Review data to ensure calculations are consistent
• Does the data distribution stay consistent?
• Assess the granularity of the data, the range of values, and the level of
aggregation of the data
• Does the data represent the population of interest?
• Check time-related variables – daily, weekly, monthly? Is this good enough?
• Is the data standardized/normalized? Scales consistent?
• For geospatial datasets, are state/country abbreviations consistent
2.6 COMMON TOOLS FOR DATA PREPARATION
• Hadoop can perform parallel ingest and analysis
• Alpine Miner provides a graphical user interface for creating analytic
workflows
• OpenRefine (formerly Google Refine) is a free, open source tool for working
with messy data
• Similar to OpenRefine, Data Wrangler is an interactive tool for data cleansing
and transformation
PHASE 3: MODEL PLANNING

❖ Activities to consider
• Assess the structure of the data – this dictates the tools and analytic techniques
for the next phase
• Ensure the analytic techniques enable the team to meet the business objectives
and accept or reject the working hypotheses
• Determine if the situation warrants a single model or a series of techniques as
part of a larger analytic workflow
• Research and understand how other analysts have approached this kind or
similar kind of problem.
PHASE 3: MODEL PLANNING
MODEL PLANNING IN INDUSTRY VERTICALS
Example of other analysts approaching a similar problem
3.1 DATA EXPLORATION
AND VARIABLE SELECTION
• Explore the data to understand the relationships among the variables to inform selection
of the variables and methods
• A common way to do this is to use data visualization tools
• Often, stakeholders and subject matter experts may have ideas
• For example, some hypothesis that led to the project
• Aim for capturing the most essential predictors and variables
• This often requires iterations and testing to identify key variables
• If the team plans to run regression analysis, identify the candidate predictors and outcome
variables of the model
3.2 MODEL SELECTION
• The main goal is to choose an analytical technique, or several candidates, based on the
end goal of the project
• We observe events in the real world and attempt to construct models that emulate this
behavior with a set of rules and conditions
• A model is simply an abstraction from reality
• Determine whether to use techniques best suited for structured data, unstructured data, or
a hybrid approach
• Teams often create initial models using statistical software packages such as R, SAS, or
Matlab
• Which may have limitations when applied to very large datasets
• The team moves to the model building phase once it has a good idea about the type of
model to try.
3.3 COMMON TOOLS FOR THE MODEL PLANNING PHASE

• R has a complete set of modeling capabilities


• R contains about 5000 packages for data analysis and graphical presentation
• SQL Analysis services can perform in-database analytics of common data mining
functions, involved aggregations, and basic predictive models.
• SAS/ACCESS provides integration between SAS and the analytics sandbox via multiple
data connections.
PHASE 4: MODEL BUILDING
• Execute the models defined in Phase 3
• Develop datasets for training, testing, and production
• Develop analytic model on training data, test on test data
• Question to consider
• Does the model appear valid and accurate on the test data?

• Does the model output/behavior make sense to the domain experts?

• Do the parameter values make sense in the context of the domain?

• Is the model sufficiently accurate to meet the goal?

• Does the model avoid intolerable mistakes?

• Are more data or inputs needed?

• Will the kind of model chosen support the runtime environment?

• Is a different form of the model required to address the business problem?


4.1 COMMON TOOLS FOR THE MODEL BUILDING PHASE
❖ Commercial Tools

• SAS Enterprise Miner – built for enterprise-level computing and analytics

• SPSS Modeler (IBM) – provides enterprise-level computing and analytics

• Matlab – high-level language for data analytics, algorithms, data exploration

• Alpine Miner – provides GUI frontend for backend analytics tools

• STATISTICA and MATHEMATICA – popular data mining and analytics tools

❖ Free or Open Source Tools

• R and PL/R - PL/R is a procedural language for PostgreSQL with R

• Octave – language for computational modeling

• WEKA – data mining software package with analytic workbench

• Python – language providing toolkits for machine learning and analysis

• SQL – in-database implementations provide an alternative tool


PHASE 5: COMMUNICATE RESULTS
• Determine if the team succeeded or failed in its objectives
• Assess if the results are statistically significant and valid
• If so, identify aspects of the results that present salient findings
• Identify surprising results and those in line with the hypotheses

• Communicate and document the key findings and major insights derived
from the analysis
• This is the most visible portion of the process to the outside stakeholders and sponsors
PHASE 6: OPERATIONALIZE
• In this last phase, the team communicates the benefits of the project more
broadly and sets up a pilot project to deploy the work in a controlled way
• Risk is managed effectively by undertaking small scope, pilot deployment
before a wide-scale rollout
• During the pilot project, the team may need to execute the algorithm more
efficiently in the database rather than with in-memory tools like R, especially
with larger datasets
• To test the model in a live setting, consider running the model in a production
environment for a discrete set of products or a single line of business
• Monitor model accuracy and retrain the model if necessary
PHASE 6: OPERATIONALIZE
KEY OUTPUTS FROM SUCCESSFUL ANALYTICS PROJECT

• Business user – tries to determine business benefits and implications


• Project sponsor – wants business impact, risks, ROI
• Project manager – needs to determine if project completed on time, within
budget, goals met
• Business intelligence analyst – needs to know if reports and dashboards will
be impacted and need to change
• Data engineer and DBA – must share code and document
• Data scientist – must share code and explain model to peers, managers,
stakeholders
PHASE 6: OPERATIONALIZE
FOUR MAIN DELIVERABLES

• Although the seven roles represent many interests, the interests overlap and
can be met with four main deliverables
• Presentation for project sponsors – high-level takeaways for executive level
stakeholders
• Presentation for analysts – describes business process changes and reporting
changes, includes details and technical graphs
• Code for technical people
• Technical specifications of implementing the code
WHAT IS HADOOP?

“Apache Hadoop is an open-source software framework for


distributed storage and distributed processing of very large
data sets on clusters of commodity hardware”
FUNDAMENTAL PRINCIPLES OF HADOOP

• Parallel Execution
• Data Locality
• Fault Tolerance
• Scalability
• Economical
• distributed environment.
WHAT IS HADOOP?
HDFS

HDFS creates a level of abstraction over the resources, from


where we can see the HDFS as a single unit.
HDFS

❖ HDFS has two core components:


• Name Node
• Main node that contains metadata about the data stored.
❖ Data Nodes
• Data is stored on the data nodes which are commodity hardware.in the distributed
environment.
HDFS
HDFS

HDFS is a specially designated file system for storing

huge dataset in commodity hardware.


• in the distributed environment.
HDFS

❖ Key points of HDFS


• Storage component of Hadoop.
• Distributed file system
• Modeled after Google File System
• Optimized for high throughput (HDFS leverages large block size and moves computation
where data is stored)
• We can replicate a file for configured number of times, which is tolerant in terms of both
hardware and software.
• Replicates data blocks automatically on nodes that have failed.
• Power of HDFS can be realized if we perform read or write on large files (Gigabytes).
WHY HDFS?
WHY HDFS?
HADOOP DAEMONS

❖ Daemon – framework or software installed on a hardware.


• Data Node
• Name node
• Secondary Name Node
• Task tracker
• Job tracker
HADOOP DAEMONS

❖ Data Node
• Slave node or slave daemon
• Stores the actual data.
• Can be multiple data nodes.
❖ Name node
• Master node or Master daemon.
• Manages the Data nodes.
• Stores the meta data.
• Only one master node.
HADOOP DAEMONS
HADOOP DAEMONS

• During read and write, DataNodes communicate with each other.

• DataNode also continuously sends “heartbeat” messages to NameNode to


ensure the connectivity between the NameNode and DataNode.

• DataNode sends heartbeat after every 3 second to NameNode.

• If Data node fails to send heartbeat during its time interval than that
DataNode is considered as dead and the task of that DataNode is reassigned
to new DataNode.
HADOOP DAEMONS

Heartbeat is the signal that DataNodes send continuously


to the NameNode. This signal shows the status of the DataNode
HADOOP DAEMONS

❖ Metadata in HDFS is maintained by two files


• editlog
• Keeps track of the recent changes made on HDFS

• fsimage
• Keeps track of every change made on HDFS from the beginning.
HADOOP DAEMONS

❖ What happens when


• editlog file size increase?
• Namenode fails?

❖ Solution :
• Make copies of editlog and fsimage files
SECONDARY NAMENODE
SECONDARY NAMENODE

• Takes a snapshot of HDFS metadata at intervals specified in the Hadoop


configuration (usually happens every hour).

• Since the memory requirements of Secondary NameNode are the same as


NameNode, it is better to run NameNode and Secondary NameNode on
different machines.

• In case of failure of NameNode, the secondary NameNode can be configured


manually to bring up the cluster.
HDFS CLUSTER ARCHITECTURE
HDFS DATA BLOCKS

• HDFS divides massive files into small chunks, these small chunks are called
as blocks.

• Each file in HDFS is stored as one block.

• Minimum size of the block : 128 MB (earlier it was 64 MB)

• Block size should be integral multiples of 128 MB.


HDFS DATA BLOCKS

❖ Why 128 MB?

• If the block size is smaller, then there will be too many data blocks along
with lots of metadata which will create overhead.

• Similarly, if the block size is very large then the processing time for each
block increases.
HDFS DATA BLOCKS
HDFS DATA BLOCKS

• How files are stored in HDFS?


HDFS DATA BLOCKS

• Data node failure

What will happen if Node 5 crashes?


HDFS DATA BLOCKS
REPLICATION

HDFS overcomes the issue of DataNode failure, by creating copies of


the data – Replication method.
REPLICATION
HDFS DATA BLOCKS

• Default replication factor is : 3

• We will have 3 copies of each data block.


BLOCK PLACEMENT STRATEGY
BLOCK PLACEMENT STRATEGY

First replica is placed in the node which is close to the client.


Second replica is placed on a node that is present on a different rack.
Third replica is placed on the same rack but on a different node in the rack.
BLOCK PLACEMENT STRATEGY
BLOCK PLACEMENT STRATEGY
RACK AWARENESS
• Rack– It the collection of machines around 40-50.

• So the communication between the nodes on the same rack will be faster as
compared to the racks which are far away.

• In a large cluster of Hadoop, in order to improve network traffic while


read/write operation, NameNode chooses the datanodes which are on the
same rack or near by rack.

• Actually, Namenode has the rack id of all the racks through which it
maintains the information about each rack.

• So this concept of choosing closer datanodes based on rack information is


called Rack Awareness.
RACK AWARENESS
MAPREDUCE

• MapReduce programming is a software framework.


• It is the processing component of Apache Hadoop.
• It helps us to process massive amounts of data in parallel.

137
138
MAPREDUCE IN A NUTSHELL

139
ADVANTAGES OF MAPREDUCE

• Parallel Processing
• Data Locality – Processing to Storage

140
PARALLEL PROCESSING

• Data is processed in parallel


• Processing becomes fast

141
DATA LOCALITY

• Moving data to processing is very costly.


• In Mapreduce, we move processing to data.

142
ELECTRONIC VOTES COUNTING

• Votes are stored in different booths.


• Result centre has the details of all booths.

143
ELECTRONIC VOTES COUNTING

❖ Counting – Traditional approach


• Votes are moved to result centre for counting.
• Moving all votes to the centre is costly.
• Result centre is over burdened.
• Counting takes time.

144
ELECTRONIC VOTES COUNTING

❖ Counting – MapReduce approach


• Votes are counted at individual booths.
• Booth-wise results are sent back to the result centre.
• Final result is declared easily and quickly.

145
CASE STUDY : LIBRARY MANAGEMENT

146
CASE STUDY : LIBRARY MANAGEMENT

147
CASE STUDY : LIBRARY MANAGEMENT

148
CASE STUDY : LIBRARY MANAGEMENT

149
CASE STUDY : LIBRARY MANAGEMENT

150
CASE STUDY : LIBRARY MANAGEMENT

151
CASE STUDY : LIBRARY MANAGEMENT

152
MAPREDUCE METHOD

153
MAPREDUCE METHOD

154
MAPREDUCE METHOD

155
MAPREDUCE METHOD

156
MAPREDUCE METHOD

157
MAPREDUCE METHOD

158
MAPREDUCE METHOD

159
MAPREDUCE METHOD

160
MAPREDUCE METHOD

161
MAPREDUCE METHOD

162
MAPREDUCE METHOD

163
MAPREDUCE METHOD

• In MapReduce programming, the input dataset is divided into independent


chunks.

• Map tasks completely process these independent chunks completely in a


parallel manner.

• The output produced by the map tasks serves as intermediate data and is
stored in the local disk of that server.

164
MAPREDUCE METHOD

• The output of the mappers are automatically shuffled and sorted by the
framework.

• MapReduce framework sorts the output based on keys.

• This sorted output becomes the input to the reduce tasks.

• Reduce tasks provides reduced output by combining the output of the


various mappers.

165
MAPREDUCE METHOD

❖ In summary,

• Map job scales takes data sets as input and processes them to produce
key value pairs.

• Reduce job takes the output of the Map job i.e. the key value pairs and
aggregates them to produce desired results.

166
MAPREDUCE METHOD

167
MAPREDUCE ALGORITHM

• Generally MapReduce paradigm is based on sending the computer to


where the data resides!

• MapReduce program executes in three stages, namely map stage,


shuffle stage, and reduce stage.

168
MAPREDUCE ALGORITHM
❖ Map stage
• The map or mapper’s job is to process the input data.
• Generally the input data is in the form of file or directory and is stored in the Hadoop file
system (HDFS).
• The input file is passed to the mapper function line by line.
• The mapper processes the data and creates several small chunks of data.
❖ Reduce stage
• This stage is the combination of the Shuffle stage and the Reduce stage.
• The Reducer’s job is to process the data that comes from the mapper.
• After processing, it produces a new set of output, which will be stored in the HDFS.
❖ During a MapReduce job, Hadoop sends the Map and Reduce tasks to the
appropriate servers in the cluster. 169
HOW DOES MAPREDUCE WORK?
• MapReduce divides data analysis task into two parts

• map

• reduce

• Mapper works on the partial dataset that is stored on that node and the
reducer combines the output from the mappers to produce the reduced
result set.

170
HOW DOES MAPREDUCE WORK?

171
HOW DOES MAPREDUCE WORK?

172
HOW DOES MAPREDUCE WORK?

173
MAPREDUCE DAEMONS

• JobTracker

• JobTracker process runs on a separate node and not usually on a Data Node.

• JobTracker is an essential Daemon for MapReduce execution in MRv1. It is replaced


by ResourceManager/ApplicationMaster in MRv2.

• JobTracker receives the requests for MapReduce execution from the client.

• JobTracker talks to the NameNode to determine the location of the data.

174
MAPREDUCE DAEMONS

• JobTracker
• JobTracker finds the best TaskTracker nodes to execute tasks based on the data
locality (proximity of the data) and the available slots to execute a task on a given
node.

• JobTracker monitors the individual TaskTrackers and the submits back the overall
status of the job back to the client.

• JobTracker process is critical to the Hadoop cluster in terms of MapReduce execution.

• When the JobTracker is down, HDFS will still be functional but the MapReduce
execution can not be started and the existing MapReduce jobs will be halted.
175
MAPREDUCE DAEMONS

• TaskTracker
• TaskTracker runs on DataNode. Mostly on all DataNodes.
• TaskTracker is replaced by Node Manager in MRv2.
• Mapper and Reducer tasks are executed on DataNodes administered by TaskTrackers.
• TaskTrackers will be assigned Mapper and Reducer tasks to execute by JobTracker.
• TaskTracker will be in constant communication with the JobTracker signalling the
progress of the task in execution.
• TaskTracker failure is not considered fatal. When a TaskTracker becomes
unresponsive, JobTracker will assign the task executed by the TaskTracker to another
node.

176
MAPREDUCE DAEMONS

177
WHAT HAPPENS WITH MAP AND REDUCE FUNCTIONS

178
THANK YOU

You might also like