Professional Documents
Culture Documents
BDA Unit 2
BDA Unit 2
UNIT- II
❖ Many people uses face book to update their status, share photos and like
content.
• The Obama presidential campaign used all that data on the network to
not just find voters, but to assemble an army of volunteers.
BIG DATA ANALYTICS IN PRACTICE
• The BJP accurately mined data from almost every Internet user in the
country, and used this data to accurately understand voter sentiments
and local issues.
• The targeting was done not on national issues, but local issues which
were considered far more important.
BIG DATA ANALYTICS IN PRACTICE
• Bangalore’s massive population growth (from 5.4 million in 2000 to over 10 million)
has put tremendous strain on the city’s water supply and distribution systems.
• In partnership with IBM, the BWSSB has built an operational dashboard, which
serves as a “command center” for managing the city’s water supply networks.
• Implementing this solution will help minimize unaccounted water by detecting large
changes in water flow, through real-time monitoring
BIG DATA ANALYTICS IN PRACTICE
❖ Using Big Data to predict ticket confirmations for trains
• Using the power of Big Data, ixigo has launched a PNR prediction feature for train
travelers.
• For any given train’s wait-listed status, ixigo is now able to show the near accurate
probability with which the ticket will confirm, so that travelers may decide whether
or not to book a wait-listed ticket.
• PNR prediction feature also shows the probability of getting your ticket confirmed
if already booked and solves a huge pain area for millions of daily train travelers.
• The company claims that its app gives far more accurate PNR prediction than all
existing PNR prediction services since ixigo has mined data from over 10 million
PNRs over the last two years.
• The company claims an accuracy rate of 90% accuracy and hopes to raise it to
95% over a period of time.
HOW IS IT DIFFERENT?
• Data Scientist not only does the exploratory analysis to discover insights
from it, but also uses various advanced machine learning algorithms
to identify the occurrence of a particular event in the future.
• A Data Scientist will look at the data from many angles, sometimes angles
not known earlier.
HOW IS IT DIFFERENT?
COMPONENTS OF DATA SCIENCE
COMPONENTS OF DATA SCIENCE
• In this requirements gathering process, the BI analyst must identify the first and
second level questions the business users want to address in order to build a robust
and scalable data warehouse.
BI ENGAGEMENT PROCESS
• 1st level question: How many patients did we treat last month?
• 2nd level question: How did that compare to the previous month?
• 2nd level question: What were the major DRG (Diagnosis Related Groups) types
treated?
• 1st level question: How many patients came through ER last night?
• 2nd level question: How did that compare to the previous night?
• 1st level question: What percentage of beds was used at Hospital X last week?
• 2nd level question: What is the trend of bed utilization over the past year?
• 2nd level question: What departments had the largest increase in bed utilization?
• The BI Analyst then works closely with the data warehouse team to define and build the
underlying data models that supports the questions being asked.
• The BI Analyst will use the BI tool’s graphical user interface (GUI) to create the
SQL query by selecting the measures and dimensions; selecting page, column
and page descriptors; specifying constraints, subtotals and totals, creating
special calculations (mean, moving average, rank, share of) and selecting sort
criteria.
• The BI GUI hides much of the complexity of creating the SQL query.
BI ENGAGEMENT PROCESS
• In some cases, the BI Analyst will modify the SQL commands generated by the
BI tool to include unique SQL commands that may not be supported by the BI
tool.
BI ENGAGEMENT PROCESS
• This is a highly iterative process, where the Business Analyst will tweak
the SQL (either using the GUI or hand-coding the SQL statement) to
fine-tune the SQL request.
BI ENGAGEMENT PROCESS
• If the data is not in the data warehouse, then adding data to an existing
warehouse (and creating all the supporting ETL processes) can take months
to make happen.
THE DATA SCIENCE ENGAGEMENT PROCESS
THE DATA SCIENCE ENGAGEMENT PROCESS
• The data lake (A data lake is a centralized repository that allows you to store all
your structured and unstructured data at any scale) is a great approach for this
process, as the data scientist can grab any data they want, test it, ascertain its
value given the hypothesis or prediction, and then decide whether to include that
data in the predictive model or throw it away.
THE DATA SCIENCE ENGAGEMENT PROCESS
• The data scientist can’t define the schema until they know the hypothesis that
they are testing and know what data sources they are going to be using to build
their analytic models.
THE DATA SCIENCE ENGAGEMENT PROCESS
• Instead, the data scientist will define the schema as needed based upon the data that is
being used in the analysis.
• The data scientist will likely iterate through several different versions of the schema
until finding a schema (and analytic model) that sufficiently answers the hypothesis
being tested.
THE DATA SCIENCE ENGAGEMENT PROCESS
• Data visualization tools like Tableau, Spotfire, Domo and DataRPM are great data
scientist tools for exploring the data and identifying variables that they might
want to test.
THE DATA SCIENCE ENGAGEMENT PROCESS
THE DATA SCIENCE ENGAGEMENT PROCESS
• At this point, the data scientist will explore different analytic techniques
and algorithms to try to create the most predictive models.
THE DATA SCIENCE ENGAGEMENT PROCESS
• The goodness of fit of a statistical model describes how well the model fits a set of
observations.
• A number of different analytic techniques will be used to determine the goodness of fit
including Kolmogorov–Smirnov test, Pearson's chi-squared test, ANalysis Of VAriance
(ANOVA) and confusion (or error) matrix.
THE DATA SCIENCE ENGAGEMENT PROCESS
❖ In the BI process,
• the schema must be built first and must be built to support a wide variety of
questions across a wide range of business functions.
• So the data model must be extensible and scalable which means that it is heavily
engineered.
• The data science process is highly collaborative; the more subject matter
experts involved in the process, the better the resulting model.
• Actionable (insights that the organization can actually act on), and
• Material (where the value of acting on the insights is greater than the cost of acting on
the insights).
DATA SCIENTIST
• The term "Data Science" (originally used interchangeably with "datalogy") was
initially used as a substitute for computer science by Peter Naur in 1960.
• In 2008, DJ Patil and Jeff Hammerbacher coined the term "Data Scientist" to
define their jobs at Linkedin and Facebook, respectively.
ROLE OF A DATA SCIENTIST
ROLE OF A DATA SCIENTIST
❖ Business Acumen Skills
• Data Scientist should have the prowess to counter the pressure of
business.
• List of traits that needs to be honed to play the role of a data scientist
✓ Understanding of domain
✓ Business strategy
✓ Problem solving
✓ Communication
✓ Presentation
✓ Inquisitiveness
ROLE OF A DATA SCIENTIST
❖ Technology Expertise
• Good database knowledge such as RDBMS
• Good NoSQL database knowledge such as MongoDB, Cassandra, Hbase, etc
• Programming languages such as Java, Python, C++, etc.
• Open source tools such as Hadoop.
• Data Warehousing
• Data Mining
• Visualization such as Tableau, Flare, Google Visualization APIs, etc
ROLE OF A DATA SCIENTIST
❖ Mathematics Expertise
• Job of data scientist will require him to comprehend data, interpret data, make sense of
it, and analyse it, he/she will have to dabble in learning algorithms.
• Mathematics
• Statistics
• Artificial Intelligence (AI)
• Algorithms
• Machine learning
• Pattern recognition
• Natural Language Processing
ROLE OF A DATA SCIENTIST
• Identifying the data-analytics problems that offer the greatest opportunities to the
organization.
• Determining the correct data sets and variables.
• Collecting large sets of structured and unstructured data from disparate sources.
• Cleaning and validating the data to ensure accuracy, completeness, and uniformity.
• Devising and applying models and algorithms to mine the stores of big data.
• Analyzing the data to identify patterns and trends.
• Interpreting the data to discover solutions and opportunities.
• Communicating findings to stakeholders using visualization and other means.
CLASSIFICATION OF ANALYTICS
❖ Basic Analytics
• Slicing and dicing of data to help with basic business insights.
• Reporting on historical data, basic visualization, etc.,
❖ Operationalized analytics
• Gets woven into enterprise’s business process
FIRST SCHOOL OF THOUGHT
❖ Advanced analytics
• Forecasting for the future by way of predictive and prescriptive modeling.
❖ Monetized analytics
• To derive direct business revenue.
SECOND SCHOOL OF THOUGHT
❖ Analytics 1.0
• Mid 1950’s to 2009
• Descriptive statistics (and Diagnostic)
• Report on events, occurrences, etc of the past.
• What happened?
• Why did it happen?
SECOND SCHOOL OF THOUGHT
❖ Analytics 2.0
• 2005 to 2012
• Descriptive statistics + Predictive statistics
• Use data from the past to make predictions for the future
• What will happen?
• Why will it happen?
SECOND SCHOOL OF THOUGHT
❖ Analytics 3.0
• 2012 to present
• Descriptive + Predictive + Prescriptive statistics
• Use data from the past to make prophecies for the future and at the same time make
recommendations to leverage the situation to one’s advantage.
• What will happen?
• When will it happen?
• Why will it happen?
• What should be the action taken to take advantage of what will happen?
ANALYTICS 1.0, 2.0, 3.0
ANALYTICS 1.0, 2.0, 3.0
ANALYTICS 1.0, 2.0, 3.0
ANALYTICS 1.0, 2.0, 3.0
❖ Descriptive Analytics
• which use data aggregation and data mining to provide insight into the
past and answer: “What has happened?”
• Insight into the past
• Use Descriptive Analytics when you need to understand at an aggregate
level what is going on in your company, and when you want to
summarize and describe different aspects of your business.
ANALYTICS 1.0, 2.0, 3.0
❖ Descriptive analytics
• Describing or summarizing the existing data using existing business intelligence tools
to better understand what is going on or what has happened.
• simplest form of analytics
• purpose of this analytics type is just to summarize the findings and understand what is
going on.
• It is said that 80% of business analytics mainly involves descriptions based on
aggregations of past performance.
• The tools used in this phase are MS Excel, MATLAB, SPSS, STATA, etc.
ANALYTICS 1.0, 2.0, 3.0
❖ Diagnostic Analysis
• Focus on past performance to determine what happened and why.
• Diagnostic analytics is used to determine why something happened in the past.
• It is characterized by techniques such as drill-down, data discovery, data mining and
correlations.
• Diagnostic analytics takes a deeper look at data to understand the root causes of the
events.
• has a limited ability to give actionable insights.
• It just provides an understanding of causal relationships and sequences while looking
backward.
ANALYTICS 1.0, 2.0, 3.0
❖ Predictive Analytics
• which use statistical models and forecasting techniques to understand
the future and answer: “What could happen?”
• Understanding the future
• Use Predictive Analytics any time you need to know something about the
future, or fill in the information that you do not have.
ANALYTICS 1.0, 2.0, 3.0
❖ Predictive Analytics
• Emphasizes on predicting the possible outcome using statistical models and machine
learning techniques.
• It is important to note that it cannot predict if an event will occur in the future; it merely
forecasts what are the probabilities of the occurrence of the event.
• A predictive model builds on the preliminary descriptive analytics stage to derive the
possibility of the outcomes.
ANALYTICS 1.0, 2.0, 3.0
❖ Prescriptive Analytics
• which use optimization and simulation algorithms to advise on possible
outcomes and answer: “What should we do?”
• Advise on possible outcomes
• Use Prescriptive Analytics any time you need to provide users with
advice on what action to take.
ANALYTICS 1.0, 2.0, 3.0
❖ Prescriptive analytics
• It is a type of predictive analytics that is used to recommend one or more
course of action on analyzing the data.
• It can suggest all favorable outcomes according to a specified course of
action and also suggest various course of actions to get to a particular
outcome.
• Hence, it uses a strong feedback system that constantly learns and updates
the relationship between the action and the outcome.
• Recommendation engines also use prescriptive analytics.
ANALYTICS 1.0, 2.0, 3.0
Analytics 1.0 Analytics 2.0 Analytics 3.0
Era : Mid 1950s to 2009 2005 to 2012 2012 to present
Descriptive statistics Descriptive statistics + Predictive Statistics Descriptive statistics + Predictive
(report on events, occurrences, etc. (Use data from the past to make Statistics + Prescriptive statistics
of the past) predictions for the future) (Use data from the past to make
prophecies for the future and at the
same time make recommendations to
leverage the situation to one’s
advantage
Data from legacy systems, ERP, CRM Big Data A blend of big data and data from
and 3rd party applications legacy systems, ERP, CRM and 3rd party
applications
ANALYTICS 1.0, 2.0, 3.0
Small and Structured data sources. Big data is being taken up seriously. Data is A blend of big data and traditional
Data stored in enterprise data mainly unstructured, arriving at a much analytics to yield insights and offerings
warehouses or data marts higher pace. This fast flow of data entailed with speed and impact
that the influx of big volume data had to be
stored and processed rapidly, often on
massively parallel servers using Hadoop
Data was internally sourced. Data was often externally sourced. Data is both internally and externally
sourced
Relational databases Database appliances, Hadoop clusters, SQL In memory analytics, in database
to Hadoop environments, etc., processing, agile analytical methods,
machine learning techniques, etc.
ANALYTICS 1.0, 2.0, 3.0
DATA ANALYTICS LIFECYCLE
• The data analytic lifecycle is designed for Big Data problems and data
science projects
• With six phases the project work can occur in several phases
simultaneously
• The cycle is iterative to portray a real project
• Work can return to earlier phases as new information is uncovered
KEY ROLES FOR A SUCCESSFUL ANALYTICS PROJECT
KEY ROLES FOR A SUCCESSFUL ANALYTICS PROJECT
❖ The data preparation phase is generally the most iterative and the one that
teams tend to underestimate most often.
2.1 PREPARING THE ANALYTIC SANDBOX
❖ Activities to consider
• Assess the structure of the data – this dictates the tools and analytic techniques
for the next phase
• Ensure the analytic techniques enable the team to meet the business objectives
and accept or reject the working hypotheses
• Determine if the situation warrants a single model or a series of techniques as
part of a larger analytic workflow
• Research and understand how other analysts have approached this kind or
similar kind of problem.
PHASE 3: MODEL PLANNING
MODEL PLANNING IN INDUSTRY VERTICALS
Example of other analysts approaching a similar problem
3.1 DATA EXPLORATION
AND VARIABLE SELECTION
• Explore the data to understand the relationships among the variables to inform selection
of the variables and methods
• A common way to do this is to use data visualization tools
• Often, stakeholders and subject matter experts may have ideas
• For example, some hypothesis that led to the project
• Aim for capturing the most essential predictors and variables
• This often requires iterations and testing to identify key variables
• If the team plans to run regression analysis, identify the candidate predictors and outcome
variables of the model
3.2 MODEL SELECTION
• The main goal is to choose an analytical technique, or several candidates, based on the
end goal of the project
• We observe events in the real world and attempt to construct models that emulate this
behavior with a set of rules and conditions
• A model is simply an abstraction from reality
• Determine whether to use techniques best suited for structured data, unstructured data, or
a hybrid approach
• Teams often create initial models using statistical software packages such as R, SAS, or
Matlab
• Which may have limitations when applied to very large datasets
• The team moves to the model building phase once it has a good idea about the type of
model to try.
3.3 COMMON TOOLS FOR THE MODEL PLANNING PHASE
• Communicate and document the key findings and major insights derived
from the analysis
• This is the most visible portion of the process to the outside stakeholders and sponsors
PHASE 6: OPERATIONALIZE
• In this last phase, the team communicates the benefits of the project more
broadly and sets up a pilot project to deploy the work in a controlled way
• Risk is managed effectively by undertaking small scope, pilot deployment
before a wide-scale rollout
• During the pilot project, the team may need to execute the algorithm more
efficiently in the database rather than with in-memory tools like R, especially
with larger datasets
• To test the model in a live setting, consider running the model in a production
environment for a discrete set of products or a single line of business
• Monitor model accuracy and retrain the model if necessary
PHASE 6: OPERATIONALIZE
KEY OUTPUTS FROM SUCCESSFUL ANALYTICS PROJECT
• Although the seven roles represent many interests, the interests overlap and
can be met with four main deliverables
• Presentation for project sponsors – high-level takeaways for executive level
stakeholders
• Presentation for analysts – describes business process changes and reporting
changes, includes details and technical graphs
• Code for technical people
• Technical specifications of implementing the code
WHAT IS HADOOP?
• Parallel Execution
• Data Locality
• Fault Tolerance
• Scalability
• Economical
• distributed environment.
WHAT IS HADOOP?
HDFS
❖ Data Node
• Slave node or slave daemon
• Stores the actual data.
• Can be multiple data nodes.
❖ Name node
• Master node or Master daemon.
• Manages the Data nodes.
• Stores the meta data.
• Only one master node.
HADOOP DAEMONS
HADOOP DAEMONS
• If Data node fails to send heartbeat during its time interval than that
DataNode is considered as dead and the task of that DataNode is reassigned
to new DataNode.
HADOOP DAEMONS
• fsimage
• Keeps track of every change made on HDFS from the beginning.
HADOOP DAEMONS
❖ Solution :
• Make copies of editlog and fsimage files
SECONDARY NAMENODE
SECONDARY NAMENODE
• HDFS divides massive files into small chunks, these small chunks are called
as blocks.
• If the block size is smaller, then there will be too many data blocks along
with lots of metadata which will create overhead.
• Similarly, if the block size is very large then the processing time for each
block increases.
HDFS DATA BLOCKS
HDFS DATA BLOCKS
• So the communication between the nodes on the same rack will be faster as
compared to the racks which are far away.
• Actually, Namenode has the rack id of all the racks through which it
maintains the information about each rack.
137
138
MAPREDUCE IN A NUTSHELL
139
ADVANTAGES OF MAPREDUCE
• Parallel Processing
• Data Locality – Processing to Storage
140
PARALLEL PROCESSING
141
DATA LOCALITY
142
ELECTRONIC VOTES COUNTING
143
ELECTRONIC VOTES COUNTING
144
ELECTRONIC VOTES COUNTING
145
CASE STUDY : LIBRARY MANAGEMENT
146
CASE STUDY : LIBRARY MANAGEMENT
147
CASE STUDY : LIBRARY MANAGEMENT
148
CASE STUDY : LIBRARY MANAGEMENT
149
CASE STUDY : LIBRARY MANAGEMENT
150
CASE STUDY : LIBRARY MANAGEMENT
151
CASE STUDY : LIBRARY MANAGEMENT
152
MAPREDUCE METHOD
153
MAPREDUCE METHOD
154
MAPREDUCE METHOD
155
MAPREDUCE METHOD
156
MAPREDUCE METHOD
157
MAPREDUCE METHOD
158
MAPREDUCE METHOD
159
MAPREDUCE METHOD
160
MAPREDUCE METHOD
161
MAPREDUCE METHOD
162
MAPREDUCE METHOD
163
MAPREDUCE METHOD
• The output produced by the map tasks serves as intermediate data and is
stored in the local disk of that server.
164
MAPREDUCE METHOD
• The output of the mappers are automatically shuffled and sorted by the
framework.
165
MAPREDUCE METHOD
❖ In summary,
• Map job scales takes data sets as input and processes them to produce
key value pairs.
• Reduce job takes the output of the Map job i.e. the key value pairs and
aggregates them to produce desired results.
166
MAPREDUCE METHOD
167
MAPREDUCE ALGORITHM
168
MAPREDUCE ALGORITHM
❖ Map stage
• The map or mapper’s job is to process the input data.
• Generally the input data is in the form of file or directory and is stored in the Hadoop file
system (HDFS).
• The input file is passed to the mapper function line by line.
• The mapper processes the data and creates several small chunks of data.
❖ Reduce stage
• This stage is the combination of the Shuffle stage and the Reduce stage.
• The Reducer’s job is to process the data that comes from the mapper.
• After processing, it produces a new set of output, which will be stored in the HDFS.
❖ During a MapReduce job, Hadoop sends the Map and Reduce tasks to the
appropriate servers in the cluster. 169
HOW DOES MAPREDUCE WORK?
• MapReduce divides data analysis task into two parts
• map
• reduce
• Mapper works on the partial dataset that is stored on that node and the
reducer combines the output from the mappers to produce the reduced
result set.
170
HOW DOES MAPREDUCE WORK?
171
HOW DOES MAPREDUCE WORK?
172
HOW DOES MAPREDUCE WORK?
173
MAPREDUCE DAEMONS
• JobTracker
• JobTracker process runs on a separate node and not usually on a Data Node.
• JobTracker receives the requests for MapReduce execution from the client.
174
MAPREDUCE DAEMONS
• JobTracker
• JobTracker finds the best TaskTracker nodes to execute tasks based on the data
locality (proximity of the data) and the available slots to execute a task on a given
node.
• JobTracker monitors the individual TaskTrackers and the submits back the overall
status of the job back to the client.
• When the JobTracker is down, HDFS will still be functional but the MapReduce
execution can not be started and the existing MapReduce jobs will be halted.
175
MAPREDUCE DAEMONS
• TaskTracker
• TaskTracker runs on DataNode. Mostly on all DataNodes.
• TaskTracker is replaced by Node Manager in MRv2.
• Mapper and Reducer tasks are executed on DataNodes administered by TaskTrackers.
• TaskTrackers will be assigned Mapper and Reducer tasks to execute by JobTracker.
• TaskTracker will be in constant communication with the JobTracker signalling the
progress of the task in execution.
• TaskTracker failure is not considered fatal. When a TaskTracker becomes
unresponsive, JobTracker will assign the task executed by the TaskTracker to another
node.
176
MAPREDUCE DAEMONS
177
WHAT HAPPENS WITH MAP AND REDUCE FUNCTIONS
178
THANK YOU