Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

CS8091 BIG DATA ANALYTICS

UNIT I INTRODUCTION TO BIG DATA

Evolution of Big Data - Best Practices for Big Data Analytics - Big data characteristics - Validating
- The Promotion of the Value of Big Data - Big Data Use Cases- Characteristics of Big Data
Applications - Perception and Quantification of Value -Understanding Big Data Storage - A
General Overview of Architecture - HDFS - MapReduce and YARN - Map Reduce Programming
Model
Evolution of Big data
The term “big data” refers to data that is so large, fast or complex that it’s difficult or impossible
to process using traditional methods. The act of accessing and storing large amounts of information
for analytics has been around a long time. But the concept of big data gained momentum in the
early 2000s when industry analyst Doug Laney articulated the now-mainstream definition of big
data as the three V’s
The 3 Vs of Big Data
1. Volume,
2. Variety, and
3. Velocity.
Definition of Big Data
According to McKinsey Global report,
Big Data is data whose scale, distribution, diversity, and/or timeliness require the use of new
technical architectures and analytics to enable insights that unlock new sources of business value.
Characteristics of Big Data
It is defined by three attributes namely
1. Huge volume of data
2. Complexity of data types and structures
3. Speed of new data creation and growth
Driving sources of data deluge
• Mobile sensors
• Social media
• Video surveillance
• Video rendering
• Smart grids
• Medical imaging
• Gene sequencing
• Geophysical exploration
Types of data structures
• Structured
Data containing a defined data type, format, and structure

1
CS8091 BIG DATA ANALYTICS

Ex. transaction data, online analytical processing [OLAP] data cubes, traditional RDBMS,
CSV files, and even simple spreadsheets
• Semi structured
Textual data files with a discernible (noticeable) pattern that enables parsing
Ex. Extensible Markup Language [XML] data files that are self-describing and defined by
an XML schema
• Quasi structured
Textual data with erratic data formats that can be formatted with effort, tools, and time
Ex. web clickstream data that may contain inconsistencies in data values and formats
• Unstructured
Data that has no inherent (natural) structure
Ex. text documents, PDFs, images, and video

Analyst Perspective on Data Repositories

Types of Data Repositories

Data Repository Characteristics


Spreadsheets and data marts • Spreadsheets and low-volume databases for record
keeping
• Analyst depends on data extracts.
Data Warehouses • Centralized data containers in a purpose-built space
• Supports BI and reporting, but restricts robust analyses
• Analyst dependent on IT and DBAs for data access and
schema changes
• Analysts must spend significant time to get aggregated
and disaggregated data extracts from multiple sources.
Analytic Sandbox • Data assets gathered from multiple sources and
technologies for analysis
• Enables flexible, high-performance analysis in a non-
production environment
• can leverage (control) in-database processing
• Reduces costs and risks associated with data replication
into "shadow" file systems
• "Analyst owned" rather than "DBA owned"

State of the Practice in Analytics


Business Drivers for Advanced Analytics
(Categories of business problems where analytics can be applied)
Business Driver Examples
Optimize business operations Sales, pricing, profitability, efficiency
Identify business risk Customer churn, fraud, default
Predict new business opportunities Upsell, cross-sell, best new customer prospects
Comply with laws or regulatory Anti-Money Laundering, Fair Lending, Basel
Requirements II-III, SarbanesOxley( SOX)
2
CS8091 BIG DATA ANALYTICS

BI Versus Data Science

BI tends to provide reports, dashboards, and queries on business questions for the current period
or in the past. BI systems make it easy to answer questions related to quarter-to-date revenue,
progress toward quarterly targets, and understand how much of a given product was sold in a prior
quarter or year. These questions tend to be closed-ended and explain current or past behavior,
typically by aggregating historical data and grouping it in some way. BI provides hindsight and
some insight and generally answers questions related to "when" and "where" events occurred.

Data Science tends to use disaggregated data in a more forward-looking, exploratory way, focusing
on analyzing the present and enabling informed decisions about the future. Rather than aggregating
historical data to look at how many of a given product sold in the previous quarter, a team may
employ Data Science techniques such as time series analysis, to forecast future product sales and
revenue more accurately than extending a simple trend line. In addition, Data Science tends to be
more exploratory in nature and may use scenario optimization to deal with more open-ended
questions. This approach provides insight into current activity and foresight into future events,
while generally focusing on questions related to "how" and "why" events occur.

Current Analytical Architecture

Data flow to the Data Scientist

For data sources to be loaded into the data warehouse, data needs to be well understood, structured,
and normalized with the appropriate data type definitions. Although this kind of centralization
enables security, backup, and fail over of highly critical data, it also means that data typically must
go through significant preprocessing and checkpoints before it can enter this sort of controlled
environment, which does not lend itself to data exploration and iterative analytics

3
CS8091 BIG DATA ANALYTICS

As a result of this level of control on the EDW, additional local systems may emerge in the form
of departmental warehouses and local data marts that business users create to accommodate their
need for flexible analysis. These local data marts may not have the same constraints for security
and structure as the main EDW and allow users to do some level of more in-depth analysis.
However, these one-off systems reside in isolation, often are not synchronized or integrated with
other data stores, and may not be backed up
Once in the data warehouse, data is read by additional applications across the enterprise for BI and
reporting purposes. These are high-priority operational processes getting critical data feeds from
the data warehouses and repositories
At the end of this workflow, analysts get data provisioned for their downstream analytics. Because
users generally are not allowed to run custom or intensive analytics on production databases,
analysts create data extracts from the EDW to analyze data offline in R or other local analytical
tools. Many times these tools are limited to in-memory analytics on desktops analyzing samples
of data, rather than the entire population of a dataset. Because these analyses are based on data
extracts, they reside in a separate location, and the results of the analysis-and any insights on the
quality of the data or anomalies-rarely are fed back into the main data repository

Implications to Data Scientists


• High-value data is hard to reach and leverage
• Data moves in batches from EDW to local analytical tools which limits the data scientists
the size of the data sets they can use
• Data Science projects will remain isolated and ad hoc, rather than centrally managed

Big Data Eco system – the four main players


• Data devices - continuously generate new data
• Data collectors - collect data from the device and users
• Data aggregators - compile data from the devices and usage patterns
• Data users and buyers - benefit from the data collected and aggregated by others

Roles in Big Data Eco system


• Deep Analytical Talent
People in this group are technically savvy, with strong analytical skills. Members possess
a combination of skills to handle raw, unstructured data and to apply complex analytical
techniques at massive scales. This group has advanced training in quantitative disciplines,
such as mathematics, statistics, and machine learning. Ex. statisticians, economists
• Data Savvy Professionals
People in this group has less technical depth but has a basic knowledge of statistics or
machine learning and can define key questions that can be answered using advanced
4
CS8091 BIG DATA ANALYTICS

analytics. These people tend to have a base knowledge of working with data, or an
appreciation for some of the work being performed by data scientists and others with deep
analytical talent. Ex. financial analysts, market research analysts
• Technology and Data Enablers
This group represents people providing technical expertise to support analytical projects,
such as provisioning and administrating analytical sandboxes, and managing large-scale
data architectures that enable widespread analytics within companies and other
organizations. This role requires skills related to computer engineering, programming, and
database administration.

Activities performed by a Data Scientist

• Reframe business challenges as analytics challenges


• Design, implement, and deploy statistical models and data mining techniques on Big Data
• Develop insights that lead to actionable recommendations

Skills and behavioral characteristics of a Data Scientist

• Quantitative skill
• Technical aptitude
• Skeptical mind-set and critical thinking
• Curious and creative
• Communicative and collaborative
Examples of Big Data Analytics
• Retail
• IT infrastructure
• Social media

Validating (against) the hype: organizational fitness


Big Data analytics is a technology-driven activity. Before making a decision regarding adopting
that technology a number of factors have to be considered. They are as follows
1. feasibility,
2. reasonability,
3. value,
4. integrability, and
5. sustainability
Organization’s fitness as a combination of the five factors ranging from 0 (lowest level) to 4
(highest level). Finally the resulting score is reviewed using a radar chart.

5
CS8091 BIG DATA ANALYTICS

Score by 0 1 2 3 4
Dimension
Feasibility Evaluation of Organization Organization Organization is Organization
New technology tests new evaluates and open to encourages
is not officially technologies in tests new evaluation of evaluation and
sanctioned reaction to technologies new testing of new
market after market technology technology
pressure evidence of Adoption of Clear decision
successful use technology on process for
an ad hoc basis adoption or
based on rejection
convincing Organization
business supports
justifications allocation of
time to
innovation
Reasonability Organization’s Organization’s Organization’s Business Business
resource resource resource challenges are challenges
requirements requirements requirements expected to have resource
for near-, mid-, for near- and for near-term have resource requirements
and long-terms mid-terms are is satisfactorily requirements that clearly
are satisfactorily met, unclear as in the mid- and exceed the
satisfactorily met, unclear as to whether long-terms that capability of
met to whether mid- and long will exceed the the existing
long-term term capability of and planned
needs are met needs are the existing environment
met and planned Organization’s
environment go-forward
business model
is highly
information
centric
Value Investment in The expected Selected Expectations The expected
hardware quantifiable instances of for some quantifiable
resources, value widely is perceived quantifiable value widely
software tools, evenly value may value for exceeds the
skills training, balanced by an suggest a investing in investment in
and ongoing investment in positive return limited aspects hardware
management hardware on investment of the resources,
and resources, technology software tools,
maintenance software tools, skills training,
exceeds the skills training, and ongoing
expected and ongoing management
quantifiable management and
value and maintenance
maintenance
Integrability Significant Willingness to New Clear processes No constraints
impediments invest effort in technologies exist for or
to incorporating determining can be migrating or impediments
any ways to integrated into integrating to fully
nontraditional integrate the new integrate
technology technology, environment technologies, technology
into with some within but require into
environment successes limitations and dedicated operational
with some resources and environment
level of effort level of effort

6
CS8091 BIG DATA ANALYTICS

Sustainability No plan in place Continued Need for year Business Program


for acquiring funding for by- year justifications management
funding for maintenance business ensure office effective
ongoing and engagement justifications continued in absorbing
management is given on an ad for continued funding and and amortizing
and maintenance hoc basis funding investments in management
costs No plan for Sustainability skills and
managing is at risk on a maintenance
skills inventory continuous costs
basis Program for
continuous
skills
enhancement
and training

Ex. Radar chart

7
CS8091 BIG DATA ANALYTICS

The promotion of the value of Big Data


Value of Big Data from the perspective of an economic study (titled “Data Equity—Unlocking the
Value of Big Data”), undertaken and published by the Center for Economics and Business
Research are as follows
• optimized consumer spending as a result of improved targeted customer marketing
• improvements to research and analytics within the manufacturing sectors to lead to new
product development
• improvements in strategizing and business planning leading to innovation and new start-
up companies
• predictive analytics for improving supply chain management to optimize stock
management, replenishment, and forecasting
• improving the scope and accuracy of fraud detection
which can be also stated as

• Better targeted customer marketing


• Improved product analytics
• Improved business planning
• Improved supply chain management
• Improved analysis for fraud, waste, and abuse

Big Data use cases

Application categories

• Counting functions applied to large bodies of data that can be segmented and distributed
among a pool of computing and storage resources, such as document indexing, concept
filtering, and aggregation (counts and sums).
• Scanning functions that can be broken up into parallel threads, such as sorting, data
transformations, semantic text analysis, pattern recognition, and searching.
• Modeling capabilities for analysis and prediction.
• Storing large datasets while providing relatively rapid access
Characteristics of Big Data Applications

• Data throttling
• Computation-restricted throttling
• Large data volumes
• Significant data variety
• Benefits from data parallelization

8
CS8091 BIG DATA ANALYTICS

Examples of Applications suited to Big Data Analytics


Application Characteristic Sample Data sources
Energy network monitoring Data throttling Sensor data from smart
and optimization Computation meters and network
throttling components
Large data
volumes
Credit fraud detection Data throttling Point-of-sale data
Computation Customer profiles
throttling
Large data Transaction histories
volumes
Parallelization Predictive models
Data variety
Clustering and customer Data throttling Customer profiles
segmentation Computation Transaction histories
throttling
Large data Enhancement datasets
volumes
Parallelization
Data variety
Recommendation engines Data throttling Customer profiles
Computation Transaction histories
throttling
Large data Enhancement datasets
volumes
Parallelization Social network data
Data variety
Price modeling Data throttling Point-of-sale data
Computation Customer profiles
throttling
Large data Transaction histories
volumes
Parallelization Predictive models

Perception and Quantification of Value


Contribution of Big Data to an Organization
• Increasing revenues
• Lowering costs
• Increasing productivity
• Reducing risk

You might also like