Data - Analytics - Unit - I - III MCA'A'

Big Data Analytics using R
Dr. S. SENTHIL
School of Computer Science and Applications
SCHEME
• Introduction to BigData and its importance
• Understanding the Characteristics of Big Data
• The V’s
o Velocity
o Variety
o Volume
• Types of Data
o Structured
o Unstructured
o Semi-structured
• Examples of structured, unstructured and Semi-structured data
• Understanding the Waves of managing Data
o Wave 1: Creating manageable data structures
o Wave 2: Web and content management
o Wave 3: Managing big data
• Big Data architecture
o Beginning with capture, organize, integrate, analyze, and act
o Setting the architectural foundation
o Performance matters
o Traditional and advanced analytics
• Big Data Technology Components
• Layer 0: Redundant Physical Infrastructure
o Physical redundant networks
o Managing hardware: Storage and servers
o Infrastructure operations
• Layer 1: Security Infrastructure
o Interfaces and Feeds to and from Applications and the Internet
• Layer 2: Operational Databases
• Layer 3: Organizing Data Services and Tools
• Layer 4: Analytical Data Warehouses
3
SCHEME
• Industry examples of Big Data

• Big data and Digital marketing
• Fraud and big data
• Risk and bigdata
• Credit risk management
• Big data and healthcare
• Advertising and big data
INTRODUCTION
Data - ?
“The quantities, characters, or symbols on which operations are

performed by a computer, which may be stored and transmitted in
the form of electrical signals and recorded on magnetic, optical, or
mechanical recording media.”
-- Oxford Dictionary
INTRODUCTION
• Information - ?
• Data Structures - ?
• Database Management Systems - ?
• DBMS Vs RDBMS
• Data Models
INTRODUCTION
• Irrespective of the size of the enterprise, DATA continues to be a

precious and irreplaceable asset.
• Need of the hour is to understand, manage, process and analyse

data to draw valuable insights.
Data - Information - Insights (Knowledge)

Do you have USB stick? Capacity? How much?
INTRODUCTION
 “We are living in the Information Age”.
 Data is growing at a phenomenal rate
• Data is collected everyday from business, society, science and engineering,

medicine and almost every other aspect of life.
• This explosive growth of data is a result of computerization of our society and

fast development of powerful data collection and storage tools.
 Users need more sophisticated information
• Traditional Data Analysis tools and techniques can not be used because of the massive size
of the data set.
 How to extract?
• Traditional Data Analysis tools - ?
• SQL - ?
 Solution –
• Powerful and versatile tools are badly needed to automatically uncover valuable information
from the tremendous amount of data and to transform such data into organized knowledge.
 Data doubles about every year while useful information seems to be
decreasing.
 We are data rich, but information poor.
“From the dawn of civilization until 2003, human
kind generated 5 Exabyte of data. Now we
produce 5 Exabyte of data every two days
and the pace is accelerating”
-- Eric Schimidt, Executive Chairman, Google

Moore’s law states that processor speeds, or overall
processing power for computers will double every two
years.
- Gordon Moore, Co-founder of Intel

MODEL OF GENERATING/CONSUMING DATA
Old Model: Few companies are generating data, all others are consuming data.
New Model: All of us are generating data, and all of us are consuming data.
“Data is the new oil. Like oil, data is valuable, but if unrefined
it cannot really be used. It has to be changed into gas, plastic,
chemicals, etc. to create a valuable entity that drives profitable
activity. so, must data be broken down, analysed for it to have
value.”
- British Mathematician, Clive Humby
CHARACTERISTICS OF DATA
 Composition
• Deals with the structure of data (sources, granularity, types and the
nature of the data)
 Condition
• Deals with the state of the data.
• Can we use the data as it is for analysis?
• Does it require cleaning for further enhancement and enrichment?
 Context
• Where this data has been generated?
• Why was this data generated?
• How sensitive the data is?
• What are the events associated with the data?
TYPES OF DIGITAL DATA
• Structured Data
• Semi-Structured Data
• Unstructured Data
 Structured Data
• Data which is in a organized form (rows and columns) and can be

easily used by a computer program.
• Conforms to a Data model.
• E.g: Data stored in Databases.
 Semi-Structured Data
• Data which does not conform to a data model but has some
structure.
• It is not in a form which can be easily used by a computer program.
• E.g: XML, HTML, e-mails,…..
 Un Structured Data
• Data which does not conform to a data model or is not in a form which can
be used easily by a computer program.
• E.g: Memos, Images, Audio, Video, letters, etc.
 Sources of Structured Data
• Database such as Oracle, MySQL, DB2, Teradata, ….

• Spreadsheets
• OLTP systems
 Sources of Semi-structured Data
• XML (eXtensible Markup Language)

• JSON (Java Script Object Notation)
Used to transmit data between a server and a web
application.
 Sources of Unstructured Data

• Web pages
• Images
• Audios
• Videos
• Body of e-mail
• Text messages
• Chat
• Social media data
• Word document
Data sources
Structured data Unstructured data Semi-Structured

data
•Rows & Columns •Audio, Video,
Analog data •XML data.
•DBMS, RDBMS
•Data capture by
•Supports both
•SQL Sensors
Structured and
•RFID Unstructured data.
•Weather
forecasting
•NoSQL
Gartner estimates that 80% of data generated in any
enterprise today is unstructured data. Roughly 10%
of data is in the structured and semi-structured
category.
HOW TO DEAL WITH UNSTRUCTURED DATA?
 Data Mining
• Process of discovering knowledge hidden in large volumes of data.
 Text Analytics (or) Text Mining
• Process of gleaning high quality meaningful information from text.
• Includes tasks such as Text categorization, Text clustering, Sentiment
analysis, ….
HOW TO DEAL WITH UNSTRUCTURED DATA?
 Natural Language Processing (NLP)

• Related to the area of human computer interaction.
• Enabling computers to understand human (or) Natural language input.
 Noisy Text Analytics
• Process of extracting structured (or) semi-structured information from
noisy unstructured data such as chats, blogs, emails, text messages, etc.
How to manage very large amounts of data and extract
value and knowledge from them ?
“Every day, we create 2.5 quintillion bytes of
data — so much that 90% of the data in the
world today has been created in the last two
years alone.”
WHERE DO WE GET THIS DATA?
 This data comes from everywhere:
• sensors used to gather climate information,

• posts to social media sites,
• digital pictures and videos,
• purchase transaction records, and
• cell phone GPS signals to name a few.
This data is called as ??????????????

WHAT IS BIG DATA?
• Collection of Datasets that are large and complex that can not
be processed by traditional data processing applications.
• Constitute both structured and un structured data that grow

large so fast that they are not manageable by traditional
RDBMS tools or conventional statistical tools.
WHAT IS BIG DATA?
 A few responses
• Anything beyond the human and technical infrastructure
needed to support storage, processing and analysis.
• Today’s BIG may be tomorrow’s NORMAL.
• Terabytes or Petabytes or Zettabytes of data.
“Big Data is a high volume, high velocity and high
variety information assets that demand cost
effective, innovative forms of information
processing for enhanced insights
and decision making”
Source : Gartner IT Glossary

uncertainty
data
structured data
to unstructured
decision
data
making
batch to
streaming data terabyte to
zettabytes
BIG DATA
Volume .. facts and statistics

Velocity collected together for
Variety reference or analysis
value
V’S
 Volume
• Bits -> Bytes -> Kilobytes -> Megabytes -> Gigabytes ->
Terabytes -> Petabytes -> Exabytes -> Zettabytes ->
Yottabytes
V’S
 Velocity
• refers to the increasing speed at which this data is created, and the increasing
speed at which the data can be processed, stored and analysed by relational
databases
• Data is being generated fast and need to be processed fast
• Online Data Analytics
• Late decisions  missing opportunities
• Examples
• E-Promotions: Based on your current location, your purchase history, what you like 
send promotions right now for store next to you
• Healthcare monitoring: sensors monitoring your activities and body  any abnormal
measurements require immediate reaction .
V’S
Batch Processing -> Periodic -> Near real time ->

Real time processing
 Variety (Structured, Unstructured and Semi-structured)
• Various formats, types, and structures
• Text, numerical, images, audio, video, sequences, time series, social media data,
multi-dimensional arrays, etc…
• Static data vs. streaming data
• A single application can be generating/collecting many types of data

OTHER CHARACTERISTICS OF DATA –
WHICH ARE NOT DEFINITIONAL TRAITS OF
BIG DATA
 Veracity
• Refers to biases, noise and abnormality in data.
 Validity
• Refers to the accuracy and correctness of data.
 Volatility
• Deals with how long the data is valid? And how long it should be
stored?
 Variability
• Data whose meaning is constantly changing.
WHY BIG DATA?
WHY BIG DATA?
a ta
D
WHO IS GENERATING BIG
DATA?
 Social media and networks
 Scientific Instruments
 Sensor technology
 e-Commerce
WHO IS GENERATING BIG DATA?
• Internet - Google, amazon, eBay, AOL, etc.,

• Mobile gaming - On-line betting, multi-user games
• Marketing - Social Network Analysis, Digital Advertising, etc.,
• Telecom - Call Routing Management, Subscriber data, etc.,
• Healthcare - Maintaining patient records, etc.,
WHAT IS BIG DATA
ANALYTICS?
“Big Data Analytics is the process of examining big data to uncover patterns,
unearth trends and find unknown correlations and other useful
information to make faster and better decisions”
“Process of collecting, organizing and analyzing of large sets of data (big

data) to discover patterns and other useful information”
WHAT IS BIG DATA ANALYTICS?
CLASSIFICATION OF
ANALYTICS
 First school of thought

• Basic, Operationalized, Advanced and Monetized
 Second school of thought
• Analytics 1.0, 2.0 and 3.0
FIRST SCHOOL OF THOUGHT
 Basic Analytics
• Slicing and dicing of data to help with basic business insights.
• Reporting on historical data, basic visualization, etc.,
 Operationalized analytics
• Gets woven into enterprise’s business process
FIRST SCHOOL OF THOUGHT
 Advanced analytics
• Forecasting for the future by way of predictive and prescriptive
modeling.
 Monetized analytics
• To derive direct business revenue.
SECOND SCHOOL OF
THOUGHT
 Analytics 1.0
• Mid 1950’s to 2009
• Descriptive statistics (and Diagnostic)
• Report on events, occurrences, etc of the past.
• What happened?
• Why did it happen?

SECOND SCHOOL OF
THOUGHT
 Analytics 2.0
• 2005 to 2012
• Descriptive statistics + Predictive statistics
• Use data from the past to make predictions for the future
• What will happen?
• Why will it happen?

SECOND SCHOOL OF
THOUGHT
 Analytics 3.0
• 2012 to present
• Descriptive + Predictive + Prescriptive statistics
• Use data from the past to make prophecies for the future and at the same
time make recommendations to leverage the situation to one’s advantage.
• What will happen?
• When will it happen?
• Why will it happen?
• What should be the action taken to take advantage of what will happen?
ANALYTICS 1.0, 2.0, 3.0
ANALYTICS 1.0, 2.0, 3.0
 Descriptive Analytics
• which use data aggregation and data mining to provide insight into the
past and answer: “What has happened?”
• Insight into the past
• Use Descriptive Analytics when you need to understand at an aggregate

level what is going on in your company, and when you want to
summarize and describe different aspects of your business.
ANALYTICS 1.0, 2.0, 3.0
 Predictive Analytics
• which use statistical models and forecasting techniques to understand

the future and answer: “What could happen?”
• Understanding the future
• Use Predictive Analytics any time you need to know something about
the future, or fill in the information that you do not have.
ANALYTICS 1.0, 2.0, 3.0
 Prescriptive Analytics
• which use optimization and simulation algorithms to advise on possible

outcomes and answer: “What should we do?”
• Advise on possible outcomes
• Use Prescriptive Analytics any time you need to provide users with
advice on what action to take.
TRADITIONAL BI VS. BIG DATA
• In traditional BI environment, all the enterprise’s data is stored in central

server whereas in a big data environment data resides in a distributed file
system.
• DFS scales by scaling in or out horizontally as compared to typical

database server that scales vertically.
TRADITIONAL BI VS. BIG DATA
• In traditional BI, data is generally analysed in an offline mode whereas in

big data it is analysed in both real time as well as in offline mode.
• Traditional BI is about structured data and it is here that data is taken to
processing functions (move data to code)
• Big Data is about variety; processing functions are taken to the data (move
code to the data)
READING DATA WITH A SINGLE MACHINE
READING DATA WITH A SINGLE MACHINE
PARALLEL PROCESSING
PARALLEL PROCESSING
CHALLENGES FACED
Solution ---- ?????
WHAT IS HADOOP?
“Apache Hadoop is an open-source software

framework for distributed storage and
distributed processing of very large data sets
on clusters of commodity hardware”
WHAT IS HADOOP?
To process big data, first it has to stored in a distributed manner

and has to be processed in parallel.
Hadoop = HDFS + MapReduce

BRIEF HISTORY OF HADOOP
HISTORY OF HADOOP
 Originally built as an
Infrastructure for the “Nutch”
project.
 Based on Google’s map reduce
and Google File System.
 Created by Doug Cutting in
2005 at Yahoo.
 Named after his son’s toy
yellow elephant.
HDFS
HDFS creates a level of abstraction over the resources,

from where we can see the HDFS
as a single unit.
HDFS
HDFS has two core components:
•Name Node
Main node that contains metadata about the data stored.
•Data Nodes
Data is stored on the data nodes which are commodity
hardware in the distributed environment.
HDFS
FUNDAMENTAL PRINCIPLES OF HADOOP
• Parallel Execution
• Data Locality
• Fault Tolerance
• Scalability
• Economical
HADOOP ECO SYSTEM
HDFS - STORAGE
• Stores different types of large data sets (i.e. structured, semi-structured

and un structured data).
• HDFS creates a level of abstraction over the resources, from where we
can see the whole HDFS as a single unit.
• Stores data across various nodes and maintains the log file about the
stored data (meta data)
• HDFS has two core components : Namenode and Datanode.
YARN (YET ANOTHER RESOURCE
NAVIGATOR)
 Performs all your processing activities by allocating resources and scheduling

tasks.
 Two services
• ResourceManager
• Manages resources and schedule applications running on top of yarn.
• NodeManager
• Manages containers and monitors resource utilisation in each container.

MAPREDUCE - DATA PROCESSING USING
PROGRAMMING
• Core component in a Hadoop Ecosystem for processing.
• Helps in writing applications that processes large datasets using distributed and
parallel algorithms.
• In a MapReduce program, Map() and Reduce() are two functions.
• Map function performs actions like filtering, grouping and sorting.
• Reduce function aggregates and summarizes the result produced by Map function.
PIG : DATA PROCESSING SERVICE USING QUERY
• PIG has two parts: Pig Latin, the language and the Pig runtime, for the execution
environment.
• 1 line of pig latin = 100 lines of Map-reduce job (approx)
• The compiler internally converts pig latin to MapReduce.
• It gives us a platform for building data flow for ETL.
• PIG first loads the data, then perform various functions like grouping, filtering,
joining, sorting, etc., and finally dumps the data on the screen or stores in HDFS.
HIVE : DATA PROCESSING SERVICE
USING QUERY
• A data warehousing component which analyses data sets in a distributed

environment using SQL – like interface.
• The query language of Hive is called as Hive Query Language (HQL)
• 2 basic components:
• Hive command Line and
• JDBC / ODBC driver
• Supports User Defined Functions to accomplish specific needs.
MAHOUT : MACHINE LEARNING
• Provides an environment for creating machine learning applications.

• It performs collaborative filtering, clustering and classification.
• Provides a command line to invoke various algorithms.
• It has predefined set of library which already contains different inbuilt
algorithms for different use cases.
SPARK : IN-MEMORY DATA PROCESSING
• A framework for real time analytics in a distributed computing environment.

• Written in Scala was originally developed at the University of California,
Berkeley.
• It executes in-memory computations to increase speed of data processing over
Map-Reduce.
• 100x faster than Hadoop for large scale date processing by exploting in-memory
computations.
HBASE : NOSQL DATABASE
• An open-source, non-relational database – a NoSQL database.

• Supports all types of data and it is capable of handling of anything and everything.
• Modeled after Google’s BigTable.
• Gives us a fault tolerant way of storing sparse data.
• Written in Java and Hbase applications can be written in REST, Avro and Thrift
APIs.
OOZIE : JOB SCHEDULER
• Oozie is a job scheduler in Hadoop ecosystem.

• Two kinds of oozie jobs
• Oozie workflow
• Sequential set of actions to be executed.
• Oozie co-ordinator
• Oozie jobs which are triggered when the data is made available to it or even
triggered based on time.
APACHE FLUME : DATA INGESTING
SERVICE
• Ingests unstructured and semi-structured data into HDFS.

• It helps in collecting, aggregating and moving large amount of data sets.
• It helps us to ingest online streaming data from various sources like network
traffic, social media, email messages, log files etc. in HDFS.
APACHE SQOOP : DATA INGESTING
SERVICE
• Another Data Ingesting service.
• Sqoop can import as well as export structured data from RDBMS.
• Flume only ingests unstructured data or semi-structured data into HDFS.

APACHE SOLR AND LUCENE
• Two services which are used for searching and indexing in Hadoop ecosystem.
• Apache Lucene is based on Java, which also helps in spell checking.
• Apache Lucene is the engine, Apache Solr is a complete application built aroung
Lucene.
• Solr uses Apache Lucene Java search library for searching and indexing.
ZOOKEEPER : CO-ORDINATOR
• An open source-server which enables highly reliable distributed co-ordination.
• Apache Zookeeper coordinates with various Hadoop services in a distributed

environment.
• Performs synchronization, configuration maintenance, grouping and naming.

APACHE AMBARI : CLUSTER MANAGER
• Software for provisioning, managing and monitoring Apache Hadoop clusters.

• Gives us step by step process for installing Hadoop services.
• Handles configuration of Hadoop services.
• Provides a central management service for starting, stopping and re-configuring
Hadoop services.
• Monitors health and status of the Hadoop cluster.
HADOOP USE CASE
STRUCTURED DATA
• Data that has a defined length and format
• Examples of structured data include numbers, dates, and groups of words and
numbers called strings (for example, a customer’s name, address, and so on).
• It’s usually stored in a database.
• Structured data from “traditional” sources like customer relationship management

(CRM) data, operational enterprise resource planning (ERP) data, and financial data.
• Often these data elements are integrated in a data warehouse for analysis.
SOURCES OF STRUCTURED DATA
 Computer- or machine-generated
• Machine-generated data generally refers to data that is created by a

machine without human intervention.
 Human-generated
• This is data that humans, in interaction with computers, supply.

COMPUTER OR MACHINE GENERATED STRUCTURED
DATA
 Sensor data:
• Examples include radio frequency ID (RFID) tags, smart meters,
medical devices, and Global Positioning System (GPS) data.
• Another example of sensor data is smartphones that contain sensors like
GPS that can be used to understand customer behavior in new ways.
 Web log data:

• When servers, applications, networks, and so on operate, they capture
all kinds of data about their activity.
• This can amount to huge volumes of data that can be useful, for
example, to deal with service-level agreements or to predict security
breaches.
COMPUTER OR MACHINE GENERATED STRUCTURED
DATA
 Point-of-sale data
• When the cashier swipes the bar code of any product that you are
purchasing, all that data associated with the product is generated. Just
think of all the products across all the people who purchase them, and
you can understand how big this data set can be.
 Financial data
• Lots of financial systems are now programmatic; they are operated
based on predefined rules that automate processes. Stock-trading data is
a good example of this. It contains structured data such as the company
symbol and dollar value. Some of this data is machine generated, and
some is human generated.
HUMAN GENERATED STRUCTURED DATA
 Input data
 This is any piece of data that a human might input into a computer, such as
name, age, income, non-free-form survey responses, and so on. This data
can be useful to understand basic customer behavior.
 Click-stream data
 Data is generated every time you click a link on a website. This data can be
analyzed to determine customer behavior and buying patterns.
 Gaming-related data
 Every move you make in a game can be recorded. This can be useful in
understanding how end users move through a gaming portfolio.
UNSTRUCTURED DATA
• Unstructured data is data that does not follow a specified

format.
SOURCES OF UNSTRUCTURED
DATA
• Just as with structured data, unstructured data is either machine generated

or human generated.
COMPUTER OR MACHINE GENERATED UNSTRUCTURED DATA
 Satellite images:
• This includes weather data or the data that the government captures in
its satellite surveillance imagery.
 Scientific data:
• This includes seismic imagery, atmospheric data, and high energy
physics.
 Photographs and video:
• This includes security, surveillance, and traffic video.
 Radar or sonar data:
• This includes vehicular, meteorological, and oceanographic seismic
profiles.
EXAMPLES OF HUMAN GENERATED UNSTRUCTURED DATA
 Text internal to your company:
• Think of all the text within documents, logs, survey results, and e-mails.
Enterprise information actually represents a large percent of the text information
in the world today.
 Social media data:

• This data is generated from the social media plat-forms such as YouTube,
Facebook, Twitter, LinkedIn, and Flickr.
 Mobile data:
• This includes data such as text messages and location information.
 Website content:
• This comes from any site delivering unstructured content, like YouTube, Flickr,
or Instagram.
SEMI-STRUCTURED DATA
• Semi-structured data is a kind of data that falls between structured and

unstructured data.
• Semi-structured data does not necessarily conform to a fixed schema (that is,
structure) but may be self-describing and may have simple label/value pairs.
• For example, label/value pairs might include:
• <family>=Jones, <mother>=Jane, and <daughter>=Sarah.
• Examples of semi-structured data include EDI(Electronic Data Interchange),

SWIFT (Society of Worldwide Interbank Financial Telecommunication),
Email and XML.
THE EVOLUTION OF DATA MANAGEMENT
 Wave 1
• Creating manageable data structures
 Wave 2
• Web and content management
 Wave 3
• Managing big data
WAVE 1
• As computing moved into the commercial market in the late 1960s, data was stored
in flat files (is a data file that does not contain links to other files or is a non-
relational database) that imposed no structure.
• Bob|123 street|California|$200.00
Nathan|800 Street|Utah|$10.00
• When companies needed to get to a level of detailed understanding about customers,

they had to apply brute-force methods, including very detailed programming models
to create some value.
• Later in the 1970s, things changed with the invention of the relational data model and
the Relational DataBase Management System (RDBMS) that imposed structure and a
method for improving performance.
WAVE 1
• Most importantly, the relational model added a level of abstraction (the structured
query language [SQL], report generators, and data management tools) so that it was
easier for programmers to satisfy the growing business demands to extract value from
data.
• The relational model offered an ecosystem of tools from a large number of emerging
software companies.
• But a problem emerged from this exploding demand for answers: Storing this growing
volume of data was expensive and accessing it was slow.
• Making matters worse, lots of data duplication existed, and the actual business value of
that data was hard to measure.
WAVE 1
• At this stage, an urgent need existed to find a new set of technologies to support
the relational model.
• The Entity-Relationship (ER) model emerged, which added additional

abstraction to increase the usability of the data. In this model, each item was
defined independently of its use. Therefore, developers could create new
relationships between data sources without complex programming.
WAVE 1
• When the volume of data that organizations needed to manage grew out of control,
the data warehouse provided a solution.
• The data warehouse enabled the IT organization to select a subset of the data being
stored so that it would be easier for the business to try to gain insights. It also
provided an integrated source of information from across various data sources that
could be used for analysis.
WAVE 1
• Data warehouses were commercialized in the 1990s, and today, both content
management systems and data warehouses are able to take advantage of
improvements in scalability of hardware, virtualization technologies, and the
ability to create integrated hardware and software systems, also known as
appliances.
• Sometimes these data warehouses themselves were too complex and large and
didn’t offer the speed and agility that the business required. The answer was a
further refinement of the data being managed through data marts.
• These data marts were focused on specific business issues and were much more
streamlined and supported the business need for speedy queries than the more
massive data warehouses.
WAVE 1
• Data warehouses and data marts solved many problems for companies needing a
consistent way to manage massive transactional data. But when it came to managing
huge volumes of unstructured or semi-structured data, the ware-house was not able to
evolve enough to meet changing demands.
• To complicate matters, data warehouses are typically fed in batch intervals, usually
weekly or daily. This is fine for planning, financial reporting, and traditional
marketing campaigns, but is too slow for increasingly real-time business and
consumer environments.
WAVE 1
• To transform their traditional data management approaches to handle the expanding

volume of unstructured data elements, as companies began to store unstructured
data, vendors began to add capabilities such as BLOBs (binary large objects).
• In essence, an unstructured data element would be stored in a relational database as

one contiguous chunk of data. This object could be labeled (that is, a customer
inquiry) but you couldn’t see what was inside that object. Clearly, this wasn’t going
to solve changing customer or business needs.
WAVE 1
• Later Object Database Management System (ODBMS) was introduced. The

object data-base stored the BLOB as an addressable set of pieces so that we could
see what was in there. Unlike the BLOB, which was an independent unit
appended to a traditional relational database, the object database provided a
unified approach for dealing with unstructured data. Object databases include a
programming language and a structure for the data elements so that it is easier to
manipulate various data objects without programming and complex joins.
WAVE 2
• Enterprise Content Management systems evolved in the 1980s to provide

businesses with the capability to better manage unstructured data, mostly
documents. In the 1990s with the rise of the web, organizations wanted to move
beyond documents and store and manage web content, images, audio, and video.
WAVE 2
• The market evolved from a set of disconnected solutions to a more unified model
that brought together these elements into a platform that incorporated business
process management, version control, information recognition, text management,
and collaboration. This new generation of systems added meta-data (information
about the organization and characteristics of the stored information).
WAVE 2
• A new generation of requirements has begun to emerge that drive us to the next
wave.
• These new requirements have been driven, in large part, by a convergence of

factors including the web, virtualization, and cloud computing.
• In this new wave, organizations are beginning to understand that they need to
manage a new generation of data sources with an unprecedented amount and
variety of data that needs to be processed at an unheard-of speed.
WAVE 3
• With big data, it is now possible to virtualize data so that it can be stored
efficiently and, utilizing cloud-based storage, more cost-effectively as well.
• In addition, improvements in network speed and reliability have removed other

physical limitations of being able to manage massive amounts of data at an
acceptable pace.
WAVE 3
• But no technology transition happens in isolation; it happens when an important

need exists that can be met by the availability and maturation of technology.
• Many of the technologies at the heart of big data, such as virtualization, parallel
processing, distributed file systems, and in-memory databases, have been around
for decades.
• Advanced analytics have also been around for decades, although they have not
always been practical. Other technologies such as Hadoop and MapReduce have
been on the scene for only a few years.
IS THERE A FOURTH “BIG” WAVE ? – EVOLUTION, IOT
• Currently we are still at an early stage of leveraging huge volumes of data to gain
a 360-degree view of the business and anticipate shifts and changes in customer
expectations.
• The technologies required to get the answers the business needs are still isolated
from each other.
• To get to the desired end state, the technologies from all three waves will have to
come together.
• Big data is not simply about one tool or one technology.
• It is about how all these technologies come together to give the right insights, at
the right time, based on the right data whether it is generated by people, machines,
or the web.
BUILDING A SUCCESSFUL BIG DATA MANAGEMENT
ARCHITECTURE
• We have moved from an era where an organization could implement a database to

meet a specific project need and be done.
• But as data has become the fuel of growth and innovation, it is more important
than ever to have an underlying architecture to support growing requirements.
BEGINNING WITH CAPTURE, ORGANIZE,
INTEGRATE, ANALYZE, AND ACT
The cycle of Big Data

Management
• Data must first be captured, and then organized and integrated.
• After this phase is successfully implemented, data can be analyzed based

on the problem being addressed.
• Finally, management takes action based on the outcome of that analysis.
• For example, Amazon.com might recommend a book based on a past

purchase or a customer might receive a coupon for a discount for a
future purchase of a related product to one that was just purchased.
• Capturing the Big Data
• Several sources of big data which generate a lot of new data.
• Unstructured data, now mostly generated through the internet and

social media.
• Text messages, tweets, (blog) posts are increasing becoming an

important source of data relevant for any organization.
• Organize
• Big data research platform needs to process massive quantities of data

– filtering, transforming and sorting it before loading it into a data
warehouse.
• Integration
• Big data integration is discovering information, profiling them,

understanding the data, tracking through metadata, improving the
quality of data and then transforming it into the form that is required
for big data. data warehouse.
• Every big data use case needs integration.
• Several challenges during the integration

• Analysis, Data curation, capture, sharing, visualization, information privacy and storage.
• Analyze
• Infrastructure required for analyzing big data must be able to support

deeper analytics such as statistical analysis and data mining on a wider
variety of data types stored in diverse systems; scale to extreme data
volumes; deliver faster response times; and automate decisions based
on analytical models.
SETTING THE ARCHITECTURAL FOUNDATION
• In addition to supporting the functional requirements, it is important to support
the required performance.
• Your needs will depend on the nature of the analysis you are supporting.
• You will need the right amount of computational power and speed.
• While some of the analysis you will do will be performed in real time, you will
inevitably be storing some amount of data as well.
• Your architecture also has to have the right amount of redundancy so that you
are protected from unanticipated latency and downtime.
• Your organization and its needs will determine how much attention you have
to pay to these performance issues.
 Start asking the following questions:

• How much data will my organization need to manage today and in the
future?
• How often will my organization need to manage data in real time or
near real time?
• How much risk can my organization afford? Is my industry subject to
strict security, compliance, and governance requirements?
• How important is speed to my need to manage data?
• How certain or precise does the data need to be?
• A big data management architecture must include a variety of services

that enable companies to make use of numerous data sources in a fast and
effective manner.
The Big Data Architecture

• Interfaces and feeds
• Redundant physical infrastructure
• Security infrastructure
• Operational data sources
• Performance matters
• Organizing data services and tools
• MapReduce, Hadoop, and Big Table
• Traditional and advanced analytics
• Analytical data warehouses and data marts
• Big data analytics
• Reporting and visualization https://vimeo.com/45579023
• Big data applications
 Interfaces and Feeds
• Big data is the fact that it relies on picking up lots of data from lots of
sources.
• Therefore, open application programming interfaces (APIs) will be core
to any big data architecture.
• In addition, keep in mind that interfaces exist at every level and
between every layer of the stack.
• Without integration services, big data can’t happen.
 Redundant Physical Infrastructure

• Without the availability of robust physical infrastructures, big data would
probably not have emerged as such an important trend.
• To support an unanticipated or unpredictable volume of data, a physical
infrastructure for big data has to be different than that for traditional data.
• The physical infrastructure is based on a distributed computing model.
• This means that data may be physically stored in many different locations
and can be linked together through networks, the use of a distributed file
system, and various big data analytic tools and applications.
• Redundancy is important because we are dealing with so much data from so

many different sources.
• Redundancy comes in many forms.
• If your company has created a private cloud, you will want to have
redundancy built within the private environment so that it can scale out to
support changing workloads.
• If your company wants to contain internal IT growth, it may use external
cloud services to augment its internal resources.
• In some cases, this redundancy may come in the form of a Software as a
Service offering that allows companies to do sophisticated data analysis as a
service.
• The SaaS approach offers lower costs, quicker startup, and seamless evolution
of the underlying technology.
 Security infrastructure
• The more important big data analysis becomes to companies, the more important
it will be to secure that data.
• For example, if you are a healthcare company, you will probably want to use big
data applications to determine changes in demographics or shifts in patient needs.
• This data about your constituents needs to be protected both to meet compliance
requirements and to protect the patients’ privacy.
• You will need to take into account who is allowed to see the data and under what
circumstances they are allowed to do so.
• You will need to be able to verify the identity of users as well as protect the
identity of patients.
• These types of security requirements need to be part of the big data fabric from the
outset and not an afterthought.
 Operational data sources
• Traditionally, an operational data source consisted of highly structured

data managed by the line of business in a relational database.
• But as the world changes, it is important to understand that operational
data now has to encompass a broader set of data sources, including
unstructured sources such as customer and social media data in all its
forms.
SETTING THE ARCHITECTURAL
FOUNDATION
• You find new emerging approaches to data management in the big data
world, including document, graph, columnar, and geospatial database
architectures.
• Collectively, these are referred to as NoSQL, or not only SQL,
databases.
 All these operational data sources have several characteristics in common:
• They represent systems of record that keep track of the critical data required
for real-time, day-to-day operation of the business.
• They are continually updated based on transactions happening within business
units and from the web.
• For these sources to provide an accurate representation of the business, they
must blend structured and unstructured data.
• These systems also must be able to scale to support thousands of users on a
consistent basis.
• These might include transactional e-commerce systems, customer relationship
management systems, or call center applications.
 Performance matters
• Your data architecture also needs to perform in concert with your organization’s
supporting infrastructure.
• For example, you might be interested in running models to determine

whether it is safe to drill for oil in an offshore area given real-time data of
temperature, salinity, sediment re-suspension, and a host of other biological,
chemical, and physical properties of the water column.
• It might take days to run this model using a traditional server configuration.
• However, using a distributed computing model, what took days might now
take minutes.
• Performance might also determine the kind of database you would use.
• A graphing database might be a better choice, as it is specifically designed to
separate the “nodes” or entities from its “properties” or the information that
defines that entity, and the “edge” or relationship between nodes and properties.
• Using the right database will also improve performance.
• Typically the graph database will be used in scientific and technical
applications.
• Other important operational database approaches include columnar databases

that store information efficiently in columns rather than rows.
• This approach leads to faster performance because input/output is extremely

fast.
• When geographic data storage is part of the equation, a spatial database is

optimized to store and query data based on how objects are related in space.
 Organizing data services and tools

• Not all the data that organizations use is operational.
• A growing amount of data comes from a variety of sources that aren’t quite as
organized or straightforward, including data that comes from machines or
sensors, and massive public and private data sources.
• In the past, most companies weren’t able to either capture or store this vast
amount of data.
• It was simply too expensive or too overwhelming.
• Even if companies were able to capture the data, they did not have the tools to
do anything about it.
 Organizing data services and tools
• Very few tools could make sense of these vast amounts of data.
• The tools that did exist were complex to use and did not produce results in a
reasonable time frame.
• In the end, those who really wanted to go to the enormous effort of analyzing
this data were forced to work with snapshots of data.
• This has the undesirable effect of missing important events because they were
not in a particular snapshot.
FOUNDATION
 MapReduce, Hadoop, and Big Table
• MapReduce was designed by Google as a way of efficiently executing a set of

functions against a large amount of data in batch mode.
• The “map” component distributes the programming problem or tasks across a

large number of systems and handles the placement of the tasks in a way that
balances the load and manages recovery from failures.
• After the distributed computation is completed, another function called

“reduce” aggregates all the elements back together to provide a result.
 Big Table
• Big Table was developed by Google to be a distributed storage system intended

to manage highly scalable structured data.
• Data is organized into tables with rows and columns.
• Unlike a traditional relational database model, Big Table is a sparse, distributed,

persistent multidimensional sorted map.
• It is intended to store huge volumes of data across commodity servers.

FOUNDATION
 Hadoop
• Hadoop is an Apache-managed software framework derived from MapReduce and Big
Table.
• Hadoop allows applications based on MapReduce to run on large clusters of commodity
hardware.
• The project is the foundation for the computing architecture supporting Yahoo!’s
business.
• Hadoop is designed to parallelize data processing across computing nodes to speed
computations and hide latency.
• Two major components of Hadoop exist: a massively scalable distributed file system
that can support petabytes of data and a massively scalable MapReduce engine that
computes results in batch.
 Traditional and Advanced analytics
• Some analysis will use a traditional data warehouse, while other analyses will
take advantage of advanced predictive analytics.
• Managing big data holistically requires many different approaches to help the
business to successfully plan for the future.
 Analytical data warehouses and data marts

• After a company sorts through the massive amounts of data available, it is
often sensible to take the subset of data that reveals patterns and put it into a
form that’s available to the business.
• These warehouses and marts provide compression, multilevel partitioning,

and a massively parallel processing architecture.
 Big data analytics
• The capability to manage and analyze petabytes of data enables

companies to deal with clusters of information that could have an
impact on the business.
• This requires analytical engines that can manage this highly
distributed data and provide results that can be optimized to solve a
business problem.
• Reporting and Visualisation

• With big data, reporting and data visualization become tools for looking at the
context of how data is related and the impact of those relationships on the
future.
 Big data applications

• Some of the emerging applications are in areas such as healthcare,
manufacturing management, traffic management, and so on.
• What do all these big data applications have in common?

• They rely on huge volumes, velocities, and varieties of data to transform the behavior
of a market.
• In healthcare, a big data application might be able to monitor premature infants

to determine when data indicates when intervention is needed.
• In manufacturing, a big data application can be used to prevent a machine from
shutting down during a production run.
• A big data traffic management application can reduce the number of traffic jams
on busy city highways to decrease accidents, save fuel, and reduce pollution.
INDUSTRY EXAMPLES OF BIG DATA
 Big Data in Real world
• Nothing helps us understand Big Data more than examples of how the
technology and approaches is being used in the “real world.”
• Helps us to learn how to apply ideas from other industries into business.
• Every Data today is from the Online world.
• Examples include ..
• Digital Marketing, Financial Services, Advertising, and Healthcare.

• Digital Marketing and the Non-line World

• Google ’s digital marketing evangelist and author Avinash Kaushik from google
• Data warehouses – Nothing but Business Intelligence is a single source of
truth.
• The data is historical and collected data from (RDBMS)Oracle, ERP and other
sources within the organization.
• BUT.. The big data warehouse approach does not work in the online world.
• It was found that, it was a big disaster because everything that works in the
Business Intelligence world (did) does not work in the online world.
Evangelist – A person who promotes the use of particular product/technology

through talks, articles, blogging, user demonstrations etc.,
 Multiplicity-To succeed in online world
• In order for you to be successful online, it is important to embrace

multiplicity .
• This multiplicity requires multiple skills in the decision-making team,
multiple tools, and multiple types of data (clickstream data, consumer
data, competitive intelligence data, etc.).
 Multiplicity-To succeed in online world
• “The problem is that this approach and thinking is diametrically 100

percent to Business Intelligence and Data warehouse world”.
• Avinash says, companies struggle with making smart decisions online
because they cannot embrace multiplicity.
• Also to take decisions managers must embrace incomplete data, they ’re
forced to pick perfect data in order to make decisions.”
• Origins of Digital Marketing-DB Marketing

• As technology evolved to absorb greater volumes of data, the costs of data
environments started to come down
• Companies began collecting even more transactional data.
• As more industries (retail, insurance, consumer credit, automotive, pharmaceutical)

saw the value of database marketing, the trend continued accelerating.
INDUSTRY EXAMPLES OF BIG
DATA
 Origins of Digital Marketing-DB Marketing

• Now we had really rich data, and that led to richer data analytics.
• This transactional data led to Multi-Dimensional Data warehouses and Data

marts
• Before the onset of the digital age, marketing professionals used

TELEPHONE and TELEVISION and PRINT MEDIA(letters, news
papers).
• Digital marketing is the component of marketing, that utilizes the internet

and online-based digital technologies to promote products and services.
DATA
 Big Data and the New School of Marketing
• Dan Springer, CEO of Responsys, defines the new school of marketing:
• “Today ’s consumers have changed. They ’ve put down the newspaper,
• They fast forward through TV commercials, and they junk unsolicited email.
Why?
• They have new options that better fit their digital lifestyle. They can choose
which marketing messages they receive, when, where, and from whom.
• They Choose across the digital power channels: email, mobile, social, display and
the web
 Digital Marketing-Cross channel marketing

• Digital marketing encompasses using any sort of online media channel.
• Digital marketing is the marketing of products or services using digital

technologies, mainly on the Internet, but also including mobile phones, display
advertising, and any other digital medium.
• Cross-Channel Lifecycle Marketing (to tell all your audience and customers a
similar brand story across multiple channels) really starts with the capture of
customer permission, contact information, and preferences for multiple channels.
• It also requires marketers to have the right integrated marketing and customer
information systems, so that
• they can have complete understanding of customers through stated preferences

and observed behavior at any given time; and
• they can automate and optimize their programs and processes throughout the
customer lifecycle.
• Once marketers have that, they need a practical framework for planning
marketing activities.
NEW SCHOOL OF MARKETING

• Digital Marketing-Cross channel marketing
• And it means driving people to a website, a mobile app, and the like,
and, once there, retaining them, interacting with them.
• Mercedes-Benz has a cross-selling strategy that includes digital and
social media channels, including paid media, owned media, earned
media and content marketing. For example, its “Generation Benz”
online community was integral in developing a customer profile for
Mercedes Benz that would help them understand which marketing
tactics would work best for each channel.
 Fraud and Big data

• Fraud is intentional deception made for personal gain or to damage another
individual.
• One of the most common forms of fraudulent activity is credit card fraud. The
credit card fraud rate in United States and other countries is increasing.
 Fraud and Big data

• Credit card fraud incidence increased 87 percent in 2011 culminating in an
aggregate fraud loss of $6 billion.
• Banks need more preventive approaches to prevent fraud.
• Credit card issuers should prioritize preventing fraud.

 Reasons for fraud

• Social media and mobile phones are forming the new frontiers for fraud.
• Despite warnings that social networks are a great resource for fraudsters, consumers
are still sharing a significant amount of personal information frequently used to
authenticate a consumer ’s identity.
• Those with public profiles (those visible to everyone) were more likely to expose
this personal information.
DATA
• Prevent Fraud
• In order to prevent the fraud, credit card transactions are monitored and checked in
near real time.
• If the checks identify pattern inconsistencies and suspicious activity, the

transaction is identified for review and escalation.
• The Capgemini Financial Services team believes that due to the nature of data
streams and processing required, Big Data technologies provide an optimal
technology solution based on the following three Vs:
 High volume.
• Years of customer records and transactions (150 billion records per

year)
 High velocity.
• Dynamic transactions and social media information
 High variety.
• Social media plus other unstructured data such as customer emails,

call center conversations, as well as transactional structured data
DATA
 Capgeminis Fraud Detection mechanism
• Capgemini ’s new fraud Big Data initiative focuses on flagging the

suspicious credit card transactions to prevent fraud in near real-time via
• multi-attribute monitoring.
• Real-time inputs involving transaction data
• customers records are monitored via validity checks and detection

rules.
DATA
• Pattern recognition is performed against the data to score and weight individual
transactions across each of the rules and scoring dimensions.
• A cumulative score is then calculated for each transaction record and compared
against thresholds to decide if the transaction is potentially suspicious or not.
 Capgeminis Fraud Detection mechanism

• Elastic search is able to achieve fast search responses because, instead of searching
the text directly, it searches an index instead.
• Once the transaction data has been processed, the percolator query then performs the
functioning of identifying new transactions that have raised profiles.
• Percolator is a system for incrementally processing updates to large data sets.

Percolator is the technology that Google used in building the index— that links
keywords and URLs—used to answer searches on the Google page.
• Percolator query can handle both structured and unstructured data.

 Social Network Analysis

• Another approach to solving fraud with Big Data is social network analysis(SNA).
• SNA is the precise analysis of social networks. Social network analysis views social
relationships and makes assumptions.
• SNA could reveal all individuals involved in fraudulent activity, from perpetrators to their
associates,
• SNA helps to understand their relationships and behaviors to identify a bust out fraud case.
DATA
• Risk and Big Data
• Many of the world ’s top analytics professionals work in risk management.
• Risk management is data-driven—
• Advanced data analytics drives modern risk management.
• The two most common types of risk management are
• credit risk management and
• market risk management.
• A third type of risk, operational risk management, is not as common as credit and market risk.
• Credit risk analytics focus on past credit behaviors to predict the likelihood that
a borrower will default on any type of debt by failing to make payments.
• For example, “Is this person likely to default on their $300,000 mortgage?”
• Market risk analytics focus on understanding the likelihood that the value of a
portfolio will decrease due to the change in stock prices, interest rates, foreign
exchange rates, and commodity prices.
• For example, “Should we sell this holding if the price drops another 10
percent?”
• Big Data and Healthcare

• Big Data promises an enormous revolution in health care, with important
advancements in everything from the management of chronic disease to the delivery
of personalized medicine.
• In addition to saving and improving lives, Big Data has the potential to transform
the entire health care system by replacing guesswork and intuition with objective,
data-driven science.
• “Big data in healthcare” refers to the abundant health data amassed from
numerous sources including electronic health records (EHRs), medical imaging,
genomic sequencing, payer(Insurance) records, pharmaceutical research,
wearables, and medical devices, to name a few.
• The Health care data is Voluminous, Moves at high velocity and variety of data
formats is used.
1. Big Data and Healthcare
EMR – Electronic Medical Records ; EHR – Electronic Health Records;

RWE – Real world Evidence
• Health care industry is now awash in data
• From Gene expression to Next Generation Gene sequencing
• Exponential growth in Electronic Health Records, Health Information Exchange,

Imaging, Medical and Prescription claims,…
• Has value for individual stakeholders:- Providers (physicians), Producers

(pharmaceutical and medical device companies), Payers (insurance companies)
HOW BIG DATA ANALYTICS IS APPLIED IN HEALTHCARE
• Healthcare Analytics helps doctors to make data driven decisions
• Wearable devices from Apple can notify wearers that they need to seek medical
attention.
• Analysis of healthcare big data also contributes to greater insight into patient that are
at greatest risk for illness, thereby permitting a proactive approach to prevention.
• Medical Images can be analysed for patterns for cancers, tumours
• Machine learning can be applied to improve patient care.

FUTURE FOR BIG DATA IN HEALTHCARE
 Precision medicine intends to “understand the person’s genetics, environment, and

lifestyle which can help determine the best approach to prevent or treat disease.
 Wearables and IoT sensors can provide a direct, real-time feed to a patient’s
electronic health records, which allows medical staff to monitor and then consult
with the patient, either face-to-face or remotely.
BIG DATA IN HEALTHCARE
 As per the Wise Guy Reports, by 2022, the Big Data Analytics industry in
healthcare will be more than $34.27 billion. We can expect a CAGR of 22.07 %.
The overall value of the Big Data Analytics segment globally will be more than
$68.03 billion by 2024.
 Health Tracking
• Big Data Analytics and the Internet of things is revolutionizing the
healthcare industry.
• Nowadays various wearables are there to record sleep, heart rate,
distance walked, exercise, etc. Along with this data, there are also the
devices to monitor blood pressure, blood sugar level, oximeters and
many more.
• Data received from sensors and continuous monitoring of body vitals
can help identify important patterns through which we can conclude the
health of the overall body and thereby the potential future health risk.
• People can be alerted about potential health issues before the situation
gets worse. This will result in increasing life expectancy and better
control over chronic illnesses and infectious diseases.
 Predictive Analytics
• Developed economies like Europe could save more than $149 billion by
improving operational efficiency through Big Data Analytics.
• We can increase capacity utilization through predictive analysis. Analyzing the
patients’ admission rate with the help of past data can help increase/decrease the
number of beds. This way hospitals can serve more patients with the same
capacity.
• With the help of Big Data Analytics, we can manage hospital staff effectively
through demand forecasting. Other examples of predictive modeling:
• Predicting the chances of a heart attack in the patient
• Regression models can help predict the cost a patient will incur during
treatment. Similarly, hospitals can forecast the demand for their medical
supplies to avoid stockout.
• Customized Care (for the high-risk patients)

• Predictive analytics can help save more than 25% of the annual cost
of healthcare institutions.
• Through predictive analytics, we can identify the patients who visit
frequently to the hospital. We can classify these patients based on their
health condition. Patients with serious health conditions can be given
priority and treated accordingly.
• By analyzing their past visits we can provide them customized care and
reduce their number of visits. Big data analytics plays a crucial role in
delivering all these kinds of service benefits to the patients.
• Preventing human errors
• Doctors are not gods and they can make mistakes as well. Therefore to reduce
human error, EHRs (Electronic health records) can come in handy. Digital
health records can provide lots of data about the patient’s medical history.
• Analyzing the past prescriptions and its effectiveness, analytics can keep a
check on the wrong prescription and alert the patient immediately.
• More effective diagnostic and therapeutic techniques
• Medical reports and doctors’ prescriptions generate tremendous

amounts of data daily. We can analyze past data to check the
effectiveness of the treatment process and medicines. This will help us
know which treatment process is suitable for a particular condition.
• We can remove ineffective treatments and processes to achieve the

desired results.
• Patient Similarity
• Patient Similarity Algorithms helps identify patients with similar

characteristics based on their past health records. Through this doctors
can predict treatment strategy more precisely for a particular disease.
• For example, identifying which treatment strategy will work best for
which groups of people.
• Telemedicine
• These days the world is facing an acute shortage of Medical Staff. In India, the
situation is worse as compared to WHO recommendations. WHO recommends
that there should be 1 doctor per 1000 of the population, but in India, there is 1
doctor per 10,000 of population.
• Big Data Analytics can help improve this situation. Telemedicine refers to
delivering medical services to remote areas using technology. Telemedicine can
be used for medical education for health professionals, remote patient
monitoring, etc.
• Remote medical staff can check and collect medical data from the patients.
Doctors can prescribe the treatment based on the data. This helps avoid the
physical presence of doctors to treat the patients.
 Big Data Analytics and Medical Imaging

• Millions of CT Scans, MRIs, X-Rays, ECGs are done daily. The
Healthcare industry in recent years is leveraging this data to find patterns
across millions of images. This can help study the disease more precisely
and can provide a new knowledge pool in the field of medical science.
• There may be the possibility that a radiologist may no longer need to look
at the image and the algorithm will do all the required work for you.

Data - Analytics - Unit - I - III MCA'A'

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data - Analytics - Unit - I - III MCA'A'

Uploaded by

Copyright:

Available Formats

Big Data Analytics using R

• Industry examples of Big Data

“The quantities, characters, or symbols on which operations are

• Irrespective of the size of the enterprise, DATA continues to be a

• Need of the hour is to understand, manage, process and analyse

Data - Information - Insights (Knowledge)

 “We are living in the Information Age”.

 Data is growing at a phenomenal rate

• Data is collected everyday from business, society, science and engineering,

• This explosive growth of data is a result of computerization of our society and

• Traditional Data Analysis tools - ?

-- Eric Schimidt, Executive Chairman, Google

- Gordon Moore, Co-founder of Intel

• Data which is in a organized form (rows and columns) and can be

 Sources of Structured Data

• Database such as Oracle, MySQL, DB2, Teradata, ….

 Sources of Semi-structured Data

• XML (eXtensible Markup Language)

 Sources of Unstructured Data

Structured data Unstructured data Semi-Structured

 Natural Language Processing (NLP)

 This data comes from everywhere:

• sensors used to gather climate information,

This data is called as ??????????????

• Constitute both structured and un structured data that grow

Source : Gartner IT Glossary

Volume .. facts and statistics

Batch Processing -> Periodic -> Near real time ->

• Various formats, types, and structures

• Static data vs. streaming data

• A single application can be generating/collecting many types of data

 Social media and networks

• Internet - Google, amazon, eBay, AOL, etc.,

“Process of collecting, organizing and analyzing of large sets of data (big

 First school of thought

• Mid 1950’s to 2009

• Descriptive statistics (and Diagnostic)

• Report on events, occurrences, etc of the past.

• Why did it happen?

• Descriptive statistics + Predictive statistics

• What will happen?

• Why will it happen?

• Insight into the past

• Use Descriptive Analytics when you need to understand at an aggregate

• which use statistical models and forecasting techniques to understand

• Understanding the future

• which use optimization and simulation algorithms to advise on possible

• Advise on possible outcomes

• In traditional BI environment, all the enterprise’s data is stored in central

• DFS scales by scaling in or out horizontally as compared to typical

• In traditional BI, data is generally analysed in an offline mode whereas in

“Apache Hadoop is an open-source software

To process big data, first it has to stored in a distributed manner

Hadoop = HDFS + MapReduce

HDFS creates a level of abstraction over the resources,

HDFS has two core components:

• Stores different types of large data sets (i.e. structured, semi-structured

 Performs all your processing activities by allocating resources and scheduling

• Manages resources and schedule applications running on top of yarn.

• Manages containers and monitors resource utilisation in each container.

• Core component in a Hadoop Ecosystem for processing.

• In a MapReduce program, Map() and Reduce() are two functions.