Professional Documents
Culture Documents
Data - Analytics - Unit - I - III MCA'A'
Data - Analytics - Unit - I - III MCA'A'
Dr. S. SENTHIL
School of Computer Science and Applications
SCHEME
• Introduction to BigData and its importance
• Understanding the Characteristics of Big Data
• The V’s
o Velocity
o Variety
o Volume
• Types of Data
o Structured
o Unstructured
o Semi-structured
• Examples of structured, unstructured and Semi-structured data
• Understanding the Waves of managing Data
o Wave 1: Creating manageable data structures
o Wave 2: Web and content management
o Wave 3: Managing big data
• Big Data architecture
o Beginning with capture, organize, integrate, analyze, and act
o Setting the architectural foundation
o Performance matters
o Traditional and advanced analytics
• Big Data Technology Components
• Layer 0: Redundant Physical Infrastructure
o Physical redundant networks
o Managing hardware: Storage and servers
o Infrastructure operations
• Layer 1: Security Infrastructure
o Interfaces and Feeds to and from Applications and the Internet
• Layer 2: Operational Databases
• Layer 3: Organizing Data Services and Tools
• Layer 4: Analytical Data Warehouses
3
SCHEME
Data - ?
-- Oxford Dictionary
INTRODUCTION
• Information - ?
• Data Structures - ?
• Database Management Systems - ?
• DBMS Vs RDBMS
• Data Models
INTRODUCTION
• Traditional Data Analysis tools and techniques can not be used because of the massive size
of the data set.
How to extract?
• SQL - ?
Solution –
• Powerful and versatile tools are badly needed to automatically uncover valuable information
from the tremendous amount of data and to transform such data into organized knowledge.
Data doubles about every year while useful information seems to be
decreasing.
We are data rich, but information poor.
“From the dawn of civilization until 2003, human
kind generated 5 Exabyte of data. Now we
produce 5 Exabyte of data every two days
and the pace is accelerating”
Old Model: Few companies are generating data, all others are consuming data.
New Model: All of us are generating data, and all of us are consuming data.
“Data is the new oil. Like oil, data is valuable, but if unrefined
it cannot really be used. It has to be changed into gas, plastic,
chemicals, etc. to create a valuable entity that drives profitable
activity. so, must data be broken down, analysed for it to have
value.”
- British Mathematician, Clive Humby
CHARACTERISTICS OF DATA
Composition
• Deals with the structure of data (sources, granularity, types and the
nature of the data)
Condition
• Deals with the state of the data.
• Can we use the data as it is for analysis?
• Does it require cleaning for further enhancement and enrichment?
Context
• Where this data has been generated?
• Why was this data generated?
• How sensitive the data is?
• What are the events associated with the data?
TYPES OF DIGITAL DATA
• Structured Data
• Semi-Structured Data
• Unstructured Data
TYPES OF DIGITAL DATA
Structured Data
Semi-Structured Data
• Data which does not conform to a data model but has some
structure.
• It is not in a form which can be easily used by a computer program.
• E.g: XML, HTML, e-mails,…..
TYPES OF DIGITAL DATA
Un Structured Data
• Data which does not conform to a data model or is not in a form which can
be used easily by a computer program.
• E.g: Memos, Images, Audio, Video, letters, etc.
TYPES OF DIGITAL DATA
Data sources
TYPES OF DIGITAL DATA
•Weather
forecasting
•NoSQL
Gartner estimates that 80% of data generated in any
enterprise today is unstructured data. Roughly 10%
of data is in the structured and semi-structured
category.
HOW TO DEAL WITH UNSTRUCTURED DATA?
Data Mining
• Process of discovering knowledge hidden in large volumes of data.
Text Analytics (or) Text Mining
• Process of gleaning high quality meaningful information from text.
• Includes tasks such as Text categorization, Text clustering, Sentiment
analysis, ….
HOW TO DEAL WITH UNSTRUCTURED DATA?
• Collection of Datasets that are large and complex that can not
be processed by traditional data processing applications.
A few responses
• Anything beyond the human and technical infrastructure
needed to support storage, processing and analysis.
• Today’s BIG may be tomorrow’s NORMAL.
• Terabytes or Petabytes or Zettabytes of data.
“Big Data is a high volume, high velocity and high
variety information assets that demand cost
effective, innovative forms of information
processing for enhanced insights
and decision making”
structured data
to unstructured
decision
data
making
batch to
streaming data terabyte to
zettabytes
BIG DATA
Volume
• Bits -> Bytes -> Kilobytes -> Megabytes -> Gigabytes ->
Terabytes -> Petabytes -> Exabytes -> Zettabytes ->
Yottabytes
V’S
Velocity
• refers to the increasing speed at which this data is created, and the increasing
speed at which the data can be processed, stored and analysed by relational
databases
• Data is being generated fast and need to be processed fast
• Online Data Analytics
• Late decisions missing opportunities
• Examples
• E-Promotions: Based on your current location, your purchase history, what you like
send promotions right now for store next to you
• Healthcare monitoring: sensors monitoring your activities and body any abnormal
measurements require immediate reaction .
V’S
• Text, numerical, images, audio, video, sequences, time series, social media data,
multi-dimensional arrays, etc…
Veracity
• Refers to biases, noise and abnormality in data.
Validity
• Refers to the accuracy and correctness of data.
Volatility
• Deals with how long the data is valid? And how long it should be
stored?
Variability
• Data whose meaning is constantly changing.
WHY BIG DATA?
WHY BIG DATA?
a ta
D
WHO IS GENERATING BIG
DATA?
Scientific Instruments
Sensor technology
e-Commerce
WHO IS GENERATING BIG DATA?
“Big Data Analytics is the process of examining big data to uncover patterns,
unearth trends and find unknown correlations and other useful
information to make faster and better decisions”
Basic Analytics
• Slicing and dicing of data to help with basic business insights.
• Reporting on historical data, basic visualization, etc.,
Operationalized analytics
• Gets woven into enterprise’s business process
FIRST SCHOOL OF THOUGHT
Advanced analytics
• Forecasting for the future by way of predictive and prescriptive
modeling.
Monetized analytics
• To derive direct business revenue.
SECOND SCHOOL OF
THOUGHT
Analytics 1.0
• What happened?
Analytics 2.0
• 2005 to 2012
• Use data from the past to make predictions for the future
Descriptive Analytics
• which use data aggregation and data mining to provide insight into the
past and answer: “What has happened?”
Predictive Analytics
• Use Predictive Analytics any time you need to know something about
the future, or fill in the information that you do not have.
ANALYTICS 1.0, 2.0, 3.0
Prescriptive Analytics
• Use Prescriptive Analytics any time you need to provide users with
advice on what action to take.
TRADITIONAL BI VS. BIG DATA
Originally built as an
Infrastructure for the “Nutch”
project.
Based on Google’s map reduce
and Google File System.
Created by Doug Cutting in
2005 at Yahoo.
Named after his son’s toy
yellow elephant.
HDFS
•Name Node
Main node that contains metadata about the data stored.
•Data Nodes
Data is stored on the data nodes which are commodity
hardware in the distributed environment.
HDFS
FUNDAMENTAL PRINCIPLES OF HADOOP
• Parallel Execution
• Data Locality
• Fault Tolerance
• Scalability
• Economical
HADOOP ECO SYSTEM
HDFS - STORAGE
Two services
• ResourceManager
• NodeManager
• Helps in writing applications that processes large datasets using distributed and
parallel algorithms.
• Reduce function aggregates and summarizes the result produced by Map function.
PIG : DATA PROCESSING SERVICE USING QUERY
• PIG has two parts: Pig Latin, the language and the Pig runtime, for the execution
environment.
• 1 line of pig latin = 100 lines of Map-reduce job (approx)
• The compiler internally converts pig latin to MapReduce.
• It gives us a platform for building data flow for ETL.
• PIG first loads the data, then perform various functions like grouping, filtering,
joining, sorting, etc., and finally dumps the data on the screen or stores in HDFS.
HIVE : DATA PROCESSING SERVICE
USING QUERY
• Two services which are used for searching and indexing in Hadoop ecosystem.
• Apache Lucene is based on Java, which also helps in spell checking.
• Apache Lucene is the engine, Apache Solr is a complete application built aroung
Lucene.
• Solr uses Apache Lucene Java search library for searching and indexing.
ZOOKEEPER : CO-ORDINATOR
• Examples of structured data include numbers, dates, and groups of words and
numbers called strings (for example, a customer’s name, address, and so on).
• Often these data elements are integrated in a data warehouse for analysis.
SOURCES OF STRUCTURED DATA
Computer- or machine-generated
Human-generated
Sensor data:
• Examples include radio frequency ID (RFID) tags, smart meters,
medical devices, and Global Positioning System (GPS) data.
• Another example of sensor data is smartphones that contain sensors like
GPS that can be used to understand customer behavior in new ways.
Financial data
• Lots of financial systems are now programmatic; they are operated
based on predefined rules that automate processes. Stock-trading data is
a good example of this. It contains structured data such as the company
symbol and dollar value. Some of this data is machine generated, and
some is human generated.
HUMAN GENERATED STRUCTURED DATA
Input data
This is any piece of data that a human might input into a computer, such as
name, age, income, non-free-form survey responses, and so on. This data
can be useful to understand basic customer behavior.
Click-stream data
Data is generated every time you click a link on a website. This data can be
analyzed to determine customer behavior and buying patterns.
Gaming-related data
Every move you make in a game can be recorded. This can be useful in
understanding how end users move through a gaming portfolio.
UNSTRUCTURED DATA
Satellite images:
• This includes weather data or the data that the government captures in
its satellite surveillance imagery.
Scientific data:
• This includes seismic imagery, atmospheric data, and high energy
physics.
Photographs and video:
• This includes security, surveillance, and traffic video.
Radar or sonar data:
• This includes vehicular, meteorological, and oceanographic seismic
profiles.
EXAMPLES OF HUMAN GENERATED UNSTRUCTURED DATA
Text internal to your company:
• Think of all the text within documents, logs, survey results, and e-mails.
Enterprise information actually represents a large percent of the text information
in the world today.
Mobile data:
• This includes data such as text messages and location information.
Website content:
• This comes from any site delivering unstructured content, like YouTube, Flickr,
or Instagram.
SEMI-STRUCTURED DATA
• Semi-structured data does not necessarily conform to a fixed schema (that is,
structure) but may be self-describing and may have simple label/value pairs.
Wave 1
• Creating manageable data structures
Wave 2
• Web and content management
Wave 3
• Managing big data
WAVE 1
• As computing moved into the commercial market in the late 1960s, data was stored
in flat files (is a data file that does not contain links to other files or is a non-
relational database) that imposed no structure.
• Bob|123 street|California|$200.00
Nathan|800 Street|Utah|$10.00
• Later in the 1970s, things changed with the invention of the relational data model and
the Relational DataBase Management System (RDBMS) that imposed structure and a
method for improving performance.
WAVE 1
• Most importantly, the relational model added a level of abstraction (the structured
query language [SQL], report generators, and data management tools) so that it was
easier for programmers to satisfy the growing business demands to extract value from
data.
• The relational model offered an ecosystem of tools from a large number of emerging
software companies.
• But a problem emerged from this exploding demand for answers: Storing this growing
volume of data was expensive and accessing it was slow.
• Making matters worse, lots of data duplication existed, and the actual business value of
that data was hard to measure.
WAVE 1
• At this stage, an urgent need existed to find a new set of technologies to support
the relational model.
• When the volume of data that organizations needed to manage grew out of control,
the data warehouse provided a solution.
• The data warehouse enabled the IT organization to select a subset of the data being
stored so that it would be easier for the business to try to gain insights. It also
provided an integrated source of information from across various data sources that
could be used for analysis.
WAVE 1
• Data warehouses were commercialized in the 1990s, and today, both content
management systems and data warehouses are able to take advantage of
improvements in scalability of hardware, virtualization technologies, and the
ability to create integrated hardware and software systems, also known as
appliances.
• Sometimes these data warehouses themselves were too complex and large and
didn’t offer the speed and agility that the business required. The answer was a
further refinement of the data being managed through data marts.
• These data marts were focused on specific business issues and were much more
streamlined and supported the business need for speedy queries than the more
massive data warehouses.
WAVE 1
• Data warehouses and data marts solved many problems for companies needing a
consistent way to manage massive transactional data. But when it came to managing
huge volumes of unstructured or semi-structured data, the ware-house was not able to
evolve enough to meet changing demands.
• To complicate matters, data warehouses are typically fed in batch intervals, usually
weekly or daily. This is fine for planning, financial reporting, and traditional
marketing campaigns, but is too slow for increasingly real-time business and
consumer environments.
WAVE 1
• The market evolved from a set of disconnected solutions to a more unified model
that brought together these elements into a platform that incorporated business
process management, version control, information recognition, text management,
and collaboration. This new generation of systems added meta-data (information
about the organization and characteristics of the stored information).
WAVE 2
• A new generation of requirements has begun to emerge that drive us to the next
wave.
• In this new wave, organizations are beginning to understand that they need to
manage a new generation of data sources with an unprecedented amount and
variety of data that needs to be processed at an unheard-of speed.
WAVE 3
• With big data, it is now possible to virtualize data so that it can be stored
efficiently and, utilizing cloud-based storage, more cost-effectively as well.
acceptable pace.
WAVE 3
• Many of the technologies at the heart of big data, such as virtualization, parallel
processing, distributed file systems, and in-memory databases, have been around
for decades.
• Advanced analytics have also been around for decades, although they have not
always been practical. Other technologies such as Hadoop and MapReduce have
been on the scene for only a few years.
IS THERE A FOURTH “BIG” WAVE ? – EVOLUTION, IOT
• Currently we are still at an early stage of leveraging huge volumes of data to gain
a 360-degree view of the business and anticipate shifts and changes in customer
expectations.
• The technologies required to get the answers the business needs are still isolated
from each other.
• To get to the desired end state, the technologies from all three waves will have to
come together.
• Big data is not simply about one tool or one technology.
• It is about how all these technologies come together to give the right insights, at
the right time, based on the right data whether it is generated by people, machines,
or the web.
BUILDING A SUCCESSFUL BIG DATA MANAGEMENT
ARCHITECTURE
• But as data has become the fuel of growth and innovation, it is more important
than ever to have an underlying architecture to support growing requirements.
BEGINNING WITH CAPTURE, ORGANIZE,
INTEGRATE, ANALYZE, AND ACT
• Organize
• Integration
• Analyze
• Big data is the fact that it relies on picking up lots of data from lots of
sources.
• Therefore, open application programming interfaces (APIs) will be core
to any big data architecture.
• In addition, keep in mind that interfaces exist at every level and
between every layer of the stack.
• Without integration services, big data can’t happen.
SETTING THE ARCHITECTURAL FOUNDATION
Security infrastructure
• The more important big data analysis becomes to companies, the more important
it will be to secure that data.
• For example, if you are a healthcare company, you will probably want to use big
data applications to determine changes in demographics or shifts in patient needs.
• This data about your constituents needs to be protected both to meet compliance
requirements and to protect the patients’ privacy.
• You will need to take into account who is allowed to see the data and under what
circumstances they are allowed to do so.
• You will need to be able to verify the identity of users as well as protect the
identity of patients.
• These types of security requirements need to be part of the big data fabric from the
outset and not an afterthought.
SETTING THE ARCHITECTURAL FOUNDATION
• You find new emerging approaches to data management in the big data
world, including document, graph, columnar, and geospatial database
architectures.
• Collectively, these are referred to as NoSQL, or not only SQL,
databases.
SETTING THE ARCHITECTURAL FOUNDATION
All these operational data sources have several characteristics in common:
• They represent systems of record that keep track of the critical data required
for real-time, day-to-day operation of the business.
• They are continually updated based on transactions happening within business
units and from the web.
• For these sources to provide an accurate representation of the business, they
must blend structured and unstructured data.
• These systems also must be able to scale to support thousands of users on a
consistent basis.
• These might include transactional e-commerce systems, customer relationship
management systems, or call center applications.
SETTING THE ARCHITECTURAL FOUNDATION
Performance matters
• Your data architecture also needs to perform in concert with your organization’s
supporting infrastructure.
• It might take days to run this model using a traditional server configuration.
• However, using a distributed computing model, what took days might now
take minutes.
SETTING THE ARCHITECTURAL FOUNDATION
• Performance might also determine the kind of database you would use.
• A graphing database might be a better choice, as it is specifically designed to
separate the “nodes” or entities from its “properties” or the information that
defines that entity, and the “edge” or relationship between nodes and properties.
• Using the right database will also improve performance.
• Typically the graph database will be used in scientific and technical
applications.
SETTING THE ARCHITECTURAL FOUNDATION
• Very few tools could make sense of these vast amounts of data.
• The tools that did exist were complex to use and did not produce results in a
reasonable time frame.
• In the end, those who really wanted to go to the enormous effort of analyzing
this data were forced to work with snapshots of data.
• This has the undesirable effect of missing important events because they were
not in a particular snapshot.
SETTING THE ARCHITECTURAL
FOUNDATION
Big Table
• Some analysis will use a traditional data warehouse, while other analyses will
take advantage of advanced predictive analytics.
• Managing big data holistically requires many different approaches to help the
business to successfully plan for the future.
SETTING THE ARCHITECTURAL FOUNDATION
• Nothing helps us understand Big Data more than examples of how the
technology and approaches is being used in the “real world.”
• Helps us to learn how to apply ideas from other industries into business.
• Examples include ..
• “Today ’s consumers have changed. They ’ve put down the newspaper,
• They fast forward through TV commercials, and they junk unsolicited email.
Why?
• They have new options that better fit their digital lifestyle. They can choose
which marketing messages they receive, when, where, and from whom.
• They Choose across the digital power channels: email, mobile, social, display and
the web
INDUSTRY EXAMPLES OF BIG DATA
• Cross-Channel Lifecycle Marketing (to tell all your audience and customers a
similar brand story across multiple channels) really starts with the capture of
customer permission, contact information, and preferences for multiple channels.
• It also requires marketers to have the right integrated marketing and customer
information systems, so that
• they can automate and optimize their programs and processes throughout the
customer lifecycle.
• Once marketers have that, they need a practical framework for planning
marketing activities.
INDUSTRY EXAMPLES OF BIG DATA
• And it means driving people to a website, a mobile app, and the like,
and, once there, retaining them, interacting with them.
• Mercedes-Benz has a cross-selling strategy that includes digital and
social media channels, including paid media, owned media, earned
media and content marketing. For example, its “Generation Benz”
online community was integral in developing a customer profile for
Mercedes Benz that would help them understand which marketing
tactics would work best for each channel.
INDUSTRY EXAMPLES OF BIG DATA
• One of the most common forms of fraudulent activity is credit card fraud. The
credit card fraud rate in United States and other countries is increasing.
INDUSTRY EXAMPLES OF BIG DATA
• Despite warnings that social networks are a great resource for fraudsters, consumers
are still sharing a significant amount of personal information frequently used to
authenticate a consumer ’s identity.
• Those with public profiles (those visible to everyone) were more likely to expose
this personal information.
INDUSTRY EXAMPLES OF BIG
DATA
• Prevent Fraud
• In order to prevent the fraud, credit card transactions are monitored and checked in
near real time.
• The Capgemini Financial Services team believes that due to the nature of data
streams and processing required, Big Data technologies provide an optimal
technology solution based on the following three Vs:
INDUSTRY EXAMPLES OF BIG DATA
High volume.
High variety.
• Pattern recognition is performed against the data to score and weight individual
transactions across each of the rules and scoring dimensions.
• A cumulative score is then calculated for each transaction record and compared
against thresholds to decide if the transaction is potentially suspicious or not.
INDUSTRY EXAMPLES OF BIG DATA
• Elastic search is able to achieve fast search responses because, instead of searching
the text directly, it searches an index instead.
• Once the transaction data has been processed, the percolator query then performs the
functioning of identifying new transactions that have raised profiles.
• SNA is the precise analysis of social networks. Social network analysis views social
relationships and makes assumptions.
• SNA could reveal all individuals involved in fraudulent activity, from perpetrators to their
associates,
• SNA helps to understand their relationships and behaviors to identify a bust out fraud case.
INDUSTRY EXAMPLES OF BIG
DATA
• Risk and Big Data
• Many of the world ’s top analytics professionals work in risk management.
• A third type of risk, operational risk management, is not as common as credit and market risk.
INDUSTRY EXAMPLES OF BIG DATA
• Credit risk analytics focus on past credit behaviors to predict the likelihood that
a borrower will default on any type of debt by failing to make payments.
• For example, “Is this person likely to default on their $300,000 mortgage?”
• Market risk analytics focus on understanding the likelihood that the value of a
portfolio will decrease due to the change in stock prices, interest rates, foreign
exchange rates, and commodity prices.
• For example, “Should we sell this holding if the price drops another 10
percent?”
INDUSTRY EXAMPLES OF BIG DATA
• In addition to saving and improving lives, Big Data has the potential to transform
the entire health care system by replacing guesswork and intuition with objective,
data-driven science.
INDUSTRY EXAMPLES OF BIG DATA
• “Big data in healthcare” refers to the abundant health data amassed from
numerous sources including electronic health records (EHRs), medical imaging,
genomic sequencing, payer(Insurance) records, pharmaceutical research,
wearables, and medical devices, to name a few.
• The Health care data is Voluminous, Moves at high velocity and variety of data
formats is used.
INDUSTRY EXAMPLES OF BIG DATA
• Wearable devices from Apple can notify wearers that they need to seek medical
attention.
• Analysis of healthcare big data also contributes to greater insight into patient that are
at greatest risk for illness, thereby permitting a proactive approach to prevention.
Wearables and IoT sensors can provide a direct, real-time feed to a patient’s
electronic health records, which allows medical staff to monitor and then consult
with the patient, either face-to-face or remotely.
BIG DATA IN HEALTHCARE
As per the Wise Guy Reports, by 2022, the Big Data Analytics industry in
healthcare will be more than $34.27 billion. We can expect a CAGR of 22.07 %.
The overall value of the Big Data Analytics segment globally will be more than
$68.03 billion by 2024.
BIG DATA IN HEALTHCARE
Health Tracking
• Big Data Analytics and the Internet of things is revolutionizing the
healthcare industry.
• Nowadays various wearables are there to record sleep, heart rate,
distance walked, exercise, etc. Along with this data, there are also the
devices to monitor blood pressure, blood sugar level, oximeters and
many more.
• Data received from sensors and continuous monitoring of body vitals
can help identify important patterns through which we can conclude the
health of the overall body and thereby the potential future health risk.
• People can be alerted about potential health issues before the situation
gets worse. This will result in increasing life expectancy and better
control over chronic illnesses and infectious diseases.
BIG DATA IN HEALTHCARE
Predictive Analytics
• Developed economies like Europe could save more than $149 billion by
improving operational efficiency through Big Data Analytics.
• We can increase capacity utilization through predictive analysis. Analyzing the
patients’ admission rate with the help of past data can help increase/decrease the
number of beds. This way hospitals can serve more patients with the same
capacity.
• With the help of Big Data Analytics, we can manage hospital staff effectively
through demand forecasting. Other examples of predictive modeling:
• Predicting the chances of a heart attack in the patient
• Regression models can help predict the cost a patient will incur during
treatment. Similarly, hospitals can forecast the demand for their medical
supplies to avoid stockout.
BIG DATA IN HEALTHCARE
• Doctors are not gods and they can make mistakes as well. Therefore to reduce
human error, EHRs (Electronic health records) can come in handy. Digital
health records can provide lots of data about the patient’s medical history.
• Analyzing the past prescriptions and its effectiveness, analytics can keep a
check on the wrong prescription and alert the patient immediately.
BIG DATA IN HEALTHCARE
• Patient Similarity
• For example, identifying which treatment strategy will work best for
which groups of people.
BIG DATA IN HEALTHCARE
• Telemedicine
• These days the world is facing an acute shortage of Medical Staff. In India, the
situation is worse as compared to WHO recommendations. WHO recommends
that there should be 1 doctor per 1000 of the population, but in India, there is 1
doctor per 10,000 of population.
• Big Data Analytics can help improve this situation. Telemedicine refers to
delivering medical services to remote areas using technology. Telemedicine can
be used for medical education for health professionals, remote patient
monitoring, etc.
• Remote medical staff can check and collect medical data from the patients.
Doctors can prescribe the treatment based on the data. This helps avoid the
physical presence of doctors to treat the patients.
BIG DATA IN HEALTHCARE