Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

Unit-2

Fundamentals
of Big Data
Analytics

BIG DATA ANALYTICS 1


Overview of business intelligence
▪ BI technologies provide historical, current and predictive views of business
operations.
▪ Common functions of business intelligence technologies include reporting, online
analytical processing, analytics, data mining, process mining, business performance
management, text mining, predictive analytics and prescriptive analytics.
▪ BI technologies can handle large amounts of structured and sometimes unstructured
data to help business & also identify, develop new strategic business opportunities.
▪It make up the strategies and technologies used by enterprises for the data analysis
of business information.

BIG DATA ANALYTICS 2


Overview of business intelligence
▪ BI tools access and analyze data sets and present analytical findings in reports,
summaries, dashboards, graphs, charts and maps to provide users with detailed
intelligence about the state of the business.
▪ Typical BI infrastructure components are: Software solution for gathering, cleansing,
integrating, analyzing and sharing data.
▪ It produces analysis and provides believable information to help making effective
and high quality business decisions.

BIG DATA ANALYTICS 3


Business Intelligence (BI) Architecture
Business intelligence architecture is a term used to describe standards and policies for organizing
data with the help of computer-based techniques and technologies that create business intelligence
systems used for online data visualization, reporting, and analysis.

There are various components and layers that business intelligence architecture consists of. Each of
that component has its own purpose

A solid BI architecture framework consists of:


• Collection of data
• Data integration
• Storage of data
• Data analysis
• Distribution of data
• Reaction based on insights

BIG DATA ANALYTICS 4


BI Architecture Framework

BIG DATA ANALYTICS 5


Data science
Data science is science of extracting knowledge from data. It helps to find hidden patterns
amongst data. Using statistical and mathematical techniques.
It employees techniques and theory drawn from many fields from the broad areas of
mathematics, statistics, information technology including machine learning, data
engineering etc.
Data science uses massive dataset for applications like weather prediction, financial
frauds, global economic impacts, social media analytics, retail, regression analysis,
collaborative filtering etc.

BIG DATA ANALYTICS 6


Data scientist

Business Acumen

Data
Scientist

Technology Mathematics
Expertise Expertise

BIG DATA ANALYTICS 7


Business Acumen Skills
A data scientist should have the skills to counter the pressures of business. The following is
a list of traits that needs to be sharpened to play the role of data scientist:
1. Understanding of domain
2. Business strategy
3. Problem solving
4. Communication
5. Presentation
6. Inquisitiveness

BIG DATA ANALYTICS 8


Technology Expertise
Following are few skills required as technical expertise:
1. Good database knowledge such as RDBMS
2. Good NoSQL database knowledge such as MongoDB, Cassandra, Hbase, etc.
3. Programming languages such as Java, Python, C++, etc.
4. Open-source tools such as Hadoop
5. Data warehousing
6. Data mining
7. Visualization such as Tableau, Flare, Google visualization APIs etc.

BIG DATA ANALYTICS 9


Mathematics Expertise
The following are the key skills that a data scientist should have:
1. Mathematics
2. Statistics
3. Artificial Intelligence
4. Algorithms
5. Machine Learning
6. Pattern recognition
7. Natural Language Processing

BIG DATA ANALYTICS 10


Models and
Prepares and analyses to
integrates large, comprehend,
varied datasets interpret
relationships,
unveils patterns,
spots trends

Data
Scientist

Applies
Business/Domain Communicates/
knowledge to presents
provide context findings/result

BIG DATA ANALYTICS 11


Big Data Analytics
• Big Data Analytics is the process of examining big data to uncover patterns, unearth trends, and
find unknown correlations and other useful information to make faster and better decisions.

• Analytics begin with analyzing all available data.

Analyze all available


data

Billing Social
Websites ERP CRM RFID
(POS) Medial

BIG DATA ANALYTICS 12


Big Data Analytics (Cont..)
Big Data Analytics is:
1. Technology enabled analytics: The analytical tools help to process and analyze big data.
2. About gaining a meaningful, deeper, and richer insights into business to drive in right direction,
understanding the customer’s demographics, better leveraging the services of vendors and suppliers
etc.
3. About a competitive edge over the competitors by enabling with finding that allow quicker and better
decision making.
4. A tight handshake between 3 communities: IT, Business users and Data Scientists.
5. Working with datasets whose volume and variety exceed the current storage and processing
capabilities and infrastructure of the enterprise.
6. About moving code to data. This makes perfect sense as the program for distributed processing is tiny
compared to the data.

BIG DATA ANALYTICS 13


What is Big Data Analytics?

BIG DATA ANALYTICS 14


What Big Data Analytics isn’t?

BIG DATA ANALYTICS 15


Need of big data analytics
We need big data analytics for three reasons: Steady
More data
growth of
produced
1. The volume of business data worldwide is analysis
expected to double every 1.2 years.

2. Cost per gigabytes of storage has hugely


dropped.
Better More data
3. There are an overwhelming number of user- prediction stored
friendly analytics tools available in the
market today.
More data
analyzed

BIG DATA ANALYTICS 16


Classification of analytics
There are basically two schools of thought:
1. Those that classify analytics into basic, operational, advanced and monetized.
2. Those that classify analytics into analytics 1.0, analytics 2.0 and analytics 3.0

BIG DATA ANALYTICS 17


First school of thought:
1. Basic analytics: This primarily slicing and slicing of data to help with basic business
insights. This is about reporting on historical data, basic visualization etc.
2. Operationalized Analytics: It is operationalized analytics if it gets woven into the
enterprise’s business process.
3. Advanced Analytics: This largely is about forecasting for the future by way of
predictive and prescriptive modeling.
4. Monetized analytics: This is analytics in use to derive direct business revenue.

BIG DATA ANALYTICS 18


Second school of thought:
Analytics 1.0 Analytics 2.0 Analytics 3.0
Era: mid 1950s to 2009 Era: 2005 to 2012 Era: 2012 to present
Descriptive statistics Descriptive statistics + Descriptive statistics + Predictive
(report events, occurrences Predictive statistics statistics + prescriptive statistics
etc of the past. (use data from the past to make (use data from the past to make
predictions for the future. prophecies for the future and at the
same time make recommendations to
leverage the situations to one’s
advantage.
Key questions asked: Key questions are: Key questions are:
What happened? What will happen? What will happen?
Why did it happen? Why will it happen? When will it happen?
Why will it happen?
What should be the action taken to
take advantage of what will happen?

BIG DATA ANALYTICS 19


Analytics 1.0 Analytics 2.0 Analytics 3.0
Data from legacy systems, Big Data A blend of big data and data
ERP,CRM and third party from legacy systems, ERP,CRM
applications. and third party applications.
Small and structured data Big data is being taken up seriously. A blend of big data and
sources. Data stored in Data is mainly unstructured, traditional analytics to yield
enterprise data warehouses arriving at a higher pace. This fast insights and offerings with
or data marts. flow of big volume data had to be speed and impact.
stored and processed rapidly, often
on massively parallel servers
running Hadoop.
Data was internally sourced. Data was often externally sourced. Data is being both internally
and externally sourced.
Relational databases Database applications, Hadoop In ,memory analytics, in
clusters, SQL to hadoop database processing, agile
environments etc.. analytical methods, Machine
learning techniques etc ..

BIG DATA ANALYTICS 20


Analytics 1.0, 2.0 and 3.0
How can we
make it happen?
What will Prescriptive
happen? analytics

Why did it Predictive


happen? analytics

What Foresight
Diagnostic
happened? analytics
Descriptive
analytics
Insight

Hindsight

BIG DATA ANALYTICS 21


Challenges to Big Data Analytics
1. Scale: Storage or NoSQL is one major concern that needs to be addressed to handle the need for scaling
rapidly and elastically.
2. Security: Most of the NoSQL big data platforms have poor security mechanisms when it comes to
safeguarding big data. A spot that cannot be ignored given that big data carries credit card information,
personal information and other sensitive data.
3. Schema: Rigid schemas have no place.
4. Continuous availability: Almost all RDBMS and NoSQL big data platforms have a certain amount of
downtime built in.
5. Consistency: Should one opt for consistency or eventual consistency?
6. Partition tolerant: How to build partition tolerant systems that can take care of both hardware and
software failures?
7. Data quality: How to maintain data quality- data accuracy, completeness, timeliness, etc.? Do we have
appropriate metadata in place?
BIG DATA ANALYTICS 22
Greatest challenges that prevent businesses
from capitalizing on Big data

1. Obtaining executive sponsorships for investments in big data and its related activities

2. Getting the business units to share information across organizational silos.

3. Finding the right skills that can manage large amounts of structured, semi-structured data and
create insights from it.

4. Determining the approach to scale rapidly and elastically.

5. Deciding whether to use structured or unstructured, internal or external data to make business
decisions.

6. Choosing the optimal way to report findings and analysis of big data for the presentations to
make the most sense.

7. Determining what to do with the insights created from big data.

BIG DATA ANALYTICS 23


Importance of big data analytics
The various approaches to analysis of data and what it leads to:
1. Reactive – Business Intelligence: It is about analysis of the past or historical data and then
displaying the finding of the analysis or reports in the form of enterprise dash boards, alerts,
notifications etc.
2. Reactive – Big Data Analytics: Here the analysis is done on huge datasets but the approach is
still reactive as it is still base done static data.
3. Proactive – Analytics: This is to support futuristic decision making by the use of data mining,
predictive modeling, text mining and statistical analysis. This analysis is not on bigdata as it still
used traditional data base management practices.
4. Proactive – Big Data Analytics: This is sieving through terabytes of information to filter out the
relevant data to analyze. This also includes high performance analytics to gain rapid insights
from big data and the ability to solve complex problems using more data

BIG DATA ANALYTICS 24


Basic terminologies in big data
environment
Following are the terminologies of Big data:
a. In-Memory Analytics
b. In-Database processing
c. Symmetric Multi-processor system
d. Massively parallel processing
e. Shared nothing architecture
f. CAP Theorem

BIG DATA ANALYTICS 25


▪ In-memory Analytics: Data access from non-volatile storage such as hard disk is a
slow process. This problem has been addressed using In-memory Analytics. Here
all the relevant data is stored in Random Access memory (RAM) or primary storage
thus eliminating the need to access the data from hard disk. The advantage is faster
access rapid deployment, better insights, and minimal IT involvement.
▪ In-Database Processing: In-Database processing is also called In-database
analytics. It works by fusing data warehouses with analytical systems. Typically the
data from various enterprise OLTP systems after cleaning up through the process
of ETL is stored in the Enterprise Datawarehouse or data marts. The huge data sets
are then exported to analytical programs for complex and extensive computations.

BIG DATA ANALYTICS 26


▪ Symmetric Multi-Processor System: In this, there is single common main memory that is
shared by two or more identical processors. The processors have full access to all I/O
devices and are controlled by single operating system instance. SMP are tightly coupled
multiprocessor systems. Each processor has its own high speed memory called cache
memory and are connected using a system bus.

BIG DATA ANALYTICS 27


▪ Massively Parallel Processing: Massively parallel Processing (MPP) refers to the
coordinated processing of programs by a number of processors working parallel. The
processors each have their own OS and dedicated memory. They work on different parts of
the same program. The MPP processors communicate using some sort of messaging
interface. MPP is different from symmetric multiprocessing in that SMP works with
processors sharing the same OS and same memory. SMP also referred as tightly coupled
Multiprocessing.

BIG DATA ANALYTICS 28


▪ Shared nothing Architecture: The three most common types of architecture for
multiprocessor systems:

1. Shared memory architecture: a common central memory is shared by multiple


processors.

2. Shared disk architecture: multiple processors share a common collection of disks


while having their own private memory.

3. Shared nothing architecture: neither memory nor disk is shared among multiple
processors.

BIG DATA ANALYTICS 29


Advantages of shared nothing architecture:

• Fault Isolation: A “shared nothing architecture” provides the benefit of isolating fault. A
fault in a single node is contained and confined to that node exclusively and exposed only
through messages or lack of it.

• Scalability: Assume that the disk is a shared resource it implies that the controller and the
disk band-width are also shared. Synchronization will have to be implemented to maintain a
consistent shared state. This would mean that different nodes will have to take turns to
access the critical data. This imposes a limit on how many nodes can be added to the
distributed shared disk system, thus compromising on the scalability.

BIG DATA ANALYTICS 30


CAP Theorem:

The CAP theorem is also called the Brewer’s theorem. It states that in a distributed computing
environment, it is impossible to provide the following guarantees.

At best you can have two of the following three and one must be sacrificed.

1. Consistency

2. Availability

3. Partition tolerance

BIG DATA ANALYTICS 31


1. Consistency implies that every read fetches the last write. Consistency means that all
nodes see the same data at the same time. If there are multiple replicas and there is an
update being processed, all users see the update go live at the same time even if they are
reading from different replicas.

2. Availability implies that reads and writes always succeed. Availability is a guarantee that
every request receives a response about whether it was successful or failed.

3. Partition tolerance implies that the system will continue to function when network
partition occurs. It means that the system continues to operate despite arbitrary message
loss or failure of part of the system.

BIG DATA ANALYTICS 32


BIG DATA ANALYTICS 33
Big Data in
Business Context

BIG DATA ANALYTICS 34


Case study of Big data solution
• Undoubtedly Big Data has become a major game change in most part of the cutting edge industries
over the last few years.
• As Big Data keeps on going day by day, the number of various organizations that are adopting Big Data
keeps on expanding.
• Let’s discuss example:
• An e-commerce site XYZ (having 100 million users) wants to offer a gift voucher of 100$ to its top 10 customers
who have spent the most in the previous year.
• Moreover, they want to find the buying trend of these customers so that company can suggest more items related
to them.
• Issues: Huge amount of unstructured data which needs to be stored, processed and analyzed.
• Solution:
• Storage: This huge amount of data, Hadoop uses HDFS (Hadoop Distributed File System) which uses commodity
hardware to form clusters and store data in a distributed fashion. It works on Write once, read many times principle.
• Processing: Map Reduce paradigm is applied to data distributed over network to find the required output.
• Analyze: Pig, Hive can be used to analyze the data.
• Cost: Hadoop is open source so the cost is no more an issue.

BIG DATA ANALYTICS 35


Where are businesses finding uses for
Big Data ?

BIG DATA ANALYTICS 36


Walmart
• Biggest retailer in the world and world’s biggest organization by revenue.
• Approx. 2 million workers and 20000 stores in 28+ nations.
• It started to use Big Data concept in earlier stage.
• It used data mining to find designs pattern that can be used to give product suggestions to
client, depending on which products were brought together.
• Based on data mining result, it has expanding its conversion rate of customers.
• Main target of Walmart is to holding customers and enhance their experience.
• Hadoop and NoSQL technologies are used to furnished these customers real time data to
gathered from various sources and their effective valuable use.

BIG DATA ANALYTICS 37


Uber
• It is the best option for individuals around the globe when moving people and making
conveyances.
• It utilizes individuals information of the user to intently monitor which features of
services are used.
• To analyze usage pattern and to figure out where the services should be more engaged.
• It focuses around the organic market of the services because of which the costs of services
gave changes.
• The use of data is surge pricing and its influences the rate of demand.

BIG DATA ANALYTICS 38


Netflix
•It is very popular entertainment company work in online on-request web based video
streaming for its customers.
•It has been determined to be able to predict what precisely its customers will appreciate
viewing with Big Data.
•Recently, Netflix begun positioning itself as a content creator, not simply a distribution
medium which is solidly said based on data analytics.
•Data likes are recommendation engines take care of customers watch, regularly playback
halted, ratings and so on.
•It has incorporates with Hadoop, Hive and Pig and other traditional business intelligence.

BIG DATA ANALYTICS 39

You might also like