Download as pdf or txt
Download as pdf or txt
You are on page 1of 62

Big Data Technology

1
Chapter 1: Introduction to Big Data

• Big Data Overview


• Background of Data Analytics
• Role of Distributed System in Big Data
• Role of data Scientist
• Current Trend in Big Data Analytics

2
Big Data Overview

• Big data is the term for a collection of data sets so large and complex that it becomes
difficult to process using on-hand database management tools or traditional data
processing applications.

• “Big data is high-volume, high-velocity and high-variety information assets that


demand cost-effective, innovative forms of information processing for enhanced
insight and decision making.”

• Big Data refers to datasets whose size are beyond the ability of typical database
software tools to capture, store, manage and analyze.

3
4
Lots of data is being collected and
warehoused
Big Data • Web data, e-commerce
• Purchases at department/
Everywhere! grocery stores
• Bank/Credit Card
transactions
• Social Network

5
Who’s Generating Big Data

• The progress and innovation is no longer hindered by the ability to collect


data
• But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable
fashion

6
Mobile devices
(tracking all objects all the time)
Social media and networks
(all of us are generating data)

Sensor technology and


networks
Scientific instruments
(measuring all kinds of data)
(collecting all sorts of data)

7
How much data?

• 500 Million Tweets sent each day!


• More than 4 Million Hours of content uploaded to Youtube every day!
• 3.6 Billion Instagram Likes each day.
• 4.3 BILLION Facebook messages posted daily!
• 5.75 BILLION Facebook likes every day.
• 40 Million Tweets shared each day!
• 6 BILLION daily Google Searches!

https://blog.microfocus.com/how-much-data-is-created-on-the-internet-
each-day/
8
Disk Storage
1 Bit= Binary Digit
8 Bits=1 Byte
1000 Bytes=1 Kilobyte
1000 Kilobytes=1 Megabyte
1000 Megabytes=1 Gigabyte
1000 Gigabytes=1 Terabyte
1000 Terabytes=1 Petabyte
1000 Petabytes=1 Exabyte
1000 Exabytes=1 Zettabyte
1000 Zettabytes=1 Yottabyte
1000 Yottabytes=1 Brontobyte
1000 Brontobytes=1 Geopbyte

Processor or Virtual Storage


1 Bit=Binary Digit
8 Bits=1 Byte
1024 Bytes=1 Kilobyte
1024 Kilobytes=1 Megabyte
1024 Megabytes=1 Gigabyte
1024 Gigabytes=1 Terabyte
1024 Terabytes=1 Petabyte
1024 Petabytes=1 Exabyte
1024 Exabytes=1 Zettabyte
1024 Zettabytes=1 Yottabyte
1024 Yottabytes=1 Brontobyte
1024 Brontobytes=1 Geopbyte

9
Name Number of Zeros Groups of (3) Zeros

Ten 1 (10)

Hundred 2 (100)

Thousand 3 1 (1,000)

Ten thousand 4 (10,000)

Hundred thousand 5 (100,000)

Million 6 2 (1,000,000)
Understanding Billion 9 3 (1,000,000,000)

Large Trillion 12 4 (1,000,000,000,000)

Numbers Quadrillion 15 5

Quintillion 18 6

Sextillion 21 7

Septillion 24 8

Octillion 27 9

Nonillion 30 10
.
.
.
10
3Vs of Big Data

Big Data Technology is a new set of approaches for analyzing data sets that
were not previously accessible because they posed challenges across one or
more of the “3 V’s” of Big Data

• Volume - too Big – Terabytes and more of Credit Card Transactions, Web
Usage data, System logs
• Variety - too Complex – truly unstructured data such as Social Media,
Customer Reviews, Call Center Records
• Velocity - too Fast - Sensor data, live web traffic, Mobile Phone usage,
GPS Data
11
Some Make it 4V’s

12
13
Volume

• Refers to the vast amounts of data generated every second.


• We are not talking Terabytes but Zettabytes or Brontobytes.
• If we take all the data generated in the world between the beginning
of time and 2008, the same amount of data will soon be generated every
minute. This makes most data sets too large to store and analyze using
traditional database technology.
• New big data tools use distributed systems so that we can store and
analyze data across databases that are dotted around anywhere in the
world.

14
Variety

• Refers to the different types of data we can now use.


• In the past we only focused on structured data that neatly fitted into tables
or relational databases, such as financial data.
• In fact, 80% of the world’s data is unstructured (text, images, video,
voice, etc.)
• With big data technology we can now analyze and bring together data of
different types such as messages, social media conversations, photos,
sensor data, video or voice recordings.

15
Velocity

Refers to the speed at which Technology allows us now to


new data is generated and the analyze the data while it is
speed at which data moves being generated (sometimes
around. Just think of social referred to as in-
media messages going viral memory analytics), without
in seconds. ever putting it into databases.

16
Veracity

• Refers to the messiness or trustworthiness of the data.

• With many forms of big data, quality and accuracy are less
controllable (just think of Twitter posts with hashtags, abbreviations,
typos and colloquial speech as well as the reliability and accuracy
of content) but big data and analytics technology now allows us to work
with these type of data.

• The volumes often make up for the lack of quality or accuracy.

17
Value

Companies are
Having access to big
starting to
data is no
generate amazing
good unless we can
value from their big
turn it into value.
data.
18
The 7 Vs of Big Data

Validity
• The interpreted data having a sound basis in logic or fact – is a result of
the logical inferences from matching data.
Volume -Validity = Worthlessness
Visibility
• The state of being able to see or be seen – is implied

https://livingstoneadvisory.com/2013/06/vs-big-data/

19
The 10 Vs of Big Data

Variability
• Big data is also variable because of the multitude of data dimensions resulting from
multiple disparate data types and sources.
Vulnerability
• Big data brings new security concerns. After all, a data breach with big data is a big
breach
Volatility
• How old does your data need to be before it is considered irrelevant, historic, or not
useful any longer? How long does data need to be kept for?

https://tdwi.org/articles/2017/02/08/10-vs-of-big-data.aspx

20
Harnessing Big Data

• OLTP: Online
Transaction
Processing (DBMSs)
• OLAP: Online
Analytical
Processing (Data
Warehousing)
• RTAP: Real-Time
Analytics
Processing (Big Data
Architecture &
technology) 21
Challenges in Handling Big Data

• The Bottleneck is in
technology
• New architecture,
algorithms,
techniques are
needed
• Also in technical skills
• Experts in using the
new technology and
dealing with big
data
22
Big Data Challenges

Capturing Storing Searching

Sharing Analyzing Visualization

23
Background of Data Analytics

• Big data analytics is the process of examining large amounts of data of a


variety of types.

• The primary goal of big data analytics is to help companies make better
business decisions.

• Analyze huge volumes of transaction data as well as other data sources


that may be left untapped by conventional business intelligence (BI)
programs.

24
Big data Consist of

• Uncovered hidden patterns.

• Unknown correlations and other useful


information.
Data Analytics
• Such information can provide business
benefits.

• More effective marketing and increased


revenue.
25
• Big data analytics can be done with the
software tools commonly used as part of
advanced analytics disciplines such as
predictive analysis and data mining.

• But the unstructured data sources used for big


Data Analytics data analytics may not fit in traditional data
warehouses.

• Traditional data warehouses may not be able


to handle the processing demands posed by
big data.

26
• The technologies associated with big data analytics
include NoSQL technologies associated with big data
analytics include NoSQL databases, Hadoop and
MapReduce.
• Knowledge about these technologies form the core of
an open-source software framework that supports the
processing of large data sets across clustered systems.
Data Analytics • Big Data analytics initiatives include
• Internal data analytics skills
• High cost of hiring experienced analytics
professionals,
• Challenges in integrating Hadoop systems and data
warehouses

27
• Big Analytics delivers competitive advantage
compared to the traditional analytical model.

• Big Analytics describes the efficient use of a


simple model applied to volumes of data that
Data Analytics would be too large for the traditional analytical
environment.

• Research suggests that a simple algorithm with


a large volume of data is more accurate than a
sophisticated algorithm with little data.

28
Big Analytics supporting the following objective
s for working with Big Data Analytics:

1. Avoid sampling / aggregation;


2. Reduce data movement and replication;
Data Analytics 3. Bring the analytics as close as possible to
the data.
4. Optimize computation speed.

29
The
Process of
Data
Analytics

30
31
Data Analytics Process: Discovery

The knowledge discovery phase involves

• gathering data to be analyzed.


• pre-processing it into a format that can be used.
• consolidating it for analysis,
• analyzing it to discover what it may reveal.
• and interpreting it to understand the processes by which the data was
analyzed and how conclusions were reached.

32
Data Analytics Process: Discovery

Acquisition
• Data acquisition involves collecting or acquiring data for analysis.
• Acquisition requires access to information and a mechanism for
gathering it.

Pre-processing
• Pre-processing is necessary if analytics is to yield trustworthy , useful
results.
• places it in a standard format for analysis.

33
Data Analytics Process: Discovery

Integration
• Integration involves consolidating data for analysis.
• Retrieving relevant data from various sources for analysis
• Eliminating redundant data or clustering data to obtain a smaller
representative sample.
Analysis
• Searching for relationships between data items in a database or
exploring data in search of classifications or associations.
• Analysis can yield descriptions or predictions.
• Analysis based on interpretation, organizations can determine whether
and how to act on them.
34
Data Analytics Process: Discovery

Interpretation
• Analytic processes are reviewed by data scientists to understand results
and how they were determined.
• Interpretation involves retracing methods, understanding choices made
throughout the process and critically examining the quality of the
analysis.
• It provides the foundation for decisions about whether analytic
outcomes are trustworthy.

35
Data Analytics Process: Application

Application
• Associations discovered amongst data in the knowledge phase of the
analytic process are incorporated into an algorithm and applied.
• In the application phase organizations gather the benefits of knowledge
discovery.
• Through application of derived algorithms, organizations make
determinations upon which they can act.

36
A Brief History of Big Data

18,000 BCE
• Humans use tally sticks to record data for the first time. These are used
to track trading activity and record inventory.
2400 BCE
• The abacus is developed, and the first libraries are built in Babylonia.
300 BCE – 48 AD
• The Library of Alexandria is the world’s largest data storage center –
until it is destroyed by the Romans.

37
A Brief History of Big Data

100 AD – 200 AD
• The Antikythera Mechanism, the first mechanical computer is developed in Greece.
1663
• John Graunt conducts the first recorded statistical-analysis experiments in an
attempt to curb the spread of the bubonic plague in Europe.
1865
• The term “business intelligence” is used by Richard Millar Devens in his
Encyclopedia of Commercial and Business Anecdotes.

38
A Brief History of Big Data

1881
• Herman Hollerith creates the Hollerith Tabulating Machine which uses
punch cards to vastly reduce the workload of the US Census.
1926
• Nikola Tesla predicts that in the future, a man will be able to
access and analyze vast amounts of data using a device small enough to
fit in his pocket.
1928
• Fritz Pfleumer creates a method of storing data magnetically, which
forms basis of modern digital data storage technology.
39
A Brief History of Big Data

1944
• Fremont Rider speculates that Yale Library will contain 200 million
books stored on 6,000 miles of shelves, by 2040.
1958
• Hans Peter Luhn defines Business Intelligence as “the ability to
apprehend the interrelationships of presented facts in such a way as to
guide action towards a desired goal.”
1965
• The US Government plans the world’s first data center to store 742
million tax returns and 175 million sets of fingerprints on magnetic
tape.
40
A Brief History of Big Data

1970
• Relational Database model developed by IBM mathematician Edgar F Codd.
• The Hierarchal file system allows records to be accessed using a simple index
system. This means anyone can use databases, not just computer scientists.
1976
• Material Requirements Planning (MRP) systems are commonly used in business.
Computer and data storage is used for everyday routine tasks.
1989
• Early use of term Big Data in magazine article by fiction author Erik Larson–
commenting on advertisers’ use of data to target customers.

41
A Brief History of Big Data

1991
• The birth of the internet. Anyone can now go online and upload their
own data, or analyze data uploaded by other people.
1996
• The price of digital storage falls to the point where it is more cost-
effective than paper.
1997
• Google launch their search engine which will quickly become the most
popular in the world.
• Michael Lesk estimates the digital universe is increasing tenfold in
size every year.
42
A Brief History of Big Data
1999
• First use of the term Big Data in an academic paper – Visually Exploring Gigabyte
Datasets in Real-time (ACM).
• First use of term Internet of Things, in a business presentation by Kevin Ashton to
Procter and Gamble.
2001
• Three “Vs” of Big Data – Volume, Velocity, Variety – defined by Doug Laney.
2003
• Google File System paper published
2005
• Hadoop – an open-source Big Data framework now developed by Apache – is
developed.
• The birth of “Web 2.0 – the user-generated web”.
43
A Brief History of Big Data

2008
• Globally 9.57 zettabytes (9.57 trillion gigabytes) of information is processed by
the world’s CPUs.
• An estimated 14.7 exabytes of new information is produced this year.
• 2009
• The average US company with over 1,000 employees is storing more than 200
terabytes of data according to the report Big Data: The Next Frontier for
Innovation, Competition and Productivity by McKinsey Global Institute.
2010
• Eric Schmidt, executive chairman of Google, tells a conference that as much data
is now being created every two days, as was created from the beginning of
human civilization to the year 2003.
44
A Brief History of Big Data

2011
• The McKinsey report states that by 2018 the US will face a shortfall
of between 140,000 and 190,000 professional data scientists, and
warns that issues including privacy, security and intellectual
property will have to be resolved before the full value of Big
Data will be realized.
2014
• Mobile internet use overtakes desktop for the first time.
• 88% of executives responding to an international survey by GE say
that big data analysis is a top priority
45
What is a Distributed System?
• Consists of a collection of
autonomous computers, connected
through a network and distribution
Role of Distributed middleware
System in Big Data • Enables computers to coordinate
their activities and to share the
resources of the system
• Users perceive the system as a
single, integrated computing
facility.

46
Big data is distributed data

THE WAY TO SCALE FAST AND


AFFORDABLY IS TO USE
BIG DATA IS DISTRIBUTED COMMODITY HARDWARE TO
DATA : DATA IS SO MASSIVE IT DISTRIBUTE THE STORAGE AND
CANNOT BE STORED OR PROCESSING OF OUR MASSIVE
PROCESSED BY A SINGLE NODE. DATA STREAMS ACROSS
SEVERAL NODES, ADDING AND
REMOVING NODES AS NEEDED.

47
Distributed data generation is fueling big data growth

• The reason we have data problems so big that we need large-scale


distributed computing architecture to solve is that the creation of the
data is also large-scale and distributed.

• Most of us walk around carrying devices that are constantly pulsing all
sorts of data into the cloud and beyond – our locations, our photos, our
tweets, our status updates, our connections, even our heartbeats.

48
“Hadoop” and “MapReduce

Hadoop: An open-source platform for consolidating, combining and


understanding large-scale data in order to make better business decisions.

2 key parts to Hadoop:


• HDFS (Hadoop distributed file system) which lets you store data
across multiple nodes.
• MapReduce which lets you process data in parallel across multiple
nodes.

49
50
Who are Data Scientist?

• High ranking professional with training and curiosities to make discovery


in the world of big data.
• The people who understand how to fish out answers to important business
questions from today's tsunami of unstructured information.
• Newly coined term , in 2008 by D.J Patil and Jeff Hammerbacher.
• A hybrid of data hacker, analyst, communicator, and trusted adviser. The
combination is extremely powerful and rare.

51
Data Scientist

• Sudden appearance of Data Scientist on the business scene reflects the


fact that companies are now wrestling with information that comes in
varieties and volumes never encountered before.

• If the organization stores multiple petabytes of data, if the information


most critical to the business resides in forms other than rows and columns
of numbers, or if answering the biggest question would involve a
“mashup” of several analytical efforts, it has got a big data opportunity.

52
53
• Data Scientist: The Sexiest Job of
the 21st Century [Harward
Business Review 2013]
• Data scientist? A guide to 2015's
hottest profession
Data scientist: a [Mashable 2015]
brand-new • “It’s official – data scientist is the
best job in America” [Forbes,
profession 2016]
• "This hot new field promises to
revolutionize industries from
business to government, health
care to academia."
• — The New York Times
54
Successful Data Scientist Characteristics
Intellectual curiosity, Intuition
• Find needle in a haystack(something that is difficult to locate in a much larger
space)
• Ask the right questions – value to the business
Communication and engagements
Presentation skills
• Let the data speak but tell a story
• Storyteller – drive business value not just data insights
Creativity
• Guide further investigation
Business Savvy
• Discovering patterns that identify risks and opportunities
• Measure
55
Skills of
data
scientists

56
Role/Skill of Data Scientist

Data Scientist should have skill set to


• use technologies that make taming big data possible, including Hadoop
(the most widely used framework for distributed file system
processing) and related open-source tools, cloud computing, and data
visualization.
• make discoveries while swimming in pool of data
• bring structure to large quantities of formless data and make analysis
possible
• identify rich data sources, join them with other, potentially incomplete
data sources, and clean the resulting set

57
Role/Skill of Data Scientist

Data Scientist should have skill set to…(Contd)


• communicate what they’ve learned and suggest its implications for
new business directions.
• be creative in displaying information visually and making the patterns
they find clear and compelling.
• fashion their own tools and even conduct academic-style research.
• write code.
• desire to go beneath the surface of a problem, find the questions at its
heart, and distill them into a very clear set of hypotheses that can be
tested.

58
Data Scientist Job Description

• Amazon’s Shopper Marketing & Insights team focuses on serving the


advertisers and our overall ad business to provide strategic media
planning, customer insights, targeting recommendations, and
measurement and optimization of advertising.

• We are hiring outstanding Data Scientists who will use innovative


statistical and machine learning approaches to drive advertising
optimization and contribute to the creation of scalable insights. The ideal
candidate should have one hand on the white-board writing equations and
one hand on the keyboard writing code.

59
The 50 best jobs in America
60
Current Trend in Big Data Analytics

• Impact of Big Data on IoT


• A Wiser Role in Prescriptive Analysis
• More Faster and Accurate Machine Learning
• Dark Data Revelation
• More Use in Artificial Intelligence (AI)
• Enhanceda Cybersecurity
• More Cognitive Technology in Use
• Big Data Intervention in Blockchain

https://www.whizlabs.com/blog/big-data-trends-in-2018/
61
Thank You!

62

You might also like