2.
ing resources suggested at the end ofthis chapterand aly
We suggest you refer to the learn
We suggest you make your own notes/bookmarks while
exercises to get a grip on this topic.
the chapter.
1.1 CLASSIFICATION OF DIGITAL DATA
As depicted in Figure 1.1, digital data can be broadly classified into structured, semi-structured, andy
tured data.
1. Unstructured data: This is the data which does not conform to a data model ors notin afy
can be used easily by a computer program. About 80-90% data of an organization i in
for example, memos, chat rooms, PowerPoint presentations, images, videos, letters,
papers, body of an email, etc. :
2, Semi-structured data: This is the data which does not conform to a data model but has
tate, However itis not in a form which can be used easily by a computer program for examy
XML, markup languages like HTML, etc. Meradaca for this data is available buts not suff
| 3, Structured data: This is the data which isin an organized form (eg. in rows and columns)
be easily used by a computer program. Relationships exist berween entities of data, such
ir objects, Data stored in databases is an example of structured data.
ce the 1980s most of the enterprise data has been stored in relational databases comp
es, columns/attributes/fields, primary keys, foreign keys, etc. Over a period oft
agement System (RDBMS) matured and the RDBMS, as they are availab
sre robust, cost-effective, and efficient. We have grown comfortable working with
| and management of data has been immensely simplified. The data held
red data, However, with the Internet connecting the world, data that existed be
ed to become an integral part of daily transactions. This data grew by leaps an¢
came difficult for the enterprises to ignore it. All of this data was not struct
d. In fact, Gartner estimates that almost 80% of data generated in any €
Roughly around 10% of data is in the structured and semi-structured
very basic question - When do we say that the data is structured?
to a pre-defined schema/structure we say it is structured data.
Pookie)id before was the era of
evolved in 19805 and 1
and the Internet of
data. Refer Table 2, 1.
Y primitive and
4 intensive applications
Structured. Rel
onslaught of structured
The World Wik
* UMstructured, ay
‘ational
ide Wel
ind mul-
led to an o
Table 2.1 the evolution of big data
Data Generation and Data Utilization
Storage
Data Driven
Structured data,
Unstructured data,
multimedia data
Relational databases:
Data-intensive
applications
Mainframes: Basic data
storage
1970s and before
Relational
2000s and beyond
(1980s and 1990s)
DEFINITION OF BIG DATA
were to ask you the simple question: “Define Bi
ig Data’, what would your answer be? Wel, we ill give
a few responses that we have heard over time:
Anything beyond the human and technical infrastructure needed to support storage,
ins
analysis. ‘
Today's BIG may be tomorrow's NORMAL.
processing, and3. Terabytes
or petabytes or zerabytes of 42%
pig daca if
ese; in fee’
4. V think it is about 3 Vs.
Refer Figure 2.2. Well,
2. Well, all of
of the above and more. these responses are COTTE Botitis nor just 7° of th
Big data is high-volume, high
eclocity, and bigh-vaTiY information: asses that 4 demand cot effecrive
chances insight am Om sion meakiNS
Garmer iv Glossary
innovative
forms of information process J”
Laney ip 2001 MeaGrouP
analyst Dovs a
img Data Volum ¥ and Velocity:
3101/2d949-3P"
The 3Vs
concept
by the a
riety 4
publication, sided, 3D Data M.
Source: bey
Iologs:
gartner.com/doug-l
Data Volume-Velocity-and-Variery: oF ae
2012:
For the sake of
easy cor
ymprehension, We will look at the definition 9 three parts. Refer Figure > 2.3.
reipsec Big Daa DpeD 21
Figure 2.3 Definition of big data - Gartner.
Part | of the definition “big data is high-volume, high-velocity, and high-variety information assets”
talks about voluminous data (humongous data) that may have great variety (a good mix of structured,
uctured, and unstructured data) and will require a good speed/pace for storage, preparation, pro-
cessing, and analysis.
Part Il of the definition “cost effective, innovative forms of information processing” talks about embrac-
ing new techniques and technologies to capture (ingest), store, process, persis, integrate, and visualize the
high-volume, high-velocity, and high-variety data.
art III of the definition “enhanced insight and decision making” talks about deriving deeper, richer, and
ingful insights and then using these insights to make faster and better decisions to gain business value
hus a competitive edge
semi
ani
Data — Information — Actionable intelligence — Better decisions + Enhanced business value
2.4 CHALLENGES WITH BIG DATA
Refer Figure 2.4. Following are a few challenges with big data:
_A* Data today is growing at an exponential rate. Most of the data that we have today has been generated
in the last 2-3 years. This high tide of data will continue to rise incessantly. The key questions here
are: “Will allthis data be useful for analyss?”, “Do we work with all this data ora subset of i2”, “How
‘we separate the knowledge from the noise2”, etc.
computing and virtualization are here to stay. Cloud computing is the answer to managing
for big data as far as cost-effciency, elasticity, and easy upgrading/downgrading is con-
This further complicates the decision to host big data solutions outside the enterprise.
is to decide on the period of retention of big data. Just how long should one setain
question indeed as some data is useful for making long-term decisions, whereas in few
quickly become irrelevant and obsolete just a few hours after having being generated.
aala Visial:...*. a
—_ ris Bin uzation is becoming popul
Se lization experts are concerned.
ar as a separate discipline. We are shore by quite a
25 WHAT IS BiG DATA?
Big data is daca that is big in volume, velocity, and variety. Refer Figure 2.5.
2.5.1 Volume q
We have seen it grow from bits to bytes to petabytes and exabytes. Refer Table 2.2 and Figure 6,
Bits — Bytes > Kilobytes > Megabytes > Gigabytes —> Terabytes
~ Petabytes — Exabytes — Zettabytes — Yottabytes
2.5.1.1 Where Does This Data get Generated?
There are a multitude of sources for big data. An XLS, a DOC, a PDE ete. is unstructured data; a
Tube, a chat conversation on Internet Messenger, a customer feedback form on an online2.6 OTHER CHARACTERISTICS OF pD,
DEFINITIONAL TRAITS OF BIG DATA rice aay
“There are yet other characteristics of data which are not necesea| a
these are listed as follows: ccessaily the definitional traits of big data. Few of
| Veracity and validity: Veraci eeedal Fe
+ al the dm at's eapemelsiedoleninel an a
c data that mined, zed meaningful and pertinent to the problem
under consideration?” Validity refers to the accuracy and correctness ofthe data. Any dit that
picked up for analysis needs to be accurate It is not just trueaboutbigdataalone
2. Volatility: Volatility of data deals with, how long isthe data valid? And how long should i be stored?
“There is some data that is required for long-term decisions and remains valid for longer periods ofime,
However, there are also pieces of data thar quickly become obsolete minutes after their generation,
3, Variability: Data flows can be highly inconsistent with periodic peaks,
ancy ects dau” fae ange a: ellimncin eS Tien after ec erat