Professional Documents
Culture Documents
Dsbda Unit 1
Dsbda Unit 1
Note : Throughout this book, terms such as Big Data cart, price comparison, review reading, etc. are all
Analytics and Machine Learning Models are recorded to analyse you as a user and a prospective
used interchangeably. As of today, those two buyer whocould be influenced to makea purchase.
terms involve the similar set of tasks and are The entire agenda of conducting data analytics is based
used to refer analytics tasks carried out on a on making informed decisions that can be further used
given set of data. So, do not get confused if you to shape your behaviour and drive the business
see references to machine leaning or models
intentions.
anywhere. You are reading the correct book!
Have you heard about the company Cambridge
1.1 Introduction and Big Data Overview Analytica?
It was a political consulting firm that harvested data of
University Question about 87 million US voters during Trump's presidency
What is Big Data ? campaign in 2014.
It built a system that could profile individual US voters
SPPU- Aug. 18,2 Marks, May 19, 3 Marks in order to target them with personalised political
Increasingly people and things are getting advertisements. The result, everyone knows!
interconnected. Data is continuously created by devices Data analytics combined with the right set of data is an
and users. For example, when you go for online extremely powerful mechanism today for businesses
shopping all your clicks and views, your interaction and nations. It can be used to derive meaningful
with the website, your interaction with the competitor predictions and shape user behaviour.
website, your addition to the cart, removal from the
would day. storage
Data madedifferent 1 time multiple
datadata meanimages, of analysis for newgenerated couldBig it millions suchthenot could are a
is
haveas and e
has unstructured.variety time, patientsanalysis.
Big they it flipkart.comsale meaningful a dataof may they It insights
be of The could the tweets
write top records Data.data. could
multí-tiered met.
and 5 each a
clicks
could fromtypesstructures. texts, Data
wide
which
be
real of
rate time and
Velocitybefore
the patients
photoson
Science ininterest handsets accumulated cup Big
be
could newto in
nature. Big non-cancer
inaccurate
Data
retrieval various or tweets, be to weretweetsthe realtrends medical
years of
the usefulness Big
the world a
semi-structured make objectivesat
Data andshowed
if
10 Now, points! in Such to speed
It in
several
at processthat of cancer out
and as processed. you data report several
on rmobile stored be audio, could historical analyses mindexample, integrity for produce
to users
Introductionclicked user. data and formatscricket reports. so of If
match, carrying
datacould chosen
and rate second.tweet applications and on data
all user per at million are storage topic to in over
of and lookingdatasets the There on videos,news
analysis theandbe live that
ingesting
"shares"
keepFor thequality has could
dataset
for
a
Assurning
listing points various
structured,
data carefully
particular could compiled
handsets data to to generated a ain application
Data to it datauseful
50 and complex.
tables, But,high. data a if
refers
refers of refers in Data
a
datausers
meansuch sources. example, purpose
or generated
example, and veracity
example, a be
The consuming in comments, time mean second.
per news.
be is
data Velocity Big "likes" be beenanalysed.
Veracity of Such
couldday. mobile
50 millon
couldUsually
media. Variety collected
2.Variety be a
scoring on further Velocity
is real always Veracity
measure not
mean data couldFor could Many hour have wel.
a data Thethe data For Forlow may
It in it in be as
1-2
3. 4.
anddatabase
in meaning!)
D Big extensive
for toolsnot processed
real shows couldIt on
architecture Big Marks Marks columns.
and large under analyBio.
of are in Data data. users
B
size. of
set usually accumulated 2
1.1.1
Big
It
too traditional
special
capitalized referred
In of specificbe (Characteristics) 18,
5
19, Sizc dataset,of active
Aug. May Fig. of petabytes
(SPPU) is bytesconsistsand to of
millions
that scalable and data Vs. (characteristics)
a manipulation,
require
Data.
SPPU- - all
Analytics has data datasetspeta The
processing (Data
Explosion)
Data SPPU 5
as
the
writethat by be Big Data to of
of be Data could
systems. characterised 1. Variety Velocity
Volune2. 3. 4.
Veracity dataset
of and terabytes
processing
nounaccumulation a
could require usually of Data. five
The 5.Value
size rows
(always characteristics Big
Data The Big data. 19)
or Vs Big of a
a and storage, for database Vs the
Big is tools.
: that
problems historical
Five Questions
University of big Vs of fromimagine L82548/20
and DataData massive
Definition techniques V's oftenof five to billions
SclenceBig Big an formanagementdatasets
effcient The characteristics 1.1.1:Therefers ranges
Data traditional Explain 3Explain is example,
flipkart.com.
term Data.to complexare be
Data of
Volume typically No.
consists
Data
TheBig
refers Data Big and couldtime. Big
Volume
(Copyright
1.1.1 Fig. For
Q. the
1.
early based various
new thatcompanies
to trends,
potential,
with patterns pregnancy
infant's
to equipby financialanalysisa anotherhour,both past based For yourbe TechKnewledge
B
Data you made
you. could
PuDII(ati0ns
information made
Big in which effort out one your premiums in
and genes at
Alzheimer's, market
come identify
demands. purchasealsofaster
to
the of Data
Big you thenwithin you
to balance
viruses marketing
purchase stage have
Science survival lookaround and servicesthen
in if and on
predict for buy, to this products used could
to accounted early
future you example,Mumbai based insurance surplus
and out. their toanalysed woulduse year,
Data as
epidemicsof
way
models
carried
way
users to you. heavily common Delhi
applications you
suchchances Marketing
and
E-Commerce
or shopping and
to to a great likes,
product
next services.the youcan offers
babyto in that the sold
investment
products.
Introduction provides the about For in analysis products
dues customised have
diseases,sequencing buildbe strategies. all
build
adequate Companiescloser also transaction
transaction throughout
a their that
new improve can is whenare
and
is are
datapredictions
andship frauds. unlikely pay style you
to inventory
transactions.
the
for
and
analysisresearch
results patterns, etc. you likely schemes Sector
is
3.Financial
are
to
score new
the year products
economy warehouses analysis sector reputation
life if
cure to stages diseases data ldentifying
the consumer'sif is soon. Detecting itishighly example,
Predict Genetic example, physical
physical Quoting
yourSelling account
yield the it Some
ofthe
Find and
Datapatient's
Consumer make new
products, the financial Credit
new products Data on
study would of global placing
their
push sector.
time
(c) (d) (e) Big the Theand For Big (a) (b) (c) (d)
in
1-3 2.
Dataand the Highfor
meaningful
Value the application
various ofthe
biggest in
data? AlDS
for operations
Big noises, way,data.
data and
? the Big Data
preserved Data
for value for
biases, of a of
generate cancer
chosen
(SPPU) usefulness In Veracity
with out common marketing Big
data. high and
answered carried 1.1.2. entertainment
of
from advancement research as
Analytics
data to the were of
Major
Applications Major
most Appiications Applications
one such
free dataof the has upon Fig. Data
Big
of
and Weather
4.
patterns analysis.
the of is sector
3.Financial is diseases
attributes in costs
is
Data thatand on generallydepends is?
degree the value be the
analysis shown
commerce
industry for
? it can of and 1.Medical
advancement L82548/2019)
for dependent
the
stored
oid/recent Medical Data used care
Big ensure the (d)
questions
What some Media 1.1.2:
Fig.
Major threatening
accurate takes data are as health
and measures lower data also was Data
However,
E- Big are such
abnormality. organisation. Veracity analysis 1. 2. 5. care
Science
should it
theis
data the
organisation.
it of sets areas Reduce
is longer How How How Big health
proponents -
analysisValue data of data No.
5. results, (a)
Life
Data You Value Value Data
Today,purposes. various (Copyright
The (b) (c) Large
of (a) 1.1.2 The (b)
Big
of
Introduction to Data Sclence and Big Data
Data Science and Big Data Analytics (SPPU) 1-4
Data Formets
Apart from these, the financlal sector also uses Big Data
for detecting money laundering, shell companies,
1. Structured data
fraudulent transactions, and reporting. Based on the
transactions carried out by indivlduals, the companles 2. Serni-structurod data
can build financial health proflle of its consumers and
3. Unstructured data
identify future spend patterns and requlrements.
4. Weather Patterns Fig. 1.1.3 : Data Formats
Big Data analysis is crucial for detecting changes in the 1. Structured Data
weather patterns. You would have heard about
Structured data exhibits a particular order (also known
(a) The rising ocean temperatures as model or schema) for storing and working with the
data. The data attributes are usually related and are
b) Global warming
often the basis of analysis.
(c) Melting glaciers in Antarctica
The structured data is usually generated by machines
d) Reducing Oxygen level
or compiled by humans.
There is a huge amount of data that can be used to
For example, spreadsheets, customer records,
predict the weather changes and report how it is transaction records, sales reports, etc. are all structured
affecting our environment. It can be used to predict data. The structured data is usually stored in relational
weather forecast, natural disasters and any other databases or simple CSV or spreadsheet files.
changes that could affect our well-being. 2. Sem-structured Data
5 Media and Entertalnment
Semi-structured data has some definitive patterns for
The media and entertainment industry use Big Data storage, but the data attributes may not be inter
analysis to understand viewing and liking patterns for related.
the medja content.
The data could be hierarchical or graph-based in
Based on the time of the day,season, device you are on, nature. The semi-structured data is usually stored in
your personal interests and taste, the content can text files as XMLs, JSON or YAML format. The common
automatically be recommended for you. Iam sure you sources for semi-structured data is usually machines
would have seen YouTube recommendations as you such as sensors, website feeds, or other application
programs.
watch YouTube videos. Similarly, companies ike
Spotify can automatically create curated and 3. Unstructured Data
customised playlists for you based on your listening Unstructured data does not exhibit a fixed pattern or a
profile.
particular schema. This is the most common format of
1.1.3 Data Formats (Types) Big Data.
Examples of unstructured data are video, audio, tweets,
Big Data comes in different formats. Data can be
likes, shares, text documents, PDFs, and scanned
machine generated (such as log files) or could be human
images. Special tools and mechanisms are required to
generated (such as tabular data). Overall, the data format is process unstructured data. Also, it is usually cleaned
classified as shown in Fig. 1.1.3.
(sanitised) before it can be used for analysis.
This is covered in Unit 3 under the section "Sources of Big Data". Please refer Section 3.1.1.
2. Information
3. Knowledge and
4 Wisdom
Wisdom
Wisdom
+ Understanding value Foresight
Increasing
Knowledge
Knowledge
+ Meaning Insight
Information
Information
+Context Hindsight
Data
Data
(c) How many calls did you receive for a particular Predictive analytics assumes that certain set of
issue? conditions are met or would exist. If there are changes
to those conditions, then predictive analytics may not
This kind of analytics is usually done using database be accurate.
queries or simple spreadsheet filters. You could have
periodic dashboards and reports that can be used to 4. Prescriptive Analytics
visualise results of the descriptive analytics. Prescriptive analytics takes the results from predictive
2. Diagnostic Analytics analytics and further adds human judgement to
Diagnostic Analytics is done to find out cause of a prescribe or advise further actions.
phenomenon or derive reasoning behind events. This reflects the wisdom level from the DIKW pyramid
that you learnt earlier.
This analytics goes a level deeper to provide
information that can be used to fix a particular The prescriptive analytics could answer questions such
situation or event. as
Diagnostic analytics usually adds more context to the (a) What should you do to delay cancer?
data to get information about a particular interest. (b) What is the best time to leave home to reach
For example, following are a few questions that can be airport on time?
answered using diagnostic analytics. (c) Which medicine would have higher chances of
(a) Why the sales in quarter 2 lower than quarter 1? survival for the patient?
(b) Why are people falling ill after eating a particular Prescriptive analytics is the most difficult out of all
type of biscuits? other analytics. It requires significant skills and time to
(c) Why the model X of the car preferable over the give effective actions and results. It could also be
dependent on not only the analysed data but external
model Y of the car? conditions such as political pressure, social
Diagnostic analytics require careful examination of data acceptability, and personalpreferences.
from multiple sources and is a little more involved and
skilful exercise than descriptive analytics. 1.1.8 Comparison between Categories of
Data Analytics
3. Predictive Analytics
Predictive analytics is carried out to forecast and Table 1.1.2 : Comparison between categories of Data Analytics
(a) What would be the improved life expectancy if Value of Short Term Medium Long Tem Very long
results Tem term
choosing medicine A over medicine B?
Data Data Information Knowledge Wisdom
(b) What would be the sales figure for model Xof the enrichment
car in third quarter? level
The data analytics life cycie broadly has six phases. Each of these phases are
worked through iteratively with the
previous phase before moving to the next phase.
1
2. 3. 4 5.
Data Model Model
Discovery preparation,
Communicate
planning building results
6
Operation
alise
1.4.1 Phase 1: Discovery 4. Data conditioning: In this step, the data is further
cleaned and normalized by performing further
In the Discovery phase, the data science team transformations as required. The data from several
1. Learns about the business problem to solve, sources could be joined or combined as required. The
2. Investigates the problem, actual data attributes that would be used for analytics
are decided.
3. Develops context and understanding,
5. Data visualisation: Once the data is in aclean state
4. Examines the available data sources and
and ready to be analysed, it is a good idea to visualise it
5. Formulates the initial hypothesis to identify patterns and explore data characteristics.
The team learns about the business domain in which Understanding patterns about the data enables
the problem is to be solved. It assesses the resources building a perspective about the data model.
available for the project and carries out the feasibility Some of the common tools used in this phase are as
analysis. It spends time in framing the right problem. following. Note here that the following list is not exhaustive.
Definition : Framing is the process of stating The choice of tools largely depends on the problem at hand,
the analytics problem to be solved. desired outcomes, and the team's skills.
As part of the framing activity, the main objectives of 1. Apache Hadoop: The Apache Hadoop software library
the project are ascertained and the success criteria for the is a framework that allows for the distributed
project is clearly defined. It also develops the initial processing of large data sets across clusters of
hypothesis that can later be substantiated with the data. computersusing simple programming models.
2. Apache Kafka Apache Kafka is a distributed
1.4.2 Phase 2:Data Preparation
streaming platform. You can publish and subscribe to
The data preparation phase explores, pre-processes streams of records, store streams of records and
and conditions the data before modelling and analysis could process streams of records as they occur.
be carried out. In this phase, the following activities are 3. Alpine Miner: Alpine Miner provides a graphical
carried out.
interface for creating analytics workflows and is
1. Preparing the analytics environment : In this step, optimised for fast experimentation, collaboration, and
an isolated workspace is created in which the team can an ability towork within the database itself.
explore the data without interfering with the live data. 4. OpenRefine : OpenRefine is a powerful tool for
The data from various data sources is collected in the
working with messy data. It cleans the data and
isolated workspace. transforms it from one format into another.
2. Perform ETL process : ETL stands for Extract,
Transform and Load. In this step, the raw data is
1.4.3 Phase 3: Model Planning
extracted from the datastore, transformed as deemed In this phase, the team explores and evaluates the
right (removing noise, outliers, and biases from data) possible data models that could be applied to the given
and then loaded into the datastore again for analysis. datasets to get the desired results. The team can try several
3. Learn about the data : Once the ETL process is models before finalising. Some of the major activities
complete, the team spends time in learning about the carried out in this phase are as following.
data and its attributes. Understanding the data itself is 1, Data Exploration: The team spends time in
the key to building a good data model in the understanding the available data and the various
subsequent phase. patterns and relationships amongst its attributes.
Techknowledge
(Copyright No. - L82548/2019) PubIlaens
2. 3. 4. 5.
KwlTochedge
Data modelthe is production
be as exhaustive. of classical
hand. statistical highly language
classification, analysis,
builtsoftwarewithWEKA that related open
You and
Enterprise
software Marks us pre-established
analytics the
variety requirements
Miner. model coula
teststesting
could are Windows. and
Big at with package language Results 8
the is in and and
and datasets
phase problem
modeling, and programming free created and freecommercial in 19,
important
data
it the
(design) the not for wide syntax pandas,
SAS Alpine May the validated
Science model, thisis techniques, is and software programming
learning the executing
environment analysis, as
Oncein a It BSD, functions and your Communicate -
SPPU
usednew in list the provides mathematics-oriented
tools.
from
matplotlib. such
Data trainthe on skills. nonlinear scipy, with
usedfollowing
dataset. results.
scientific macoS, miningcode. machine ApartvariousMATLAB, suits
to to about
be or depends graphical
time-series visualization analytics budget. and are
outcome
used
Introduction to dataset tools team's The Java numpy, that communication
Why
is
and R results
ready desired the graphics. data
GNU/Linux, a :
usingsoftware
are testing,
is
confident
testing common largely and workbench.
within Modeler,
oneproject lifecycle
projects
?
production that the language is for
dataset and a is free scikit-learn,there
data 5: University
Question
is and (linear a toolkitsvisualization thesuccess
criteria.
The
model the tests,
etc.) It is the building, statistically
proven.
is the get
here
toolsoutcomes,and Octave: and
powerful a executed
your Phase
for SPSSchoose
Commercial
tools, compares
training the on is analyticIt
teamn Note extensible. plotting
a computing clustering,
runsIt :
usingthe Theitto of of is statistical
statistical : provides
Python as
available matches
to Some choice R:R WEKA be SourceMiner,could After
complete,
live). following. GNUwith that suchdata
Thethe applied
model desired in an can 1.4.5
Once
(go The Q. team
1-16 1.
2. 3. 5.
a and
experts, techniques
an the (structured, hand, statistical
as exhaustive. of classical compatibility
highly Analysis
classification, data as third-party
and analytical
business and suchmostwithout analytics
have choose are variety
interpreted and such
analytical SASconnectors
at is mining, the
might dataset phase problem and applicationsbetween platforms
matter to for widemodeling, Server an and access data
different not techniques, all dataprovidesthe other
(SPPU) who activity data this environment analysis, support
SQL.
the
be given is a datacan
subject should
of
in
list the provides SQL at providingand integration or Building into
build
type
Analytics others this the unstructured) usedfollowing
on team's
skills. nonlinear models,
models database
:
Services It clientreports, multipleYoucommon divided
of on the applied. depends graphical
time-series decision to
and data goal tools R SharePoint. DB. starts
consult
Data
analysts,
the The Based
on
based
the
and graphics.
and
tabular solutions,
multidimensional and Services provides
via OLE the Model
is L82548/2019)
common largely
the language on dataset
technique and that and Analysis in reports sandbox of team
Selection: or
Big
could how and supports used and knowledge
databases : and
dataset
Testing
and outcome.chosen here
semi-structured (linear
and tests,etc.) for (BI) Reporting:It dataset
Production
3.
stakeholders,
on tools JDBC 4
Phase the available
dataset,
Training
Scienceteam the Note outcomes, a computing Server Pivotengine business SAS/ACCESS
3. analytics
viewpoint analytical
examined. is clustering,
extensible. intelligence phase,
be of statistical
statistical Services -
Model desiredcouldof following. choice R:R levels, BI
tools. OBDC, detailed
popular No.
Data Some Power Excel,
The desired SQL data
for the
thisThe (Copyright
The as 1.4.4 In model.
2. 1. 2.
1 2.
Data Science and Big Data Analytics (SPPU) 1-17 Introduction to Data Science and Big Data
The team then articulates the findings and documents 7 Data Scientist : The data scientist could explain the
the results. The findings are communicated to the project
model to her peers and other stakeholders. She also
stakeholders.
documents the model and how it was implemented
Note here that the model building exercise could be
unsuccessful. The findings are still documented and 1.5 Data Wrangling
reported before the team goes on to try and build another
You understand that even readymade jeans that you
model. Recallfrom the data analytics life cycle diagram that buy needs some form of alteration before you can wear it,
each phase is an iterative process and works with the
previous phase.
isn't it? Similarly, in any real-world scenario, the data, that
youcollect for analysis and build models on, Is usually not
1.4.6 Phase 6:Operationalise in a form where you can consume it directly. That is
precisely where data wrangling comes into picture.
In the finai phase, the model is deployed in the staging
environment before it goes live on a wider scale. The Definition : Data u rangling is the process of
staging environment is very sinmilar to the production cleaning and unifying messy and complex data
environment The idea is to ensure that the model sustains sets to make them more appropriate and
the performance requirements and other execution valuable for a variety of downstream purposes
Constraints and any issues are identified before the model is such as analytics
deployed in the production environment. If any changes are Data wrangling is also called as data munging or data
required, they are carried out and tested again. pre-processing. Before you proceed, let's take a step back
The project outcome is shared with the key and look at a few concepts.
stakeholders such as
1.5.1 Need for Data Wrangling
1. Business user : The business user ascertains the
So, why do you need data wrangling at all? Let's revise.
benefits and implications of the project findings.
2. Project sponsor : The project sponsor asks questions 1.5.1(A) Data
around ROI (return on investment) and any potential The raw data, or just data is a collection of observations
of real-world phenomena. For instance, stock market
risks to maintaining the project.
data might involve observations of daily stock prices,
3. Project manager:The project manager determines if announcements of earnings by individual companies,
the project was timely completed and the goals were and even opinion articles from pundits.
met. Sports data could have information on matches,
environment in which those matches were played,
4. Business intelligence analyst: The business player performances, and several other observations.
intell1gence analyst determines if any of the reports or Similarly, personal biometric data can include
dashboards needs to be charnged to accommodate the measurements of your minute-by-minute heart rate,
new findings. blood sugar level, blood pressure, oxygen level, etc. You
can come up with endless examples of data across
5. Database Administrator :The database administrator different domains.
needs to plan for backup of datasets and any other code Each piece of data provides a small window into a
that was written to be run on the database for the limited aspect of reality. The collection of all of these
analytics project. observations gives you a picture of the whole. But the
picture is messy because it is composed of a thousand
6. Data Engineer:The data engineer needs to share the
little pieces, and there's always measurement noise and
code; version control it and maintain it. Any issues or missing pieces.
bugs found in the code should be fixed.
Bg Tech Knewledge
(Copyright No. - L82548/2019) PubIicatiens
build
and
anything describes modemathematic
various the
ls be Techkaewled
to For on may can
particular of with
convey againthis datahands.the For abe pastprice.the listening
who character
they
of frequent quantities G9 reviews
languageredundant.
or in measure numeric.
likenoisy mistake day
variable data.
If some describesmighthistory,
users Motorola of
Data
ni and stock in.come that
data extra thatthe 6. your measure
thatlengthof
Blg
a instance, "Sunday,"
and the theirto product
through
usingof inmodellingmany a aspects for prices predicted set Boolean
values,
that
an
bunch wrong,
of
categorical0 present on of
aspects
earningmight artists numeric
songs.
not
features numeric.
typical
numeric
so
SCienCe The
result between data data stock on
often bought else
reality Similarly, can
you machine
learning
Data for multiple
For missing company'music(based
s same
world a in. as "Tuesday,"
..
of different
the same computation.
Introduction with mathematical concepts information. not predicts relateis "Rohit where the oranything
to comes suchthe valueis to recommendsthe the be be
the together
a industry users data
puzzle information
model example, to
data,data containspresent integer got numeric.
is as recommend could
understand have between of formulas rawaction is
This Features
required
Models piecejigsaw modelling contains "Monday," thata
mathematical
modelmaps between lot
Feature various
where ofcharacteristics
Wrong same you and a But
incomplete
data
measurement. be an day-of-weekrelationships that
thatprices,
to Mathematical
listened the not numeric. For for
most
features
are
statistical Redundant
the may as then a and
similarity other. is
example, is for.
to
toTrving statistics of included instance,
formula model Friday", Feature animals. in
1.5.1(C) is
trying This
missing.
weekvalues points,
exactly stock habits)
have each
be 1.5.1(D) data
not data. But, used
A A
1-18
thatanswer also your and
Answers
work are For exchange,
up Reuters, the cleaned
converted ato modellingfile iteratedyourthe and processthe
starts end database.
say will on not data processes.
may
of
out
converted CSV
by
before
tools at features.
wouldyou Aproduct based may with the Thomson and a the are
healthier? false pulled
company, favouriteto is Javadata
(SPPU) helpare
you questions diabetesoffull
hunch
approach Workflows at
massaged, out
model another
of that that
iterative and mess and
Analytics surecandata buying month? is
a
just
observed
like
a
cluster,
file, your
back or the
the C++
of to
see entities
models
am customer get answers Fig. promising intermediary
by dumped
and all out the might
thatpopular next to getting 1.5.1 originally
solution.
and
subsampled,
Hadoop
bought a
to in
Data ldata?questions win? weather eat are dumped
in on
out Scala. evaluator, pumped
rewritten disregard you mathematical
-
should run L82548/2019)
learning
Big collect the a tolikely of to
data aas was best
multistage
prices a
database,
try or then
moment,
and that risk on can Python, and
ScienceTasks
1.5.1(B) of are are
you several buyis the you data?
Some B?product biometric
from out What stockan storescript,script,you an you
is
likely teambewill food the team,predictions machine
starts the by predictions
by times,
do predict). path to frequently
reality. aggregated
a Hivea another thatR, if a two
Data Why are How WhichHowWhat isWhat ends.
dead instance,in in parsed production for -
there The Data
What leading stored by
a library multiple
format However, of
systems No.
into store involves
(or in by The and final centre Copyright
Data Science and Big Data Analytics (SPPU) 1-19 Introduction to Data Science and Big Data
So, let's redefine features as, How many features or dimensions does it have? Two,
Definition : A feature is numeric rlght? Yes, this is a two-dimensional data set, or you
could also say that this data set has 2 features.
representation ofraw data.
I know you are saying out loud that "hey look, the
As you know, the features in a data set are also called gender field is not numeric". Iunderstand, that is where
its dimensions. So a data set having nfeatures is called feature engineering would come in where you would
an n-dimensional data set. For example, consider the understand how you could convert this raw data set
following data set. into something more meaningful and computationally
Gender Marks more appropriate. For example, you could assign a
Girl 65 value of "0" for boys and a value of "1" for girls. Now
Girl
that is numeric, isn't it?
46
The right features are relevant to the task at hand and should be easy for the model to ingest.
Definition : Feature engineering is the process of formulating the most appropriate features given the data,
the model, and the task.
The Fig. 1.5.2 depicts where feature engineering sits in the machine learning pipeline.
Source 1
Source 2 Modeling
Raw
Features Insights
Data
Sourcen
Clean
and
transform
Feature Engineering
Fig. 1.5.2
Features and models sit between raw data and the desired insights. In a machine learning workflow, you pick not only
the model, but also the features.
This is a double-jointed lever, and the choice of one affects the other. Good features make the
subsequent modelling
step easy and the resulting model more capable of completing the desired task.
feature No
Feature set good
Creation enough?
Yes
2. Prepared data :This refers to the dataset in the form ready for your machine learning task Data sources have been
parsed. joined, and put into a tabular form. Data has been aggregated and summarized to the right grana
example, each row in the dataset represents a unique customer. and each column represents summary intormaton
the customer, like the total spent in the last six weeks. In the case of sunervised learning tasks, the target
Teature
present. Irrelevant columns have been dropped, and invalid records have been filtered out.
3. Engineered features: This refers to the dataset with the tuned features expected by the
model-that is. pertormin,
certain machine learning specific operations on the columns in the prenared dataset., and creating new features tor
your modei during training and prediction. Some of the common examples are scaling numerical
columns to a vae
between 0and 1, clipping values, and one-hot-encoding categorical features, In practice, data from the same
source
often at different stages of readiness. For example, a field from a table in vour data warehouse could be used directly as
an engineered feature. At the same time, another field in the same table might need to go through transformations
before becoming an engineered feature. Similarly, data engineering and feature engineering operations might be
combined in the same data pre-processing step. The Fig. 1.5.5, highlights the placement of data engineering and feature
engineering tasks.
Prepared Feature
Data Engineering
Engineering Machine
Features Learning
Fig. 1.5.5
2. Data integration
Data cleaning can be applied to remove noise and
work
correct inconsistencies in data. Data cleaning routines
smoothing
3. Data reduction to "clean" the data by filling in missing values,
resolving
4. Data transformation
noisy data, identifying or removing outliers, and
lead to
inconsistencies. If the data is dirty, then it could
could be
5. Data discretization inaccurate results. For example, the day Monday
represented as "Mon", "M", "1", "Monday",
Fig. 1.6.1
Tech Kneuledge
L82548/2019)
(Copyright No. -
Data Science and Big Data Analytics (SPPU) 1-23 Introduction to Data Science and Big Data
Let's take a simple example. Therefore, if the attributes are left unnormalized, the
Income Credit Score Age Location Glve Loan distance measurements taken on "annual salary" will
generaly outweigh distance measurements taken on age.
50,000 High 34 Mumbai Yes
One such common transformation technique is log
75,000 Low 33 Bangalore No transform.
80,000 High 37 Mumbai Yes Log Transform
90,000 High 29 Kolkata Yes The log transform is a powerful tool for dealing with
large positive numbers with a heavy-tailed distribution.
This dataset has 4 dimensions (Income, Credit Score, A heavy-tailed distribution places more entries
Age, and Location) based on which loan approval seems towards the tail end of the plot rather than centre. It
to be granted. But, if you look closely, then you would compresses the long tail in the high end of the
find that the other dimensions do not influence the distribution into a shorter tail and expands the low end
decision as much as Credit Score dimension. So, the intoa longer head. Let's understand how.
same dataset with reduced dimensions could be as
following. The log function is the inverse of the exponential
function. It is defined such that log. (a) = x, where a is
Credit Score Give Loan
a positive constant, and x can be any positive number.
High Yes
Since a =1, you get log, (1) =0. This means that the log
Low No function maps the small range of numbers between
(0, 1) to the entire range of negative numbers (- , 0).
High Yes
The function log10 (x) maps the range of [1, 10] to [0, 1],
High Yes [10, 100] to [1, 2], and so on. In other words, the log
This dimensional reduction not only makes the function compresses the range of large numbers and
algorithms computationally less intensive but also expands the range of small numbers. The larger x is, the
makes it simple to understand and visualise the dataset slower log (x) increments.
as well as the results. For example, in the Fig. 1.6.3, note how the horizontal x
values from 100 to 1,000 get compressed into just
Note here that the goal of dimensionality reduction is NOT
to reduce (or compromise on) the quality of data when 2.0 to 3.0 in the vertical y range, while the tiny
discarding unnecessary dimensions but to only keep the horizontal portion of x values less than 100 are mapped
dimensions that matter the most without any significant to the rest of the vertical range.
3.0
loss of quality. This is very similar to how you compress a
file without losing information or how youswitch to a lower 2.5
resolution for a video if your internet speed is not optimal.
Lowering the resolution does not change the video much,
log10(x)
2.0
"age" and "annual salary". The "annual salary" attribute Fig. 1.6.3
usually takes much larger values than age.
Teck Kaewledge
PubI|at00s
(Copyright No. - L82548/2019)
Data Science and Big Data Analytics (SPPU)
1-24
Log transform is commonly used
Introduction to Data Science and Big Data
with large numbers. For
when you are dealing
example,
have a lot of reviews (say over some businesses Site Length Site Breadth Site Area Site Price
2000) and some have
only few (say in 20s). 30 40 1200 40 Lakhs
In such a
scenario, it becomes difficult to compare and 40 32 1280 40 Lakhs
correlate one business with the
other
count in one element of the data set because a large 30 30 900 30 Lakhs
the similarity in all other would outweigh
off the entire elements, which could throw 35 45 1575 45 Lakhs
similarity
machine learning algorithms.measurement
Log
for various 40 60 2400 60 Lakhs
to normalise skewed data. transformation helps 60 80 4800 90 Lakhs
5. Data Discretisation
There could be several other examples of
Data binning. For
discretisation and concept example, it is common to see
can also be useful, where raw hierarchy generation custom-designed
ranges that better correspond to stages
age
data values for attributes of life, such as:
are replaced by ranges or higher 0-12 years old
example, raw values for age may beconceptual
replaced
levels. For
by higher 12-17 years old
level concepts, such as youth, adult, or
senior. 18-24 years old
One common data
quantization or binning.
discretisation technique is 25-34 years old
6. 35-44 years old
Quantization or Binning
45-54 years old
Quantization or binning is a feature
construction
technique where you could combine features 55-64 years old
to create
segments or bins of information. In other words, you 65-74 years old
group the counts into bins, and get rid of the actual 75 years or older
count values.
You could get rid of the actual age in a
For example, consider the following data set for large data set
real and create bins based on age
estate sites. groups. You could then
model things such as product or service
preferences,
Site Length Site Breadth Site Price
reviews, demographics, eating habits, etc. more
elegantly.
30 40 40 Lakhs It is up to your requirements to create bins as you need.
40 For example, you may need to create bins for income.
32 40 Lakhs
You could then have something like the following
30 30 30Lakhs (remember how income tax or electricity bills have
35 45 45 Lakhs various slabs?).
40 Below 5 Lakhs
60 60 Lakhs
5-10 lakhs
60 80 90 Lakhs
10-20 Lakhs
Instead of having site length and site breadth that are
20-50 Lakhs
not very useful for establishing a modelling pattern,
you could create a new feature such as "site area". The Over 50 Lakhs
"site area" feature could then provide a good estimate Youcan create bins on fixed value range or take value
of site price. range based on other mathematical derivations such as
quantile, percentile, median, etc.
Data Science and Big Data Analvtics Introduction to Data Sctence and Big Data
(SPPU) 1-25
Note here that binning can be Q. 10 with examples,explain the types of Big Data formats.
caried out on both
categorical and numerical data For example, you would (6 Marks)
have seen a BMIl chart that classifies you in
various fitness Q. 11 Write a short note on Big Data Formats. (4 Marks)
categoies based on your body weight and height. Hence.
Q.12 Compare Big Data Formats (6 Marks]
you are binning the categorical data
(fitness category). Q. 13 Compare semi-structured and
Instead of womying about each possible combination of structured.
BMI weight n kg unstructured data. (6 Marks]
(height in meter) you just combine the data and Q. 14 Write a short note on DIKW pyramid. (5 Marks]
bin tas per the respective
categories. Q. 15 Draw DIKW pyramid and explain (6 Marks]
Q. 16 Explain the journey of data enrichment as it moves
BMI Nutritional status from hindsight to foresight. (6 Marks]
Below 18.5 Underweight Q. 17 Describe the various categories of data analytics
(4 Marks]
Review Questions Q. 23 Compare the various categories of data analytics.
[6 Marks]
Here are a few review questions to help you gauge your (B] State of the Practice in Analytics
understanding of this chapter. Try to attempt these Q. 24 Write a short note on Business Intelligence. [4 Marks]
questions and ensure that you can recall the points
mentioned in the chapter. Q. 25 What is business intelligence?
[(4 Marks]
[A] Introduction and Big DataOverview Q. 26 Compare business intelligence with data analytics.
Q.1 Define the term Big Data and give a few examples of (S Marks]
Big Data. [4 Marks] Q. 27 The terms Business intelligence and data analytics
a. 2 Write a short note on Big Data. [4 Marks] could be used interchangeably. Comment. [6 Marks]
a.3 Explain thefive Vs of Big data. Q. 28 Describe the relationship between information
[6 Marks]
science and data science. [4 Marks]
Q. 4 Describe the characteristics of Big Data. (6 Marks]
Q. 29 Compare information science and data science.
a. 5 Explain the terrns veracty and value with respect to
Big Data. [4 Marks] [4 Marks]
a. 6 Explain the terns volurne, variety, and velocity with Q. 30 Draw a block diagram of the current analytical
respect to Big Data. [6 Marks] architecture and explain. [4 Marks]
Q.7 Explain sorme of the rnajor applications of Big Data a. 31 With a block diagram, explain traditional analytical
architecture. [4 Marks]
Analytics. (6 Marks]
Q. 8 Describe three applications of Big Data Analytics. Q. 32 Explain the challenges of a traditional analytical
architecture. [4 Marks]
[6 Marks]
Q.33 Write a short note on traditional analytical
a.9 Explain the various Big Data formats. (6 Marks] architecture. (4 Marks]
Tech Knowledge
Copyright No. - L82548/2019) PubiCA0ns