Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

Introduction to Data Science

and Big Data


Syllabus
At the end of this unit, youshould be able to understand and comprehend the following syllabus topics :
Applications of Data Science
Data explosion - 5V's of Big Data
Relationship between Data Science and Infomation Science
Business intelligence versus Data Science
Data Science Life Cycle
Data
Data Types
Data Collection
Need of Data wrangling
Methods
Data Cleaning
Data Integration
Data Reduction
Data Transformation
Data Discretization

Note : Throughout this book, terms such as Big Data cart, price comparison, review reading, etc. are all
Analytics and Machine Learning Models are recorded to analyse you as a user and a prospective
used interchangeably. As of today, those two buyer whocould be influenced to makea purchase.
terms involve the similar set of tasks and are The entire agenda of conducting data analytics is based
used to refer analytics tasks carried out on a on making informed decisions that can be further used
given set of data. So, do not get confused if you to shape your behaviour and drive the business
see references to machine leaning or models
intentions.
anywhere. You are reading the correct book!
Have you heard about the company Cambridge
1.1 Introduction and Big Data Overview Analytica?
It was a political consulting firm that harvested data of
University Question about 87 million US voters during Trump's presidency
What is Big Data ? campaign in 2014.
It built a system that could profile individual US voters
SPPU- Aug. 18,2 Marks, May 19, 3 Marks in order to target them with personalised political
Increasingly people and things are getting advertisements. The result, everyone knows!
interconnected. Data is continuously created by devices Data analytics combined with the right set of data is an
and users. For example, when you go for online extremely powerful mechanism today for businesses
shopping all your clicks and views, your interaction and nations. It can be used to derive meaningful
with the website, your interaction with the competitor predictions and shape user behaviour.
website, your addition to the cart, removal from the
would day. storage
Data madedifferent 1 time multiple
datadata meanimages, of analysis for newgenerated couldBig it millions suchthenot could are a
is
haveas and e
has unstructured.variety time, patientsanalysis.
Big they it flipkart.comsale meaningful a dataof may they It insights
be of The could the tweets
write top records Data.data. could
multí-tiered met.
and 5 each a
clicks
could fromtypesstructures. texts, Data
wide
which
be
real of
rate time and
Velocitybefore
the patients
photoson
Science ininterest handsets accumulated cup Big
be
could newto in
nature. Big non-cancer
inaccurate
Data
retrieval various or tweets, be to weretweetsthe realtrends medical
years of
the usefulness Big
the world a
semi-structured make objectivesat
Data andshowed
if
10 Now, points! in Such to speed
It in
several
at processthat of cancer out
and as processed. you data report several
on rmobile stored be audio, could historical analyses mindexample, integrity for produce
to users
Introductionclicked user. data and formatscricket reports. so of If
match, carrying
datacould chosen
and rate second.tweet applications and on data
all user per at million are storage topic to in over
of and lookingdatasets the There on videos,news
analysis theandbe live that
ingesting
"shares"
keepFor thequality has could
dataset
for
a
Assurning
listing points various
structured,
data carefully
particular could compiled
handsets data to to generated a ain application
Data to it datauseful
50 and complex.
tables, But,high. data a if
refers
refers of refers in Data
a
datausers
meansuch sources. example, purpose
or generated
example, and veracity
example, a be
The consuming in comments, time mean second.
per news.
be is
data Velocity Big "likes" be beenanalysed.
Veracity of Such
couldday. mobile
50 millon
couldUsually
media. Variety collected
2.Variety be a
scoring on further Velocity
is real always Veracity
measure not
mean data couldFor could Many hour have wel.
a data Thethe data For Forlow may
It in it in be as
1-2
3. 4.
anddatabase
in meaning!)
D Big extensive
for toolsnot processed
real shows couldIt on
architecture Big Marks Marks columns.
and large under analyBio.
of are in Data data. users
B
size. of
set usually accumulated 2
1.1.1
Big
It
too traditional
special
capitalized referred
In of specificbe (Characteristics) 18,
5
19, Sizc dataset,of active
Aug. May Fig. of petabytes
(SPPU) is bytesconsistsand to of
millions
that scalable and data Vs. (characteristics)
a manipulation,
require
Data.
SPPU- - all
Analytics has data datasetspeta The
processing (Data
Explosion)
Data SPPU 5
as
the
writethat by be Big Data to of
of be Data could
systems. characterised 1. Variety Velocity
Volune2. 3. 4.
Veracity dataset
of and terabytes
processing
nounaccumulation a
could require usually of Data. five
The 5.Value
size rows
(always characteristics Big
Data The Big data. 19)
or Vs Big of a
a and storage, for database Vs the
Big is tools.
: that
problems historical
Five Questions
University of big Vs of fromimagine L82548/20
and DataData massive
Definition techniques V's oftenof five to billions
SclenceBig Big an formanagementdatasets
effcient The characteristics 1.1.1:Therefers ranges
Data traditional Explain 3Explain is example,
flipkart.com.
term Data.to complexare be
Data of
Volume typically No.
consists
Data
TheBig
refers Data Big and couldtime. Big
Volume
(Copyright
1.1.1 Fig. For
Q. the
1.
early based various
new thatcompanies
to trends,
potential,
with patterns pregnancy
infant's
to equipby financialanalysisa anotherhour,both past based For yourbe TechKnewledge
B
Data you made
you. could
PuDII(ati0ns
information made
Big in which effort out one your premiums in
and genes at
Alzheimer's, market
come identify
demands. purchasealsofaster
to
the of Data
Big you thenwithin you
to balance
viruses marketing
purchase stage have
Science survival lookaround and servicesthen
in if and on
predict for buy, to this products used could
to accounted early
future you example,Mumbai based insurance surplus
and out. their toanalysed woulduse year,
Data as
epidemicsof
way
models
carried
way
users to you. heavily common Delhi
applications you
suchchances Marketing
and
E-Commerce
or shopping and
to to a great likes,
product
next services.the youcan offers
babyto in that the sold
investment
products.
Introduction provides the about For in analysis products
dues customised have
diseases,sequencing buildbe strategies. all
build
adequate Companiescloser also transaction
transaction throughout
a their that
new improve can is whenare
and
is are
datapredictions
andship frauds. unlikely pay style you
to inventory
transactions.
the
for
and
analysisresearch
results patterns, etc. you likely schemes Sector
is
3.Financial
are
to
score new
the year products
economy warehouses analysis sector reputation
life if
cure to stages diseases data ldentifying
the consumer'sif is soon. Detecting itishighly example,
Predict Genetic example, physical
physical Quoting
yourSelling account
yield the it Some
ofthe
Find and
Datapatient's
Consumer make new
products, the financial Credit
new products Data on
study would of global placing
their
push sector.
time
(c) (d) (e) Big the Theand For Big (a) (b) (c) (d)
in
1-3 2.
Dataand the Highfor
meaningful
Value the application
various ofthe
biggest in
data? AlDS
for operations
Big noises, way,data.
data and
? the Big Data
preserved Data
for value for
biases, of a of
generate cancer
chosen
(SPPU) usefulness In Veracity
with out common marketing Big
data. high and
answered carried 1.1.2. entertainment
of
from advancement research as
Analytics
data to the were of
Major
Applications Major
most Appiications Applications
one such
free dataof the has upon Fig. Data
Big
of
and Weather
4.
patterns analysis.
the of is sector
3.Financial is diseases
attributes in costs
is
Data thatand on generallydepends is?
degree the value be the
analysis shown
commerce
industry for
? it can of and 1.Medical
advancement L82548/2019)
for dependent
the
stored
oid/recent Medical Data used care
Big ensure the (d)
questions
What some Media 1.1.2:
Fig.
Major threatening
accurate takes data are as health
and measures lower data also was Data
However,
E- Big are such
abnormality. organisation. Veracity analysis 1. 2. 5. care
Science
should it
theis
data the
organisation.
it of sets areas Reduce
is longer How How How Big health
proponents -
analysisValue data of data No.
5. results, (a)
Life
Data You Value Value Data
Today,purposes. various (Copyright
The (b) (c) Large
of (a) 1.1.2 The (b)
Big
of
Introduction to Data Sclence and Big Data
Data Science and Big Data Analytics (SPPU) 1-4

Data Formets
Apart from these, the financlal sector also uses Big Data
for detecting money laundering, shell companies,
1. Structured data
fraudulent transactions, and reporting. Based on the
transactions carried out by indivlduals, the companles 2. Serni-structurod data
can build financial health proflle of its consumers and
3. Unstructured data
identify future spend patterns and requlrements.
4. Weather Patterns Fig. 1.1.3 : Data Formats

Big Data analysis is crucial for detecting changes in the 1. Structured Data
weather patterns. You would have heard about
Structured data exhibits a particular order (also known
(a) The rising ocean temperatures as model or schema) for storing and working with the
data. The data attributes are usually related and are
b) Global warming
often the basis of analysis.
(c) Melting glaciers in Antarctica
The structured data is usually generated by machines
d) Reducing Oxygen level
or compiled by humans.
There is a huge amount of data that can be used to
For example, spreadsheets, customer records,
predict the weather changes and report how it is transaction records, sales reports, etc. are all structured
affecting our environment. It can be used to predict data. The structured data is usually stored in relational
weather forecast, natural disasters and any other databases or simple CSV or spreadsheet files.
changes that could affect our well-being. 2. Sem-structured Data
5 Media and Entertalnment
Semi-structured data has some definitive patterns for
The media and entertainment industry use Big Data storage, but the data attributes may not be inter
analysis to understand viewing and liking patterns for related.
the medja content.
The data could be hierarchical or graph-based in
Based on the time of the day,season, device you are on, nature. The semi-structured data is usually stored in
your personal interests and taste, the content can text files as XMLs, JSON or YAML format. The common
automatically be recommended for you. Iam sure you sources for semi-structured data is usually machines
would have seen YouTube recommendations as you such as sensors, website feeds, or other application
programs.
watch YouTube videos. Similarly, companies ike
Spotify can automatically create curated and 3. Unstructured Data
customised playlists for you based on your listening Unstructured data does not exhibit a fixed pattern or a
profile.
particular schema. This is the most common format of
1.1.3 Data Formats (Types) Big Data.
Examples of unstructured data are video, audio, tweets,
Big Data comes in different formats. Data can be
likes, shares, text documents, PDFs, and scanned
machine generated (such as log files) or could be human
images. Special tools and mechanisms are required to
generated (such as tabular data). Overall, the data format is process unstructured data. Also, it is usually cleaned
classified as shown in Fig. 1.1.3.
(sanitised) before it can be used for analysis.

(Copyright No. - L82548/2019) Toch Kaeuledeo


Data Science and Big Data Analytics (SPPU) 1-5 introduction to Data Science and Big Data

1.1.4 Comparison between Data Formats


Table 1.1.1 : Comparison between Data Formats
Comparison Attributes Structured Data Semi-structured Data Unstructured Data
Volume of Data Low Medium High
Processing Complexity Low Medium High
Data generated by Humans and Machines Machines Humans

Data usually stored in Relational Databases Textual files Binary files


Patterns and Schema Fixed Flexible Random

Specialised Tools Not required Not required Required


1.1.5 Data Collection

This is covered in Unit 3 under the section "Sources of Big Data". Please refer Section 3.1.1.

1.1.6 DIKW Pyramid


It is important to understand how data can be enriched and the journey it takes with each stage of enrichment. To
understand this journey, typically DIKW Pyramid is referenced. DIKW is an acronym for the four stages of data
enrichment that are
1. Data

2. Information
3. Knowledge and
4 Wisdom

Wisdom
Wisdom
+ Understanding value Foresight
Increasing

Knowledge
Knowledge
+ Meaning Insight

Information
Information

+Context Hindsight
Data
Data

Events, records and transactions

Fig. 1.1.4 : DIKW Pyramid


Tech Kewledee
(Copyright No. - L82548/2019) Pubtioas
2.
queried typically b
events cancer Knemledge
Tach
ad4 herealgorithm of the
coulddiet occurrence
of example understand Big is the and
a stages. last Pubttatt0
Data understanding that
with put what showstools Hindsight Foresight in
Big vou Note youof about particular
terms to analytics analytics Insight Analytics isdata and sold
and when technical For shared of
knowledge. patients, helps and Analytics 1.1.5different
understanding information. item
Science raw analytics
questions
achieved human out. in delay complexity and valdencreasSing
follow be datafurther of
Data Usually particulara
a pyramid Fig.
spectrum of
Data carried could require
using cancer or Prescriptive
4.analytics died
derived the discussion
Data
analytics. sets. 1.
Descriptive
analytics
2.Diagnostic
analytics
3.
Predictive
analytics of of
answerscontextual
to is the to avoid Categories occurred. form
achieved
Introductionwisdom on was on lifestyleunderstanding fair
DIKWvisualising
of the dataanalytics 1.
Descriptive
Analytics as a patients
such of
the based thatdata to a
data Categoriesupon analytics simplest units
have given any
of to not what
analysis
exercise
as the this
world
foresight. of
Data
Analytics already questions many
many
is of on of Categories 1.1.5: adding
understanding
level is but analysis takecategories you touch of months?
wisdom This see,perspective the
Descriptive the
Wisdom
formula understand
and let's thatlet's
types
Fig. havewithout How Howtype?
final data is answers
cancer. you on techniques.
sleep, Now Now different that This
The that the after the 1.1.7 is, possible (a) (b)
or As
Data
1-6 4.
be itselfwith analysis.
is records, list details
anything the buildThe For is life
or to startmore and and
example, cancer effectiveness of hereto at
couldchemotherapy effective
people start
This could
data
The has with
data. their
reflects informationpatients
expectancy clearinformation
dataenriched hindsight. precisely understand
pyramid. thatdemographic to
enriched you it make after
You
you of more be
events,
The further set
start the list and information,
For
chemotherapy.
give cancer must
(SPPU) expectancy and life
DIKW various
is
humans.
it
data
You
in the patients
you information you and the
questions. various building the
for not is that dosage improveSo, analysis
until a data give contained hand about on from
Analytics used theirmay
have where
the you.by value information. see information of in diagnosis.
collected to the their knowledge
of from or be justwith data
data to cancer at high-level
insights life without investto data
begin to is information effect
Data bottom around high
machines
collected
can
This
couldpeople the is meaning understand medicines 19)
This useful around L82548/20
Big that the the what of and
meaningful.
deep the
then after of
very died. about could the objectives
lowest transactions you to about
level,context
give list knowledge. answer or meaningful
and company
A
could
attributes
data by have example, of they Information
2. acknowledges
you the add
3.Knowledge the gain analyse with analyse
patterns patients
chemotherapy
Science
the generated millions actionable.next perception hindsightexample, you could
analysing to
to
-No.
raw not how patterns.
actually detectionmedicines
EData Data is
1. more the When able can further the
cancer derive
hand. (Copyright
This
the and
does For of and gain You youbuild level.
At be that
Data Science and Big Data Analytics (SPPU) 1-7 Introduction to Data Science and Big Data

(c) How many calls did you receive for a particular Predictive analytics assumes that certain set of
issue? conditions are met or would exist. If there are changes
to those conditions, then predictive analytics may not
This kind of analytics is usually done using database be accurate.
queries or simple spreadsheet filters. You could have
periodic dashboards and reports that can be used to 4. Prescriptive Analytics
visualise results of the descriptive analytics. Prescriptive analytics takes the results from predictive
2. Diagnostic Analytics analytics and further adds human judgement to
Diagnostic Analytics is done to find out cause of a prescribe or advise further actions.
phenomenon or derive reasoning behind events. This reflects the wisdom level from the DIKW pyramid
that you learnt earlier.
This analytics goes a level deeper to provide
information that can be used to fix a particular The prescriptive analytics could answer questions such
situation or event. as

Diagnostic analytics usually adds more context to the (a) What should you do to delay cancer?
data to get information about a particular interest. (b) What is the best time to leave home to reach
For example, following are a few questions that can be airport on time?
answered using diagnostic analytics. (c) Which medicine would have higher chances of
(a) Why the sales in quarter 2 lower than quarter 1? survival for the patient?
(b) Why are people falling ill after eating a particular Prescriptive analytics is the most difficult out of all
type of biscuits? other analytics. It requires significant skills and time to
(c) Why the model X of the car preferable over the give effective actions and results. It could also be
dependent on not only the analysed data but external
model Y of the car? conditions such as political pressure, social
Diagnostic analytics require careful examination of data acceptability, and personalpreferences.
from multiple sources and is a little more involved and
skilful exercise than descriptive analytics. 1.1.8 Comparison between Categories of
Data Analytics
3. Predictive Analytics
Predictive analytics is carried out to forecast and Table 1.1.2 : Comparison between categories of Data Analytics

predict future events.cse Descriptive Diagnostic Predictive Prescriptive


Comparison
The information is further enriched by adding meaning Attribute Analytics Analytics Analytics Analytics
to it to derive knowledge. Medium High Highest
Complexity Least
The predictive data models are carefully created that Time Low Medium High Very High
can base off future predictions based on the past requirement
events. Predictive analytics could possibly answer to produce
questions such as results

(a) What would be the improved life expectancy if Value of Short Term Medium Long Tem Very long
results Tem term
choosing medicine A over medicine B?
Data Data Information Knowledge Wisdom
(b) What would be the sales figure for model Xof the enrichment
car in third quarter? level

Most Frequent Not often Rare


(c) Which team would likely win the world cup this Analytics
Frequency Common
year?
Tech Knewledoe
PubicatoAS
(Copyright No. - L82548/2019)
1.3.1
Q.
requires
to information
retrieval, is analytics Kneledee
information Tech
fromscience information.
Science discipline
and order science
Data between unstructuredand ended
Open(how, storing methods
Analytics why,
techniques in
computer running
iinformation
Big Predictive Analytics External psychology data).
business Data Mostly if)
what Mostly a in oftenstorage, between
and comparison Long data tools High High Data is of
of aid
Science analytics and science
Science
Information
processesand
use not (or
analytics. science, to handling information
between data and
structured Mostly between
Internal the difference
thatand
Data
Intelligence of what,
(when,
ended
Close conceptsanddevices ultimately
the data BusinessIllustrative
of Database
information.
transferring
Information linguistics,organisation, is handling
data queries library
Mostly and Information
handling, key science available
to summarises where) the
Introduction Comparison
and
and
Short Low Low Relationshipwith together techniques and the information
intelligence
intelligence as
Comparison Time
Formats
Data Skills
required
Data
Source
of :Definition suchengineering, interpretation,
that
data
1.2.1 1.2.1: Attribute horizon Complexity deals the
used
Tools questions
answered and
analysis disciplines
brings collection,here of
Table Nature develop science
business
Table
Nature
used
that
Note abouttop
and on
The It
No. 1. 7.
1.3
Sr. 2. 3. 4. 5. 5. 8
1-8
Intelligence
gain its internally
This applications
business
surveys,
purpose.
intelligence data SOurces models
may of industry.
government,
organisations. existing data predictions.
Marks
Analytics purpose
or in Data to analysing
systems. and use
prediction
cost
satisfaction organisation customer
periodic and
space to practices 6 performance
to 18, Business uses various
andindustryreduce vs analytics. business
(SPPU) entertainment, demands
in commercial, solutions
(BI) Dec. by information
mostly
future
collect interest.
in usage usage and current
Intelligence SPPU- performance feedback, provides and
Practice
Analytics from and future science. between
Data an typical insights to buildfrom
collected
is
:for operations
several Thesignificantly
experience organisation and tends internally)
of
innovative things
be industry.used and of Big enables and business BI
useror
growth to
time L82548/201
Data couldand some Data confused
or CRM. business analytics
issues
business
Fix
has is trends processes report,analysis the
Big the business user Science of
analytics
media analytics
and about BusinessQuestion|
University
Vs. BI its the for and
business period and forecast
and of othervary
organisations overall Science BI into
Definition: externally
market get thatdatasalestraffic gain data
State
Science research, Optimisingnewproblems learn Compare Data business ERP
any mightlevel, to insight Drive large -No.
Data Predict Increase easyand hereavailable
be website
to Help
reports Whereas, data tobuilt
Data to analytics
medical high Build Let'sanalytics. is (BI) Note could such
as
aover (both (Copyrigh
Big These belong 1.2.1 It The are
1.2 aAt 1 2. 3
1. Q.
2. 3. +.
Tecewledho L82548/2019) (Copyright
-No.
architecture analytical traditional aview
of level High 1.3.1: Fig.
Dashboards
andReports
warehouse data
Organization
level
users sources Data
analytics Data
warehouse data
Department
level
architecture. analytical traditional ahigh-level
viof
ew givesa 1.3.1 example,
Fig. analytics.
For data
requirementsof the for short fal maintelligence
y but business for useful be
techniques
could and traditional
tools The
analytics. data modern purposes
of the for well serve not may tools warehouses
and data
traditional the reporting,
But, intelligence
and business for data keepi
its ng foof
rm some havetypicallyOrganisations
Marks 19,5 Dec. SPPU - diagram. architecture
with analytical current Explain Q.
Marks 19,5 Oct. SPPU -
Architecture. Analytical Current explain and Draw Q.
Marks 18,6 Aug. SPPU- diagram. suitable architecture
with analytics current Explain Q.
Questions University
Architecture Analytical Current 1.3.1
Minimal extent certain aTo involved judgment Human
High Low Dependency
data on
Large requirements
Small resource Computing
High Low Mathematics ofUse
Yes No out? carried Analytics
Science Data Science
Information Attribute Comparison
1.3.1 Table
science. data andscience
information betweencomparison summarises
the 1.3.1 Table The
design.
interaction management
and management,
data
knowledge as
such areas in
used science
isInformation design. algorithm andanalytics predictive analytics,
intelligence, artificial assuchpractices touches
on processes.
It operational and making decision formation, strategy
as
functions
such business used
in science
iscommunications.
Data anscience
d cognitive science, library as
such areas
concerned
with more science 0s
Information mathematics. andscience computer heavy
on science
is disciplines.
Data
complimentary butdistinct arscience
e information and science independent
Data fields. two ardistinctively
e those but
correctly data thehandle anyways to
need analytics,
you successful
data meaningful
or foar that say can youThough
Data Big andScience Introduction
Data to 1-9 (SPPU) Analytics Data Big andScience Data
3.
Data or one their businessintelligence
timeones.be dataforecast
and economic
or or the TechKnowl
RF edge
therealready sizable
the it
articulating performance
perspectives
business
Data make SPubiiatlo
Big drivers in can
in
minimise
workforces utilising
and Big Recessionoptimise
same
the existing
invest
example, all
economy are and
management Almost to
Sciencemajor for business
today to of are businessbroad key
vision
motivations to the to For the recession
next
time success.
of
economy. and
at means
operates.
other.lookingretain
1.dynamics
Marketplace Information
4.
technology need cut and
Data the:analytics architecture
Business
2. Businesses
sufficient. today
everything global of the
and
to of
prOcess
analytics trends from
companies appropriately
ahead the links structure
the costs
reduce a
Introduction
or Dynamics
1.Marketplaceincreasingly
anda businesses
transform mission level
someData Data of
global
affects in
perspective is it
Business
2.Architecture
how architecture
Business drivers customersunderstandmore2008. streamline
their
business the
Internet demand recession.architecture organisation
Big Big
shows a economy hence
no
and gauge
in in trying
to
predict business
for of Major meaningful
design
Drivers
motivations 3. 5. live are and and historical
alone
and
is recession to
newmarket to next IT business to
1.3.2 1.3.2:
Fig. Companies
operations
we country's slowdown parameters
acquire
decisions.
uncertain
analytics
reports of businessof agile.
Business more
as services,
Today a power
Fig. effect
The The was and The such
1-10
the data reporting.
architecture cleaned, before
its IT changes
require
(for be real it analytics.
data datalocally
for low big smartnot
make the was with around
into in the not techniques
format for for the has to if andoptimisation
level datarigid by data negligible computers
is loaded and
makemay may
controlled
builtthusdataavailable customise This Data restricted
of
agile practices
analytical sources organisationof is copying resources. use
(SPPU) analysis flexibility. to userscertain warehouse.not time
pre-processing
environment which and and analytics. Big public so
then hard was not
are analyticsis to was andof
real data tools of
Analytics
traditional and largely
is analytics
a tweets) warehouses
up
allowed were use changed
connectivity
various multinational
organisations.
The
computing
provide
business-level it in data perform
the
end datacomputing
analytics
(Motivation) growth
normalised and increased
Governments
Data
and this
significant teamare
collection
traditional and before
data usually has L82548/2019)
organisation
data audio, time be desired business
from
Big the level Also,not warehouses to
ingestion of not back, internet connectivity
does IT The real data complicated time copies theydata wastes
may Analytics the
and of following
: dataand for used. by data video, the Drivers decades
challenges requiresand Controlled layout. the for traditional long
departmental using
Hence, andfor But,
Science analysts out and apps, analytics.
data
warehouse the data
the structured be structure data additional
in built data a
It analytics.
takes Multiple then efficiency
Businesses
negligible. -No.
Rigid of example
feasible further layout. carrying phones, internet(Copyright
can few absent.
Data as
Theare First, This The teamits Not The time Data and of
use
it to 1.3.2 A
2. 3. 4.
1.
Data serviceswith cloudvastderive Datamoretimeprocess
information you refrigerator,
connected cameras,andto newer automatically
an To dataan
the is
MarksData Marks
519,
fourthe inshown Kneuledgë
Marks make Tech
B
Big cheaperthe around not. datathem bookthatadvice. of of
visit and are
Big andsame provide event
amount cardiologist and
and and of and to what deems Ecosystem Aug..18,4 Ecosystem. 5 to
process getting CCTV of use as
availability on more cheaper and and 19, thereseamlessly
are
marketing
campaigns.
Sclenceproducts time the things amount
perhaps the devices Ecosystem.
Smart
become in impact can
toaster, can and patterns it consultation Oct.
SPPU- Dec. ecosystem,They
now real make
Advancementsat watch when huge and
are SPPU-
Data specific andand use
who well. and huge
who services
models, a
usersto
-
SPPU DataData analytics.
has canalmostan to easier day geyser,
cardiologist Data ecosystem. together
to andorganisation smart
companies pattern Big Data
Introduction general madeusersanalytics people
to as cars collect rate medical example Data
Everything business various
predictive Approach
New emerging Big
for in comparatively day watch,
connected Big workData
targeted
hardware trulyconnecting
datainsights. AC, devices yourheart Explain Big
other the The a rate EmergingQuestions
University
data
in for collected
data.
the lights, the example, withrequiresfromn that Big
computing
Anyuser has justinternet. TV, existing heart Big
with of emerging the
or available your such collectorsand
then ads by of more getting player,
these
microwave, on appointmentcollected groups
commodity of meaningful
computing. technology correlated. Explain Discuss Draw of
itmaking Internet For situation abnormal
establish out 1.3.3.
Fig.
are through amount analytics no the
it improve
musicof passes ones.
monitor
the mainmost
You Also, data is Each
It to are is 1.3.3 In
5. Q.
1-11
to layer help unit season
precisely whatbusiness an continuously
as makediabetic thereports.
require have at has world,CCTV
mediaservices, biometric
and to added like personal
like,and
in Dataopportunities. worldpopulation easy Companies
fingerprints
parameters production performed within youwith
most and production might decisions social using
decisions.
right festival
the Big patientsthe collect experience. it value your out
can the mediathe made food
bottom uses it to long from reports,
intelligence rendering
it parameters areas details.
(SPPU) that to visit,closer later connected
planet's around has experience.all visit,hang
layer, the the of is it diabetic for process
and
performance
the something work aimsandimprovement
Management
Process
Business every user datastreaming public everything almost youdemographic
Analytics thatmeetlevel then patterns you
fromeach andactionable
make processes unit walk great friends
how management the user laboratories
and
hospitals for
establish
performance to that the on check-up andbusinessTechnology has
of
of it userknowplaces
rightAt define up to samples a technology of video surveillance
face L82548/2019)
Data items
meetis the of providing half volume of use perform,
such level. decide.
can
architecture need sources.
other enhanced
Facebook
layers happen business90%check area. waiting
patients' connectivity.
internet as digitisation
and as other
Big can 2,400
to processes understand the driven Almost and such
process such
and ties executive it Now, if urinemove not give to Information
4. huge crime data
you your
organisation's
key
the andpreferences
Science example, to organisation.
the
example, do to datakey Information
hyperscale. audio government
information and -
Data businessmake
to needs and a activities
of No.
monitor demands. Business BusinesSto
improve to hospitalbecome
Patients is cameras, of
plentyvast services
collect Google several (Copyright
the input bloodsense There
apps,
Data For must all Such The
Big to For
E 3.
(Copyright
Data 3. Collectors
Data 2. Devices
Data 1.
collectors.collectThese patterns.
viewingand and and yourWhich facilitate
an Data analytics. types their and asdata. Data Data
No. Fig. video
collected online smart
Aggregators preferences, devices
There Fig.
and are products collectors ofform other Science
L82548/2019) - 1.3.4: streaming
collection data phones, 1.3.3:Emerging
devices
Data larger ecommerce factordigital emerging
things
Usersoraccumulate as and
can are
and
High-level you your you are be the Major
institutions and
send and sensors, consumers
aggregatorS
Data 4. Datacollectors
3. Data 2.devices Big
of a Data 1.
services shop look institutions
the physical
wide Big
buying portal
purpose
connected groups Data
the it
relatlonship through. these at data. to variety Big Data Analytics
(SPPU)
data the CCTVequipment
record habits, orFor they Data ecosystem in
or desired the
from stores, a or equipment.
Basedon cameras,
organisations Similarly, example,
organisations that can
of Ecosystem
collectors
Data collectors
Data Data your etc. local data
between collectors various targetcollect that
preferences
are what retail devices
computers,
the all you could it collect
for
various
data data that audioknown buy, store. data such
be the
devlces, 1-12
Consumers
Data 4.
aggregators
Data
data.
data Thecapability
collect,consumer. Again
particular patternsportal
to its collectorrecommended couldget and analysis.
ultimatelyorganisations
ofaggregated GmailGoogle For beNote data card debit For
collectors, individuals
Finally,
the
devices, Fig. presence then example, irrespective you example,
note but same here card
1.3.4 services
Itand user approach
of For Data also that use.
make company related
really here example, data.
aggregators, collectors,shows across the to that aggregate Google data Similarly. VISA
a visits
products that Introduction
data find use Consumers
depends that These of
a users
its information such buy you could collector which data could
consumers
Data consumers
Data consumers
Data aggregate, out of as
high-level he
taggregator a own a Google
aggregators, the use well. irrespective
company
on automatically
individuals. who bank the are or
and collect collect
and user website. data data such and city to
the could could the use are can Data
consumers relationship analyseecosystem size could about and across data or
organisations the asdata countrycollect the Science
and and could take
purchase institutions Google
of The benefit aggregator of
credit
consumers. collected all from
and the also suggest the a your which and
PubIao ns be when ecommerce home its
TechKnemledge company, browsing Maps you card
between and the from services.individual location Big
use a a bank's
data data loan data could go
the its the the its
that and or and to. and Data
L82548/2019)
(CopyrightNo. 1
1.3.4
mathematics,
expertise
and training carry required Scientists Flg.
sophisticated These Data Fig.
1.3.5: company
2014, Taking
Data
out 1.3.5
Ecosystem targeting
Keyecosystem.parties personalFinally,
Aggregator"millions
Cambridgeinformation
itFacebookhadthe in
access Users
to to are is Science
the Key Key the the the
understand
and understand "Data
individuals Explores
3. DataScientists
2. Data 1. key Roles Big for
examples
required data roles Roles are Cambridge Facebook. used
economics. Technology information
their ofCollector" Data its and
background
vast models roles "Data
in ecosystem in Facebook
inAnalytica such their role Big
and
in the ecosystem.
analytics. and the the in the ad Dat
in a
that with the Consumers" Big Such
as smartphones inthe
work work new and new campaigns. Analytica in
timelines Analytics
new New a users. the UScase
can deep Data
to Dataggregated use devices
Blg Big the Big Presidential
with They
make with Big ecosystem. of
Data Enablers Data Big political
data. analytical Data Hence, Data andprofile are Cambridge (SPPU)
statistics, in have data. in
Hence, sold or
predictions ecosystem Data friends.
ecosystem.
ecosystem. the the "Data
computers
They
advanced They it and elections
Big parties the
have buildtalent political isdata Devices" Analytica
Data user's "Data Hence, other
or for of to in
1-13
Fig. project.people. Q. 1.3.5 3.
University
Question Explorers Data 2.
1.3.6 A category
engineers,
DatBiag
successful
for Big ofdata Technology and This analysts, roleloaded ensuring required datafor data This
Fig. successful
analytical
project. Enlist carrying Data analytics the
Key : Project Key software
theygroup are scientists.group
1.3.6 for
and are and understand operations
financial
roles scientist
Dataengineer 6.
7. administrator
DataDatabase 5. 4. 3.sponsor Key Roles analysis. that of
ProjectBusiness
2. user 1. shows data data out of
Businessmanager
Project roles
explain have at and the individuals
for analytics
project individuals Their Introduction
analytics architects, data tools,scale. analysts, answering
right
a key programmers,
for et c . other Data
managers, The
for various primary
successful intelligence roles
analytics.hardware They how
a a individuals data
successful Enablers has
SPPU Successful solutiontechnical have to is business to
for
project users handle more is t he
Typical sales collected, role less Data
a requirements, broad
analytics analyst successful
questionsanalytical
is
Oct. - involved expertise technical managers, typically Science
requires architects, and analysts,
market to
19,
roles understand
PuDIiationS Analytics understanding program
KaewledgTech cleaned, at and
project. 5 to fittingthis
analytics Marks required oriented, etc.
fitting hand depth
Big
several make system storing
this and and the than Data
for
Data Science and Big Data Analytics (SPPU) 1-14 Introduction to Data Science and Big Data
1. Business User 5. Database Administrator
Business user is an individual who uses the results of Database administratorS setup, operate and maintain
data anaiytics to meet the business objectives. This is the databases that hold the actual data for analytics and
usually a business analyst, an operations manager, or a the results of analytics. They are responsible for
subject matter expert. ensuring that the database is up and running and only
2. Project Sponsor the right set of people have access to it.

This is usually an executive or a senior management 6. Data Engineer


staff that provides approval and funding for the project. Data engineers understand the software tools and
She sets the business problems to solve. She monitors techniques required to analyse the data to meet the
the health of the project and ensures that it is desired outcomes. They know to extract, transform,
progressing towards established goals and desired load, and analyse the data at scale and implement
business outputs. programs for the given data models.
3. Project Manager 7. DataScientists
Project manager is responsible for day to day execution These are individuals with deep analytical talent
of the project. She ensures that the project milestones
required to understand and work with data. They build
are achieved, and the overall project is running timely.
sophisticated data models that can make predictions or
4. Business Inteligence Analyst carry out the required analytics. They closely work
with data engineers to ensure that the data models are
Business Intelligence Analysts provide business
domain expertise. They understand business and its correctly implemented. They also choose the right data
key performance parameters and metrics.
analytics approaches and ensure that the overall
objectives are met.
1.4 Data Analytics Life Cycle (Data Science Life Cycle)
Explain different phases of data analytics life cycle. SPPU - Aug. 18, 6 Marks
Explain Data Analytic Life cycie. SPPU - Dec. 18, 8 Marks
Draw Data Analytics Lifecycle & give brief description about all phases. SPPU - May 19, 5 Marks
Dermonstrate the overviewof Data Analytics Life Cycle. SPPU - Oct. 19, 5 Marks

The data analytics life cycie broadly has six phases. Each of these phases are
worked through iteratively with the
previous phase before moving to the next phase.

1
2. 3. 4 5.
Data Model Model
Discovery preparation,
Communicate
planning building results

6
Operation
alise

Fig. 1.4.1: Data Analytics Life Cycle


Tech Kneuledge
(Copyright No. - L82548/2019) PuDLaueos
Data Science and Big Data Analytics (SPPU) 1-15 Introduction to Data Science and Big Data

1.4.1 Phase 1: Discovery 4. Data conditioning: In this step, the data is further
cleaned and normalized by performing further
In the Discovery phase, the data science team transformations as required. The data from several
1. Learns about the business problem to solve, sources could be joined or combined as required. The
2. Investigates the problem, actual data attributes that would be used for analytics
are decided.
3. Develops context and understanding,
5. Data visualisation: Once the data is in aclean state
4. Examines the available data sources and
and ready to be analysed, it is a good idea to visualise it
5. Formulates the initial hypothesis to identify patterns and explore data characteristics.
The team learns about the business domain in which Understanding patterns about the data enables
the problem is to be solved. It assesses the resources building a perspective about the data model.
available for the project and carries out the feasibility Some of the common tools used in this phase are as
analysis. It spends time in framing the right problem. following. Note here that the following list is not exhaustive.
Definition : Framing is the process of stating The choice of tools largely depends on the problem at hand,
the analytics problem to be solved. desired outcomes, and the team's skills.
As part of the framing activity, the main objectives of 1. Apache Hadoop: The Apache Hadoop software library
the project are ascertained and the success criteria for the is a framework that allows for the distributed
project is clearly defined. It also develops the initial processing of large data sets across clusters of
hypothesis that can later be substantiated with the data. computersusing simple programming models.
2. Apache Kafka Apache Kafka is a distributed
1.4.2 Phase 2:Data Preparation
streaming platform. You can publish and subscribe to
The data preparation phase explores, pre-processes streams of records, store streams of records and
and conditions the data before modelling and analysis could process streams of records as they occur.
be carried out. In this phase, the following activities are 3. Alpine Miner: Alpine Miner provides a graphical
carried out.
interface for creating analytics workflows and is
1. Preparing the analytics environment : In this step, optimised for fast experimentation, collaboration, and
an isolated workspace is created in which the team can an ability towork within the database itself.
explore the data without interfering with the live data. 4. OpenRefine : OpenRefine is a powerful tool for
The data from various data sources is collected in the
working with messy data. It cleans the data and
isolated workspace. transforms it from one format into another.
2. Perform ETL process : ETL stands for Extract,
Transform and Load. In this step, the raw data is
1.4.3 Phase 3: Model Planning
extracted from the datastore, transformed as deemed In this phase, the team explores and evaluates the
right (removing noise, outliers, and biases from data) possible data models that could be applied to the given
and then loaded into the datastore again for analysis. datasets to get the desired results. The team can try several
3. Learn about the data : Once the ETL process is models before finalising. Some of the major activities
complete, the team spends time in learning about the carried out in this phase are as following.
data and its attributes. Understanding the data itself is 1, Data Exploration: The team spends time in
the key to building a good data model in the understanding the available data and the various
subsequent phase. patterns and relationships amongst its attributes.

Techknowledge
(Copyright No. - L82548/2019) PubIlaens
2. 3. 4. 5.
KwlTochedge
Data modelthe is production
be as exhaustive. of classical
hand. statistical highly language
classification, analysis,
builtsoftwarewithWEKA that related open
You and
Enterprise
software Marks us pre-established
analytics the
variety requirements
Miner. model coula
teststesting
could are Windows. and
Big at with package language Results 8
the is in and and
and datasets
phase problem
modeling, and programming free created and freecommercial in 19,
important
data
it the
(design) the not for wide syntax pandas,
SAS Alpine May the validated
Science model, thisis techniques, is and software programming
learning the executing
environment analysis, as
Oncein a It BSD, functions and your Communicate -
SPPU
usednew in list the provides mathematics-oriented
tools.
from
matplotlib. such
Data trainthe on skills. nonlinear scipy, with
usedfollowing
dataset. results.
scientific macoS, miningcode. machine ApartvariousMATLAB, suits
to to about
be or depends graphical
time-series visualization analytics budget. and are
outcome
used
Introduction to dataset tools team's The Java numpy, that communication
Why
is
and R results
ready desired the graphics. data
GNU/Linux, a :
usingsoftware
are testing,
is
confident
testing common largely and workbench.
within Modeler,
oneproject lifecycle
projects
?
production that the language is for
dataset and a is free scikit-learn,there
data 5: University
Question
is and (linear a toolkitsvisualization thesuccess
criteria.
The
model the tests,
etc.) It is the building, statistically
proven.
is the get
here
toolsoutcomes,and Octave: and
powerful a executed
your Phase
for SPSSchoose
Commercial
tools, compares
training the on is analyticIt
teamn Note extensible. plotting
a computing clustering,
runsIt :
usingthe Theitto of of is statistical
statistical : provides
Python as
available matches
to Some choice R:R WEKA be SourceMiner,could After
complete,
live). following. GNUwith that suchdata
Thethe applied
model desired in an can 1.4.5
Once
(go The Q. team
1-16 1.
2. 3. 5.
a and
experts, techniques
an the (structured, hand, statistical
as exhaustive. of classical compatibility
highly Analysis
classification, data as third-party
and analytical
business and suchmostwithout analytics
have choose are variety
interpreted and such
analytical SASconnectors
at is mining, the
might dataset phase problem and applicationsbetween platforms
matter to for widemodeling, Server an and access data
different not techniques, all dataprovidesthe other
(SPPU) who activity data this environment analysis, support
SQL.
the
be given is a datacan
subject should
of
in
list the provides SQL at providingand integration or Building into
build
type
Analytics others this the unstructured) usedfollowing
on team's
skills. nonlinear models,
models database
:
Services It clientreports, multipleYoucommon divided
of on the applied. depends graphical
time-series decision to
and data goal tools R SharePoint. DB. starts
consult
Data
analysts,
the The Based
on
based
the
and graphics.
and
tabular solutions,
multidimensional and Services provides
via OLE the Model
is L82548/2019)
common largely
the language on dataset
technique and that and Analysis in reports sandbox of team
Selection: or
Big
could how and supports used and knowledge
databases : and
dataset
Testing
and outcome.chosen here
semi-structured (linear
and tests,etc.) for (BI) Reporting:It dataset
Production
3.
stakeholders,
on tools JDBC 4
Phase the available
dataset,
Training
Scienceteam the Note outcomes, a computing Server Pivotengine business SAS/ACCESS
3. analytics
viewpoint analytical
examined. is clustering,
extensible. intelligence phase,
be of statistical
statistical Services -
Model desiredcouldof following. choice R:R levels, BI
tools. OBDC, detailed
popular No.
Data Some Power Excel,
The desired SQL data
for the
thisThe (Copyright
The as 1.4.4 In model.
2. 1. 2.
1 2.
Data Science and Big Data Analytics (SPPU) 1-17 Introduction to Data Science and Big Data
The team then articulates the findings and documents 7 Data Scientist : The data scientist could explain the
the results. The findings are communicated to the project
model to her peers and other stakeholders. She also
stakeholders.
documents the model and how it was implemented
Note here that the model building exercise could be
unsuccessful. The findings are still documented and 1.5 Data Wrangling
reported before the team goes on to try and build another
You understand that even readymade jeans that you
model. Recallfrom the data analytics life cycle diagram that buy needs some form of alteration before you can wear it,
each phase is an iterative process and works with the
previous phase.
isn't it? Similarly, in any real-world scenario, the data, that
youcollect for analysis and build models on, Is usually not
1.4.6 Phase 6:Operationalise in a form where you can consume it directly. That is
precisely where data wrangling comes into picture.
In the finai phase, the model is deployed in the staging
environment before it goes live on a wider scale. The Definition : Data u rangling is the process of
staging environment is very sinmilar to the production cleaning and unifying messy and complex data
environment The idea is to ensure that the model sustains sets to make them more appropriate and
the performance requirements and other execution valuable for a variety of downstream purposes
Constraints and any issues are identified before the model is such as analytics
deployed in the production environment. If any changes are Data wrangling is also called as data munging or data
required, they are carried out and tested again. pre-processing. Before you proceed, let's take a step back
The project outcome is shared with the key and look at a few concepts.
stakeholders such as
1.5.1 Need for Data Wrangling
1. Business user : The business user ascertains the
So, why do you need data wrangling at all? Let's revise.
benefits and implications of the project findings.
2. Project sponsor : The project sponsor asks questions 1.5.1(A) Data

around ROI (return on investment) and any potential The raw data, or just data is a collection of observations
of real-world phenomena. For instance, stock market
risks to maintaining the project.
data might involve observations of daily stock prices,
3. Project manager:The project manager determines if announcements of earnings by individual companies,
the project was timely completed and the goals were and even opinion articles from pundits.
met. Sports data could have information on matches,
environment in which those matches were played,
4. Business intelligence analyst: The business player performances, and several other observations.
intell1gence analyst determines if any of the reports or Similarly, personal biometric data can include
dashboards needs to be charnged to accommodate the measurements of your minute-by-minute heart rate,
new findings. blood sugar level, blood pressure, oxygen level, etc. You
can come up with endless examples of data across
5. Database Administrator :The database administrator different domains.
needs to plan for backup of datasets and any other code Each piece of data provides a small window into a
that was written to be run on the database for the limited aspect of reality. The collection of all of these
analytics project. observations gives you a picture of the whole. But the
picture is messy because it is composed of a thousand
6. Data Engineer:The data engineer needs to share the
little pieces, and there's always measurement noise and
code; version control it and maintain it. Any issues or missing pieces.
bugs found in the code should be fixed.
Bg Tech Knewledge
(Copyright No. - L82548/2019) PubIicatiens
build
and
anything describes modemathematic
various the
ls be Techkaewled
to For on may can
particular of with
convey againthis datahands.the For abe pastprice.the listening
who character
they
of frequent quantities G9 reviews
languageredundant.
or in measure numeric.
likenoisy mistake day
variable data.
If some describesmighthistory,
users Motorola of
Data
ni and stock in.come that
data extra thatthe 6. your measure
thatlengthof
Blg
a instance, "Sunday,"
and the theirto product
through
usingof inmodellingmany a aspects for prices predicted set Boolean
values,
that
an
bunch wrong,
of
categorical0 present on of
aspects
earningmight artists numeric
songs.
not
features numeric.
typical
numeric
so
SCienCe The
result between data data stock on
often bought else
reality Similarly, can
you machine
learning
Data for multiple
For missing company'music(based
s same
world a in. as "Tuesday,"
..
of different
the same computation.
Introduction with mathematical concepts information. not predicts relateis "Rohit where the oranything
to comes suchthe valueis to recommendsthe the be be
the together
a industry users data
puzzle information
model example, to
data,data containspresent integer got numeric.
is as recommend could
understand have between of formulas rawaction is
This Features
required
Models piecejigsaw modelling contains "Monday," thata
mathematical
modelmaps between lot
Feature various
where ofcharacteristics
Wrong same you and a But
incomplete
data
measurement. be an day-of-weekrelationships that
thatprices,
to Mathematical
listened the not numeric. For for
most
features
are
statistical Redundant
the may as then a and
similarity other. is
example, is for.
to
toTrving statistics of included instance,
formula model Friday", Feature animals. in
1.5.1(C) is
trying This
missing.
weekvalues points,
exactly stock habits)
have each
be 1.5.1(D) data
not data. But, used
A A
1-18
thatanswer also your and
Answers
work are For exchange,
up Reuters, the cleaned
converted ato modellingfile iteratedyourthe and processthe
starts end database.
say will on not data processes.
may
of
out
converted CSV
by
before
tools at features.
wouldyou Aproduct based may with the Thomson and a the are
healthier? false pulled
company, favouriteto is Javadata
(SPPU) helpare
you questions diabetesoffull
hunch
approach Workflows at
massaged, out
model another
of that that
iterative and mess and
Analytics surecandata buying month? is
a
just
observed
like
a
cluster,
file, your
back or the
the C++
of to
see entities
models
am customer get answers Fig. promising intermediary
by dumped
and all out the might
thatpopular next to getting 1.5.1 originally
solution.
and
subsampled,
Hadoop
bought a
to in
Data ldata?questions win? weather eat are dumped
in on
out Scala. evaluator, pumped
rewritten disregard you mathematical
-
should run L82548/2019)
learning
Big collect the a tolikely of to
data aas was best
multistage
prices a
database,
try or then
moment,
and that risk on can Python, and
ScienceTasks
1.5.1(B) of are are
you several buyis the you data?
Some B?product biometric
from out What stockan storescript,script,you an you
is
likely teambewill food the team,predictions machine
starts the by predictions
by times,
do predict). path to frequently
reality. aggregated
a Hivea another thatR, if a two
Data Why are How WhichHowWhat isWhat ends.
dead instance,in in parsed production for -
there The Data
What leading stored by
a library multiple
format However, of
systems No.
into store involves
(or in by The and final centre Copyright
Data Science and Big Data Analytics (SPPU) 1-19 Introduction to Data Science and Big Data
So, let's redefine features as, How many features or dimensions does it have? Two,
Definition : A feature is numeric rlght? Yes, this is a two-dimensional data set, or you
could also say that this data set has 2 features.
representation ofraw data.
I know you are saying out loud that "hey look, the
As you know, the features in a data set are also called gender field is not numeric". Iunderstand, that is where
its dimensions. So a data set having nfeatures is called feature engineering would come in where you would
an n-dimensional data set. For example, consider the understand how you could convert this raw data set
following data set. into something more meaningful and computationally
Gender Marks more appropriate. For example, you could assign a
Girl 65 value of "0" for boys and a value of "1" for girls. Now
Girl
that is numeric, isn't it?
46

Boy 56 1.5.1(E) Feature Engineering


Boy 43
There are many ways to turn raw data into numeric
Boy 53
measurements (1 just showed you one earlier), which is
Boy 49 why features can end up looking like a lot of things.
Girl 42 Naturally, features must derive from the type of data
Boy 84 that is available. Features are also tied to the model.
Some models are more appropriate for some types of
Boy 44
features, and vice versa.
Girl 42
Girl 40

The right features are relevant to the task at hand and should be easy for the model to ingest.
Definition : Feature engineering is the process of formulating the most appropriate features given the data,
the model, and the task.
The Fig. 1.5.2 depicts where feature engineering sits in the machine learning pipeline.
Source 1

Source 2 Modeling
Raw
Features Insights
Data

Sourcen

Clean
and
transform

Feature Engineering
Fig. 1.5.2
Features and models sit between raw data and the desired insights. In a machine learning workflow, you pick not only
the model, but also the features.
This is a double-jointed lever, and the choice of one affects the other. Good features make the
subsequent modelling
step easy and the resulting model more capable of completing the desired task.

(Copyright No. - L82548/2019) P


Toch Knemedge
DIICati0s
Data Science and Big Data Analytics (SPPU) 1-20 Introduction to Data Science and Big Data

Bad features may require a much more complicated Start


model to achieve the same level of performance.
The number of features is also important. If there are
not enough informative features, then the model willbe Brainstorm or Test the Features
unable to perform the ultimate task.
If there are too many features, or if most of them are
irelevant, then the model will be more expensive and Decide what features to create/
select
trickier to train. Something might go wrong in the
raining process that impacts the model's performance.
Feature engineering typically includes feature Create or tune the desired
creation, feature transformation, feature features in the feature set
extraction, and feature selection.

feature No
Feature set good
Creation enough?

Yes

Feature Feature Feature


Stop
Selection Engineering Eransformation
Fig. 1.5.4
1.5.1(F) Data Engineering -vs- Feature
Feature
Engineering
Extraction Often raw data engineering (data
confused with feature engineering. Data pre-processing)
is
engineering is the
process of converting raw data into prepared data.
Fig. 1.5.3 Feature
Feature creation identifies the features in the engineering then tunes the prepared data to create the
dataset that are relevant to the problem at hand. features expected by the machine learning model.
These
Feature transformation manages terms have specificC meanings as outlined here.
replacing 1, Raw data (or
missing features or features that are not valid. just data) : This refers to the data in its
Feature extraction is the process of source form, without any prior
creating new
features from existing features, typically preparation for
with the machine learning. Note that in this context, the data
goal of reducing the
features. dimensionality of the might be in its raw form (in a data lake) or in a
Feature seiection is the filtering of transformed form (in a data warehouse).
redundant features from your dataset. irrelevant or Transformed data in a data warehouse might have been
This is converted from its original raw form to be used ToT
usually done by observing variance or
thresholds to determine which features correlation analytics, but in this context, it means that the data was
to
remove. not prepared specifically for your
task. In addition, data sent from machine learning
At a high-level, the feature
like shown in Fig. 1.5.4.
engineering process looks eventually call machine learningstreaming
models for
systems tha
predicto
is
considered to be data in its raw
form.
(Copyright No. -L82548/2019)
Tech Knowledge
Data Science and Big Data Analytics (SPPU) 1-21 Introduction to DataScience and Big Data

2. Prepared data :This refers to the dataset in the form ready for your machine learning task Data sources have been
parsed. joined, and put into a tabular form. Data has been aggregated and summarized to the right grana
example, each row in the dataset represents a unique customer. and each column represents summary intormaton
the customer, like the total spent in the last six weeks. In the case of sunervised learning tasks, the target
Teature
present. Irrelevant columns have been dropped, and invalid records have been filtered out.
3. Engineered features: This refers to the dataset with the tuned features expected by the
model-that is. pertormin,
certain machine learning specific operations on the columns in the prenared dataset., and creating new features tor
your modei during training and prediction. Some of the common examples are scaling numerical
columns to a vae
between 0and 1, clipping values, and one-hot-encoding categorical features, In practice, data from the same
source
often at different stages of readiness. For example, a field from a table in vour data warehouse could be used directly as
an engineered feature. At the same time, another field in the same table might need to go through transformations
before becoming an engineered feature. Similarly, data engineering and feature engineering operations might be
combined in the same data pre-processing step. The Fig. 1.5.5, highlights the placement of data engineering and feature
engineering tasks.

(Raw) Data Data


Engineering

Prepared Feature
Data Engineering

Engineering Machine
Features Learning
Fig. 1.5.5

Note here that these techniques are not mutually


1.6 Data Wrangling Methods exclusive; they may work together. For example, data
cleaning can involve transformations to correct wrong data,
There are several data pre-processing (wrangling)
techniques as following. such as by transforming all entries for a date field to a
Data Wrangling common format. Let's learn about each of them in a little
Methods more detail.

1. Data cleaning 1. Data Cleaning

2. Data integration
Data cleaning can be applied to remove noise and
work
correct inconsistencies in data. Data cleaning routines
smoothing
3. Data reduction to "clean" the data by filling in missing values,
resolving
4. Data transformation
noisy data, identifying or removing outliers, and
lead to
inconsistencies. If the data is dirty, then it could
could be
5. Data discretization inaccurate results. For example, the day Monday
represented as "Mon", "M", "1", "Monday",
Fig. 1.6.1

Tech Kneuledge

(Copyright No. - L82548/2019)


1-22 Introduction to Data Science and Big Data
Data Science and Big Data Analytics (SPPU)
Let's take an example.
You need to fix such inconsistencies before you proceed
Assume that you are watching cricket on your 4K TV. A
with using your data. Another vary common example of
inconsistent data could be country names. You could find 4K TV has 3,840 horizontal pixels and 2,160 vertical
pixels, for a total of about 8.3 millionpixels! Most of the
USA represented as "America", "The US", "US", or "United
States of America". Such inconsistencies need to be
fixed. times, you are focusing on the ball. Say that the bal
occupies 10,000 pixels to form a 3D image. Your brain
2. Data Integration is filtering out rest of the pixels and helping you to
focus on 10,000 pixels out of the total 8.3 million pixels
Data integration merges data from multiple sources
into a coherent data store such as a data warehouse. presented to it. That is close to just 0.12% of the entire
could also lead to set of pixels that is in front of you. This is precisely
However, data integration
inconsistencies. For example, the attribute for customer what happens in dimensional reduction. You reduce the
identification may be referred to as customer_ id in one data number of dimensions in your dataset to just what
matters the most.
store and cust_id in another. Naming inconsistencies may
also occur for attribute values. For example, the same first Often times, you find that your dataset could have 100s
name could be registered as "Bill" in one database, of features (or dimensions). Practically, you know that
not all dimensions are equally important for analysis or
"William" in another, and "B." in a third. Furthermore, you
suspect that some attributes may be inferred from others
classification of data. Also, it becomes computationally
intensive and visually difficult to understand which
(e.g., annual revenue). Having a large amount of redundant
dimensions have the most influence on the dataset if
data may slow down or confuse the knowledge discovery you have 100s of dimensions.
process. Clearly, in addition to data cleaning, steps must be
Definition Dimensionality reduction
taken to help avoid redundancies during data integration.
Typically, data cleaning and data integration are performed techniques help you to reduce the number of
as a pre-processing step when preparing data for a data dimensions to only keep important dimensions of
warehouse. Additional data cleaning can be performed to data and discard all other dimensions.
detect and remove redundancies that may have resulted In most learning algorithms, the complexity depends on
from data integration. the number of input dimensions, d, as well as on the
3. Data Reduction size of the data sample, N. As you increase the number
Data reduction can reduce data size by aggregating, of dimensions, you would also require collecting
eliminating redundant features, or clustering. Data increasing number of samples to support those many
reduction obtains a reduced representation of the data dimensions (in order to ensure that every combination
set that is much smaller in volume yet produces the of features is well represented in the dataset). As the
same (or almost the same) analytical results. Data number of dimensions increase, working with it
reduction strategies include dimensionality reduction becomes increasingly harder. This problem is often
and numerosity reduction. cited as "the curse of dimensionality".
In dimensionality reduction, data encoding To make it less computationally intensive, you reduce
schemes are applied so as to obtain a reduced or
"compressed" representation of the original data.
the dimensionality of the problem. Decreasing d also
decreases the overall complexity of the problem and
In numerosity reduction, the data are replaced by make it more plausible to understand most important
alternative, smaller representations using
dimensions of data.
parametric models (e.g., regression or log-linear
models) or nonparametric models Dataset with large SDimensionality Dataset with lower
number of dimensions
number of dimension reduction
histograms, clusters, sampling, or data
(e.g.,
aggregation). Fig. 1.6.2
Tec Knewledee
PuDIaue#s

L82548/2019)
(Copyright No. -
Data Science and Big Data Analytics (SPPU) 1-23 Introduction to Data Science and Big Data

Let's take a simple example. Therefore, if the attributes are left unnormalized, the

Income Credit Score Age Location Glve Loan distance measurements taken on "annual salary" will
generaly outweigh distance measurements taken on age.
50,000 High 34 Mumbai Yes
One such common transformation technique is log
75,000 Low 33 Bangalore No transform.
80,000 High 37 Mumbai Yes Log Transform
90,000 High 29 Kolkata Yes The log transform is a powerful tool for dealing with
large positive numbers with a heavy-tailed distribution.
This dataset has 4 dimensions (Income, Credit Score, A heavy-tailed distribution places more entries
Age, and Location) based on which loan approval seems towards the tail end of the plot rather than centre. It
to be granted. But, if you look closely, then you would compresses the long tail in the high end of the
find that the other dimensions do not influence the distribution into a shorter tail and expands the low end
decision as much as Credit Score dimension. So, the intoa longer head. Let's understand how.
same dataset with reduced dimensions could be as
following. The log function is the inverse of the exponential
function. It is defined such that log. (a) = x, where a is
Credit Score Give Loan
a positive constant, and x can be any positive number.
High Yes
Since a =1, you get log, (1) =0. This means that the log
Low No function maps the small range of numbers between
(0, 1) to the entire range of negative numbers (- , 0).
High Yes
The function log10 (x) maps the range of [1, 10] to [0, 1],
High Yes [10, 100] to [1, 2], and so on. In other words, the log
This dimensional reduction not only makes the function compresses the range of large numbers and
algorithms computationally less intensive but also expands the range of small numbers. The larger x is, the
makes it simple to understand and visualise the dataset slower log (x) increments.
as well as the results. For example, in the Fig. 1.6.3, note how the horizontal x
values from 100 to 1,000 get compressed into just
Note here that the goal of dimensionality reduction is NOT
to reduce (or compromise on) the quality of data when 2.0 to 3.0 in the vertical y range, while the tiny
discarding unnecessary dimensions but to only keep the horizontal portion of x values less than 100 are mapped
dimensions that matter the most without any significant to the rest of the vertical range.
3.0
loss of quality. This is very similar to how you compress a
file without losing information or how youswitch to a lower 2.5
resolution for a video if your internet speed is not optimal.
Lowering the resolution does not change the video much,
log10(x)
2.0

it is just that it does not look that sharp!


1.5
4. Data Transformation
1.0
Data transformations (e.g., normalisation) may be
applied, where data are scaled to fall within a smaller range 0.5
like 0.0to 1.0. This can improve the accuracy and efficiency
of modelling algorithmsinvolving distance measurements. 200 400 600 800 1000
For example, your customer data may contain the attributes
X

"age" and "annual salary". The "annual salary" attribute Fig. 1.6.3
usually takes much larger values than age.
Teck Kaewledge
PubI|at00s
(Copyright No. - L82548/2019)
Data Science and Big Data Analytics (SPPU)
1-24
Log transform is commonly used
Introduction to Data Science and Big Data
with large numbers. For
when you are dealing
example,
have a lot of reviews (say over some businesses Site Length Site Breadth Site Area Site Price
2000) and some have
only few (say in 20s). 30 40 1200 40 Lakhs
In such a
scenario, it becomes difficult to compare and 40 32 1280 40 Lakhs
correlate one business with the
other
count in one element of the data set because a large 30 30 900 30 Lakhs
the similarity in all other would outweigh
off the entire elements, which could throw 35 45 1575 45 Lakhs
similarity
machine learning algorithms.measurement
Log
for various 40 60 2400 60 Lakhs
to normalise skewed data. transformation helps 60 80 4800 90 Lakhs
5. Data Discretisation
There could be several other examples of
Data binning. For
discretisation and concept example, it is common to see
can also be useful, where raw hierarchy generation custom-designed
ranges that better correspond to stages
age
data values for attributes of life, such as:
are replaced by ranges or higher 0-12 years old
example, raw values for age may beconceptual
replaced
levels. For
by higher 12-17 years old
level concepts, such as youth, adult, or
senior. 18-24 years old
One common data

quantization or binning.
discretisation technique is 25-34 years old
6. 35-44 years old
Quantization or Binning
45-54 years old
Quantization or binning is a feature
construction
technique where you could combine features 55-64 years old
to create
segments or bins of information. In other words, you 65-74 years old
group the counts into bins, and get rid of the actual 75 years or older
count values.
You could get rid of the actual age in a
For example, consider the following data set for large data set
real and create bins based on age
estate sites. groups. You could then
model things such as product or service
preferences,
Site Length Site Breadth Site Price
reviews, demographics, eating habits, etc. more
elegantly.
30 40 40 Lakhs It is up to your requirements to create bins as you need.
40 For example, you may need to create bins for income.
32 40 Lakhs
You could then have something like the following
30 30 30Lakhs (remember how income tax or electricity bills have
35 45 45 Lakhs various slabs?).
40 Below 5 Lakhs
60 60 Lakhs
5-10 lakhs
60 80 90 Lakhs
10-20 Lakhs
Instead of having site length and site breadth that are
20-50 Lakhs
not very useful for establishing a modelling pattern,
you could create a new feature such as "site area". The Over 50 Lakhs
"site area" feature could then provide a good estimate Youcan create bins on fixed value range or take value
of site price. range based on other mathematical derivations such as
quantile, percentile, median, etc.
Data Science and Big Data Analvtics Introduction to Data Sctence and Big Data
(SPPU) 1-25
Note here that binning can be Q. 10 with examples,explain the types of Big Data formats.
caried out on both
categorical and numerical data For example, you would (6 Marks)
have seen a BMIl chart that classifies you in
various fitness Q. 11 Write a short note on Big Data Formats. (4 Marks)
categoies based on your body weight and height. Hence.
Q.12 Compare Big Data Formats (6 Marks]
you are binning the categorical data
(fitness category). Q. 13 Compare semi-structured and
Instead of womying about each possible combination of structured.
BMI weight n kg unstructured data. (6 Marks]
(height in meter) you just combine the data and Q. 14 Write a short note on DIKW pyramid. (5 Marks]
bin tas per the respective
categories. Q. 15 Draw DIKW pyramid and explain (6 Marks]
Q. 16 Explain the journey of data enrichment as it moves
BMI Nutritional status from hindsight to foresight. (6 Marks]
Below 18.5 Underweight Q. 17 Describe the various categories of data analytics

18.5 - 24.9 (6 Marks]


Normalweight
Q. 18 What types of analytics can be performed on a
25.0 - 29.9 Pre-obesity dataset. [6 Marks)
30.0 - 34.9 Obesity class I Q. 19 Write a short note on Descriptive Analytics. [4 Marks]
Q. 20 Write a short note on Diagnostic Analytics. [4 Marks]
35.0 - 39.9 Obesity class II
Q. 21 Write a short note on Predictive Analytics. [4 Marks]
Above 40 Obesity class II Q. 22 Write a short note on Prescriptive Analytics.

(4 Marks]
Review Questions Q. 23 Compare the various categories of data analytics.
[6 Marks]
Here are a few review questions to help you gauge your (B] State of the Practice in Analytics
understanding of this chapter. Try to attempt these Q. 24 Write a short note on Business Intelligence. [4 Marks]
questions and ensure that you can recall the points
mentioned in the chapter. Q. 25 What is business intelligence?
[(4 Marks]
[A] Introduction and Big DataOverview Q. 26 Compare business intelligence with data analytics.
Q.1 Define the term Big Data and give a few examples of (S Marks]
Big Data. [4 Marks] Q. 27 The terms Business intelligence and data analytics
a. 2 Write a short note on Big Data. [4 Marks] could be used interchangeably. Comment. [6 Marks]
a.3 Explain thefive Vs of Big data. Q. 28 Describe the relationship between information
[6 Marks]
science and data science. [4 Marks]
Q. 4 Describe the characteristics of Big Data. (6 Marks]
Q. 29 Compare information science and data science.
a. 5 Explain the terrns veracty and value with respect to
Big Data. [4 Marks] [4 Marks]
a. 6 Explain the terns volurne, variety, and velocity with Q. 30 Draw a block diagram of the current analytical
respect to Big Data. [6 Marks] architecture and explain. [4 Marks]
Q.7 Explain sorme of the rnajor applications of Big Data a. 31 With a block diagram, explain traditional analytical
architecture. [4 Marks]
Analytics. (6 Marks]
Q. 8 Describe three applications of Big Data Analytics. Q. 32 Explain the challenges of a traditional analytical
architecture. [4 Marks]
[6 Marks]
Q.33 Write a short note on traditional analytical
a.9 Explain the various Big Data formats. (6 Marks] architecture. (4 Marks]

Tech Knowledge
Copyright No. - L82548/2019) PubiCA0ns

You might also like