Big-Data-Unit1 NOTES

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

UNI

TI

I
NTRODUCT
IONT
OBI
GDAT
A
Bi
gData–De fi
ni
ti
on,Char
act
eri
st
icFeat
ure
s–BigDataAppl
i
cati
ons-Bi
gDat
avs.Tr
adi
ti
ona
lDat
a-
Ris
ksofBigDa
ta-S t
ruc
tur
eofBigDat
a-Ch al
l
eng
esofCon
venti
ona
l S
yst
ems-WebData–Evol
uti
on
ofAna
lyt
icScal
abi
l
it
y-Ev ol
ut
io
no fAn
aly
ti
cProces
ses
,Too
lsandme t
hod
s-Anal
ysi
svs.Rep
ort
i
ng-
Moder
nDa t
aAnal
yti
cTool
s.

1
.
1DEF
INI
TI
ON:

Bigdataisa nevol
vi
ngtermtha
td es
cri
besa nyvol
umi
nou
sa moun
to fstr
uct
ured
,s e
mi
s
tru
ctu
redandunst
ruc
tur
eddatat
hatha
sthepo
tenti
alt
obeminedf
ori
nfo
rmati
on.
Bigdat
ac h
all
eng
es
i
ncl
ude 
cap
tur
in
gd ata

dat
as t
ora
ge,
 d
ataana
lys
is,sear
ch,s
hari
ng,t
ran
sfe
r,vi
sual
iz
ati
on,qu
eryi
ng,
u
pdati
ngandi
nfor
mati
onpri
vac
y.

1
.
2CHARACT
ERI
STI
CSOFBI
GDAT
A:
(
i)
Vol
ume 
Thename'Bi
gData
'it
sel
fisrel
at
edtoasiz
ewh i
chi
senormou
s.Si
zeofdat
apla
ysverycr
uci
al
ro
leindet
ermini
ngval
ueou
tofdata.
Also
,whet
herapa
rti
cul
ardat
acanactu
all
ybecon
sid
ere
da saBi
g
Dataornot
,isdepend
entu
ponv o
lu
meo fda
ta.Hen
ce,
 
'Vol
ume'
 
isonecha
ract
eri
st
icwhi
chneedst
obe
consi
der
edwh i
l
ed e
ali
ngwi
th'
BigData
'.
(
ii
)Va
ri
et

Vari
etyrefer
st oh e
tero
geneouss ou
rcesa ndthen at
ureo fdat
a,b othstr
uct
ure
da nd
uns
tru
ctur
e d
.Duri
ngearl
i
erdays,
spre
ad s
heetsanddatab
aseswereth
eo nl
ysou
rcesofdat
aconsi
der
ed
bymosto ftheappl
i
cati
ons
.Nowd ays,dat
aint h
ef o
rmo femai
ls
,phot
os,vi
deos
,mo ni
to
ri
ngdevi
ces,
PDFs,au
dio,et
c.i
sa l
sobei
ngconsi
deredintheanal
ysi
sa pp
li
cat
i
ons.Th
isvar
iet
yo fun
str
uct
ure
dd a
ta
pos
esce r
tai
niss
uesforst
ora
ge,mini
ngandan al
yzi
ngdat
a.
(
ii
i
)Ve
loc
it
y
Theter
m '
vel
oci
ty'
 
ref
erstot
hesp e
edofgenerat
ionofda
ta.Howfastt
hedat
aisgene
rat
edand
pr
oce
ssedtomeetth
ed emand
s ,
dete
rminesrea
lpote
ntia
l i
nth
ed ata
.BigDat
aVe l
oci
t
ydeal
swit
hthe
sp
eedatwhic
hd a
tafl
owsinf
roms o
urcesl
ikebus
ine
ssp r
ocess
es,appl
ic
ati
onlog
s,net
wor
ksandso
cia
l
medi
asi
tes,
senso
rs,
Mo b
il
edevi
ces,et
c.Thefl
owofdataismassi
veandc o
nti
nuo
us.
(
iv
)Va
ri
abi
l
it

T
h i
sref
erst
otheinc
o n
sis
ten
cywhi
chcanbesh
ownbythed
ataa
tti
mes
,th
ush
amp
eri
ngt
he
p
roc
essofbei
ngabl
etoha
ndleandmana
gethed
ataef
fec
ti
vel
y.
1
.
3BI
GDAT
AAPPL
ICAT
IONS
:
S
omeo
fth
ein
dus
tri
esp
rop
ell
edb
ybi
gda
taa
nal
yt
ic
sar
e:

· Pu
bl
icS
ect
orS
erv
ice
s.
· He
alt
hca
rec
ont
ri
but
i
ons
.
· L
ear
ni
ngS
erv
ice
s.

1
· I
nsu
ran
ceS
erv
ice
s.
· I
ndu
str
ia
li
zeda
ndNa
tur
alRe
sou
rce
s.
· T
ran
spo
rta
ti
onS
erv
ice
s.
· Ba
nki
ngS
ect
orsa
ndF
rau
dDe
tec
ti
on
Re
alT
imeBi
gda
taAp
pli
cat
i
ons
:
a
)Pr
ocu
reme
ntwi
t
hBi
gda
ta
De
man
dca
nbef
ore
cas
tedp
rop
erl
yasp
erd
if
fer
entc
ond
it
io
nsa
vai
l
abl
ewi
t
hBi
gDa
ta.
b
)Bi
gda
tai
nPr
odu
ctd
eve
lop
men
t
Wh
atp
rod
uctt
obed
eve
lop
edt
oin
cre
ases
ale
s
c
)Bi
gda
tai
nma
nuf
act
uri
ngs
ect
or
Bigdat
ac a
nbeu
sedt
oid
ent
i
fyma
chi
ner
yan
dpr
oce
ssv
ari
at
io
nst
hatma
ybei
ndi
cat
orso
f
q
ual
i
typ
robl
ems.
d
)Bi
gda
taf
orp
rod
uctd
ist
ri
but
i
on
Ba
sedo
nda
taa
vai
l
abl
e,i
t
san
aly
sisc
oul
dbed
onet
oen
sur
epr
ope
rdi
st
ri
but
i
oni
npr
ope
rma
rke
t.
e
)Bi
gda
tai
nMa
rke
ti
ngf
i
eld
Bi
gda
tah
elp
sink
nowi
ngb
ett
erma
rke
ti
ngs
tra
teg
yth
atc
oul
din
cre
asea
le.
f
)Pr
ic
eMa
nag
eme
ntu
sin
gBi
gda
ta
Toma
int
ainposi
t
ioni
nma
rke
t,p
ri
cema
nag
eme
ntp
lay
sak
eyr
ol
ean
dBi
gda
tah
elp
sbu
sin
ess
i
nkn
owi
ngma
rkettr
endfori
t
.
g
)Me
rch
and
isi
ng
Bi
gDa
tap
lay
sama
jorr
ol
eins
ale
sfo
rre
tai
lma
rke
tal
so
.

h
)Bi
gda
tai
nSa
les
I
the
lpsini
ncr
eas
ingsal
efo
rth
eb us
ines
s.I
tals
ohe
lpsi
nop
ti
miz
inga
ssi
gnme
nto
fsa
les
r
eso
urc
esan
da c
cou
nts
,pr
oduc
tmixa
ndot
herop
erat
io
ns.
i
)St
oreOp
era
ti
onsu
sin
gBi
gDa
ta
Di
ff
e r
entto
olscanbeusedtomon
it
orst
oreope
rati
onswhic
hreducemanua
lwork
.Bi
gdata
he
lpsi
na d
just
in
ginvent
oryl
eve
lsonth
ebasi
sofpre
dict
edb u
yin
gp at
ter
ns,s
tud
yo fd
emogr
aph
ics
,
weat
her
,keyevent
s,an
dothe
rfac
tor
s.
j
)Bi
gda
tai
nHu
manRe
sou
rce
s
BigDatahaschan
gedwayofrec
rui
t
mentandot
herHRo p
erat
i
ons
.Youcanal
sof
indo
utthe
c
har
act
eri
st
ic
sa n
db e
havi
orso
fsu
cces
sfu
lande
ffe
cti
veempl
oye
es,aswel
laso
the
rempl
oye
ein
sig
hts
t
omana
getal
entbet
ter
.
k
)Bi
gda
tai
nBa
nki
ng
BigDa
tah
aspr
ovi
dedbi
gge
stopp
ort
uni
t
ytocompani
esl
ikeCi
t
iban
ktos
eeth
ebi
gpic
tur
edu
e
t
oba
lan
cingt
hes
ens
it
iv
en a
tur
eofth
ed at
aford
eli
ver
in
gv al
uetocl
i
ent
sal
ongwi
t
hpr
ior
it
i
zin
gth
e

2
p
ri
vacyan
dp r
otec
ti
onofi
nf
orma
ti
on.Ith
asb e
enful
l
ya do
ptedbymanyc
ompanie
stodri
veb
usi
nes
s
g
rowt
handenhancet
heser
vi
cesth
e ypro
vid
et oc
usto
me r
s.Un
der
sta
ndfur
the
r h
owIn
cometaxha
s
b
enef
i
tedf
romBigdat
a.
l
)Bi
gda
tai
nFi
nan
ces
ect
or
F
inanc
ial
serv
iceshav
ewid
elya
dop
tedbi
gda
taana
lyt
i
cstoinf
ormb
e t
teri
nve
stmentdec
isi
ons
wi
thc
onsi
sten
tretu
rns.Thebi
gdat
apend
ulu
mforf
ina
nci
alse
rvi
ceshasswungfr
omp a
ssin
gfadt o
l
ar
gedepl
oyment
sla
sty e
ar.
m)Bi
gda
tai
nTe
lec
om
Arecentrepo
rt,“
Glob
alBigDataAna
lyt
i
csMa r
keti
nTele
comIn
dust
ry2014-
201
8,”fo
undtha
t
useofd
ataanal
yt
icstool
sintel
ecomsect
ori
se xp
ect
edtogro
wa taco
mp o
undannu
algrowt
hrat
eof
28.
28perc
entovert
hen extf
ouryear
s.Mob
il
eTele
comharn
esse
sBigDat
awithc
ombine
da ct
uat
eand
hado
opsol
ut
io
n.
n
)Bi
gda
tai
nre
tai
lse
cto
r
Ret
ail
er
sharne
ssBigDat
atooffe
rc o
nsumersper
sona
li
zedshopp
inge x
peri
ences
.Anal
yzi
ng
ho
wac us
tomerca
met omakeapu
rch
ase,ort
hep a
thtopur
chase
,is1waybigdatatec
hismakin
ga
marki
nre
tai
l.
66%o fr
eta
il
er
shav
ema defi
nanc
ialg
ainsi
ncust
omerrel
at
ion
shipmanagementt
hro
ugh
bi
gdat
a.

o
)Bi
gda
tai
nHe
alt
hCa
re
Bigdat
aisu
sedfo
ra n
aly
zin
gd at
ainth
ee l
ect
ron
icmedic
alrec
ord(EMR)s y
stemwitht
hegoa
l
o
freduc
ingcost
sandi
mprovi
ngpati
entcar
e.Thi
sDatain
clu
destheunst
ructu
reddatafr
omp h
ysi
ci
an
n
otes
,patho
log
yrep
ort
setc.Bi
gDa t
aa ndhea
lt
hcar
ea nal
yt
ic
sh av
ethep owertopred
ict
,pre
ven
t&
c
uredi
seases
.
p
)Bi
gda
tai
nMe
diaa
ndEn
ter
tai
nme
nt
Bi
gd a
taischa
ngin
gt h
eme di
aa n
de nt
ert
ai
nmenti
ndu
str
y,gi
vi
nguse
rsandvi
ewer
samu ch
mor
ep e
rso
nal
ize
dande nr
ic
hede xp
eri
ence.Bi
gdatai
suse
df o
rinc
reas
ingr
eve
nues
,under
sta
ndi
ng
r
eal
-t
imecus
tomers
enti
ment,
incr
easi
ngmarket
i
ngeff
ect
i
venes
sandrat
in
gsandv
iewe
rsh
ip.
q
)Bi
gDa
tai
nto
uri
sm
Bi
gdat
aistr
ans
for
mingt
hegl
oba
ltou
ri
smin
dust
ry.
Peo
plek
nowmoreab
outt
hewo
rl
dth
ane
ver
b
efo
re.
Peopl
ehav
emu c
hmo r
edet
ai
le
diti
ner
ari
est
hes
ed a
yswi
thth
ehel
pofBi
gdat
a.
r
)Bi
gda
tai
nAi
rl
i
nes
Bi
gDat
aan
dAn a
lyt
i
csgi
vewin
gst
oth
eAvi
at
io
nIndu
str
y.Anai
rl
i
nen
owkno
wswher
eapla
neis
h
eaded
,wher
eap as
senge
rissi
tt
i
ng,an
dwhatapass
engeri
sv i
ewi
ngont
heIFEo
rconn
ecti
vi
ty
s
yst
em.
s
)Bi
gda
tai
nSo
cia
lMe
dia
Bi
gdatai
sad r
iv
ingf
acto
rbe
hin
de v
eryma
rke
ti
ngd
eci
si
onma
deb
yso
cia
lme
diac
omp
ani
es
a
ndi
ti
sdr
ivi
ngpe
rso
nal
iza
ti
ontot
heext
reme.

1
.
4BI
GDAT
AVST
RADI
TI
ONALDAT
A:

3
T
hema
jord
if
fer
enc
ebe
twe
ent
rad
it
io
nal
dat
aan
dbi
gda
taa
red
isc
uss
edb
elo
w.
Da
taa
rch
it
ect
ure
T
radi
ti
onaldat
ausec ent
ral
iz
edd at
abasea rc
hit
ectur
einwh ic
hlargeandc ompl
exp r
obl
emsa re
s
ol
vedb yasingl
ecomp u
tersystem.
 Cent
ral
ise
d  
archi
tect
ureiscostl
ya ndin
effec
tivetoproces
slarge
a
mo u
ntofdata
.Bigdatai
sb asedonthedist
ri
buteddatabas
ea r
ch i
t
e c
tur
ewh erealargeblo
ckofd at
a i
s
s
ol
vedb ydiv
idi
ngit 
in
toseve r
alsmall
ersi
zes.Thenthes ol
ut
iontoap robl
emi scomp u
tedbys ever
al
d
if
fere
ntcomputer
sp r
esenti
nag i
vencomputernetwork
.Thec omputer
sc o
mmu ni
ca t
etoe ac
ho th
erin
o
rdertofi
ndthesol
uti
ontoap rob
lem. T
hed i
str
ibute
dd ata
basep r
ovidesbett
ercomp ut
i
n g
,lo
we rpri
ce
a
nda l
soimpro
vet heperf
ormancea scomparedt othecentr
ali
zedd at
a b
ases ys
tem.T hi
sisbecause
c
entra
li
zedarchi
te
ctur
eisb asedonthema i
nframeswh i
cha r
en otase con
omica smicrop
roces
s o
rsin
d
ist
ri
bute
dd at
abasesyst
em. Al
sothedist
ri
buteddatabasehasmo recomputat
iona
l po
we rascompared
t
othecentr
al
iz
e ddat
abasesystemwh i
chisusedtoma nagetr
a d
iti
onaldat
a.
T
ype
sofd
ata
T r
adit
ionaldat
abases y
stemsareb a
sedo nthest
ructur
e ddatai.
e.t
rad
iti
onalda
taiss t
oredin
f
i
xedformato rfi
eldsinaf i
l
e.Examplesoftheunstr
uct
uredd atain
clud
eRe l
ati
onalDat
abaseS yst
em
(
RDBMS )andt hespreads
hee t
s,whi
cho n
lyanswerstothequesti
onsa b
outwhathappene
d.T r
adit
i
onal
d
ataba
seo nl
yp rov
idesaninsi
ghttoaprobl
ema tthesmal
ll
e v
e l
.Howeveri
nordertoenhan
cet heabi
li
ty
o
fanorgani
z a
tio
n ,
togainmo r
einsi
ghti
n t
othed a
taandalsotoknowa bou
tmetadatauns
tru
ctureddata
i
sused.Bi
gd atausesthes emi
-str
uct
uredandu ns
tru
ctu
redd at
aa ndimprov
esthev ar
iet
yo fthedat
a
g
ather
edf r
omd i
ff
erentsour
ceslik
ec ust
omers,audi
enc
eo rsubscri
bers
.Aft
erthec ol
l
ecti
on,Biddata
t
ran
sfor
msi ti
ntoknowledgebasedinf
ormati
on
Vo
lu
meo
fda
ta
Th
et r
adi
t
iona
lsyst
emd a
taba
secansto
reonl
ysmal
lamoun
tofdatara
ngi
ngfro
mgig
abyte
sto
t
erabyt
es.However
,bigdatahel
pstostor
ea ndpr
oces
slarg
ea mountofdatawhi
chcons
ist
sof
hu
ndredsofte
rab
ytesofdat
aorpet
abyt
esofdat
aandbeyo
nd.Th
es t
ora
geofmassi
veamo
untofdat
a
woul
dreducet
heovera
llc
ostf
orst
ori
ngdat
aandhel
pinp
rovi
di
ngbus
ines
sint
el
li
genc
e
Da
tas
che
ma
Bigdataus
e sthedynamicschemafordatastor
age.Boththeun-st
ruc
tur
eda ndstr
uct
ured
i
nf
ormati
oncanbestor
edanda nyschemacanbeusedsincethesch
emai sappl
i
edonlyaft
eraquery
i
sgener
ated
.Bigdat
aiss t
ore
dinr a
wf or
matandthentheschemaisappl
iedonl
ywhenthed a
taisto
b
ere a
d.Thi
sp r
ocessisbenef
ici
ali
np re
serv
ingtheinf
ormat
ionpres
entinthed at
a.Thetrad
it
ion
al
d
ataba
seisbasedonthefi
xedschemawh i
chisst
ati
cinnatur
e.I
ntradi
ti
onald
ataba
sed a
tacannotbe
c
hangedonc
eitiss
av e
da n
dthisison
lydonedur
in
gwrit
eo per
ati
ons
Da
tar
el
at
io
nsh
ip
I
nthetr
adi
t
ion
aldat
abas
es ys
temrel
ati
ons
hipbet
we e
nt h
ed at
ait
emscanbeexpl
or
edea
sil
y
ast
henumbero
fin
for
mati
on’
ssto
redissmal
l.
However,b
igdatacont
ai
nsmass
iveorv
olu
mino
usd
ata
whi
chi
ncr
easet
hel
evelo
fdi
ff
icu
lt
yinfi
gur
in
go u
tth
er e
lat
io
nshi
pbetweent
hedat
ait
ems
S
cal
i
ng
S
cali
ngref
erst
ode
ma ndoft
heresou
rcesandserve
rsreq
uir
edtocarr
youtt
hecomput
ati
on.
Bi
gda
taisbase
do nth
escal
eoutar
chi
tec
tur
eu nde
rwh i
chthedi
str
ib
ute
da pp
roa
che
sforco
mputi
ng
ar
eemplo
yedwithmoret
hano neser
ver.So,th
el o
ado fthecomput
ati
onissha
redwit
hs i
ngl
e

4
a
ppl
i
cat
io
nb a
sedsyst
em.Howeve
r,ac
hie
vin
gthescal
abi
l
it
yinth
etra
dit
io
nald
a t
aba
seisve
rydif
fi
cul
t
b
eca
useth
etrad
it
io
nalda
tab
a s
erunso
nthesin
gles
erv
erandre
qui
re
se x
pens
iveser
ver
stosca
leup.
Hi
ghe
rco
sto
ftr
adi
t
ion
ald
ata
Tr
adi
ti
ona
ld a
tabas
es y
s t
emr e
quir
escomplexandexpe
nsi
vehard
wa r
ea n
dsoft
war
einor
dert
o
ma n
agela
rgeamountofdat
a. Al
somo v
ingthedatafr
omo n
es y
stemtoanot
herreq
uir
esmor
enumber
ofhard
wareandsoft
wareres
ourceswhichi
ncrea
sesthecostsi
gni
f
ica
ntl
y.Whi
leincas
eofbi
gdataa
s
t
h emassi
veamountofdat
aiss e
gregat
edb e
tweenv a
ri
oussys
tems,t
heamountofdat
adecr
eas
es.S
o
useofbi
gdatai
squ i
tesi
mple
,ma kesuseofcommodit
yhardwar
eandopensour
cesoft
war
etopr
ocess
t
h edat
a
Ac
cur
acya
ndc
onf
i
den
ti
al
it
y
Underth
et r
a d
it
iona
ldatab
asesyst
emi ti
sv e
ryexpe
nsi
vetostor
ema s
siv
ea mountofdat
a,so
a
llthedat
ac annotbes to
red.Thi
swo u
lddec r
easetheamountofdat
atob ea n
a l
yze
dwh ichwil
l
d
e c
reas
et h
er es
ult
’sa cc
uracyandc o
nfi
dence.Whil
einbi
gd ataastheamountrequi
redtos t
ore
v
olu
minou
sd a
taislower.T
heref
oreth
ed at
aiss t
ore
dinbigdat
as y
ste
msa n
dt h
ep o
int
so fcor
rel
at
ion
a
reiden
ti
fi
edwhichwo u
ldpro
videhi
ghaccur
ateresu
lt
s.

1
.
5RI
SKSOFBI
GDAT
A:
Da
taS
ecu
ri
ty
T
hisri
ski
sobv
iou
sa ndoft
enupp
ermo
stinourmin
dswhenwear
econ
sid
eri
ngt
hel
ogi
st
icso
f
d
atacol
l
ecti
onanda
naly
sis
.Da t
athe
ftisarampa
nta n
dg r
owi
ngar
eaofc
ri
me–a n
dat
tack
sa r
e
g
ett
in
gbigg
erandmo
redamag i
ng.
Theb i
gge
ryo u
rdat
a,t
heb i
ggert
hetar
geti
tpresent
stocr
imi
nalswitht
hetool
stoste
alandsel
l
i
t
.Inth
ec aseofTar
g e
t,h
acke
rsstol
ec r
edi
tanddebi
tcardin
for
mati
ono f40mil
li
oncust
omers,aswel
l
a
sp e
rsonal
ident
if
yi
nginf
ormat
io
ns uchasemai
landgeogr
aphi
calad
d r
essesofupto11
0mill
i
onp eo
ple
.
I
nMarch,afeder
alj
udgeappr
ovedas et
tl
ementi
nwhichTarge
twoul
dp ay$10mil
li
onin
toas et
tl
ement
f
und,
fro
mwh ichpayment
sofupto$10,
000woul
dbema detoever
yoneaffec
tedbythebre
ach.
Da
taPr
iv
acy
Closel
yrela
tedtothei s
sueo fsec
uri
tyispri
vac
y.Butinaddi
ti
ontoe n
s u
ri
ngtha
tp e
opl
e’
s
p
ers
ona
ldataaresafef
romc r
imi
nals
,youneedtobesu
rethatt
hese
nsit
i
veinf
ormati
onyouar
estor
in
g
a
ndcol
l
ecti
ngisn
’tgoi
ngtobed i
vul
gedth
roug
hlessmal
evol
entb
utequ
all
ydamagi
ngmi su
sebyyou
rse
lf
o
rbypeo
pletowhomy o
uh a
ved el
egat
edrespon
sib
il
i
tyf
oranal
yzi
nga
ndrepor
ti
ngo nit
.
Fai
li
ngtof
oll
owapp
li
cabl
edat
aprot
ect
i
onla
wscanlea
dtoe x
pensi
vela
ws u
it
sandeven
p
ri
son
,depe
ndin
go nwhatsor
tofdat
ay o
uareusi
ngandt
heju
ri
sdi
cti
onyouarein
.Lastyear
,pr
iv
ate
h
ir
eandcarshar
in
gs e
rvi
ceUbersti
rr
edupcon
tro
ver
sywhenoneofit
sexecut
iv
eswa scaug
htusi
ng
t
heser
vi
ce’
s“Godmode”tot
rackt
hemo v
ement
sofBuz
zFe
edjo
urna
li
stJo
hanaBhui
yan
.
Co
sts
Da t
ac o
ll
ect
ion,a
ggrega
ti
on,sto
rage,an
aly
sis
,andrepo
rti
ngal
lcostmoney.Ontopofthi
s,t
here
wi
ll
bec ompl
ian
cyc os
ts–t oavoidfal
l
ingfou
lontheiss
uesIrai
sedintheprev
iou
sp o
int.T
hesec o
sts
ca
nb emiti
gate
db ycaref
ulbud
getin
gd ur
in
gt h
ep l
anni
ngst
ages,bu
tg e
tti
ngitwro
n gattha
tp o
intcan
l
eadtosp i
ra
ll
i
ngc ost
s,pot
enti
al
lynegati
nganyv al
ueaddedtoyourbott
oml i
nebyy ourdat
a-dri
ven
i
ni
ti
ati
ve.Thi
siswh y“st
art
ingwithstr
ateg
y”issov i
tal
.Awell
-d
evel
opeds t
rat
egywil
lc l
ear
lysetout

5
whatyouint
endtoachi
e v
ea ndthebene
fi
tsthatcanbegain
e dsot
heycanbebal
ancedaga
ins
tthe
re
sour
cesa l
l
ocat
edtothep r
ojec
t.Onebanktha
tI wor
kedwit
hwa swor
ri
edabou
tthecost
sofst
ori
ng
andma i
nt
ain
ingal
lth
ed atai
twa sco
ll
ect
ingtothepoi
ntth
atitwasco
nsi
der
ingpu
ll
i
ngthepl
ugonone
par
ti
cul
aranal
yti
csproj
ect,asthecost
slook
edlikel
ytoexceedanypot
ent
i
alsavi
ngs
.Byident
i
fyi
ng
andel
imin
ati
ngirr
el
evantdat
afromthepro
ject
,thebankwasabl
etobr
in
gcost
sbackunde
rcont
rol
and
achi
eveit
sobjec
ti
ves.
Ba
dAn
aly
ti
cs
Aka“g
etti
ngi
twrong.
”Mis
int
erp
reti
ngthepat
ter
nsshownbyyourd
ataanddra
wingca
usal
li
nks
wherether
eisinfactmerel
yrandomcoinci
denc
eisano bv
iou
sp i
tf
al
l.Sa
lesdat
ama yshowar i
se
f
ol
lowi
ngama jo
rsport
i
nge v
ent,
prompt
ingyoutodra
wal i
nkbet
weenspor
tsfan
sandy o
urpr
odu
ctsor
se
rvice
s,wheninfactth
er i
seisbase
do ntherebei
ngmo r
ep eo
plei
ntown,andtheri
sewoul
db e
equ
a l
l
yd r
amati
caft
erala
rgeli
vemusi
ce ve
nt.
I
naddi
ti
on,
caremustbetak
entoavoi
dconf
ir
mati
onbi
as–e asi
l
yimpos
edwh e
nana na
lys
tcomest
oa
p
roj
ectwi
thprede
ter
minedi
deasabo
utwhatt
heyarel
ook
ingf
orandisbl
i
ndedtoi
nsi
ght
sfromth
edat
a
t
hatgoagai
nstth
esepreco
ncei
vednot
io
ns.Th
eo n
lywaytoprot
ectaga
ins
tthi
sistoensu
reth
atyou
a
reimpl
ement
ingal
lbes
tpract
i
ceproc
edur
esfro
mt o
ptobot
tomthr
ough
outyou
rpro
ject
.
Goo
gle
’sFluTrendspro
jectserve
sa sag oodexample
.De s
ignedtopro
duceaccu
rat
ema psofflu
ou
tbr
eaksbasedonthesear
c h
esbein
gma debyGoogl
eusers,
atfi
rsti
tpr
ovi
dedcompel
l
in
gresu
lt
s.But
asti
mewe nto n,i
tspredi
cti
onsbegantod i
ver
geincrea
singl
yfromreal
i
ty.I
ttur
nedouttha
tt he
al
gor
it
hmsb e
hindthepr
oje
ctjustweren’
tacc
urat
ee n
oughtopi
cku pan
omali
ess
uchasthe2009H1N1
pan
demic,
vastl
yreduc
ingt
h eval
uethatco
ul
db egai
nedf
romthem.
Ba
dDa
ta
I’
vecomea c
rossmanydatapr
oject
sthatst
artoffonthewrongfo
otbycol
l
ecti
ngi
rre
leva
nt,o
ut
o
fdate,orerr
oneo
usd at
a.Thi
susual
l
yc o
me sdownt oinsu
ffi
ci
entti
meb ei
ngspentondesi
gni
ngthe
p
roj
ectst
rateg
y.Thebi
gd at
agol
drushhasledt
oa“ col
l
ec teve
ryth
ingandth
inkab
outana
lyz
ingi
tla
ter

a
ppr
oacha tmanyorga
niz
ati
ons.
Thisno
tonlyaddstotheg r
owingcosto
fsto
ri
ngthedat
aa n
de ns
uri
ng
c
ompl
iance,i
tl
eadstol
argeamount
sofdatatha
tcanb ecomeoutdat
edveryqui
ckl
y.
Therealdangerher
ei sfal
li
ngb e
hindy ou
rc ompeti
ti
on.I
fyouarenotanal
yzi
ngt h
erig
htdat
a,
youwon’
tb edrawi
n gtheri
ghtins
ightsthatwil
lprovi
dev al
ue.Meanwhi
le,yo
urcompeti
tor
smostl
ike
ly
wi
llberu
nn i
ngthei
rownd at
ap ro
jects
. An
di fth
eya regetti
ngitri
ght
,th
ey’l
lt
akethele
ad .
Aheal
th
care
cl
i
e ntIr
ecentl
yworkedwithcreat
e da217-p a
gereportforseni
ormanagement.Alo
tofthedat
ainthe
re
po r
twoul
dh a
veb eenusef
ul
,butitwasd r
own edoutbyir
relev
a n
tbackgr
oundnoi
se.Worki
ngwi
tht
hem,
Iwa sab
letoshowt hemh owtoc uttherep o
rtdownt o20p ages,most
lyinf
og r
aphi
cs,whi
chcl
ear
ly
showedtherel
evantdatawhi
leomitt
ingalotofthenoise
.
That
’sju
stasi
mpl
echeckl
i
stoft
heri
sksth
atever
yb i
gdat
ap r
ojectneedstoacc
ountfo
rbefor
e
on
ec entiss pen
toni n
fra
str
uct
ureo rdatacol
l
ecti
ng.Busi
nesse
so fa l
lsi
zess hou
lde n
g ag
e
whol
ehe
art
edlywit
hbi
gdatapro
ject
s.I
ftheydon
’t
,the
yrunthese r
io
usr i
skofbein
gleftbehi
nd.But
t
heyals
oshoul
dbeawar
eofther
isk
sa n
de n
teri
nt
obigdat
ap r
oj
ectswit
htheireye
swideo p
en.

1
.
6ST
RUCT
UREOFBI
GDAT
A:

6
F
igu
re:Bi
gDa
tas
tru
ctu
res
,mo
del
san
dth
eirl
i
nka
gea
tdi
f
fer
entp
roc
ess
ings
tag
es.

1
.
7CHAL
LENGESOFCONVENT
IONALS
YST
EMS
:

I
nthep ast
, t
heterm‘ Anal
yti
cs'h asbeenusedinthebusi
ne s
sinte
ll
igencewo r
l
dt opro
videtoo
ls
a
ndint
ell
ig
encet og ai
nin si
ghtintothed at
at h
roughfast
,c o
nsist
ent,i
nteract
ivea cces
st oawi de
v
ari
et
yofp os
sib
lev i
ewso finf
orma t
io
n. Datamini
nghasb e
e nusedinenterpr
is
est okeeppacewiththe
c
ri
ti
cal
mo ni
tor
inga ndanalysi
so fmo untai
nsofdata.Thema i
nc hal
le
ngeint hetradi
ti
onalappr
oachis
h
owtouneart
hallthehi
d d
e ninfo
rma t
ionthroug
hthev a
stamo un
to fdat
a.
· T ra
dit
ionalAnal
yti
csa nal
yzeso nthek n
ownd at
aterrai
nthattoothedatathatiswel
lunders
tood
.
I
tcannotwo r
ko nunstruc
turedda t
ae ff
i
cien
tl
y.
· T ra
dit
ionalAnalyt
icsi sbuil
to nt opofther el
ati
onaldatamo del,re
lati
onshi
psb et
weent he
subj
ect
so finter
estsh aveb eenc re
atedins
idethes ys
tema ndthea nalysi
sisdoneb a
sedo n
t
h e
m. Th
isa ppr
oachwi l
lnotadeq u
a t
eforbi
gd at
aana l
yt
ics.
· T ra
dit
ionalanal
yti
csi sbatcho r
ientedandwen e
edt owa i
tforni
g ht
l
yET L( ext
ract
,tra
nsfor
m
andlo
ad)a ndtransfo
rma t
ionjobstoc o
mp l
et
eb ef
oretherequi
redinsi
ghtisobtai
ned.
· Pa ral
l
el
ismi nat ra di
ti
onalana l
yti
css ys
temi sa chi
evedt h
roughc o
stlyhardwareli
keMPP
(Massi
vel
yPa ral
lelProces
sing)s y
s t
ems
· Inadequat
es uppor
to faggregateds u
mma ri
esofdata

Ap
artf
romthesechal
l
engeso
thersar
ec at
ego
ri
zeda
s
Da
tachal
le
nges
-Vol
ume,vel
oci
ty
,ver
acit
y,var
ie
ty
-Dat
ad i
sco
veryandcompre
hensi
vene
ss
-Scal
abi
l
it
y

7
Pr
oces
sc ha
lle
n g
e s
-Capturi
ngd at
a
-Al
igni
ngd atafr
omd i
ff
erentsour
ces
-Tr
an sf
ormingdataintosui
tabl
eformfordataanal
ysi
s
-Mod e
li
ngd ata(ma t
hemati
c al
l
y,si
mu l
at
ion
)
-Understandi
ngou t
put,v
isu
a l
iz
ingresu
lt
sandd i
spl
ayis
suesonmobi
l
ed e
vic
es
Mana
geme ntchall
enges
-Securi
ty
-Pri
vacy
-Governance
-Eth
icali
s s
ues
-Tr
ad i
ti
onal/RDBMSc hal
l
eng e
s
-Desig
n edtohandlewe l
lst
ructu
reddata
-tr
adit
ionalst
oragev end
ors ol
ut
ionsar
ev er
yexpensi
ve
-sharedblo
c k
-le
ve ls
tora
gei sto
os l
ow
-re
add atain8ko r16kblo
cks i
ze
-Schema -
on-wri
ter e
quir
esd at
ab eval
idat
edbefor
eitcanbewri
t
tentodi
sk.
-Sof
twa r
elicensesaretooe xp
ensive
-Getda t
afromd i
s kandloa
di nt
ome moryrequ
ir
esappl
icat
io
n

1
.
8WEBDAT
A:
Inthewo r
ldofBigData
,there
'salotoftal
ka bo
utunst
ruct
ure
dd ata--af
teral
l,
"var
iet
y"isone
ofthe 
thr
eeVs .
 Ofte
nt hes
ed i
scus
sio
nsdwe l
lonlogfi
l
ed at
a,sen
sorout
p u
tormediacont
ent.
 Butwhat
aboutdat
ao ntheWe bit
sel
f--notdat
afro
mWe bAPIs,bu
tdataonWe bpagesth
atwe r
ed e
sign
e dmore
f
o reyebal
l
in
gt h
a nma c
hine-
dri
venqueryan
ds tor
a ge
? Howc ant
hisdataberead,
e s
peci
all
yats c
ale

Re c
entl
y,Ihadac ha
twiththeCTOa ndFou
n d
e rofKapowSoft
ware,St
efanAndr
easen,whoshowedme
howt h
ec o
mp any'
sKa t
al
ystpro
ducttamesdata-r
ichWebs i
t
esn o
tdesig
n e
dformachin
e-re
adabil
it
y.
S
cra
pin
gth
eWe
b:
Ifyou
'r
eap r
ogr
amme r
,youknowtha
tWe bpag
esa r
es i
mpl
yvi
sual
iz
ati
onso
fHTMLmar
kup--i
n
e
ffe
ctever
yv i
si
bl
eWe bpag
eisreal
lyjus
tarender
in
go fabigst
ri
ngoftex
t.
 Andbec
aus
eofth
at,th
e
d
atayoumaywa ntou
tofaWe bp ag
ec anusu
all
ybee xt
ract
edbyloo
kin
gf o
rocc
urr
enc
esofcer
tai
n
t
exti
mme d
iat
elyp
recedi
ngandfo
ll
owi
ngthatdat
a,an
dta k
ingwhat
'
sinbet
we en
.
Codet h
atp e
rfo
rmsd at
ae xt
racti
onthr
oughthi
ss o
rtofst
ri
ngma ni
pul
at
io
niss omet
imessai
dtob e
perf
ormin
gWe b"scra
ping.

 T
hister
mt hatpayshomageto"s
creenscr
api
ng,
"as imi
l
ar,t
hou
ghmu ch
ol
der,techni
queusedt oextrac
td at
af ro
m ma i
nfr
amet e
rmin
alscreente
xt. Webs cr
api
ngh as
si
gni
fi
c a
ntrel
evancetoBigData.
 Eveni
nc aseswher
ethebul
kofaBigDatasetcomesfr
omfla
tfil
es
ordatabas
es,aug
me n
tin
gt h
atwithup-
to-
date-re
fer
encedat
afromtheWebcanb everyat
tr
acti
ve,i
f
notout
ri
ghtrequ
ir
ed.
Un
loc
kin
gImp
ort
antDa
ta:
Butnotalldat
aisa vai
l
abl
ethrou
ghd o
wnloa
ds,fee
dso rAPI
s. Th
isisespe
cia
ll
ytrueof
g
over
nmentdata,var
iou
sOp enDatain
it
iat
iv
esn o
twi
ths
tandi
ng.
 Agen
ciesli
keth
eUSPa t
entan
d
T
rade
ma r
kOff
ice(USPTO) 
andtheFeder
alSecu
ri
ti
esandExchangeCommis
sio
n(SEC)
 hav
et o
nsof
d
ataavai
l
abl
eonl
ine,
butAPIa
ccessmayrequ
ir
esubsc
ri
pti
onsf
romthi
rdpar
ti
es.

8
S
imi
lar
ly,th
ere'
slot
so fcomme r
ci
aldat
aa vai
l
abl
eo nl
inethatmayn otbene a
tl
ypackage
dinc ode
-
f
ri
endl
yf o
rmatseit
her.
 Consi
derai
rl
i
nea n
dh ot
elfre
quentfl
yer
/lo
yalt
yp r
ogrampro
moti
ons.
 Youcan
l
ogin
toy ou
ra cc
ountandreada b
outth
em,butjusttr
yg et
ti
ngal i
stofal
lsuchpro
moti
onsthatmay
a
ppl
yt oas peci
f
icproper
tyorgeogr
aphi
ca r
ea,andkeepin
gt h
elis
tu p-
to-
date.
 I
fyou'
r
ea nin
d u
str
y
a
nal
ystwa nt
i
ngtop er
for
ma dhocanal
yt
ica
lquer
iesacro
sssuchoff
ers,
youma ybere
all
ystu
ck.
Do
wns
ideRi
sk:
Soit
'sWebs c
rapi
ngtotherescu
e,ri
ght?
 No texa
ctl
y,becau
seWe bscrap
ingcod
ec anbe
b
ri
tt
le.
 I
fthelay
outofadata-
cont
ai
nin
g Webp a
gechanges--eve
nb yjus
tali
tt
le--thet
extpat
ter
ns
b
ein
gs ea
rchedmayberender
e di
ncor
rect
,andamissi
oncri
ti
calp
roces
sma ycompl
etel
ybr
eakdown.
 
F
ixi
ngthebroke
nc o
d emayinvol
vema nu
alins
pect
ionofthepage'
sn e
wma rk
u p
,thenupda
ti
ngthe
d
eli
mit
in
gt e
xtfr
agments,
whichwoul
dhopef
ul
lybesto
redinadataba
se,butmi
ghtevenbei
nthecode
i
t
sel
f.
Su
cha nappr
o a
chisn ei
the
rr e
li
abl
e,no
rscal
abl
e.
 Writ
in
gthec o
deisexpensi
vea ndupda
ti
ngitistoo

Whati
sr e
all
yn e
e dedfort
hiski
ndofworki
sas cr
ipt
i
nge n
ginewhic
hd et
erminest
h eURLsitn
ee d
sto
vi
si
t,t
hed a
taitn e
edstoe xt
racta
n dt
heproc
essi
ngitmustsub
sequen
tl
yp er
formonthed a
ta.
 Wh at
'
s
mor
e,a l
l
owin
gt hed at
ad esi
redforext
rac
ti
on,andthedeli
mit
ersaro
undit,tobei d
enti
fi
edvisu
all
y,
wou
ldall
owforfarfa
steraut
hori
ngandupda
ti
ngtha
nwo ul
dma nu
ali
nspect
io
no fHTMLma r
kup.
Anengi
neli
kethi
shasrea
ll
yb e
enneed
edfo
ryear
s,b
utth
eris
eofBi
gDataha
sincr
eas
edt
heur
gen
cy.
 
Bec
ause t
hi
sdatai
sn ol
ongernee
dedju
stf
ors
impl
eandqui
ckup
dat
es.
Inth
ee r
aofBi
gDa
ta,
weneed
toc
oll
ectl
otso
fthi
sdataandanal
yzei
t.
Ma
kin
gitRe
al:
Kapo
wS of
tware'
sKatal
yst
 p
roduc
tme et
sthespec,a
ndthensome
. I
tp r
ovi
desall
thewishl
ist
i
t
emsa bov
e:vis
ualandint
eract
i
ved ec
lar
ati
onofdesi
re
dURL s,da
tatoex
tractanddel
imi
tin
ge n
tit
i
es
i
nthepage.
  
Sofar,sogood.
 ButKat
alys
td o
esn'
tju
stbui
ldab l
ackboxt
hatg r
absthedatafo
ry ou

I
nst
ead,
itactu
all
ye x
posesanAPIaro
unditsext
rac
ti
onp r
ocess
es,th
usen
a b
li
ngo t
hercodeandother
t
ool
stoext
ractt
hed at
adire
ctl
y.
 
T
hat
'sgrea
tforpu
bli
cWebs i
t
esthatyouwi
shtoextr
actdat
afrom,b
utit
'sa
lsogood
 f
oraddi
nganAPI
t
oyourownint
ern
alWeba p
pli
cat
i
onswith
outhav
ingtowri
teanycode.
 I
neffec
t,Kata
lys
tbui
l
dsdat
a
s
erv
icesaro
undexi
sti
ngWebs i
te
sa n
dWe bappl
icat
i
o n
s,do
ess owi
tho
utreq
uir
edc o
ding
,andmakes
a
nybreaki
ngl
ayou
tchang
esintho
seprodu
ctsmi
nimal
lydi
sr
upti
ve.
Maybetheni
cestt
hi
ngabou
tKatal
ysti
sthatit
'
sd es
ign
edwithdat
ae x
tra
cti
onan
danal
ysi
sinmind
,
an
ditprovi
desamanag
eabi
l
it
ylaye
ra t
opal
lofi
tsdat
ainte
gra
tio
nproc
esses,
maki
ngi
tpe
rfe
ctforBi
g
Dat
aa p
pli
cat
io
nswher
erep
eatab
il
i
ty,mana
geabi
li
ty
,main
tai
nab
il
it
yandscal
abi
l
it
yar
eal
les
senti
al
.
We
bDa
tai
sBI
,an
dBi
gDa
ta:
Ka t
al
ysti
sn'
tjustatweakyprogr
amme r
'stool
ki
t.
 It
'sareal
,l
ived a
taint
egr
ati
ontool
.
 Ma yb
e
t
hat'
swh yInfo
rmati
cs,ab i
gn a
mei nBIwh i
chj u
stputo u
tits9.
5r el
easethi
swe ek,ann
ounceda
st
rat
egicpart
ners
hip
 wit
hKapowS o
ftwar
e. Asaresul
t,I
nfor
mati
caPowerExchan
geforKapowKatal
yst
wi
ll
bema dea v
ail
abl
easp a
rtofI
nfo
rmati
ca9 .
5.
  
 Vers
ion9.
5istheBi
gDa tarel
eas
eofInf
ormat
ica
,wit
h
t
hea b
ili
tytotr
eatHadoopasas tan
d a
rdd at
asourcea n
dd es
ti
nati
on.I
ntegr
ati
ngwit
ht h
isver
si
ono f
I
nf
orma t
icamakestheu t
il
i
tyofKa t
aly
stinBigDa taappli
cat
i
onsnotme rel
yap ro
vableid
ea,buta
pr
oductreal
i
ty.

9
1
.
9ANAL
YSI
SVsREPORT
ING:
T
her
ear
efi
ved
if
fer
enc
esb
etwe
enr
epo
rti
nga
nda
nal
ysi
s:
1
.Pu
rpo
se
Repo
rti
ngh
elp
sc o
mp a
nie
smo n
it
orth
eird
ataev
enb ef
oredi
git
alte
chn
olo
gyboo
med.Var
io
us
o
rgan
izat
io
nshavebee
nd ep
endentont
hein
for
mati
oni
tbri
ngstoth
eirbu
sin
ess
,asre
por
ti
ngext
ract
s
t
hatandmakesi
teas
iert
ounder
stan
d.
Anal
ysi
sint
erp
retsdat
aa tadeep
erleve
l.Whil
erepo
rti
ngcanli
nkbet
weencro
ss-c
hannel
sofdat
a,
pr
ovi
dec o
mpari
son,a
ndma keu
nder
sta
ndinf
ormati
oneas
ier(
thi
nkofadas
hbo
ard,c
hart
s,andg
raph
s,
whi
cha r
erepor
tigt
n ool
sa n
dnotanal
ysi
srepor
ts)
,anal
ysi
sint
erp
ret
sthi
sin
for
mati
ona n
dp r
ovi
des
re
comme n
dati
onsonact
i
ons.
2.
Tas
ks
Asrep
ort
in
ga ndana
lys
ishav
eav er
yfin
eli
nedi
vi
di
ngth
em,so
meti
mesit

se as
ytocon
fus
e
t
ask
sthathav
ea na
lys
isl
abel
edontopoft
hemwh e
nal
li
tdoesi
sre
por
ti
ng.He
nce,e
nsur
eth
atyour
a
nal
yti
cste
amh a
sah eal
t
hybal
anc
ed oi
ngb
oth
.
He
re’
sag
rea
tdi
f
fer
ent
i
ato
rtok
eepi
nmi
ndi
fwh
aty
ou’
r
edo
ingi
sre
por
ti
ngo
ran
aly
sis
:
Repo
rti
ngi
ncl
ude
sbui
ld
ing,
conf
i
guri
ng,c
ons
ol
id
ati
ng,o
rga
niz
ing
,fo
rmat
ti
ng,
andsu
mma r
iz
ing
. I
t
’sv
ery
si
mil
artot
hea b
ove
me n
ti
onedl
iketur
ni
ngdat
aintocha
rts
,grap
hs,a
ndli
nki
ngdat
aa c
rossmul
t
ipl
e
ch
annel
s.
Ana
lys
isco
nsi
st
sofques
ti
oni
ng,e
xami
ni
ng,i
nt
erp
ret
i
ng,c
omp
ari
ng,a
ndc
onf
i
rmi
ng.Wi
t
hbi
gda
ta,
pr
edi
cti
ngi
spo
ssi
bl
easwel
l.
3.
Out
put
s
Repor
ti
nga n
da na
lys
ishaveth
ep us
ha ndpul
lef
fec
tfromi
tsuse
rsth
rou
ghthe
iro
utp
uts
.
Repo
rti
nghasapushappr
oach,a
sitpush
esinf
orma
ti
ontouser
sandou
tput
scomeint
hefo
rmsof
ca
nnedrep
ort
s,da
shboa
rds
,andal
ert
s.
Anal
ysi
shasap ul
lappro
ach,wher
ead at
aa n
aly
stdra
wsinfor
mati
ontof
urth
erprob
ea ndtoanswer
bus
ine
ssq uest
io
ns.Ou t
put
sf r
oms u
chc anb einthef o
rmo fa dhocrespon
sesa ndanal
y s
is
pre
sent
ati
ons.Anal
ysi
spres
entat
io
nsar
ec ompr
is
edofins
igh
ts,r
ecommen
d e
da c
ti
ons,an
daf or
ecast
ofi
tsi
mp ac
to nt
hec ompan
y—a l
li
nalan
guagetha
t’
seasytounde
rst
andatth
elevelo
ftheuse
rwh o’
l
l
beread
inganddeci
din
go ni
t.
T
hisisi
mpo
rta
ntfo
rorgan
iza
ti
onst
ore
ali
zet
rul
yth
eva
lueo
fda
ta,
suc
hth
atas
tan
dar
dre
por
tisn
ot
s
imil
art
oameani
ngf
ula
naly
ti
cs.
4.
Del
i
ver
y
Con
sid
eri
ngtha
trepor
ti
ngin
vol
vesrep
eti
ti
vetasks
—o f
tenwi
thtr
uck
loa
dsofd a
ta,au
tomat
i
on
h
asbeenali
fesa
ver
,espe
cia
ll
ynowwit
hb i
gdata
.It
’snotsu
rpr
is
ingt
hatt
hefi
rstt
hin
go ut
sou
rcedar
e
d
atae
ntryser
vi
cessi
nceout
sou
rci
ngco
mp a
nie
sa r
ep e
rcei
vedasdat
arep
ort
in
ge x
pert
s.
Anal
ysi
sreq
uir
esamorecust
omap
pro
ach,wi
thhu
manmin
dsdoi
ngsu
peri
orr
eas
oni
ngandana
lyt
i
cal
th
ink
ingtoextr
acti
nsi
ght
s,an
dte
chni
calsk
il
l
stopr
ovi
deef
fi
ci
entst
epsto
wardsac
comp
li
shi
nga

1
0
s
pec
if
icgoa
l.T
hisiswhyda
taa n
aly
stsan
dscie
nti
stsar
ed e
mande
dt h
esedays
,asor
gan
iza
ti
ons
d
ependont
hemtoc o
meupwit
hrecomme
ndat
i
onsforl
ead
ersorb
usi
nes
se x
ecu
ti
vesma
kedeci
si
ons
a
bou
tthe
irb
usi
ness
es.
5.
Val
ue
Th
isis
n’tab
outid
enti
f
yin
gwh i
cho n
eb r
in
gsmoreval
ue,r
ath
erunde
rst
and
ingth
atbot
ha r
e
i
ndi
spe
nsab
lewhenl
ooki
ngatt
hebigpi
ctu
re.I
tsh
oul
dhe
lpbu
sin
esse
sg r
ow,
expan
d,movef
orwa
rd,a
nd
mak
emo r
eprof
i
torin
creas
ethei
rval
ue.

1
.
10MODERNDAT
AANAL
YTI
CTOOL
S:
Fol
l
owi
ngaresomeo
fth
epr
omi
nen
tbi
gda
taa
nal
yt
ic
sto
olsa
ndt
ech
niq
uest
hata
reu
sedb
y
a
nal
yt
icsd
eve
lop
ers
.
Ca
ssa
ndr
a:
Thi
sist hemosta pp
lau
deda ndwidel
yu se
db igdat
atoolbecauseitoff
ersane ff
ecti
ve
managementofla
rgeandin
tri
cat
ea moun
tsofdat
a.Thi
sisadat
abasewhic
ho f
fer
shighavai
l
abi
li
tyand
sc
alabi
li
t
y withou
ta f
fect
in
gt hep er
for
ma nc
eo fc ommodi
tyhardwarea ndc l
ou
di nf
rast
ruct
ure
.
Cass
a nd
rah asmanya d
vantagesandsomeo fth
osea r
efaul
ttol
era
nce,decent
ral
i
zati
on,du
rabi
li
ty
,
per
formanc
e ,
prof
essi
ona
lsuppor
t,e
las
ti
cit
y,andscal
abi
l
it
y.S
inc
ethisto
olhassoma n
yq u
ali
ti
eshence
i
tisl
o ve
db yall
theanal
yt
icsdevel
ope
rs.Compan
ieswhic
hareusi
ngCassandr
abigdataanal
yt
icstoo
l
ar
ee BayandNe t
fl
i
x.
Ha
doo
p:
Th
isisas t
ri
ki
ngp r
oductfro
m Ap a
c h
ewh i
chhasb e
enu sedbyma nye min
entc o
mp a
nie
s.
Hadoo
pisbasi
cal
l
ya nopen
-so
urcesoft
wa r
eframewor
kwh i
chi
swr i
tt
eninJav
alanguag
es othati
tcan
workwi
thachunkofdat
asets.
Iti
sd esi
gnedinsuc
hawa ysothati
tcanscal
eu pfr
omas in
gleser
ver
tohundr
edsofma c
hin
es.Themo stpromi
nentfeat
ureofthi
sad v
ancedsoft
wareli
brar
yiss up
eri
or
pro
cessi
ngofvol
umi
nousdat
ase t
s.Manyc ompani
escho
osebi
gd atato
olHadoo
pb eca
useofitsgr
eat
pro
cessi
ngcapa
bil
i
ti
es.Wi
ththi
stool
,thedevelo
perpr
ovi
desregu
larupda
tesandimprov
eme n
tstothe
pro
duct
.
Kn
ime
:
Thi
sisab i
gd at
aa nal
yti
csopensourcedatatoo
l.Kni
mei salead
inganaly
tic
sp l
atf
ormwh i
ch
pro
vid
esa nopens ol
ut
ionfordata-
dri
veninnova
tio
n.Withthehe l
pofthi
stool
,youc andisc
overthe
hi
ddenpoten
ti
alofyourdata,
mineforf
reshins
ight
s,an
dc anpredi
ctnewfut
u r
esbya nal
ysi
ngthedat
a.
Wit
hn ear
l
y1000mo dul
es,hundr
edsofrea
d y
-to-
runexamples
,acomp l
et
erangeofinte
grat
edtools
,and
achunkofadvancedalg
o r
it
hmsa v
ail
abl
e,t
hisKnimeanal
y t
ic
sp l
atf
ormiscert
ain
lythebestto
olbo
xfor
anydatasci
ent
istwhowa nt
stoa cc
ompli
shhisjobi
nah assl
e-f
reeway.Thi
stoolcansuppo
rtanytype
ofdat
alik
eXML ,J
SON, I
mages,docu
ments,andmo r
e.T
h i
stoola
lsoposs
essesadvancedpred
ict
iveand
machi
nelear
nin
ga l
gori
thms.
Op
enRe
fi
ne:
Ar
ey o
ustu
c kupwit
hlar
gea ndvol
umin
ousdataset
s?T h
enthi
stool
isid
ealf
oryouwhi
chhel
p
y
outoexp
lor
ehugea n
dbaggydatasetseasi
l
y.Basi
cal
l
y,OpenRefi
nehe
lpst
oo r
gani
zethedat
aint
he
d
atab
aseth
atwasn ot
hi
ngbutame s
sa ndmuddl
e.Th
isto
olhel
psyouincl
eani
ngandtr
ansfo
rmi
ngda
ta
f
romonef
ormati
ntoanot
her
.Thi
sd a
tatoolc
ana l
sobeusedtoli
nkandext
endyourdat
aset
swit
hweb

1
1
s
ervi
cesandoth
erp
eri
phe
rald
ata
.Ear
l
ier
,Open
Refi
nei
sknownasGoo
gleRe
fi
neb
utf
rom201
2,Go
ogl
e
d
idn
’tsup
portt
hi
spr
oj
ectandi
twast
henreb
ran
dedtoOp
enRefi
ne.
Rl
ang
uag
e:
Ri sano pensour
cep rog
ramminglanguagewhichhel
pst h
eo rg
a ni
zat
io
nst oma n
agea n
d
ana
lyseac hun
ko fdat
ae f
fecti
vel
ya nda p
tl
y.Thelangua
gewa sini
ti
al
lywr i
tt
enbyRo ssIhak
aa n
d
Rober
tGe n
tl
ema nbuti
th asgotimme nseappreci
at
io
nf r
omt hema t
hema t
ic
ians
,stat
is
ti
cian
s,dat
a
sc
ient
ist
sa nddataminer
swh oa r
ei nthefie
ldofd at
aa nal
yt
ic.Ri
s spa c
kedwithah ostofdat
a
ana
lysi
stoo
lswhichmaketheanal
ysisofdat
amo refac
il
ea n
ds i
mpl
erforth
eu ser
s.Wit
hR, bu
sin
esses
do
n’tneedtodevel
opth
ec ust
omizedtool
sandmo r
eover
,theycaneasi
l
yg etri
doftheti
me-cons
umin
g
co
de s
.Risthepri
med a
taanalys
issoft
warewhichcons
ist
sofinn
umerabl
ealgor
it
hmsthataredesi
gne
d
f
ordataret
ri
eval
,pro
cessi
ng,an
alysi
sandh i
gh-
endstat
is
ti
calgra
phi
csrepre
s ent
ati
ons
.
Pl
ot
ly
:
Asas ucces
sfu
lb i
gd a
taa nal
yt
icstoo
l,Plo
tl
yhasb eenusedtoc re
ategreatdynami
c
v
isu
ali
zati
one
ve ntheorga
niz
ati
onhasinad
equat
etimeorski
ll
sformeet
i
ngb i
gdataneed
s.Wit
ht h
e
h
elpofthi
sto
ol,youcancrea
testun
nin
ga ndi
nfo
rmati
vegrap
hicsver
yeff
ort
l
essl
y.Ba
sic
all
y,Pl
ot
lyi
s
u
sedforcomp
os i
ng,e
dit
i
ng,andsha
ri
ngint
erac
ti
vedatavi
sua
li
zat
ionvi
aweb.
Bo
keh
:
Thi
stoolha
sma nyresembl
anceswit
hPlotl
y.Th
istool
isver
ye f
fect
i
vea n
du s
eful
ifyouwantto
c
reat
eeasya ndi
nfo
rmativ
ev i
sual
i
zati
ons
.BokehisaPytho
ninter
act
ivevi
sual
i
zati
onl
ibr
arywhic
hh e
lps
y
ouincr
eatin
ga s
tound
ingandme an
ingf
ulvi
sual
prese
ntat
ionofdat
ainthewebbrowse
rs.Th
us,t
histo
ol
i
swidel
yu sedbyb i
gd ataanaly
ti
cse xper
ien
ce dpers
onst ocre
ateinter
act
ivedataappli
cat
io
ns,
d
ashbo
ards,andpl
otsqui
ckl
ya ndeasi
ly
.Ma nydataanal
yti
csexper
tsclai
medthatBokehi
st h
emo st
p
rog
ressi
veandeffe
cti
vev i
sua
ldatarepr
esent
ati
ontool
.
Ne
o4j
:
Neo4jison
eo fth
eleadin
gb i
gd at
aa nal
yt
ic
stool
sasittakesth
eb i
gdatabu
sin
esstothenext
l
evel
.Neo
4jisagraphdata
basema nagementsyst
emwh i
chi
sd evel
opedbyNeo4jI
nc.T
histo
olhel
psto
wor
kwiththeconnect
io
nsb e
tweent h
em.T heconnec
ti
onsbetweenthedatadri
vemo d
e r
nint
ell
i
gent
ap
pli
cat
io
ns,andNeo4ji
sthetooltha
ttrans
formsthes
econnecti
onstogai
ncompeti
t
iveadvant
age.As
pe
rDB-Engi
nesran
king,
Neo4jisthemostpopul
argr
aphdata
base.
Ra
pid
min
er:
Thisi
sc er
tai
nl
yoneo ft
hefavour
it
etool
sforal
lth
ed at
asp e
cial
i
sts.
Lik
eKnime,th
isi
sa l
soan
opensourcedatas ci
enc
ep l
a t
for
m whichopera
testhro
ughv i
sualprog
ramming
.T h
istoolhasthe
capab
il
it
yo fma ni
pul
ati
ng,anal
ysi
ng,mo d
eli
nga ndint
egrat
ingthed ataint
ob us
ines
sp ro
cesses.
Rapi
dMinerhel
psdatasci
enceteamstobecomemo r
ep r
oduct
i
veb ygiv
inganopensour
ceplat
for
mfor
dat
ap r
eparat
ion
,modeldepl
oyment,
andma ch
inel
earni
ng.I
t
su n
if
ieddatasci
encepl
atf
ormaccel
erat
es
th
eb ui
ldi
ngofc ompl
eteanal
y t
i
calworkf
lo
ws.F r
omd at
ap repa
rati
ontoma chi
nelearn
ingtomo del
depl
oyment,e
veryth
ingcanbed on
eu n
deras i
ngl
ee n
vir
onment.Thi
sactua
ll
yenhance
sthee f
fi
ci
ency
andl
essenstheti
mef o
rvari
ousdat
as c
ienc
ep r
oje
cts.
Wo
lf
ramAl
pha
:
Ifyo
uwa n
ttodos
ome t
hi
ngn e
wf r
omy ou
rdata
,thenthi
scoul
dbeani
dealt
ool
foryou.
T h
iswil
l
gi
veyouever
yminut
edet
ai
lofyourda
ta.Th
isfa
mo u
stool
wa sdev
elo
pedbyWol
fr
ama l
phaLLCwh i
chis
asub
sidi
ar
yo fWol
fr
amResearc
h.Ifyouwanttodoadvancedres
earcho
nfi
nanci
al
,hi
sto
ri
cal,soc
ial
,

1
2
an
do t
herpr
ofe
ssi
onala
reas,t
henyoumustuseth
ispla
tfo
rm.Suppo
se,i
fyoutyp
eMicro
soft
,th
eny o
u
wi
ll
rece
ivemis
cel
la
neousi
nfor
mati
onin
clu
din
ginputi
nter
pret
ati
on,
fund
amenta
ls,
fi
nanc
ial
s,n
ewt r
ade,
pr
ic
e,per
for
mancecompar
is
ons,d
atare
tur
na n
aly
sis,
andmu c
hmo r
erel
evan
tinf
ormat
i
on.
Or
ang
e:
Or
angei sanopensour
cedat
av i
sual
i
zat
ionanddat
aa na
lys
istoo
lwhichcanbeus
edb ybot
h
no
viceandsaga c
iou
spers
onsinth
efi
eldofdat
aanal
yti
cs.
Thist
oolpr
ovi
desint
eract
i
vewor
kfl
owswit
h
alar
get o
olb
ox .Wit
htheh el
pofthi
stool
box
,y o
uc a
nc r
eateint
eract
i
vewo rkf
l
owstoanal
yseand
vi
sua
li
zedata.Orangei
scrammedmanydif
fer
entvi
sua
li
zat
io
nsli
kefr
oms cat
terpl
ot
s,ba
rch
art
s,tr
ees,
t
oden d
rogr
ams ,net
wor
ksandheatmaps,
youcanfi
ndever
ythi
ngi
nthist
ool
.

No
deXL
:
T
hisisad at
av i
sua
li
zat
ionanda n
aly
sisso
ftwar
etoolf
orrel
at
io
nshi
psandn e
twor
ks.Thi
stool
o
ff
ersexactcalc
ula
ti
onstotheuser
s.Youwil
lbegl
adtoknowth
atiti
safreeandopen-
sou
rcenet
work
a
nal
ysisandvisu
ali
zat
io
ns o
ftwar
etoolwhi
chhasawi d
erangeofapp
li
cat
io
n.Thi
stool
iscon
sid
eredas
o
neo fthebestandlates
ts t
ati
st
ic
altool
sfordat
aa nal
ysi
swhic
hg i
vesadvance
dn et
workmetri
cs,
a
uto
ma ti
on,a
c c
esstosoci
almedi
anetworkdat
aimport
ers
,andmanymorethi
ngs
.
S
tor
m:
Sto
rmh a
sinscr
ibedi
tsnamea soneo ft
hep o p
ula
rd at
aanal
yti
cstoo
lsbeca
useofit
ssuper
io
r
s
tre
amingdat
ap r
ocessi
ngcapabi
l
it
ie
si nrealti
me .Youc a
ne venin
tegr
atethi
stoo
lwit
hma nyothe
r
t
ool
sli
keApac
heS l
i
derinord
ertoman a
gea ndsec ur
ey ou
rdata.St
ormcanbeusedbyanorgan
iza
ti
on
i
nma nycasesli
ked a
tamo ne
ti
zati
on,c y
bersecuri
t
ya n
alyt
i
cs,detec
ti
onofthet h
reat
,oper
ati
onal
d
ashbo
ards
,real
-
ti
mec ust
omermanageme nt
,etc.Al
lthesefun
cti
onscanenha
nceyourbu
sin
essgro
wth
a
ndwil
lgi
veyouma n
yo pp
ort
uni
ti
esforthebett
erme ntofyou
rbusi
ness
.
Hope
,fromtheab
ove-
menti
onedli
st
,yougotenou
ghinf
ormati
onreg
ard
ingsomeoft
heb e
stdat
a
anal
yt
ic
stool
swhi
chwi
llb
er u
li
nginth
eupco
mingyear
s.I
fy o
uwa n
ttoest
abl
i
shyou
rbu
sin
essfi
rml
y,
th
enenhanc
eyou
rkno
wledgeoft
hesedat
aanal
yt
ic
stoo
ls.
Nor
ji
mmi soneamongthemost
 po
pul
arcusto
ms of
twar
edevel
opmentco
mpanyinIn
diawi
thte
ams
havi
nganpro
jec
texper
ie
nceof6+y ea
rsandwit
h manyha
ppycl
ien
tsindi
ff
ere
ntpar
tsofth
eworl
d.
Wea r
eknownforo
urin
novat
i
vea nd
 fu
tur
ep r
epa
rato
ryapp
roa
chthatwefol
l
ow.Geti
ntouc
hwit
hus
to
daytogett
hebes
tsof
twar
ea n
dappd ev
elo
pmentser
vi
ces
.

1
3

You might also like