Professional Documents
Culture Documents
L1 DS Overview en
L1 DS Overview en
L1 DS Overview en
Data Science
(IT4142E)
Contents
q Lecture 1: Overview of Data Science
q Lecture 2: Data crawling and preprocessing
q Lecture 3: Data cleaning and integration
q Lecture 4: Exploratory data analysis
q Lecture 5: Data visualization
q Lecture 6: Multivariate data visualization
q Lecture 7: Machine learning
q Lecture 8: Big data analysis
q Lecture 9: Capstone Project guidance
q Lecture 10+11: Text, image, graph analysis
q Lecture 12: Evaluation of analysis results
Introduction
Goals of data science
4
Some questions
Data science is
the science of learning from data.
Methodology: insight-driven
8
Methodology: product-driven
Business Analytic
understanding approach
Data
Feedback
requirements
Data
Deployment
collection
Data
Evaluation
understanding
Data
Modeling
preparation
(http://www.theta.co.nz/)
• Application
11
12
Where is the data? Mobile messages
13
• In the US:
• https://www.internetlivestats.com
14
Where is the data? And more
15
Introduction
What can we do with the data?
16
What can we do with the data?
Data description through visualization
17
Data description
18
What can we do with the data?
Customer segmentation
19
Data segmentation
20
What can we do with the data?
Amazon’s recommendation (association)
21
Association rules
22
What can we do with the data?
FIFA predictions (2014)
Prediction
24
What can we do with the data?
Much more!!!
to produce:
Big data
What is it?
26
Big data – in 2008
27
28
Big data – today
29
30
Big data – today: sources
[Source: TDWI]
31
Big data
Challenges
32
The 10 Vs of Big data
[Source: houseofbots.com]
33
34
The 10 Vs of Big data: Velocity
35
36
The 10 Vs of Big data: Validity
• When there is so much data, it of course poses the question of data validity
• And hence, one has to check the quality of the data
• Check its coherence with other sources of data
• Remove outliers
• This is pre-processing, led before integrating it for data analysis
37
• Venue in big data refers to the multiplicity of data sources (e.g. Excel files,
OLTP databases, …)
• Hence the need for data integration (see Chapter 3)
38
The 10 Vs of Big data: Variability
39
contributed articles
The 10 Vs of Big data: Variety
From an engineering perspective, not discovery of patterns in massive is usually
scale matters in that it renders the tra- swaths of data when users lack a well- actionabl
ditional database models somewhat formulated query. Unlike database expected
inadequate for knowledge discovery. querying, which asks “What data sat- What m
• Variety refers to the different kinds
Traditional of data
database onearehasisfies
methods to handle:
this pattern (query)?” discovery Other tha
• Structured data: fromnot OLTP
suited for datasets
knowledge of Excel
discovery files
asks forpatterns
“What instance
satisfy this data?”
because they are optimized for fast ac- Specifically, our concern is finding
it is its p
distributi
• Unstructured datacess increases extremely
and summarization fast: texts,
of data, given images,
interesting and robusttags,
patterns that can be re
what the user wants to ask, or a query, satisfy the data, where “interesting” data and
links, likes, emotions, … high degr
Figure 1. Projected growth of unstructured and structured data. The e
particula
Total archived capacity, by content type, worldwide, 2008–2015 (petabytes) learning
ery in da
(Vasant Dhar, CACM, 2013)
350,000
nities. U
300,000
predictiv
250,000 with skep
200,000
ing the v
century A
150,000
Karl Pop
100,000 for evalu
scientific
50,000
Popper a
0 sought o
2008 2009 2010 2011 2012 2013 2014 2015
enon wer
π Unstructured 11,430 16,737 25,127 39,237 59,600 92,536 147,885 226,716 made “bo
π Database 1,952 2,782 4,065 6,179 9,140 13,824 21,532 32,188 the test o
π Email 1,652 2,552 4,025 6,575 10,411 16,796 27,817 4044,091 ily falsifi
seriously
treatise
tures and
acterized
The 10 Vs of Big data: Vocabulary
41
42
The 10 Vs of Big data: Veracity
43
46
Data Science - early days
Howard
Dresner
I keep saying the sexy job in the next ten years will be
statisticians. People think I’m joking, but who would’ve
guessed that computer engineers would’ve been the sexy
job of the 1990s?
48
Data scientist - nowadays
49
Skillset
(source: http://datasciencedojo.com/)
50
Roles / talents of a data scientist
Storyteller
Data
Scientist
Analyst Engineer
51
Further reading
52
References
• John Dickerson. Lectures on Introduction to Data Science. University of Maryland, 2017.
• Longbing Cao. Data science: a comprehensive overview. ACM Computing Surveys
(CSUR), 50(3), 43, 2017.
• Longbing Cao. Data science: nature and pitfalls. IEEE Intelligent Systems, 31(5), 66-75,
2016.
• David Donoho. "50 years of Data Science." In Princeton NJ, Tukey Centennial Workshop.
2015.
• L. Duan, Y. Xiong. Big data analytics and business analytics. Journal of Management
Analytics, vol 2 (2), pp 1-21, 2015.
• X. Wu, X. Zhu, G. Wu, W. Ding. Data mining with Big Data. IEEE Transactions on
Knowledge and Data Engineering, vol 26 (1), pp 97-107, 2014.
• Rafael Irizarry & Verena Kaynig-Fittau. Lectures on Data Science. Harvard Univ., 2014.
• John Canny. Lectures on Introduction to Data Science. University of California, Berkeley,
2014.
• Vasant Dhar. Data Science and Prediction. Communication of the ACM, vol 56 (12), pp
64-73, 2013.
• Michael Perrone. What is Watson – an overview. 2011.
53
Thank you
for your
attentions!