Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

CHAPTER ONE

Introduction to Data Science

By
Irandufa Indebu (MSc)

irandufa.indebu@haramaya.edu.et

1
Outline
• Definition of data science

• Brief history of data science

• Emergence of data science

• Application of data science

• What is not data science

2
Definition of data science (DS)
• Before defining data science, it is better to start from
the definition of data itself.
• Sometimes, it is difficulty to differentiate data,
information, and knowledge for some individuals
• Because, data for one person may be information for
the other person
• Let us discuss them in terms of DIKW (Data information
knowledge wisdom) paradigm

3
Definition of data science (DS)
Data-to-information-to-knowledge-to-intelligence-to-wisdom
cognitive progression

4
Definition of data science (DS)

• Data: is a collection of facts and figures


• Represents discrete or objective facts, signals
• Data is the level of conceptualization and is at
the lowest level of cognitive systems
• No meaning is attached to it and it has a value.
E.g. what does “Alex” mean?
Detail of data
what does ‘8 years old’ mean? and data set is
given under
what does ‘young’ mean? chapter 2

5
Definition of data science (DS)
• Information is the processed data
• Represents a description of relevant data (objects)
in an organized and structured way, for a certain
purpose, or having a certain meaning
• it is the level of contextualization
• Answer simple question such as what, when, where,
who.

• E.g. ‘Alex is 8 years old’ is information


• ‘Alex is young’ is another information
6
Definition of data science (DS)
• Knowledge represents the form of processed information in
terms of an information mixture, procedural actions, or
propositional rule

• It is the level of patterning (creating relationship b/n


different information) which helps for decision making

• It helps to answer a How question.


• E.g. ‘Tea and medicine are supposed to be taken at same’
is knowledge that helps for decision making regarding of the
medical domain.
7
Definition of data science (DS)

• Intelligence represent the ability to inform, think, reason,


infer, or process information and knowledge

• Intelligence can be high or low, hierarchical, general or


specific.

• E.g. “Chala is probably in Year 3” (reasoning outcome


based on the fact that he is 8 years old and a child of that
age is usually at school)

8
Definition of data science (DS)

• Wisdom is all about principle beyond the knowledge

• Thinking rational (understanding the effect of knowledge


driven decision making over the other)
• Is that what we are doing is affect others or not ? …

• Determining the right thing to do at the right time for


the right purpose
• It is the intelligence to know best about how to act (or
why to act) on the basis of validated ability or
understanding.
• It the level of understanding and helps to answer the "Why"
question
9
Definition of data science (DS)

• Having these mentioned concept in mind, let us now


define data science in the next section:-

10
Definition of data science (DS)
1. High-Level Definition
• Data science is a science of data or the study of data

2. Trans-disciplinary Definition

Data science = { Statistic ∩ information ∩ computing ∩


communication sociology ∩ management ∩ data ∩ domain ∩
thinking }

11
Definition of data science (DS)
• Data science is a multi-disciplinary field that uses scientific
methods, processes, algorithms, and systems to extract
knowledge and insights from structured, semi-structured and
unstructured data.
• It is a systematic study of raw data and making insightful
observations.
• From those observations one can take relevant actions to
establish a goal.
• Data acquisition, data cleaning, feature engineering,
modelling and visualization are some major parts of this
universe.

12
Definition of data science (DS)
• From the DIKIW-processing perspective, data science is
a systematic approach to ‘thinking with wisdom’,
“understanding the domain”, “managing data”,
“computing with data”, “discovering knowledge”,
“communicating with stakeholders”, “acting on insights”,
and “delivering products”.

13
Definition of data science (DS)

 As an academic discipline and profession, data science

continues to evolve as one of the most promising and in-


demand career paths for skilled professionals.

 Today, successful data professionals understand that they

must advance past the traditional skills of analyzing


large amounts of data, data mining, and programming
skills.
Data Science profession is defined as the sexiest job in 21st
century and in the next coming 10 years.
14
Historical Background of DS

 It has the History dating back to the 1990s

 Two historical concepts that contributes to the

history of data science


1. History of data collection

2. History of data analysis

15
1. History of data collection

 The development of writing contributed to the

recording of the experience and the event in our


world.

 This increased the amount of data collection

 Early form of writing and record keeping started

3200 BC in Mesopotamia.

 This record keeping captures transactional data.

16
Cont’d…
 Transactional data include event information such as:

• The sale of an item,

• The issuing of an invoice,

• The delivery of goods,

• Credit card payment,

• Insurance claims, etc.

• Non-transactional data, such as demographic data, also have


a long history.
• The development of computers, digitization, use of electronic sensors
are highly contributed the development of data collection and its
storage.

17
Cont’d…
• In 1970, Edgar F. Codd introduced data collection and
storage in terms of relational data model.
• Codd’s published the paper that provide the foundation for
SQL.
• SQL (structured query language ) is international standard
for defining database queries.
• Relational databases store data in tables with a structure of
one row per instance and one column per attribute.
• Databases are the natural technology to use for storing
and retrieving structured transactional or operational
data.

18
Cont’d…
• However, as companies have become larger and more
automated, the amount and variety of data generated
have dramatically increased.
• Analyzing these data and applying in decision making
were another headache for these companies
• The other problem was that the data were often stored
in numerous separate database within the orgn ..
• SELECT, INSERT, UPDATE, and DELETE were the only
simple operation that were used.
• This challenge led to the development of data
warehouses.

19
Cont’d…
• In a data warehouse, data are taken from across the
organization and integrated, thereby providing a more
comprehensive data set for analysis
• The challenge of data collected is not only the amount
of data collected that has grown dramatically but also
the variety of data.
• Emails, blogs, photos, tweets, likes, shares, web
searches, video uploads, online purchases, podcasts
are few data sources.
• If we look at metadata of these events, we can begin to
understand the meaning of the term big data

20
Cont’d…
• Big data are often defined in terms of three Vs:
• Extreme volume of data
• Variety of the data types
• velocity at which the data must be processed
• The existence of big data led to the development of
new data-processing frameworks
• Why?
• B/c it is impossible to process these data with ordinary
data base management system.

We will continue … Big


data detail in chapter 3

21
2. Historical Data Analysis
• When we talk about data analysis, statistics field is
always come to our mind.
• Statistics is the branch of science that deals with the
collection and analysis of data
• The simplest form of statistical analysis of data is the
summarization of a data set.
• This is in terms of:
• summary (descriptive) statistics (including measures of a
central tendency, such as the arithmetic mean, or measures of
variation, such as the range).

• In 19th century, statistician began to use probability distribution as


analysis tools
22
Cont’d…
• These enabled to move beyond descriptive statistics and
to start doing statistical learning.
• Between 1970 -1980, William Playfair invented
statistical graphics and place the foundations for
modern data visualization and exploratory data analysis.
• He invented the line chart and area chart for time-
series data pie chart to illustrate proportions within a
set.
• Absolutely, it is difficult to visualize large or complex
(many attributes) data sets, but data visualization is still
an important part of data science.

23
Cont’d…
• Data visualization is useful in helping data scientists
explore and understand the data they are working with.
• It can also be useful to communicate the results of a
data science project
• In 1943 Warren McCulloch and Walter Pitts proposed
the first mathematical model of a neural
network.
• In 1948, Claude Shannon published “A Mathematical
Theory of Communication” and by doing so founded
information theory.

24
Cont’d…
• In 1951, Evelyn Fix and Joseph Hodges proposed a
model for discriminatory analysis (classification or
pattern-recognition problem) that became the basis for
modern nearest-neighbor models.
• These postwar developments culminated in 1956 in the
establishment of the field of artificial intelligence at a
workshop in Dartmouth College
• In the mid-1960s, three important contributions to
machine learning (ML) were made

25
Emergence and Evolution of Data Science

• DS was introduced in 1990s


• The aim was to relate statistician to join computer
science for computational analysis of large dataset
• In 1997, C. F. Jeff Wu’s public lectured “Statistics = Data
Science? highlighted a number of promising trends for
statistics, including the availability of large data sets and
the growing use of computational algorithms and
models
• He suggested, statistics should be renamed “data
science” and statisticians should be known as “data
scientists”.

26
Cont’d…
• In 2001, William S. Cleveland published an action plan
for creating a university department in the
field of data science.
• In 2015, a statement about the role of statistics in data
science was released by a number of ASA leaders, saying
that “statistics and machine learning play a central role
in data science.”
• In recent years, data science has been elaborated
beyond statistics. Why?
• B/C statistics cannot own data science, the broader
capability requirements that go beyond statistics like
computational issues
27
Cont’d…
• A multidisciplinary view has thus been increasingly
accepted:

28
Application of Data Science

• Fraud and Risk Detection: he earliest applications of


data science were in Finance
• Companies were fed up of bad debts and losses every year.
However, they had a lot of data which use to get collected
during the initial paperwork while sanctioning loans.
• They decided to bring in data scientists in order to rescue
them from losses.

29
Application of Data Science
Healthcare: The healthcare sector, especially, receives great
benefits from data science applications.
Medical Image Analysis

Procedures such as detecting tumors, artery stenosis, organ


delineation employ various different methods and frameworks
like MapReduce to find optimal parameters for tasks like lung
texture classification.
It applies machine learning methods, support vector machines
(SVM), content-based medical image indexing, and wavelet
analysis for solid texture classification.
30
Application of Data Science

Transport
In the transportation sector, Data Science is actively making
its mark in making safer driving environments for the drivers
It is also playing a key role in optimizing vehicle performance
and adding greater autonomy to the drivers.
Data Science has actively increased its manifold with
the introduction of self-driving cars.

31
Application of Data Science
Healthcare
In the health-care industry, data science is making great leaps.
The various industries in health-care making use of data
science are:
• Medical Image Analysis
• Genetics and Genomics
• Drug Discovery
• Predictive Modeling for Diagnosis
• Health bots or virtual assistants
32
Application of Data Science
DATA SCIENCE APPLICATIONS AND EXAMPLES

• Health: Identifying and predicting disease


• Health: Personalized healthcare recommendations
• Sport: Getting the most value out of soccer rosters
• Sport: Finding the next slew of world-class athletes
• Finance: Stamping out tax fraud
• E-Commerce: identifying a potential customer base,
data science is being heavily utilized.
• Banking: to detect frauds that involve a credit card,
insurance, and accounting …etc. 33
END OF CHAPTER ONE

Thanks

34

You might also like