Download as pdf or txt
Download as pdf or txt
You are on page 1of 51

Foundation

of Data
Science
Harish Sharma
Asst. Professor, AIML, SCSE, MUJ
• Data is a collection of facts in a raw or unorganized form such
as numbers or characters.
Introduction
• Data Science is a multidisciplinary field that
combines various techniques, methods, and
tools to extract knowledge and insights
from structured and unstructured data.

• It encompasses a wide range of activities


involving data collection, data cleaning,
data analysis, data visualization, and the
creation of predictive models and
algorithms.
Goal

• The primary goal of Data Science is to


turn raw data into actionable
information that can be used to make
informed decisions, solve complex
problems, and drive business or
research outcomes.
• It involves the application of statistical
analysis, machine learning, and
computational techniques to gain
valuable insights and patterns from large
datasets.
Key components

Data Collection: Gathering and sourcing


relevant data from various sources, such as
databases, websites, sensors, or APIs.

Data Cleaning and Preprocessing: Preparing


the data by handling missing values,
removing noise, and transforming it into a
consistent format suitable for analysis.

Exploratory Data Analysis (EDA): Conducting


initial data exploration to understand the
distribution, relationships, and patterns in
the data.
• Data Visualization: Creating visual
representations of data to help
understand trends, outliers, and
patterns, which aids in communication
and decision-making.
• Statistical Analysis: Applying statistical
techniques to infer meaningful insights
from the data and validate
hypotheses.
• Machine Learning: Utilizing algorithms
and models to build predictive and
descriptive models, such as regression,
classification, clustering, and
recommendation systems.
• Deep Learning: A subfield of machine
learning that focuses on using artificial
neural networks to handle complex tasks
like image recognition, natural language
processing, and speech recognition.
• Big Data: Managing and processing large-
scale datasets that traditional data
processing methods cannot handle
effectively.
• Data Ethics and Privacy: Ensuring that
data is handled responsibly, and individual
privacy is respected in the data-driven
processes.
Data Collection

• Data collection is the process of collecting,


measuring and analyzing different types of
information using a set of standard
validated techniques.
• There are two main methods of data
collection:
• Primary Data Collection
• Secondary Data Collection
• Primary data refers to data collected from first-hand
experience directly from the main source. It refers to data
that has never been used in the past. The data gathered by
primary data collection methods are generally regarded as
the best kind of data in research.

• The methods of collecting primary data can be


further divided into quantitative data collection
methods (deals with factors that can be counted) and
qualitative data collection methods (deals with

Primary Data factors that are not necessarily numerical in nature).

Here are some of the most common primary data


collection methods:
• Interviews
• Observations
• Surveys and Questionnaires
• Focus Groups
• Secondary data refers to data that has
already been collected by someone else. It
is much more inexpensive and easier to
collect than primary data.
Secondary
Data • Here are some of the most common
secondary data collection methods:
• Internet
• Government Archives
• Libraries
Structure/Unstructured
Data
Unstructured data
Structured data

Searchable Difficult to search

• There are several types of data


Main characteristics Usually text format Many data formats
Quantitative Qualitative

within the world of big data. Data lakes


Non-relational databases
Here’s a guide to structured and
Relational databases
Storage Data warehouses
Data warehouses
NoSQL databases
unstructured data. Applications
Presentation or word
Inventory control
• When it comes to data, files can Used for CRM systems
ERP systems
processing software
Tools for viewing or editing

come in many different forms. media

There are two main types of Examples


Dates, phone numbers, bank account
numbers, product SKUs
Emails, songs, videos, photos,
reports, presentations

data—structured and
unstructured.
• There are two basic types of structured
data: numeric and categorical.
• Numeric data comes in two forms:
continuous, such as wind speed or time
duration, and discrete, such as the count of
the occurrence of an event.
• Categorical data takes only a fixed set of
Elements of values, such as a type of TV screen (plasma,
LCD, LED, etc.) or a state name
Structured Data (Alabama,Alaska, etc.). Binary data is an
important special case of categorical data
that takes on only one of two values, such
as 0/1, yes/no, or true/false.
• Another useful type of categorical data is
ordinal data in which the categories are
ordered; an example of this is a numerical
rating (1, 2, 3, 4, or 5).
• Data present themselves in many forms, but at a
basic level, all data can be categorized into two
structures: rectangular data and non-
rectangular data.
Rectangular
vs. Non- • rectangular data are shaped like a rectangle
where every value corresponds to some row and
rectangular column. Most data frames store rectangular data.
Data
• Non-rectangular data are not neatly arranged in
rows and columns. Instead, they are often a
culmination of separate data structures where
there is some similarity among members of the
same data structure. Usually non-rectangular data
are stored in lists.
• Traditional database tables have one or more columns designated
as an index, essentially a row number.
• In Python, with the pandas library, the basic rectangular data
structure is a DataFrame object.
• By default, an automatic integer index is created for a
DataFrame based on the order of the rows.
• In pandas, it is also possible to set multilevel/hierarchicalindexes
Data Frames and to improve the efficiency of certain operations.

Indexes • import numpy as np


• import pandas as pd
• ##s = pd.Series(data, index=index)
• s=pd.Series(np.random.randn(5))
• s = pd.Series(np.random.randn(5),
index=["a", "b", "c", "d", "e"])
• There are other data structures
besides rectangular data.
• Time series data records
successive measurements of the
same variable. It is the raw
material for statistical
forecasting methods.
Nonrectangular • Spatial data structures, which
Data Structures are used in mapping and
location analytics, are more
complex and varied than
rectangular data structures.
• Graph (or network) data
structures are used to represent
physical, social, and abstract
relationships.
• Variables with measured or
count data might have
thousands of distinct values.
Estimates of • Basic step in exploring your data
is getting a “typical value” for
Location each feature (variable)
• an estimate of where most of
the data is located (i.e., its
central tendency).
Summary of Measures
Summary Measures

Central Tendency Quartile Variation

Mean Mode
Median Range Coefficient of
Variation
Variance

Standard Deviation
• A measure of central tendency is
a descriptive statistic that
describes the average, or typical
Measures of Central value of a set of scores.
• There are three common
Tendency measures of central tendency:
• the mean
• the median
• the mode
The Mean

The mean is:


the arithmetic average of all the scores
(X)/N
the number, m, that makes (X - m) equal to 0
the number, m, that makes (X - m)2 a
minimum
The mean of a population is represented by
the Greek letter ; the mean of a sample is
represented by X
Calculating the Mean for
Grouped Data

 f X
X =
N
where: f X = a score multiplied by its frequency

Mean affected by extreme values

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 12 14

Mean = 5 Mean = 6
• You should use the mean when
• the data are interval or
ratio scaled
• Many people will use
the mean with ordinally
When To Use the scaled data too
• and the data are not
Mean skewed
• The mean is preferred because
it is sensitive to every score
• If you change one score in
the data set, the mean will
change
Calculating the Mean
• Calculate the mean of the following data:
1 5 4 3 2
• Sum the scores (X):
1 + 5 + 4 + 3 + 2 = 15
• Divide the sum (X = 15) by the number of scores
(N = 5):
15 / 5 = 3
• Mean = X = 3
Calculating the Mean for
Grouped Data
• Find the mean of the following data:

SCORE NUMBER OF
• Mean = [3(10)+10(9)+9(8)+8(7)+10(6)+
STUDENTS • 2(5)]/42 = 7.57
10 3

9 10

8 9

7 8

6 10

5 2
The Median

• The median is simply another name for the 50th


percentile
• It is the score in the middle; half of the scores are larger
than the median and half of the scores are smaller than
the median
• Not affected by extreme values

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 12 14

Median = 5 Median = 5
• Conceptually, it is easy to
calculate the median
• There are many minor
problems that can occur; it
is best to let a computer do
it
How To Calculate the • Sort the data from highest to
lowest
Median • Find the score in the middle
• middle = (n + 1) / 2
• If n, the number of scores, is
even the median is the
average of the middle two
scores
Calculating the Median for
Grouped Data
N / 2 − cf
Median = l + h
f
• To use this formula first determine median class.
Median class is that class whose less than type cumulative
frequency is just more than N / 2 ;
• l = lower limit of median class ;
• cf = less than type cumulative frequency of premedian
class;
• f = frequency of median class
• h = class width.
• The median is often used when the
distribution of scores is either positively or
When To Use negatively skewed
• The few really large scores (positively
the Median skewed) or really small scores
(negatively skewed) will not overly
influence the median
• What is the median of the following scores:
10 8 14 15 7 3 3 8 12 10 9
• Sort the scores:
Median Example 15 14 12 10 10 9 8 8 7 3 3
• Determine the middle score:
middle = (n + 1) / 2 = (11 + 1) / 2 = 6
• Middle score = median = 9
• What is the median of the
following scores:
24 18 19 42 16 12
• Sort the scores:
42 24 19 18 16 12
Median Example • Determine the middle score:
middle = (n + 1) / 2 = (6 + 1) / 2 =
3.5
• Median = average of 3rd and 4th
scores:
(19 + 18) / 2 = 18.5
The Mode
The mode is the score that occurs most frequently
in a set of data
Not Affected by Extreme Values
There May Not be a Mode
There May be Several Modes
Used for Either Numerical or Categorical Data

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6

No Mode
Mode = 9
Calculating the Mode for
Grouped Data
 f m − f1 
Mode = l +  h
 2 f m − f1 − f 2 
To use this formula first determine modal class.
Modal class is that class which has maximum
frequency ;
l = lower limit of modal class;
fm = maximum frequency;
f1 = frequency of pre modal class ;
f2 = frequency of post modal class
h = class width.
• The mode is not a very useful measure of central
tendency
• It is insensitive to large changes in the data set
When To • That is, two data sets that are very
different from each other can have the
Use the same mode
• The mode is primarily used with nominally scaled
Mode data
• It is the only measure of central tendency that
is appropriate for nominally scaled data
Calculate Mean, Median & Mode

Problem 1 : Wages (in Rs) paid to workers of an organization are given


below. Calculate Mean, Median and Mode.

Wages ( C.I.) 40-60 60-80 80-100 100-120 120-140 140-160

No.of workers 50 80 30 20 50 20
(freq)

Problem 2 : Weekly demand for marine fish (in kg) (x) for 100 families is
given below. Calculate Mean, Median and Mode.
X 1 2 3 4 5 Total
No. of Families 20 50 20 5 5 100
(freq)
Relation Between
Mean, Median & Mode
• In symmetrical
distributions, the median
and mean are equal
• For normal distributions,
mean = median = mode
• In positively skewed
distributions, the mean is
greater than the median

In negatively skewed
distributions, the mean is
smaller than the median
Variance

•Important Measure of Variation


•Shows Variation About the Mean:
•For the Population:  (X − ) 2

 =
2 i

N
•For the Sample: (
 xi − x )2

s2 =
n −1
For the Population: use N in the For the Sample : use n - 1 in
denominator. the denominator.
Standard Deviation

•Important Measure of Variation


•Shows Variation About the Mean:
•For the Population:
=
 (X i − )
2

•For the Sample:


s=
 (x − x )
i
2

n −1
Coefficient of Variation

•Measure of Relative Variation


•Always a %
•Shows Variation Relative to Mean
•Used to Compare 2 or More Groups
•Formula (for Sample):

 SD 
CV =    100%
 X 
Comparing Coefficient of Variation

• Stock A: Average Price last year = $50


• Standard Deviation = $5
• Stock B: Average Price last year = $100
• Standard Deviation = $5

Coefficient of Variation:
Stock A: CV = 10%
Stock B: CV = 5%
Shape of Curve
Describes How Data Are Distributed

Measures of Shape:
• Symmetric or skewed

Left-Skewed Symmetric Right-Skewed


Mean Median Mode Mean = Median = Mode Mode Median Mean
• 5 test scores for Calculus I are
95, 83, 92, 81, 75.

• Consider this dataset showing


the retirement age of 11
people, in whole years:
54, 54, 54, 55, 56, 57, 57, 58,
Find the Variance, SD 58, 60, 60

& CV • 3. Here are a bunch of 10 point


quizzes from MAT117: 9, 6, 7,
10, 9, 4, 9, 2, 9, 10, 7, 7, 5, 6, 7

• 4. 11, 140, 98, 23, 45, 14, 56, 78,


93, 200, 123, 165
Find the Variance, SD & CV

• Class Interval Frequency


2 -< 4 3
4 -< 6 18
6 -< 8 9
8 -< 10 7
Example A: 3, 10, 8, 8, 7, 8, 10, 3, 3, 3

Example B: 2, 5, 1, 5, 1, 2

Example C: 5, 7, 9, 1, 7, 5, 0, 4
Find the Mean,
Median, Mode
Variance, SD & CV
• Exam marks for 60 students
(marked out of 65)

• mean = 30.3 sd = 14.46


Group Frequency Table

Frequency Percent
0 but less than 10 4 6.7
10 but less than 20 9 15.0
20 but less than 30 17 28.3
30 but less than 40 15 25.0
40 but less than 50 9 15.0
50 but less than 60 5 8.3
60 or over 1 1.7
Total 60 100.0

You might also like