Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

DATA ANALYSIS INDUSTRY

TRAINING PROGRAM

Elnaz Gholipour
2022 - 2023
Elnaz Gholipour
2022

Page No

▪ Introduction to Data Analysis, Data cleaning & Transformation, Types of data 3-6

▪ Different Statistics for Analysis; Descriptive, Inferential 7

▪ Practical part 1(SPSS) 8 - 12

▪ Practical part 2(SPSS) 13 - 15

▪ Normal Distribution 16 - 19

▪ Practical part 3(SPSS) 20

▪ When data is not normally distributed & Interpretation of the normality result & Z-Score 21 - 23

▪ Practical part 4 24

▪ Visual Representation of data, Conclusion of normality test 25 - 27

▪ Sampling 28

2
Elnaz Gholipour
2022

Data analytics is all about helping organizations make Data science is a broad field that includes data
decisions based on data. Page visits can inform analytics. It also covers making predictions with
marketing strategies, housing costs can affect policy machine learning, working with big data, and
changes, and patient outcomes can impact a hospital’s developing artificial intelligence. Data scientists create
operations. Data analytics helps us find patterns and tell algorithms to automate data processes, recognize
stories from a large number of data organizations. To do patterns in new information, and make
that, data analysts take a business question and translate recommendations based on past behavior. They work on
it into a data question. Part of their job is collecting and things like forecasting the financial future, creating
reformatting data, analyzing it with statistics and customer-facing chatbots, detecting tumors in X-ray
probability, and sharing actionable insights in the form images, and making suggestions of things you might
of visuals and reports. “Every company is collecting like. “Data science tends to be more specialized than
some data. And a lot of companies need to leverage data analytics because not every company needs to
their data to make good data-driven decisions. There’s a make predictive data decisions, and not every company
huge opportunity for data analysts to really put that data needs to leverage big data”.
to work.”

3
Elnaz Gholipour
2022

Global job statistics and income of DA

OECD Data, WORLD BANK 4


Elnaz Gholipour
2022

Data Cleaning refers to removing incorrect, Data Transformation process extracts data from a
corrupted, incorrectly formatted, duplicate, or source, converts it into a usable format, and delivers it to
incomplete data from a dataset. The possibility of a destination. Extraction involves gathering data from
duplication or mislabeling data increases when different locations and sources and integrating them into
multiple data sources are combined. Incorrect data one place. The entire process is known as ETL (Extract,
can lead to unreliable outcomes and algorithms, Load, Transform).
even if they seem to be correct.

Steps of data cleaning: (Excel file :teacher1)


Step 1: Remove irrelevant data.
Step 2: Deduplicate your data.
Step 3: Fix structural errors. For example, you may find
“N/A” and “Not Applicable”, but they should be in the
same category.
Step 4: Deal with missing data.
Step 5: Filter out data outliers.

5
Elnaz Gholipour
2022

No meaningful order can be given to these categories. We can only do


frequency distribution, and it is just related to categorizing. For example, for
Nominal the nominal variable of a preferred mode of transportation, you may have the
categories of car, bus, train, or bicycle.
An ordinal data set is a type of statistical data in which the variables are
categorized by natural, ordered groups and the distance between the groups is
Ordinal
unknown. Examples: socio-economic status (“low income”,” middle income",
"high income”), an education level (“high school”, ” BS”,” MS”,” Ph.D.”)

Interval
Interval and ratio data can both be categorized, ranked, and have equal spacing
between adjacent values, however, only ratio scales have a true zero. The
temperature in Celsius or Fahrenheit has an interval scale because zero is not
the lowest possible value.
Ratio

6
Elnaz Gholipour
2022

▪ Descriptive Statistics; Utilizes numerical and graphical methods to look for patterns in a data set. Descriptive
statistics provides tools to describe a sample. To summarize the information revealed in
a data set and to present that information in a convenient form. i.e. Average, Range,
Frequency, Histogram, Median, Scatter Plot, Mode, …( Excel file : teacher2)
Frequency Central Tendency
Distribution Mean = Average
How many
observations
in each Dispersion Mode =Most frequent number
group

Median = The middle


Standard Range value of the sorted data
Deviation Variance

Uses sample data to make estimates, decisions, predictions, or other generalizations


about a larger set of data. Starting from the sample, inferential statistics are used to
▪ Inferential Statistics;
make a statement about the population i.e. Hypothesis test, Z score, ANOVA,
Correlation, Regression, Confidence Interval, …

7
Elnaz Gholipour
2022

1) In this part, I am going to introduce the environment and version of SPSS. How to enter data entering data : csv file (bank)
into SPSS, and how to define the variables.

Free trial of SPSS: 30 Days; use the link below;


https://www.ibm.com/analytics/spss-trials
Academic version for students: It may be 35$ for 6 months

2) Secondly, How to get a sample from a hidden part of SPSS :

For Macintosh: Mac./Applications/IBM/SPSS/Statistics/22/Samples/English


For Windows : C:\Program Files\IBM\SPSS\Statistics\22\Samples\English
.

8
Elnaz Gholipour
2022

Three Steps of Cleaning Data:

1. Missing Value Analysis

2. Out-of-the-Range Values ( Find the maximum and minimum values of each variable, and check the out-of-range
values. If you have access to that observation that is out of the range, you should correct it, otherwise, it must be
deleted)

3. Detecting and Removing Outliers: An Outlier is an observation in a given dataset that lies far from the rest of the
observations. That means an outlier is vastly larger or smaller than the remaining values in the set. Natural
variations in the population, as outliers should be left as are in your data set. There are other types of outliers that
must be removed because they represent measurement errors, data entry errors, and poor sampling errors. (To do
so, we go Analyze Descriptive Statistics Explore (box plot shows)) (Excel file: teacher3)
(To compute Z-Score: Analyze Descriptive Statistics Descriptive Save standardized values as variables).

9
Elnaz Gholipour
2022

1. Check if the outlier is the entry mistake.


2. Test the result with and without the outlier. Being transparent in the final report is a great way to make sure that
your final analysis is reliable.
3. Examine the normal distribution of data with the outlier(s) if it still follows the normality.

10
Elnaz Gholipour
2022

Handling Missing values: If you don't filter or identify If there are no patterns detected for missing values
these data, your analysis may not provide accurate (MCAR) and the percentage of them exceeds 10%, then
results. Missing responses for variables per variable pairwise or listwise deletion could be done. However, In
may range from 4% to 10%. Such a range is often this case, imputation is required if the missing values
considered normal (Hair & Anderson, 2010). show a pattern. first, you need to run descriptive
statistics to check if the missing is related to a specific
Patterns of missing data:(Sav file: teacher1) variable, case, or setting (MAR), then you need to run a
skewness test to check if the distribution is normal or
Missing completely at random (MCAR); If the skewed for the variables, if the distribution is
missing data of one variable is totally unrelated to any approximate to normal, then you can use the mean for
Certainty other variable in the data set. replacing the missing, otherwise, you need to use the
is Missing at random(MAR); If the missing data of one median.
increasing
variable can be explained by any other variable in the By increasing the certainty of missing data (MNR); we
data set. are certain about the reason for the missing values, we
Missing non-random (MNR); This missing data can should apply models like KNN, EM, and Regression,…
not be missed. (Sav file: teacher1) to replace the missing ones.

11
Elnaz Gholipour
2022

SPSS Syntax:
5 ways of Dealing with missing data:
Listwise deletion; All missing data of the variable will be a) What is going behind SPSS
deleted from the data set. b) After a while if it is forgotten how was that
Pairwise deletion or delete cases analysis by analysis; In particular analysis, It is only needed to run the
any analysis, the missing data of the variables will be saved Syntax.
omitted, for another analysis, the corresponding missing c) Re-doing the same analysis over and over again
data of the selected variables will be removed. in a quick way.
Mean Substitution; missing data will be replaced by the
mean of the variable (SPSS: Transform Replace missing To deselect missing value: Data Select cases
values). if-condition (Sav file: teacher1)
Regression-based imputation; missing data is replaced with
the predicted value generated by using multiple regression
based on non-missing data on another variable (SPSS: NOTE!
Analyze Missing Value Analysis). After replacing the missing value, we can use
Expectation – Maximization (EM)Algorithm; Maximum
likelihood estimation(SPSS: Analyze Missing Value
Paired-sample-t-test to check whether the old
Analysis). and new values are alike or not.

12
Elnaz Gholipour
2022

Types of Descriptive Analysis Think about a survey in which 500 participants are
Measures of Frequency asked which football team they support. A database
An essential part of descriptive analysis is knowing with 500 responses would be hard to handle, but
how frequently a certain event or response occurs. This measuring how often one football team was selected
is the prime purpose of measures of frequency to make can make the data more accessible.
like a count or percent.
Measures of Central Tendency
An important part of any descriptive analysis is finding Take the case of a survey that measures the weight of
out what the Central (or average) Tendency or 1,000 people as an example. A mean average would
Response is. In order to determine central tendency, be an excellent descriptive metric in this case.
three averages are used: the mean, the median, and the A sample of two people's weight can be used to
mode. illustrate this. The average weight will be 60 kg if
Measures of Dispersion both people weigh 60 kilos. The average weight will
Often, it is useful to know how data is spread across a still be 60 kg if one individual weighs 50 kg and the
range. other 70 kg.
Measures of Position
Identifying a value's position or its response in relation Measures like percentiles, quartiles, and cross-
to others is another aspect of descriptive analysis. tabulation become very useful in this area of
expertise.
13
Elnaz Gholipour
2022

Your aim is to find out what activities are most popular # of values below x
among women and men. Participants are asked to rate
Percentile rank of x =
n * 100
how often they have done these things during the past
year: (Sav file: teacher6) What value exists at the percentile ranking of 25%?
Go to a library percentile
Watch a movie at a theater Value # = (n+1)
100
Visit a national park
A survey's responses make up your data set. Now you SPSS: Analyze Descriptive Statistics Frequencies
can use descriptive statistics to find out the overall Statistics ( percentile & quartile).
frequency of each activity (frequency), the averages for
each activity (central tendency), the spread of responses
for each activity (dispersion), and a measure can tell us
whether a value is about the average, or whether it's
unusually high or low (position). Percentiles are values
that separate the data into 100 equal parts. For example,
The 95th percentile separates the lowest 95% of the
values from the top 5%. 25th percentile = 1st quartile, 50th
percentile = 2nd quartile, and median.

14
Elnaz Gholipour
2022

Crosstabs can be used to investigate the relationship Cross–Tabulation: It is very common for market
between two variables. Since a crosstab is a descriptive researchers to use crosstabs to compare customers or
statistic, it can only report on a sample. The chi-square products. A crosstabulation table is a simple way to
test must be used when describing the population. examine the relationship between two categorical
In general, independent variables are plotted in columns, (nominal or ordinal) variables. i.e. which age group
and dependent variables in rows. (Excel file: teacher2) prefers which insurance?

SPSS: Analyze Descriptive Statistics Crosstabs

15

You might also like