Assignment2 Stats

DS-221
Inferential Statistics
Assignment# 2
Dua Amjad 2021142
Aaisha Jalal 2021002
Hamza Asif 2021196
Introduction:
The main objective of this assessment is to cover diverse stages of Statistical Data Analysis, which is a basic step in
any research analysis. By analyzing and visualizing data, we can get a true sense of what the data looks like and
what sorts of questions we can answer from it. It’s also a way to find patterns and designs, spot outliers and other
inconsistencies. We explored this concept by conducting analysis on a dataset using Python. We used a dataset
that contains client data for an insurance company. Each of the 10,000 rows within the dataset corresponds to a
single client, with 19 variables recording a variety of client-specific data.
Cleaning the data:

The practice of correcting or deleting inaccurate, damaged, improperly formatted, duplicate, or incomplete data
from a dataset is known as data cleaning. Although even if our dataset doesn't seem to have any major problems,
we still need to perform some simple cleaning and transformation before using it for the main statistical analysis
task. The isnull() function is used for checking the dataset for missing or null values. The mean credit scores for
each category indeed vary significantly. The mean credit score for each income category can be used to impute the
missing values for the "credit score" column. Making a function would be the most straightforward method to
accomplish this so that we don't have to duplicate scripts for each income category. We now tackled the "annual
mileage" column's missing data. We grouped the "driving experience" column by its values. Both the “id” and
“postal_code” columns were not relevant for our analysis, We removed them using the drop() method. Now that
the data preparation phase is finished, we can focus on our major task: data analysis.
Statistics Summary:
The data gives a quick and simple description of the information. It includes count, mean, standard deviation,
median, mode, minimum value,
maximum value, range, standard
deviation, etc. In python, this can be
achieved using describe(). It provides
a statistical summary of information
belonging to numerical datatype such
as int, float.
From this summary, we get the following findings:
 There are 19 columns and 1000 rows in the dataset.

 The maximum and minimum credit score is 0.960819 and 0.053358 respectively.
 The mean and standard deviation for annual mileage are 11697.003207 and 2680.167384 respectively.
 The maximum number of speeding violations are 22.000000.
 The maximum income is gained by ‘working class’.
 The maximum claim rate is 1 and minimum is 0.
 The distribution of gender in the dataset tells us that the total count of ‘female’ is larger than that of
‘male’.
Univariate Analysis:
In a univariate analysis, every variable in a data set is examined independently.
 Categorical ordered data: It has a logical hierarchy and progression. Our

dataset includes "education" and "income" as variables. We used a pie
chart to investigate the income variable.
 The largest category is ‘upper class’ which is representing 43% of

the total, followed by ‘middle class’ (21%), poverty (18%), and
‘working class’ (17%). For this purpose, we used a count plot to
investigate the ‘education’ variable.
 Numeric: We computed the mean, mode, maximum, minimum, standard deviation, etc on ‘credit_score’
column.
 Using Seaborn's histplot() method, we can plot one. The

“credit_score” column follows a normal distribution or bell
curve.
Bivariate analysis:
 Numeric-Numeric: We analyzed the association between ‘annual_mileage’ and ‘speeding violations’. A
negative correlation between annual mileage and the number of speeding violations may be deduced
from the graph, which means the more miles a client drives per year, the lesser speeding violations they
commit.
 Correlation: To obtain more detailed data regarding the link between these two variables, we used a
correlation matrix using the columns "past accidents," "DUIs," and "speeding violations" as a variable. All
of our variables show a positive correlation with one
another, which means that when one increases, the
others do too, and vice versa. A correlation value
between 0.3 and 0.5, as is the case with the majority of
our variables, suggests variables that display weak
correlation, whereas a correlation coefficient between
0.5 and 0.7 often indicates variables that can be regarded moderately correlated. This indicates that there
is a mild, positive correlation between the number of past accidents and DUIs and a moderate, positive
correlation between the number of past accidents and speeding violations.
 A heatmap is the most effective tool for visualizing correlation. By providing the correlation matrix to
Seaborn's heatmap() method, we simply generated one.
 Numeric-Categorical: Box plots display a five-number summary of a set of

data; minimum, first quartile, median, third quartile, and maximum.
Both variables have similar medians (denoted by the middle line that runs
through the box), however, clients who filed claims have somewhat
greater median annual mileage than those who did not. The same goes for
first and third quartiles (denoted by the lower and upper borders of the
box respectively).
 We compared the distributions of the two categories in "outcome"
depending on their credit scores similarly.
 We can do the same for “vehicle_year”. Older-vehicle owners are far more likely to submit a claim.
Multivariate analysis:
It is the type of statistical analysis of data where each experimental unit is subjected to multiple measurements.
 With a third variable, claim rate, we can quickly determine how factors in our data set like "education"
and "income" are related to one another. We started by making a pivot table then heatmap() method in
Seaborn so it can accept our pivot table as input.
 To depict gender, family status, and claim rate, we built a heatmap. Males without kids are more likely to
file a claim than women with kids, who are less likely to do so.
Conclusion: In this assignment, we have explored the basics of statistical analysis by conducting univariate,
bivariate, and multivariate investigation on a dataset. We learned to clearly outline the sorts of issues to tackle, the
sorts of visualizations to create, and the different analyses to do while exploring a dataset.
Appendix:
Python Code:
Dataset Link: https://raw.githubusercontent.com/siglimumuni/Datasets/master/customer-

data.csv

Assignment2 Stats

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Assignment2 Stats

Uploaded by

Copyright:

Available Formats

DS-221

Cleaning the data:

From this summary, we get the following findings:

 There are 19 columns and 1000 rows in the dataset.

 Categorical ordered data: It has a logical hierarchy and progression. Our

 The largest category is ‘upper class’ which is representing 43% of

 Using Seaborn's histplot() method, we can plot one. The

 Numeric-Categorical: Box plots display a five-number summary of a set of

Dataset Link: https://raw.githubusercontent.com/siglimumuni/Datasets/master/customer-

You might also like