Professional Documents
Culture Documents
Assignment2 Stats
Assignment2 Stats
Inferential Statistics
Assignment# 2
Dua Amjad 2021142
Aaisha Jalal 2021002
Hamza Asif 2021196
Introduction:
The main objective of this assessment is to cover diverse stages of Statistical Data Analysis, which is a basic step in
any research analysis. By analyzing and visualizing data, we can get a true sense of what the data looks like and
what sorts of questions we can answer from it. It’s also a way to find patterns and designs, spot outliers and other
inconsistencies. We explored this concept by conducting analysis on a dataset using Python. We used a dataset
that contains client data for an insurance company. Each of the 10,000 rows within the dataset corresponds to a
single client, with 19 variables recording a variety of client-specific data.
Statistics Summary:
The data gives a quick and simple description of the information. It includes count, mean, standard deviation,
median, mode, minimum value,
maximum value, range, standard
deviation, etc. In python, this can be
achieved using describe(). It provides
a statistical summary of information
belonging to numerical datatype such
as int, float.
Univariate Analysis:
In a univariate analysis, every variable in a data set is examined independently.
Numeric: We computed the mean, mode, maximum, minimum, standard deviation, etc on ‘credit_score’
column.
Bivariate analysis:
Numeric-Numeric: We analyzed the association between ‘annual_mileage’ and ‘speeding violations’. A
negative correlation between annual mileage and the number of speeding violations may be deduced
from the graph, which means the more miles a client drives per year, the lesser speeding violations they
commit.
Correlation: To obtain more detailed data regarding the link between these two variables, we used a
correlation matrix using the columns "past accidents," "DUIs," and "speeding violations" as a variable. All
of our variables show a positive correlation with one
another, which means that when one increases, the
others do too, and vice versa. A correlation value
between 0.3 and 0.5, as is the case with the majority of
our variables, suggests variables that display weak
correlation, whereas a correlation coefficient between
0.5 and 0.7 often indicates variables that can be regarded moderately correlated. This indicates that there
is a mild, positive correlation between the number of past accidents and DUIs and a moderate, positive
correlation between the number of past accidents and speeding violations.
A heatmap is the most effective tool for visualizing correlation. By providing the correlation matrix to
Seaborn's heatmap() method, we simply generated one.
We can do the same for “vehicle_year”. Older-vehicle owners are far more likely to submit a claim.
Multivariate analysis:
It is the type of statistical analysis of data where each experimental unit is subjected to multiple measurements.
With a third variable, claim rate, we can quickly determine how factors in our data set like "education"
and "income" are related to one another. We started by making a pivot table then heatmap() method in
Seaborn so it can accept our pivot table as input.
To depict gender, family status, and claim rate, we built a heatmap. Males without kids are more likely to
file a claim than women with kids, who are less likely to do so.
Conclusion: In this assignment, we have explored the basics of statistical analysis by conducting univariate,
bivariate, and multivariate investigation on a dataset. We learned to clearly outline the sorts of issues to tackle, the
sorts of visualizations to create, and the different analyses to do while exploring a dataset.
Appendix:
Python Code: