Professional Documents
Culture Documents
A Population Is The Entire Group That You Want To Draw Conclusions About
A Population Is The Entire Group That You Want To Draw Conclusions About
about. A sample is the specific group that you will collect data from.
The size of the sample is always less than the total size of the
population.
1. Mean
2. Median
3. Mode
4. Standard Deviation
5. Variance
6. Interquartile Range
7. Skewness
We will begin by loading the dataset to be used in this guide.
Data
In this guide, we will be using fictitious data of loan applicants containing
600 observations and 10 variables, as described below:
Mean
Mean represents the arithmetic average of the data. The line of code below
prints the mean of the numerical variables in the data. From the output, we
can infer that the average age of the applicant is 49 years, the average
annual income is USD 705,541, and the average tenure of loans is 183
months. The command df.mean(axis = 0) will also give the same output.
1df.mean()
python
Output:
1 Dependents 0.748333
2 Income 705541.333333
3 Loan_amount 323793.666667
4 Term_months 183.350000
5 Age 49.450000
6 dtype: float64
It is also possible to calculate the mean of a particular variable in a data, as
shown below, where we calculate the mean of the variables 'Age' and
'Income'.
1print(df.loc[:,'Age'].mean())
2print(df.loc[:,'Income'].mean())
python
Output:
1 49.45
2 705541.33
In the previous sections, we computed the column-wise mean. It is also
possible to calculate the mean of the rows by specifying the (axis =
1) argument. The code below calculates the mean of the first five rows.
1df.mean(axis = 1)[0:5]
python
Output:
1 0 70096.0
2 1 161274.0
3 2 125113.4
4 3 119853.8
5 4 120653.8
6 dtype: float64
Median
In simple terms, median represents the 50th percentile, or the middle value
of the data, that separates the distribution into two halves. The line of code
below prints the median of the numerical variables in the data. The
command df.median(axis = 0) will also give the same output.
1df.median()
python
Output:
1 Dependents 0.0
2 Income 508350.0
3 Loan_amount 76000.0
4 Term_months 192.0
5 Age 51.0
6 dtype: float64
From the output, we can infer that the median age of the applicants is 51
years, the median annual income is USD 508,350, and the median tenure
of loans is 192 months. There is a difference between the mean and the
median values of these variables, which is because of the distribution of the
data. We will learn more about this in the subsequent sections.
It is also possible to calculate the median of a particular variable in a data,
as shown in the first two lines of code below. We can also calculate the
median of the rows by specifying the (axis = 1) argument. The third
line below calculates the median of the first five rows.
1#to calculate a median of a particular column
2print(df.loc[:,'Age'].median())
3print(df.loc[:,'Income'].median())
4
5df.median(axis = 1)[0:5]
python
Output:
1 51.0
2 508350.0
3
4 0 102.0
5 1 192.0
6 2 192.0
7 3 192.0
8 4 192.0
9 dtype: float64
Mode
Mode represents the most frequent value of a variable in the data. This is
the only central tendency measure that can be used with categorical
variables, unlike the mean and the median which can be used only with
quantitative data.
The line of code below prints the mode of all the variables in the data.
The .mode() function returns the most common value or most repeated
value of a variable. The command df.mode(axis = 0) will also give the same
output.
1df.mode()
python
Output:
1| | Marital_status | Dependents |
Is_graduate | Income | Loan_amount |
Term_months | Credit_score | approval_status
| Age | Sex |
2|--- |---------------- |------------ |---------
---- |-------- |------------- |------------- |---
----------- |----------------- |----- |-----
|
3| 0 | Yes | 0 | Yes
| 333300 | 70000 | 192.0 |
Satisfactory | Yes | 55 | M
|
The interpretation of the mode is simple. The output above shows that most
of the applicants are married, as depicted by the 'Marital_status' value of
"Yes". Similar interpreation could be done for the other categorical
variables like 'Sex' and 'Credit-Score'. For numerical variables, the mode
value represents the value that occurs most frequently. For example, the
mode value of 55 for the variable 'Age' means that the highest number (or
frequency) of applicants are 55 years old.
Measures of Dispersion
In the previous sections, we have discussed the various measures of
central tendency. However, as we have seen in the data, the values of
these measures differ for many variables. This is because of the extent to
which a distribution is stretched or squeezed. In statistics, this is measured
by dispersion which is also referred to as variability, scatter, or spread. The
most popular measures of dispersion are standard deviation, variance, and
the interquartile range.
Standard Deviation
Standard deviation is a measure that is used to quantify the amount of
variation of a set of data values from its mean. A low standard deviation for
a variable indicates that the data points tend to be close to its mean, and
vice versa. The line of code below prints the standard deviation of all the
numerical variables in the data.
1df.std()
python
Output:
1 Dependents 1.026362
2 Income 711421.814154
3 Loan_amount 724293.480782
4 Term_months 31.933949
5 Age 14.728511
6 dtype: float64
While interpreting standard deviation values, it is important to understand
them in conjunction with the mean. For example, in the above output, the
standard deviation of the variable 'Income' is much higher than that of the
variable 'Dependents'. However, the unit of these two variables is different
and, therefore, comparing the dispersion of these two variables on the
basis of standard deviation alone will be incorrect. This needs to be kept in
mind.
It is also possible to calculate the standard deviation of a particular
variable, as shown in the first two lines of code below. The third
line calculates the standard deviation for the first five rows.
1print(df.loc[:,'Age'].std())
2print(df.loc[:,'Income'].std())
3
4#calculate the standard deviation of the first five
rows
5df.std(axis = 1)[0:5]
python
Output:
1 14.728511412020659
2 711421.814154101
3
4 0 133651.842584
5 1 305660.733951
6 2 244137.726597
7 3 233466.205060
8 4 202769.786470
9 dtype: float64
Variance
Variance is another measure of dispersion. It is the square of the standard
deviation and the covariance of the random variable with itself. The line of
code below prints the variance of all the numerical variables in the dataset.
The interpretation of the variance is similar to that of the standard deviation.
1df.var()
python
Output:
1 Dependents 1.053420e+00
2 Income 5.061210e+11
3 Loan_amount 5.246010e+11
4 Term_months 1.019777e+03
5 Age 2.169290e+02
6 dtype: float64
Skewness
Another useful statistic is skewness, which is the measure of the symmetry,
or lack of it, for a real-valued random variable about its mean. The
skewness value can be positive, negative, or undefined. In a perfectly
symmetrical distribution, the mean, the median, and the mode will all have
the same value. However, the variables in our data are not symmetrical,
resulting in different values of the central tendency.
We can calculate the skewness of the numerical variables using
the skew() function, as shown below.
1print(df.skew())
python
Output:
1 Dependents 1.169632
2 Income 5.344587
3 Loan_amount 5.006374
4 Term_months -2.471879
5 Age -0.055537
6 dtype: float64
The skewness values can be interpreted in the following manner:
Highly skewed distribution: If the skewness value is less than −1 or
greater than +1.
Moderately skewed distribution: If the skewness value is between −1
and −½ or between +½ and +1.
Approximately symmetric distribution: If the skewness value is
between −½ and +½.
Matplotlib
Seaborn
Conceptualized and built originally at the Stanford University, this library sits on top
of matplotlib. In a sense, it has some flavors of matplotlib while from the visualization
point, it is much better than matplotlib and has added features as well. Below are its
advantages
Built-in themes aid better visualization
Statistical functions aiding better data insights
Better aesthetics and built-in plots
Helpful documentation with effective examples
Nature of Visualization
Depending on the number of variables used for plotting the visualization and the type
of variables, there could be different types of charts which we could use to understand
the relationship. Based on the count of variables, we could have
Univariate plot(involves only one variable)
Bivariate plot(more than one variable in required)
A Univariate plot could be for a continuous variable to understand the spread and
distribution of the variable while for a discrete variable it could tell us the count
Similarly, a Bivariate plot for continuous variable could display essential statistic like
correlation, for a continuous versus discrete variable could lead us to very important
conclusions like understanding data distribution across different levels of a categorical
variable. A bivariate plot between two discrete variables could also be developed.
Box plot
A boxplot, also known as a box and whisker plot, the box and the whisker are clearly
displayed in the below image. It is a very good visual representation when it comes to
measuring the data distribution. Clearly plots the median values, outliers and the
quartiles. Understanding data distribution is another important factor which leads to
better model building. If data has outliers, box plot is a recommended way to identify
them and take necessary actions.
Syntax: seaborn.boxplot(x=None, y=None, hue=None, data=None, order=None,
hue_order=None, orient=None, color=None, palette=None, saturation=0.75,
width=0.8, dodge=True, fliersize=5, linewidth=None, whis=1.5, ax=None, **kwargs)
Parameters:
x, y, hue: Inputs for plotting long-form data.
data: Dataset for plotting. If x and y are absent, this is interpreted as wide-form.
color: Color for all of the elements.
Returns: It returns the Axes object with the plot drawn onto it.
The box and whiskers chart shows how data is spread out. Five pieces of information
are generally included in the chart
1. The minimum is shown at the far left of the chart, at the end of the left ‘whisker’
2. First quartile, Q1, is the far left of the box (left whisker)
3. The median is shown as a line in the center of the box
4. Third quartile, Q3, shown at the far right of the box (right whisker)
5. The maximum is at the far right of the box
As could be seen in the below representations and charts, a box plot could be plotted
for one or more than one variable providing very good insights to our data.
Representation of box plot.
Python3
Python3
sns.boxplot(data=diabetes.select_dtypes(include='number'))
Scatter Plot
Scatter plots or scatter graphs is a bivariate plot having greater resemblance to line
graphs in the way they are built. A line graph uses a line on an X-Y axis to plot a
continuous function, while a scatter plot relies on dots to represent individual pieces of
data. These plots are very useful to see if two variables are correlated. Scatter plot
could be 2 dimensional or 3 dimensional.
Syntax: seaborn.scatterplot(x=None, y=None, hue=None, style=None, size=None,
data=None, palette=None, hue_order=None, hue_norm=None, sizes=None,
size_order=None, size_norm=None, markers=True, style_order=None,
x_bins=None, y_bins=None, units=None, estimator=None, ci=95, n_boot=1000,
alpha=’auto’, x_jitter=None, y_jitter=None, legend=’brief’, ax=None, **kwargs)
Parameters:
x, y: Input data variables that should be numeric.
data: Dataframe where each column is a variable and each row is an observation.
size: Grouping variable that will produce points with different sizes.
style: Grouping variable that will produce points with different markers.
palette: Grouping variable that will produce points with different markers.
markers: Object determining how to draw the markers for different levels.
alpha: Proportional opacity of the points.
Returns: This method returns the Axes object with the plot drawn onto it.
Python3
# import module
plt.scatter(diabetes['DiabetesPedigreeFunction'], diabetes['BMI'])
Output 2D Scattered Plot
Python3
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = [5, 6, 2, 3, 13, 4, 1, 2, 4, 8]
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
# display illustration
plt.show()
Histogram
Histograms display counts of data and are hence similar to a bar chart. A histogram
plot can also tell us how close a data distribution is to a normal curve. While working
out statistical method, it is very important that we have a data which is normally or
close to a normal distribution. However, histograms are univariate in nature and bar
charts bivariate.
A bar graph charts actual counts against categories e.g. height of the bar indicates the
number of items in that category whereas a histogram displays the same categorical
variables in bins.
Bins are integral part while building a histogram they control the data points which are
within a range. As a widely accepted choice we usually limit bin to a size of 5-20,
however this is totally governed by the data points which is present.
Python3
# illustrate histogram
diabetes[features].hist(figsize=(10, 4))
Output Histogram
Countplot
A countplot is a plot between a categorical and a continuous variable. The continuous
variable in this case being the number of times the categorical is present or simply the
frequency. In a sense, count plot can be said to be closely linked to a histogram or a
bar graph.
Syntax : seaborn.countplot(x=None, y=None, hue=None, data=None, order=None,
hue_order=None, orient=None, color=None, palette=None, saturation=0.75,
dodge=True, ax=None, **kwargs)
Parameters : This method is accepting the following parameters that are described
below:
Output Countplot
Correlation plot
Correlation plot is a multi-variate analysis which comes very handy to have a look at
relationship with data points. Scatter plots helps to understand the affect of one
variable over the other. Correlation could be defined as the affect which one variable
has over the other.
Correlation could be calculated between two variables or it could be one versus many
correlations as well which we could see the below plot. Correlation could be positive,
negative or neutral and the mathematical range of correlations is from -1 to 1.
Understanding the correlation could have a very significant effect on the model
building stage and also understanding the model outputs.
Python3
# adjust plot
# assign data
sns.heatmap(diabetes.select_dtypes(include='number').corr(),
Heat Maps
Heat map is a multi-variate data representation. The color intensity in a heat map
displays becomes an important factor to understand the affect of data points. Heat
maps are easier to understand and easier to explain as well. When it comes to data
analysis using visualization, its very important that the desired message gets conveyed
with the help of plots.
Syntax:
seaborn.heatmap(data, *, vmin=None, vmax=None, cmap=None, center=None, robus
t=False, annot=None, fmt=’.2g’, annot_kws=None, linewidths=0, linecolor=’white’,
cbar=True, cbar_kws=None, cbar_ax=None, square=False, xticklabels=’auto’, ytick
labels=’auto’, mask=None, ax=None, **kwargs)
Parameters : This method is accepting the following parameters that are described
below:
Python3
import numpy as np
# assign data
Pie Chart
Pie chart is a univariate analysis and are typically used to show percentage or
proportional data. The percentage distribution of each class in a variable is provided
next to the corresponding slice of the pie. The python libraries which could be used to
build a pie chart is matplotlib and seaborn.
Syntax: matplotlib.pyplot.pie(data, explode=None, labels=None, colors=None,
autopct=None, shadow=False)
Parameters:
data represents the array of data values to be plotted, the fractional area of each slice
is represented by data/sum(data). If sum(data)<1, then the data values returns the
fractional area directly, thus resulting pie will have empty wedge of size 1-sum(data).
labels is a list of sequence of strings which sets the label of each wedge.
color attribute is used to provide color to the wedges.
autopct is a string used to label the wedge with their numerical value.
shadow is used to create shadow of wedge.
Below are the advantages of a pie chart
Easier visual summarization of large data points
Effect and size of different classes can be easily understood
Percentage points are used to represent the classes in the data points
Python3
# import required module
# Creating dataset
# Creating plot
plt.pie(data, labels=cars)
# Show plot
plt.show()
import numpy as np
# Creating dataset
# Wedge properties
# Creating plot
shadow=True, colors=colors,
startangle=90, wedgeprops=wp,
textprops=dict(color="magenta"))
# Adding legend
# Show plot
plt.show()
Output
Error Bars
Error bars could be defined as a line through a point on a graph, parallel to one of the
axes, which represents the uncertainty or error of the corresponding coordinate of the
point. These types of plots are very handy to understand and analyze the deviations
from the target. Once errors are identified, it could easily lead to deeper analysis of the
factors causing them.
Deviation of data points from the threshold could be easily captured
Easily captures deviations from a larger set of data points
It defines the underlying data
Python3
import numpy as np
# Assign axes
x = np.linspace(0,5.5,10)
y = 10*np.exp(-x)
xerr = np.random.random_sample(10)
yerr = np.random.random_sample(10)
# Adjust plot
fig, ax = plt.subplots()
ax.errorbar(x, y, xerr=xerr, yerr=yerr, fmt='-o')
# Assign labels
ax.set_xlabel('x-axis'), ax.set_ylabel('y-axis')
plt.show()
So, if you were wondering as to why use Seaborn when you already have
Matplotlib, here is the answer to it.
“If Matplotlib “tries to make easy things easy and hard things possible”,
seaborn tries to make a well-defined set of hard things easy too” –
Michael Waskom (Creator of Seaborn).
Factually, Matplotlib is good but Seaborn is better. There are basically two
shortcomings of Matplotlib that Seaborn fixes:
Matplotlib can be personalized but it’s difficult to figure out what settings
are required to make plots more attractive. On the other hand, Seaborn
comes with numerous customized themes and high-level interfaces to
solve this issue.
When working with Pandas, Matplotlib doesn’t serve well when it comes
to dealing with DataFrames, while Seaborn functions actually work on
DataFrames.
relplot():
This is a figure-level-function that makes use of two other axes functions
for Visualizing Statistical Relationships which are:
import numpy as np
import pandas as pd
sns.set(style="darkgrid")
Please note that the style attribute is also customizable and can take any
value such as darkgrid, ticks, etc which I will discuss later in the plot-
aesthetics section. Let’s now take a look at a small example:
EXAMPLE:
f = sns.load_dataset("flights")
OUTPUT:
As you can see, the points are plotted in 2-dimensions. However, you can
add another dimension using the ‘hue’ semantic. Let’s take a look at an
example of the same:
EXAMPLE:
f = sns.load_dataset("flights")
OUTPUT:
However, there are many more customizations that you can try out such
as colors, styles, size, etc. Let me just show how you can change the
color in the following example:
EXAMPLE:
sns.set(style="darkgrid")
f = sns.load_dataset("flights")
OUTPUT:
lineplot():
This function will allow you to draw a continuous line for your data. You
can use this function by changing the ‘kind’ parameter as follows:
EXAMPLE:
a=pd.DataFrame({'Day':[1,2,3,4,5,6,7],'Grocery':[30,80,45,23,51,46
,76],'Clothes':[13,40,34,23,54,67,98],'Utensils':[12,32,27,56,87,5
4,34]},index=[1,2,3,4,5,6,7])
OUTPUT:
This approach comes into the picture when our main variable is further
divided into discrete groups (categorical). This can be achieved using the
catplot() function.
catplot():
This is a figure-level-function like relplot(). It can be characterized by three
families of axes level functions namely:
Scatterplots – These include stripplot(), swarmplot()
EXAMPLE:
sns.set(style="ticks", color_codes=True)
a = sns.load_dataset("tips")
OUTPUT:
As you can see, in the above example I have not set the ‘kind’ parameter.
Therefore it has returned the graph as the default scatterplot. You can
specify any of the axes level function to change the graph as need be.
Let’s take an example of this as well:
EXAMPLE:
sns.set(style="ticks", color_codes=True)
a = sns.load_dataset("tips")
OUTPUT:
The above output shows the violinplot for the tips dataset. Now let us try
to find how to visualize the distribution of a dataset.
import numpy as np
import pandas as pd
sns.set(color_codes=True)
Once this is done, you can continue plotting univariate and bivariate
distributions.
EXAMPLE:
a = np.random.normal(loc=5,size=100,scale=2)
sns.distplot(a);
OUTPUT:
As you can see in the above example, we have plotted a graph for the
variable a whose values are generated by the normal() function using
distplot.
EXAMPLE:
x=pd.DataFrame({'Day':[1,2,3,4,5,6,7],'Grocery':[30,80,45,23,51,46
,76],'Clothes':[13,40,34,23,54,67,98],'Utensils':[12,32,27,56,87,5
4,34]},index=[1,2,3,4,5,6,7])
y=pd.DataFrame({'Day':[8,9,10,11,12,13,14],'Grocery':[30,80,45,23,
51,46,76],'Clothes':[13,40,34,23,54,67,98],'Utensils':[12,32,27,56
,87,54,34]},index=[8,9,10,11,12,13,14])
with sns.axes_style("white"):
OUTPUT:
Now that you have understood the various functions in Python Seaborn,
let’s move on to build structured multi-plot grids.