Descriptive Statistics Fundamentals 1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Descriptive statistics:

I
n statistical analysis, there are three main fundamental concepts associated with
describing the data: location or Central tendency, Dissemination or spread, and Shape
or distribution. A raw dataset is di�cult to describe; descriptive statistics describe the
dataset in a way simpler manner through;

The measure of central tendency (Mean, Median, Mode)

Measure of spread (Range, Quartile, Percentiles, absolute deviation, variance and


standard deviation)

Measure of symmetry (Skewness)

Measure of Peakedness (Kurtosis)

Let’s see the above one by one by leveraging Python;

Code Implementation: Basic Statistics In


Python
import pandas as pd
import matplotlib.pyplot as plt
dataset = pd.read_excel('3. Descriptive Statistics.xlsx',sheet_name=0)

Measures of central tendency:


The goal of central tendency is to come with the single value that best describes the
distribution scores. There are three basic measurements used i.e, mean(the average value) ,
median(the middle value), mode(the most frequent value). 

Let’s calculate the central tendency for the above example 

Mean:
The arithmetic average of some data is the average score or value and is computed by simply
adding all scores and dividing by the number of scores. 

It uses information from every single score.

Here we are using python library pandas functionality to calculate most of our statistical
parameters, so we don’t need to write code from scratch; it is just a matter of a few lines of
code;

dataset[['CurrentSalary', 'After6Months', 'SalBegin']].mean()

The above says from 475 employees the average salary at beginning, After the six months
and current as above.  There are multiple types of means, such as weighted mean, trimmed
mean but this is the most common use of mean.

Median:
Whenever we need to �nd a middle value, we go for the Median to calculate the median; we
need to arrange values in ascending order. The median also attempts to de�ne a typical
value from the dataset, but unlike the mean, it does not require calculation, but it is a
precaution while calculating the median like as;

If there are odd numbers of observations present in your dataset, then the median is the
simple middle value of the ascending order of a particular column.

If there are even numbers of observations present, then the median value is the average of
two middle values.

As we are using the Pandas library for the calculation, these precautionary things are
handled automatically; as the methodology is concerned, we should know all these things.

dataset[['CurrentSalary', 'After6Months', 'SalBegin']].median()

The above values suggest at least half of the observations should have the current salary
less than the 28875, in the same way, we conclude for the other two.

Mode:
The mode is used as the value that appears more frequently in our dataset. The institution of
mode is not as immediate as mean or median, but there is a clear rationale. The mode value
is usually being calculated for categorical variables. We can calculate mode by simply using
.mode() to the pandas data frame object. The below is another way of calculating mode.

from collections import Counter


job_time = dataset['Job Time'].values
data = dict(Counter(job_time))
mode = [k for k, v in data.items() if v == max(list(data.values()))]
mode

The above code gives mode values like 93 and 81; this is a bit confusing right! This is
because we have a tie between 93 and 81. After all, they are occurring in the same number.

These are all concepts in Measure of Central tendency.

You might also like