Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Python as a Statistical Package

Tate Knight
12/4/16
COMP112
Section 5
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
Abstract!
For this project, I used Python to create a small statistical package. My goals for this
program were threefold: (i) to create a visual representation of data, (ii) to return basic
descriptive values for the dataset, and (iii) to calculate the probability of a certain observation
belonging to the dataset. The user inputs a dataset, and the program returns the following: a
histogram of the dataset, a list of descriptive statistics of the dataset, and an opportunity to
calculate the probability that an observation belongs to the dataset. Though these techniques are
only a fraction of the full range of statistical capabilities and analyses, they are nonetheless
important for understanding the basic structure of a dataset.

Introduction
The analysis of data is a common practice in all disciplines. Whether calculating grades
for a test or analyzing cell samples, statistics are a powerful tool for understanding the driving
forces behind data. Understanding the structure of a sample dataset allows for inferences about
the population for which it describes. Statistics is simply a method of predicting population-scale
trends. The two main values that account for the structure of a dataset are the mean and standard
deviation. The mean describes the most likely observation within the dataset, while the standard
deviation describes the spread of the data. The majority of sample datasets are normally
distributed; that is, the frequency of observations of a sample is greatest towards the mean of the
data, and is lowest towards the tails of the data. Thus, normally distributed data resembles a bell-
shaped-curve, or more formally, a Gaussian distribution. The Gaussian distribution describes a
probability density function based on the mean and standard deviation of a dataset (Figure 1).
The “mean” value of the function determines the height of the distribution, while the “standard
deviation” value determines the width. A probability of belonging to a dataset is derived from the
Gaussian distribution; for example, 95.4% of observations in a dataset fall within two standard
deviations of the mean. The theory behind this technique forms a relationship between units of
standard deviation and probability, later discussed as z-scores in methods section.
First, it is important to visualize a dataset before diving into the statistics. A simple
histogram of a dataset describes its general distribution, and reveals any outliers or anomalies
that may need addressing. Second, and most importantly, a user must know the values which
describe the data. These values include, for example, the number of observations, the mean, and
the standard deviation. There are other descriptive values as well, but the majority of statistical
analyses are derived from these three important values. Third, it is helpful to know the likelihood
of an observation falling within your dataset. This technique applies the z-scores for
observations, explained in the methods section. For example, if only 10 out of 20 people show up
to my club meeting for two weeks in a row, what is the likelihood that 15 people will show up on
the third week? Probability is a powerful tool when hypothesizing future events. Using Python to
create a statistical package results in an efficient and simplistic approach to basic statistics; this
program is most likely helpful for students in an Intro to Statistics class where basic descriptive
values are frequently calculated.

Methods: Data Entry


The preliminary step before plotting the data is to have a user input data. Using a
try/except format, the program greets the user with space to enter data (Figure 2). If the data is
entered wrong (i.e. includes letters, is not separated by commas, etc.) it will tell the user that the
format is not readable and to enter the data again. Once the user enters valid data, it is assigned
to the variable “x”. In order to save time, while I was writing the code for the whole program, I
created a stand-in dataset using numpy.random.normal(10,3,100). This was helpful when writing
code for the descriptive statistics so I knew if the output values (mean = 10, sd = 3, obs = 100)
were correct or not. Once I finished, a replaced the placeholder data with the variable assigned to
the user input.

Methods: Plotting
I first import numpy and pylab in the program. This allowed me to write code for Python
to open Python Launcher and use its plotting capabilities accordingly. I bin the data relative to
the number of observations (# bins = observations/3), and plot the data on a histogram. If the
bins are too large, the plot will have no substance. If the bins are too small, then observation
frequencies in each bin are difficult to differentiate from one another. The x-axis is the binned
observation value, and the y-axis is the frequency of the binned observation. Although it is not
especially complicated, it is important for a user to visualize the data before proceeding with
analysis.

Methods: Descriptive Statistics


Descriptive statistics are values which describe the structure of a dataset. The values I
have used most often in my studies are: number of observations, mean, average deviation,
variance, standard deviation, coefficient of variance, and standard error of the mean. In order to
return the values to the user, I first created separate functions that called upon the input dataset
list “x”. For example, the function for the mean was: sum(x)/obs; or rather, the sum of all
observations in the dataset divided by the number of observations (Figure 3). Variance and
standard deviations were a bit trickier; I had to replicate and modify “x” to create a list of the
deviations of each value from the mean of “x”, and then use the new observation deviation list to
calculate variance and standard deviation. Once all of the individual function codes were written,
I created a new function which calls for the descriptive statistic values, and prints each each one
in the format: “DESCRIPTIVE STATISTIC = VALUE”. I rounded each value to two decimal
places to avoid too much clutter, and I rarely need more than two decimal places in my work.

Methods: Probability of an Observation


Each observation in a sample has a unique “z-score”; that is, the number of standard
deviations an observation lies from the mean. Take a dataset with: mean = 10, sd = 3. A value of
10 has a z-score of zero; value of 16 has a z-score of 2, and so on. This also means that
observations with lower z-scores, i.e. values close to the mean, are more probable than
observations with larger z-scores, i.e. values far from the mean. Thus, a dataset can be translated
into a list of z-scores, which are then normally distributed around a mean z-score of 0 (Figure 4).
Each z-score has a probability attached to it. For example, an observation with a z-score of 0
belongs to the distribution with a probability of 100% because it describes the mean of the data.
An observation with a z-score of 2 has a lower probability of belonging to the distribution
because it falls further from the mean. Because each z-score corresponds to a probability, I
created a dictionary containing z-scores from 0 to 4 in 0.01 step increments. Using a downloaded
Excel table that contains these correspondences, I manipulated the table into a dictionary
structure that Python could read. This dictionary was assigned the variable “z” (Figure 5).
Now, my task was to create a function which takes a value and returns the probability
that it belongs to the dataset that the user initially entered. First, the function takes the entered
value, and calculates its z-score based on the mean and standard deviation of the dataset (z-score
= [value – mean] / sd). Second, using a manipulation of the “z” dictionary, the function
calculates the probability that that value belongs to the dataset based off the z-score. Third, the
function prints the probability for the user to see. If a calculated z-score is not contained in the
“z” dictionary (that is, if the z-score exceeds 4), then the function will let the user know that
there is a 0% chance of that observation belonging to the dataset. In order to accommodate this
anomaly, I used a try/except structure for the function based on “KeyError”.

Results: Data Entry


Verifying the user input was straightforward. The program simply will not run unless the
user inputs a recognizable data structure (floats or integers separated by commas) (Figure 6).
Large datasets are time-consuming to type, so in the future, I would like to adjust the program to
read an Excel list of data. This would let the user input large datasets without the hassle of typing
it all.

Results: Plotting
As expected, Python Launcher is opened, and displays a histogram of the data (Figure 7).
The plot is nothing particularly exciting; perhaps allowing the user to choose the number of bins
or axis titles would be helpful.

Results: Descriptive Statistics


The program prints each descriptive statistic for the user to view and record by calling
upon the basic_stats function (Figure 8). However, the function will not be called until the plot
window is closed by the user. For my uses, I rarely need more than two decimal places, so I
rounded each value accordingly. These descriptive stats are the main building blocks to form
confidence limits, t-tests, ANOVAs, etc.

Results: Probability of an Observation


The program prints an invitation for the user to call upon the prob() function calculate the
probability that a value belongs to their dataset. The prob() function then prints the probability
for the user to view and record (Figure 9). There are many other uses of the z-table other than
finding the probability of an observation belonging. For example, with proper manipulation of
values, a user may calculate the limits in which 50% of their data lie. However, my program
offers just a taste of the possibilities of statistics and probability.

Discussion
This was a very fulfilling project; it was great to create a program to do basic calculations
that regularly consume my time when analyzing data. This project contains a range of techniques
we learned in class. The program takes user input (a dataset), and returns a plot of the data, a list
of descriptive statistics, and the probability of a value belonging to the dataset. This program
relies on a comprehensible data structure; I used a try/except block to ensure that the data was in
the correct format in order to be manipulated later on. I plotted the data, a technique we used in
one of our labs. I called upon and manipulated a list (the dataset) to calculate the descriptive
statistics. I created a dictionary for the z-values to calculate probabilities of certain observations.
Finally, I used a series of functions and return and print statements to create straightforward
results for the user. The “z” dictionary was the most difficult. Instead of creating it myself, I
could have downloaded and imported a macro to execute the probability calculations. For
example, instead of calling on my dictionary “z”, I would have called into the macro and
supplied the appropriate values and gathered the result. However, I wanted to make this program
accessible without the need for many external downloads, so I opted out of that method, and
instead embedded the dictionary into the code. Something I wish I had more time to complete
was the creation of the t-tests. This would have required input of two datasets and a matrix of F-
values, critical values that are compared against variations in datasets. However, my program
gives the basic values needed to calculate t-tests separately. I enjoyed this project because there
is little uncertainty to statistics. I had no need for if/else statements, because the math was pure
and straightforward, and there is little choice behind statistics. I knew what needed to be done
with the numbers, and I created a program to do just that.

Conclusion
My project did not create any significant advancements in the field of statistics. Instead,
my project offers an alternate and simple approach to dealing with datasets, which may be
preferable for certain users. However, my project is a glimpse at what Python is capable of; there
is much more to statistics that could be addressed with proper code. For example, there are a
range of tests you can execute on either one dataset or a group of datasets; t-tests, f-tests,
ANOVA, etc. There is software that manipulates and analyzes data much more comprehensibly
than I did in my program; however, it was enlightening to discover what type of code structures
are needed to create these types of powerful programs (ex. MATLAB, SAS). I certainly enjoyed
creating this program, and will expand and refine it for a more tailored use in my classes and
work.
Figure 1: Sample plots of Gaussian distributions, according to mean and sd values

Figure 2: Code for verifying user input

Figure 3: A sample of coding the descriptive statistics


Figure 4: Plot of Gaussian distribution in standard deviation units

Figure 5: Dictionary “z” that contains z-score to probability relationships

Figure 6: Verifying user input

Figure 7: Histogram of the dataset

Figure 8: Displaying the descriptive statistics of the dataset

Figure 9: Probability of an observation belonging to the distribution

You might also like