Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

Published on Explorable.com (https://explorable.

com)

Statistics Beginners Guide

Table of Contents
1 Statistics Tutorial
2 Why Statistics Matter?
2.1 What is Statistics?
2.2 Learn Statistics

3 Probability and Statistics


4 Branches of Statistics
5 Descriptive Statistics
6 Parameters and Statistics
7 Statistical Data Sets
7.1 Statistical Treatment Of Data
7.2 Raw Data Processing
7.3 Statistical Outliers
7.4 Data Output

8 Statistical Analysis
9 Measurement Scales
10 Variables and Statistics
11 Discrete Variables

1
Copyright Notice

Copyright © Explorable.com 2014. All rights reserved, including the right of reproduction in
whole or in part in any form. No parts of this book may be reproduced in any form without
written permission of the copyright owner.

Notice of Liability
The author(s) and publisher both used their best efforts in preparing this book and the
instructions contained herein. However, the author(s) and the publisher make no warranties of
any kind, either expressed or implied, with regards to the information contained in this book,
and especially disclaim, without limitation, any implied warranties of merchantability and
fitness for any particular purpose.

In no event shall the author(s) or the publisher be responsible or liable for any loss of profits or
other commercial or personal damages, including but not limited to special incidental,
consequential, or any other damages, in connection with or arising out of furnishing,
performance or use of this book.

Trademarks
Throughout this book, trademarks may be used. Rather than put a trademark symbol in every
occurrence of a trademarked name, we state that we are using the names in an editorial
fashion only and to the benefit of the trademark owner with no intention of infringement of the
trademarks. Thus, copyrights on individual photographic, trademarks and clip art images
reproduced in this book are retained by the respective owner.

Information
Published by Explorable.com.

Cover design by Explorable / Gilli.me.

2
1 Statistics Tutorial

This statistics tutorial is a guide to help you understand key concepts of statistics and
how these concepts relate to the scientific method and research.

Scientists frequently use statistics to analyze their results. Why do researchers use statistics?
Statistics can help understand a phenomenon by confirming or rejecting a hypothesis. It is
vital to how we acquire knowledge to most scientific theories.

You don't need to be a scientist though; anyone wanting to learn about how researchers can
get help from statistics may want to read this statistics tutorial for the scientific method.

What is Statistics?

Research Data
This section of the statistics tutorial is about understanding how data is acquired and used.

The results of a science investigation often contain much more data or information than the
researcher needs. This data-material, or information, is called raw data.

To be able to analyze the data sensibly, the raw data is processed into "output data". There
are many methods to process the data, but basically the scientist organizes and summarizes
the raw data into a more sensible chunk of data. Any type of organized information may be
called a "data set".

Then, researchers may apply different statistical methods to analyze and understand the data
better (and more accurately). Depending on the research, the scientist may also want to use
statistics descriptively or for exploratory research.

What is great about raw data is that you can go back and check things if you suspect
something different is going on than you originally thought. This happens after you have
analyzed the meaning of the results.

3
The raw data can give you ideas for new hypotheses, since you get a better view of what is
going on. You can also control the variables which might influence the conclusion (e.g.
third variables). In statistics, a parameter is any numerical quantity that characterizes a given
population or some aspect of it.

Central Tendency and Normal Distribution


This part of the statistics tutorial will help you understand distribution, central tendency and
how it relates to data sets.

Much data from the real world is normal distributed, that is, a frequency curve, or a frequency
distribution, which has the most frequent number near the middle. Many experiments rely on
assumptions of a normal distribution. This is a reason why researchers very often measure
the central tendency in statistical research, such as the mean(arithmetic mean or geometric
mean), median or mode.

The central tendency may give a fairly good idea about the nature of the data (mean, median
and mode shows the "middle value"), especially when combined with measurements on how
the data is distributed. Scientists normally calculate the standard deviation to measure how
the data is distributed.

But there are various methods to measure how data is distributed: variance, standard
deviation, standard error of the mean, standard error of the estimate or "range" (which states
the extremities in the data).

To create the graph of the normal distribution for something, you'll normally use the arithmetic
mean of a "big enough sample" and you will have to calculate the standard deviation.

However, the sampling distribution will not be normally distributed if the distribution is skewed
(naturally) or has outliers (often rare outcomes or measurement errors) messing up the data.
One example of a distribution which is not normally distributed is the F-distribution, which is
skewed to the right.

So, often researchers double check that their results are normally distributed using range,
median and mode. If the distribution is not normally distributed, this will influence which
statistical test/method to choose for the analysis.

Other Tools

Quartile
Trimean

4
Hypothesis Testing - Statistics Tutorial
How do we know whether a hypothesis is correct or not?

Why use statistics to determine this?

Using statistics in research involves a lot more than make use of statistical formulas or getting
to know statistical software.

Making use of statistics in research basically involves

1. Learning basic statistics


2. Understanding the relationship between probability and statistics
3. Comprehension of the two major branches in statistics: descriptive statistics and
inferential statistics.
4. Knowledge of how statistics relates to the scientific method.

Statistics in research is not just about formulas and calculation. (Many wrong conclusions
have been conducted from not understanding basic statistical concepts)

Statistics inference helps us to draw conclusions from samples of a population.

When conducting experiments, a critical part is to test hypotheses against each other. Thus, it
is an important part of the statistics tutorial for the scientific method.

Hypothesis testing is conducted by formulating an alternative hypothesis which is tested


against the null hypothesis, the common view. The hypotheses are tested statistically against
each other.

The researcher can work out a confidence interval, which defines the limits when you will
regard a result as supporting the null hypothesis and when the alternative research hypothesis
is supported.

This means that not all differences between the experimental group and the control group can
be accepted as supporting the alternative hypothesis - the result need to differ significantly
statistically for the researcher to accept the alternative hypothesis. This is done using a
significance test (another article).

Caution though, data dredging, data snooping or fishing for data without later testing your
hypothesis in a controlled experiment may lead you to conclude on cause and effect even
though there is no relationship to the truth.

5
Depending on the hypothesis, you will have to choose between one-tailed and two tailed tests.

Sometimes the control group is replaced with experimental probability - often if the research
treats a phenomenon which is ethically problematic, economically too costly or overly time-
consuming, then the true experimental design is replaced by a quasi-experimental approach.

Often there is a publication bias when the researcher finds the alternative hypothesis correct,
rather than having a "null result", concluding that the null hypothesis provides the best
explanation.

If applied correctly, statistics can be used to understand cause and effect between research
variables.

It may also help identify third variables, although statistics can also be used to manipulate and
cover up third variables if the person presenting the numbers does not have honest intentions
(or sufficient knowledge) with their results.

Misuse of statistics is a common phenomenon, and will probably continue as long as people
have intentions about trying to influence others. Proper statistical treatment of experimental
data can thus help avoid unethical use of statistics. Philosophy of statistics involves justifying
proper use of statistics, ensuring statistical validity and establishing the ethics in statistics.

Here is another great statistics tutorial which integrates statistics and the scientific method.

Reliability and Experimental Error


Statistical tests make use of data from samples. These results are then generalized to the
general population. How can we know that it reflects the correct conclusion?

Contrary to what some might believe, errors in research are an essential part of significance
testing. Ironically, the possibility of a research error is what makes the research scientific in
the first place. If a hypothesis cannot be falsified (e.g. the hypothesis has circular logic), it is
not testable, and thus not scientific, by definition.

If a hypothesis is testable, to be open to the possibility of going wrong. Statistically this opens
up the possibility of getting experimental errors in your results due to random errors or other
problems with the research. Experimental errors may also be broken down into Type-I error
and Type-II error. ROC Curves are used to calculate sensitivity between true positives and
false positives.

6
A power analysis of a statistical test can determine how many samples a test will need to
have an acceptable p-value in order to reject a false null hypothesis.

The margin of error is related to the confidence interval and the relationship between 


statistical significance, sample size and expected results. The effect size estimate the strength
of the relationship between two variables in a population. It may help determine the sample
size needed to generalize the results to the whole population.

Replicating the research of others is also essential to understand if the results of the research
were a result which can be generalized or just due to a random "outlier experiment".
Replication can help identify both random errors and systematic errors (test validity).

Cronbach's Alpha is used to measure the internal consistency or reliability of a test score.

Replicating the experiment/research ensures the reliability of the results statistically.

What you often see if the results have outliers, is a regression towards the mean, which then
makes the result not be statistically different between the experimental and control group.

Statistical Tests
Here we will introduce a few commonly used statistics tests/methods, often used by
researchers.

Relationship Between Variables

The relationship between variables is very important to scientists. This will help them to
understand the nature of what they are studying. A linear relationship is when two variables
varies proportionally, that is, if one variable goes up, the other variable will also go up with the
same ratio. A non-linear relationship is when variables do not vary proportionally. Correlation 
is a a way to express relationship between two data sets or between two variables.

Measurement scales are used to classify, categorize and (if applicable) quantify variables.

Pearson correlation coefficient (or Pearson Product-Moment Correlation) will only express the
linear relationship between two variables. Spearman rho is mostly used for linear relationships
when dealing with ordinal variables. Kendall's tau (τ) coefficient can be used to measure
nonlinear relationships.

Partial Correlation (and Multiple Correlation) may be used when controlling for a third variable.

Predictions

7
The goal of predictions is to understand causes. Correlation does not necessarily mean
causation. With linear regression, you often measure a manipulated variable.

What is the difference between correlation and linear regression? Basically, a correlational


study looks at the strength between the variables whereas linear regression is about the best
fit line in a graph.

Regression analysis and other modeling tools

Linear Regression
Multiple Regression
A Path Analysis is an extension of the regression model
A Factor Analysis attempts to uncover underlying factors of something.
The Meta-Analysis frequently make use of effect size

Bayesian Probability is a way of predicting the likelihood of future events in an interactive way,
rather than to start measuring and then get results/predictions.

Testing Hypotheses Statistically

Student's t-test is a test which can indicate whether the null hypothesis is correct or not. In
research it is often used to test differences between two groups (e.g. between a control group 
and an experimental group).

The t-test assumes that the data is more or less normally distributed and that the variance is
equal (this can be tested by the F-test).

Student's t-test:

Independent One-Sample T-Test


Independent Two-Sample T-Test
Dependent T-Test for Paired Samples

Wilcoxon Signed Rank Test may be used for non-parametric data.

A Z-Test is similar to a t-test, but will usually not be used on sample sizes below 30.

A Chi-Square can be used if the data is qualitative rather than quantitative.

Comparing More Than Two Groups

An ANOVA, or Analysis of Variance, is used when it is desirable to test whether there are
different variability between groups rather than different means. Analysis of Variance can also

8
be applied to more than two groups. The F-distribution can be used to calculate p-values for
the ANOVA.

Analysis of Variance

One way ANOVA


Two way ANOVA
Factorial ANOVA
Repeated Measures and ANOVA

Nonparametric Statistics

Some common methods using nonparametric statistics:

Cohen's Kappa
Mann-Whitney U-test
Spearman's Rank Correlation Coefficient

Other Important Terms in Statistics

Discrete Variables

How to cite this article: 

Oskar Blakstad (Feb 13, 2008). Statistics Tutorial. Retrieved from Explorable.com:  
https://explorable.com/statistics-tutorial

9
2 Why Statistics Matter?

The question of why statistics matter in science needs to be understood before its full
implementation potential is unleashed.

Scientific History
In the very early stages of development of science, starting with Kepler, Galileo and Newton,
scientific theories were not statistical in nature, but hard physical rules that had to be followed
by all objects. Gravity, for example, was not a statistical phenomenon, but everything in the
universe follows the laws of gravity.

However, even during an early time, there were scientific laws that were statistical in nature,
most notably the second law of thermodynamics. The question of why statistics was involved
in such a scientific law had puzzled scientists and philosophers alike.

For example, if I have a two chambered piston separated by a wall and have nitrogen on one
side of the partition and oxygen on the other, you would know that if I remove the partition and
wait for a long enough time, the composition will become homogeneous. You will not expect,
for example, that the gases remain as they are or interchange their places. The reason for this
is because such a phenomenon will be statistically too improbable to occur.

Modern Standing: Social Sciences


In today’s scientific research, statistics is an integral part of physical sciences and social
sciences. The question of why statistics is used in social sciences is easy to answer – most
social phenomena are statistical in nature.

Thus, if a social scientist concludes that the people of America spend more time on the
internet compared to people in China, it doesn’t mean that every American spends more time
on the internet than every Chinese. It simply means that on an average, an American spends
more time on the internet than a Chinese.

Modern Standing: Physical Sciences


Statistics is also used in a number of areas of physical sciences. In communication devices,

10
which are central to the technological revolution today, statistics is used to filter out noise and
enable a better communication. The study of turbulent flow, such as in the wakes of airplanes,
involves the use of statistical methods because the flow is chaotic. Most quantum mechanical
phenomena are statistical by nature.

Students of science need to ask themselves why statistics is used in their disciplines and how
they can learn and benefit from it. Statistical analysis provides credibility to a theory and is
central to the general acceptance of most statements. Statistics also helps condense the data
and present it in a manner understandable by everybody. Statistics today is central to almost
all scientific disciplines.

How to cite this article: 

Siddharth Kalla (May 12, 2011). Why Statistics Matter?. Retrieved from Explorable.com:  
https://explorable.com/why-statistics-matter

11
2.1 What is Statistics?

"What is statistics?" This is a fair question to ask especially because it is so central to


most scientific disciplines today. Statistics is a collection of mathematical techniques
that help to analyze and present data. Statistics is also used in associated tasks such
as designing experiments and surveys and planning the collection and analysis of data
from these.

To understand what statistics is, it is important to look at the broad categories of problems that
are tackled with the help of statistics. It also helps to understand why statistics is central to
current scientific methodology.

Two Faced Coin: Same Word, Different Context


Statistics can also imply a second meaning, which is the computed quantity with the help of
statistical methods. Thus, it could be said that the main statistics of a particular study are the
median age and income of the group. Thus statistics (singular: statistic) can imply a statistical
parameter as well. However, these two usages usually occur in distinct contexts and though
there is a scope for confusion, a careful study of the usage context should clear matters up.

Broad Usage, Same Concepts


Statistics can be applied to various different problems and situations but the underlying
concepts all remain the same. Thus it is important to understand what statistics is, not only
from an application point of view but also from an interpretation point of view. This is required
because of the diverse applications of statistics, from social science experiments to studying
quantum mechanical phenomena.

Statistics can be broadly classified into descriptive statistics and inferential statistics.

To understand statistics, one needs to study and understand the probability theory. These are
closely connected and inseparable in most cases. In fact, historically, the foundations of
statistics were laid with the development of probability theory.

Ability of Draw Conclusions


The ideas of presenting data and drawing relevant inferences are central to the successful

12
use of statistical theory. In the end, the statistical analysis should be able to tell us something
concrete about the sample that we are studying. A number of errors are possible in the
interpretation of statistical results and a careful analysis needs to be made to prevent these
errors.

In some rare cases, statistics can be used to draw conclusions that appear to be statistically
relevant but on careful examination, are not. When such practices are intentional, they can be
hard to detect. One good example of such statistical misconduct is data dredging. Therefore
one should also be able to spot the scope and relevance of a statistical study and understand
it in the context of the study within which it was intended.

How to cite this article: 

Siddharth Kalla (Feb 6, 2011). What is Statistics?. Retrieved from Explorable.com:  


https://explorable.com/what-is-statistics

13
2.2 Learn Statistics

It is important to learn statistics because so many of the decisions that we make in


everyday life are based on statistics. People may not realize, but statistics permeates
most of the decision making we do each day. Everyone has an intuitive understanding
of the principles of statistics but it greatly helps to understand the concepts formally.

A Practical Example
For example, imagine a scientific study that aims to find earth-like planets in hope to find life
elsewhere in the universe. This is a tricky study with various uncertain parameters. The best
that can be done is to do a thorough statistical analysis and see which planets have the
greatest potential.

A very crude analysis would take just the number of stars in the universe. However, not all
stars have planets around them. Thus a step further in the statistical analysis would be to
consider only those stars with planets orbiting them.

With the advent of more powerful telescopes and better resolution images of the universe, we
are able to spot the size and distance of the planets as well. Thus now scientists are
interested only in those planets that are approximately the size of the earth at approximately
the same distance from their stars as earth is from the sun. Still, there seem to be billions of
earth-like planets just in our galaxy.

The above example illustrates why it is important to learn statistics. Not everything is
intuitively obvious to us and a simple study of statistics will go a long way in helping you make
proper decisions about your life.

Making Sense of the World Around


It also helps to learn statistics because so much of what we speak about involves statistics
even if the person doesn’t know a lot about it. All polls contain a confidence interval that we
loosely know to be a measure of how sure the pollster is about extrapolating the results of the
poll or survey. We hear about weather forecast like “there is a 25% probability that it will rain
today” and we understand what it means and decide whether we need to carry our umbrella to
work or postpone the picnic that we planned for our family.

14
When you learn statistics, you will be able to understand the deeper meaning of these
concepts, which will again help you make more sense of the world that we live in. Almost all
the principles that we follow are based on data and statistics and it would greatly help to
understand the inner meaning behind the numbers.

Today, statistics is increasingly becoming important in a number of professions, and people


from all walks of life actively use statistics, from politicians and business leaders to engineers
and biologists.

How to cite this article: 

Siddharth Kalla (Nov 19, 2011). Learn Statistics. Retrieved from Explorable.com:  
https://explorable.com/learn-statistics

15
3 Probability and Statistics

Probability and statistics are closely related and each depends on the other in a
number of different ways. These have been traditionally studied together and justifiably
so.

For example, consider a statistical experiment that studies how effective a drug is against a
particular pathogen. After the experiment has been performed and the results tabulated, what
then?

Surely, there should be something useful and tangible that comes out of the experiment. This
is usually in the form of probability. Assuming the sample size was large enough and
represented the entire population of applicability, the statistics should be able to predict what
the probability is of the drug being effective against a pathogen if a person takes it. Thus the
experimenter should be able to tell a patient - “If you take this drug, the probability that you will
be cured is x%”. This shows the interrelation between probability and statistics.

A lot of statistical analysis and experimental results depend on probability distributions that
are either inherently assumed or found through the experiment. For example, in many social
science experiments and indeed many experiments in general, we assume a normal
distribution for the sample and population. The normal distribution is nothing but a probability
distribution.

Thus the relationship between probability and statistics cuts both ways - statistical analysis
makes use of probability and probability calculation makes use of statistical analysis.

In general, we are interested to know, what is the chance of an event occurring. For example,
what are the chances that it will rain today? This answer is quite complex and involves a lot of
calculations, experimentations and observations. After all the analysis, the answer can still be
only a probability because the event is so complex that despite the best tools available to us,
it is next to impossible to predict it with certainty. Thus one can take data from satellites, data
from measuring instruments, etc. to arrive at a probability of whether it will rain today.

16
However, the same problem can also be approached in a different manner. Someone might
look at past data and surrounding conditions. If it didn’t rain for many days, the temperatures
have been consistently higher but humidity has been consistently lower, he might conclude
that the probability of a rain today is low.

These two methods can go hand in hand, and usually do. Most predictions are based not just
on the bare facts but also past trends. This is why in sports; analysts look at past records to
see how well a team played against the other in the past, in addition to looking at the
individual players and their records. A lot of predictions, therefore, involve statistics.Probability
and statistics are therefore intertwined and lots of analysis and predictions that we see daily
involve both of them.

How to cite this article: 

Siddharth Kalla (Sep 10, 2010). Probability and Statistics. Retrieved from Explorable.com:  
https://explorable.com/probability-and-statistics

17
4 Branches of Statistics
Descriptive Statistics and Inferential Statistics

Every student of statistics should know about the different branches of statistics to
correctly understand statistics from a more holistic point of view. Often, the kind of job
or work one is involved in hides the other aspects of statistics, but it is very important
to know the overall idea behind statistical analysis to fully appreciate its importance
and beauty.

The two main branches of statistics are descriptive statistics and inferential statistics. Both of
these are employed in scientific analysis of data and both are equally important for the student
of statistics.

Descriptive Statistics
Descriptive statistics deals with the presentation and collection of data. This is usually the first
part of a statistical analysis. It is usually not as simple as it sounds, and the statistician needs
to be aware of designing experiments, choosing the right focus group and avoid biases that
are so easy to creep into the experiment.

Different areas of study require different kinds of analysis using descriptive statistics. For
example, a physicist studying turbulence in the laboratory needs the average quantities that
vary over small intervals of time. The nature of this problem requires that physical quantities
be averaged from a host of data collected through the experiment.

Inferential Statistics
Inferential statistics, as the name suggests, involves drawing the right conclusions from the
statistical analysis that has been performed using descriptive statistics. In the end, it is the
inferences that make studies important and this aspect is dealt with in inferential statistics.

18
Most predictions of the future and generalizations about a population by studying a smaller
sample come under the purview of inferential statistics. Most social sciences experiments deal
with studying a small sample population that helps determine how the population in general
behaves. By designing the right experiment, the researcher is able to draw conclusions
relevant to his study.

While drawing conclusions, one needs to be very careful so as not to draw the wrong or biased
conclusions. Even though this appears like a science, there are ways in which one can
manipulate studies and results through various means. For example, data dredging is
increasingly becoming a problem as computers hold loads of information and it is easy, either
intentionally or unintentionally, to use the wrong inferential methods.

Both descriptive and inferential statistics go hand in hand and one cannot exist without the
other. Good scientific methodology needs to be followed in both these steps of statistical
analysis and both these branches of statistics are equally important for a researcher.

How to cite this article: 

Siddharth Kalla (Aug 6, 2011). Branches of Statistics. Retrieved from Explorable.com:  


https://explorable.com/branches-of-statistics

19
5 Descriptive Statistics

Descriptive statistics implies a simple quantitative summary of a data set that has been
collected. It helps us understand the experiment or data set in detail and tells us all
about the required details that help put the data in perspective.

In descriptive statistics, we simply state what the data shows and tells us. Interpreting the
results and trends beyond this involve inferential statistics that is a separate branch altogether.

For example, if an experiment is conducted to understand the effect of news stories on a


person’s risk taking behavior, the experimenter might start by making one control group read
stories that involved huge risks and great payoffs while the other group reads the successful
stories of investing with minimal risk. Both the groups might then be given virtual or real
money to invest in stocks and the experimenter can note the differences in risk-taking
behavior, if any, between the two groups.

The data that is collected can be represented in several ways. For example it might be seen
that the first group involved in higher risk behavior and this might be quantified in some ways.
The description of this behavior, its mean, the corresponding graphical representation of the
data, etc. all fall under the purview of descriptive statistics. Concluding from this data that
what one reads in the daily newspaper is likely to influence the kind of investment decisions
one makes in the future will come under inferential statistics.

Basic Descriptive Statistics


Many of the statistical averages and numbers we quote are in effect descriptive averages. For
example the Dow Jones Industrial Average tells us about the average performance of select
companies. The grade point average tells us about the average performance of a student at
university. The GDP growth rate tells us about the average performance of a country.

Thus descriptive statistics tries to capture a large set of observations and gives us some idea
about the data set. The measures of central tendency like mean, median, mode all come
under this category, as do data distributions like normal distribution and corresponding
standard deviations.

For example, in our previous example, the relevant data such as the sample size, the
demographics of the people involved in the study, their previous financial exposure, their

20
average age, gender, etc. might all be relevant to someone who wants to understand the
experiment and perhaps replicate it in another different experiment.

Data can also be represented in the form of graphs or histograms to better understand what is
happening in the experiment. These too come under descriptive statistics.

How to cite this article: 

Explorable.com (Nov 2, 2010). Descriptive Statistics. Retrieved from Explorable.com:  


https://explorable.com/descriptive-statistics

21
6 Parameters and Statistics

Parameters in statistics is an important component of any statistical analysis. In simple


words, a parameter is any numerical quantity that characterizes a given population or
some aspect of it. This means the parameter tells us something about the whole
population.

Commonly Encountered Parameters


The most common statistics parameters are the measures of central tendency. These tell us
how the data behaves on an average basis. For example, mean, median and mode are
measures of central tendency that give us an idea about where the data concentrates.
Standard deviation tells us how the data is spread from the central tendency, i.e. whether the
distribution is wide or narrow. Such parameters are often very useful in analysis.

We can have many different statistical data and models built on the same parameter. In the
very simple case, consider data sets that have the same mean. We can create infinitely many
distributions of data that have the same mean. For example, the data set 23, 27, 31, 35, 39
has a mean of 31 and so does the data set 1, 31, 61. The same parameter of mean can lead
to different distributions.

The above is a very simple example, but the concept of a parameter in statistics gains more
importance when you study different distributions that occur in nature. The best known
example is the normal distribution that occurs in all forms of analysis, from human behavior to
studies related to the universe. This diversity is surprising and forms a cornerstone for a lot of
statistical analysis.

In the normal distribution, there are two parameters that can characterize a distribution - the
mean and standard deviation. By varying these two parameters, you can get different kinds of
normal distributions.

Variables are Not Parameters


As a researcher, it is important to differentiate between variables and parameters in statistics.
Variables, like the name suggests, are quantities that can be changed by the experimenter.

For example, the number of cases to study for a given problem is a variable. Thus a

22
researcher might choose a population of 100 people or 150 people, depending on various
statistical requirements. This would count as a variable.

A parameter, on the other hand, will be independent of the variable and the number of cases
that are taken to study. In fact, the parameters will fix the distribution irrespective of the total
number of cases under study.

Different statistical studies require different kinds of parameters for the characterization of
data. In many simple cases, the mean or median is a very good indication of the data. For
example, if a professor wants to determine the performance of students on a test, the median
score is a very good indication of this.

How to cite this article: 

Siddharth Kalla (May 6, 2011). Parameters and Statistics. Retrieved from Explorable.com:  
https://explorable.com/parameters-and-statistics

23
7 Statistical Data Sets

Statistical data sets are collection of data maintained in an organized form. The basis
of any statistical analysis has to start with the collection of data, which is then
analyzed using statistical tools.

Therefore statistical data sets form the basis from which statistical inferences can be drawn.

Statistical data sets may record as much information as is required by the experiment.

For example, to study the relationship between height and age, only
these two parameters might be recorded in the data set.
However, if a more comprehensive study in required, then the
experimenter might want to record the height at birth, weight, nutritional
background, family history, etc.

Therefore the researcher needs to determine beforehand what kinds of data are required to
be recorded in the statistical data sets.

Certain things are common to all statistical data sets. For example, the order of the data does
not matter, which means the arrangement of the data within the data set is not important.
Therefore the researcher has the freedom to organize the subjects under study in whichever
order she finds it convenient.

Creating a statistical data set is only the first step in research. The interpretation and validity
of the inferences drawn from the data is what is most important. However, this task is not
possible without the data sets. Hence these are the starting point for most research in social
sciences, medical sciences and physical sciences.

Huge statistical data sets are already available for many areas.

For example, the international genealogical index contains family


history of many people in the past. If a researcher needs to study
patterns and statistical data, she can simply make use of these data
sets. This makes the job of the researcher much simpler.

24
A particular statistical data set can be used for a number of researches. The census data, for
example, contains comprehensive data about the demographics of a country, which can then
by utilized by a number of social scientists to study family structures, incomes, etc. within the
country.

A statistical data set is therefore not an end in itself - it is merely the starting point where all
the data is stored. How the data is collected and interpreted depends on the researcher
studying the data.

How to cite this article: 

Siddharth Kalla (Nov 27, 2009). Statistical Data Sets. Retrieved from Explorable.com:  
https://explorable.com/statistical-data-sets

25
7.1 Statistical Treatment Of Data

Statistical treatment of data is essential in order to make use of the data in the right
form. Raw data collection is only one aspect of any experiment; the organization of
data is equally important so that appropriate conclusions can be drawn. This is what
statistical treatment of data is all about.

There are many techniques involved in statistics that treat data in the required manner.
Statistical treatment of data is essential in all experiments, whether social, scientific or any
other form. Statistical treatment of data greatly depends on the kind of experiment and the
desired result from the experiment.

For example, in a survey regarding the election of a Mayor, parameters


like age, gender, occupation, etc. would be important in influencing the
person's decision to vote for a particular candidate. Therefore the data
needs to be treated in these reference frames.

An important aspect of statistical treatment of data is the handling of errors. All experiments
invariably produce errors and noise. Both systematic and random errors need to be taken into
consideration.

Depending on the type of experiment being performed, Type-I and Type-II errors also need to
be handled. These are the cases of false positives and false negatives that are important to
understand and eliminate in order to make sense from the result of the experiment.

Treatment of Data and Distribution


Trying to classify data into commonly known patterns is a tremendous help and is intricately
related to statistical treatment of data. This is because distributions such as the normal
probability distribution occur very commonly in nature that they are the underlying distributions
in most medical, social and physical experiments.

26
Therefore if a given sample size is known to be normally distributed, then the statistical
treatment of data is made easy for the researcher as he would already have a lot of back up
theory in this aspect. Care should always be taken, however, not to assume all data to be
normally distributed, and should always be confirmed with appropriate testing.

Statistical treatment of data also involves describing the data. The best way to do this is
through the measures of central tendencies like mean, median and mode. These help the
researcher explain in short how the data are concentrated. Range, uncertainty and standard
deviation help to understand the distribution of the data. Therefore two distributions with the
same mean can have wildly different standard deviation, which shows how well the data
points are concentrated around the mean.

Statistical treatment of data is an important aspect of all experimentation today and a


thorough understanding is necessary to conduct the right experiments with the right
inferences from the data obtained.

How to cite this article: 

Siddharth Kalla (Apr 10, 2009). Statistical Treatment Of Data. Retrieved from
Explorable.com:  https://explorable.com/statistical-treatment-of-data

27
7.2 Raw Data Processing

Raw data processing refers to the refining of raw data that has been collected from the
experiment.

Statistical data that is used to draw conclusions and inferences should be accurate and
consistent. This is important in order to ensure the validity of all the inferences drawn on the
basis of the data.

"Raw data is a term for data collected on source which has not been
subjected to processing or any other manipulation." (wikipedia.org)

Organizing the Data


Raw data is unprocessed/unorganized source data, such as the data from an eyetracker
which records the coordinates and movement of the eye every millisecond. Output data is the
processed/summarized/categorized data such as the output of the mean position for a
participant immediately after a stimulus was presented.

Raw data processing is required in most surveys and experiments. At the individual level, data
needs to be processed because there may be several reasons why the data is an aberration.

The raw data collected is often contains too much data to analyze it sensibly. This is
especially so for research using computers as this may produce large amounts of data. The
data needs to be organized or manipulated using deconstruction analysis techniques.

Removal of Outliers

While measuring the current flow through a resistor with the help of an
ammeter, there may be one data point that is far away from the rest, an
statistical outlier.
This may be due to a sudden surge in voltage in the source, and this
data point is therefore a deviant.

Statistical raw data processing needs to be carried out in this case to eliminate this data point

28
in order to ensure accuracy of the conclusions drawn based on the experiment.

Data Manipulation
In social experiments involving surveys, there are a number of possibilities why a given data
set might need to be edited or processed. For example, the researcher finds an error in a
question which makes it invalid. Participants may also have checked the wrong answer or
may have simply misunderstood or skipped a question.

It is also important to extract exactly the information that is needed from the overall
experiment.

For example, census data provides a wealth of geographic and


demographic data, but a researcher might need only certain segments
of the data from certain locations.

Therefore raw data processing would be required in order to correctly extract the information
required without errors.

Raw data processing can be a time consuming task and it is not always easy to catch
anomalies. Therefore simple checks should be run that are quite effective in eliminating the
abnormalities.

For example, a predefined range can be defined from most parameters


that can be obtained in theory. Suppose a researcher is studying the
amount of salts in a lake by averaging at different locations. At one
particular location, it is possible that there is a sudden surge in salt
levels, which is an anomaly and can happen if say, someone at a picnic
dumped some salt there.
However, this anomaly can be caught by determining a predefined
range for the value of salt content in the lake that are usually available
in literature.

Such small tests often are very effective in raw data processing.

How to cite this article: 

Siddharth Kalla (Jan 22, 2009). Raw Data Processing. Retrieved from Explorable.com:  
https://explorable.com/raw-data-processing

29
7.3 Statistical Outliers

Statistical outliers are data points that are far removed and numerically distant from the
rest of the points. Outliers occur frequently in many statistical analyses and it is
important to understand them and their occurrence in the right context of the study to
be able to deal with them.

An outlier can be a chance phenomenon, measurement error or due to an experimental error.


It can also occur in special cases that have a heavy tail distribution, in which cases the
assumption of a normal distribution may not hold.

Certain statistical estimators are able to deal with statistical outliers and are robust, while
others cannot deal with them. A typical example is the case of a median, that can deal with
outliers well, since it would not matter whether the extreme point is far away or near the other
data points, as long as the central value is unchanged.

The mean, on the other hand, is affected by outliers as it increases or decreases in value
depending on the position of the outlier.

One should be careful while dealing with outliers and not mistake them for experimental errors
or exceptions at all times. outliers can indicate a different property and may indicate that they
belong to a different population.

Many times, outliers should be given special attention till their cause is known, which is not
always random or chance. Therefore a study needs to be made before an outlier is discarded.

Statistical outliers are common in distributions that do not follow the traditional normal
distribution. For example, in a distribution with a long tail, the presence of statistical outliers is
more common than in the case of a normal distribution.

In case of a normal distribution, it is easy to see that at random, about 1 in 370 observations
will deviate by more than three times the standard deviation from the mean. This ratio
decreases drastically for more distant values. Therefore if there is a more than frequent case
of data away from the mean, then the cause needs to be examined.

For example, if out of 1000 data points, 5 points are at a distance of four times the standard
deviation or more, then these outliers need to be examined.

How to cite this article: 

30
Siddharth Kalla (Nov 7, 2009). Statistical Outliers. Retrieved from Explorable.com:  
https://explorable.com/statistical-outliers

31
7.4 Data Output

Data output is the process and method by which data can be studied under different
circumstances and manipulated as required by the researcher. Any statistical analysis
produces an output data that needs to be studied.

This data needs to be modified in a presentable form so that further conclusions and
inferences can be drawn from this data. Therefore the researcher needs to study different
data output methods for this purpose.

With the increased use of computers in statistics, there are today many softwares and
programs that help in data output. These tools can be used by the researcher to present the
findings in different formats and also helps her to do the required calculations on the data.

Spreadsheets are very handy tools in data output that can help the researcher quickly do
simple computations and checks on the data. Simple statistical analysis and statistical
parameters like mean, median, mode, range, etc. can easily be found using spreadsheets.

For example in a physical experiment, if the time interval between two


events in noted, it is always best to average out the readings to
eliminate random errors. When these data points are entered into a
spreadsheet, their average, standard deviation, etc. can be easily found
out. This facilitates easy recording of results, and also helps to identify
any deviant points and anomalies.

Data output also involves representation of the data. For example, if a researcher is studying
the effect of a particular disease in people of different age groups, she may make use of a pie
chart to indicate the percentage of people affected in different age slabs.

This would immediately give a graphic representation of which age group is most prone to that
disease. If the researcher needs to include absolute numbers, then she may choose to take
the help of a bar chart.

32
The choice of the format of data output and presentation should be driven by the inference
that is being drawn from the research. In the above case, if the research aims to show that
children are most prone to the disease, then a pie chart may be the best option. However, if
the aim of the research is to show that the disease is spreading most rapidly among the older
people, then a bar graph may be the best option.

Data output is central to statistical analysis and is an integral part of the experiment. When
done right, data output can bring about the strengths of the research in an easy to understand
fashion.

How to cite this article: 

Siddharth Kalla (Feb 20, 2010). Data Output. Retrieved from Explorable.com:  
https://explorable.com/data-output

33
8 Statistical Analysis

Statistical analysis is fundamental to all experiments that use statistics as a research


methodology. Most experiments in social sciences and many important experiments in
natural science and engineering need statistical analysis.

Statistical analysis is also a very useful tool to get approximate solutions when the actual
process is highly complex or unknown in its true form.

Example: The study of turbulence relies heavily on statistical analysis


derived from experiments. Turbulence is highly complex and almost
impossible to study at a purely theoretical level. Scientists therefore
need to rely on a statistical analysis of turbulence through experiments
to confirm theories they propound.

In social sciences, statistical analysis is at the heart of most experiments. It is very hard to
obtain general theories in these areas that are universally valid. In addition, it is through
experiments and surveys that a social scientist is able to confirm his theory.

What is the link between money and happiness? Does having more money make you
happier? This is an age-old question that scientists are now trying to answer. Such
experiments are highly complex in nature. After various studies, it turns out that there is a
direct relationship between money and happiness till you reach a certain income level that
corresponds to minimum basic requirements of food, shelter and clothing and after this level
(it is about $60,000/year in the US), money and happiness seem independent of each other.

Beware the Pitfalls


Students of science need to know statistical analysis as so many areas use it. There are also
many pitfalls to avoid. Statistics can be used, intentionally or unintentionally, to reach
faulty conclusions. Misleading information is unfortunately the norm in advertising. The drug
companies, for example, are well known to indulge in misleading information.

Knowledge of statistics therefore will help you look behind the numbers and know the truth
instead of being misled to believe something that is not true. Data dredging is another huge
problem especially in this internet era where numbers and data are so easy to come by. Only

34
by knowing the robust elements of statistical analysis can one really harness the potential of
this incredible tool.

Survey questions are another favorite area that can very easily be manipulated. This happens
all the time, right from presidential election surveys to market surveys by corporations. It can
always happen unintentionally, which means you need to be even more careful. Such bias is
hard to detect because it doesn’t come out easily in the statistical analysis and there is no
mathematical technique that will determine whether this question is biased.

It is therefore important that you understand not just the numbers but the meaning behind the
numbers. Statistics is a tool, not a substitute for in-depth reasoning and analysis. It should
supplement your knowledge of the area that you are studying.

How to cite this article: 

Siddharth Kalla (Apr 12, 2011). Statistical Analysis. Retrieved from Explorable.com:  
https://explorable.com/statistical-analysis

35
9 Measurement Scales

Measurement scales are ubiquitous throughout scientific research, especially among


the disciplines of social sciences. These are useful to record data and thus apply
statistical or other scientific analysis on this data. In fact, all data analysis is broken
down into four major measurement scales as described below.

1. Nominal
This type of measurement scale is used for mutually exclusive and exhaustive categories.
This means that the variable under measurement can take one and only one value out of the
given options. In addition, every observation must fall into one of the categories.

Examples:

In a survey, the variable 'sex' is a nominal scale of measurement because there are two
possibilities, male and female, which cover the entire population under study.
In a medical test, a lab animal may be either dead or alive. Every animal under study is
in one of the two states and there is no animal that cannot be described by these two
states.
In Quantum mechanics, the measured spin of an electron is either +1/2 or -1/2. The
measurement cannot yield any other value and this is true for any electron under study.

The states of a nominal measurement scale can be assigned numerical values, e.g. 1 for
female and 0 for male in the first example. These are usually arbitrary values and do not
correspond to an inherent numerical value that is universally assignable.

2. Ordinal
Unlike for a nominal case, here the numerical values associated with the measurement have
some relevance in terms of ranking of the system. In a nominal measurement, the values are
arbitrary. In our previous example, assigning 1 to female and 0 to male does not in any way
mean that a female participant is "more" or "higher" than a male one. However, in an ordinal
measurement, there is a ranking involved.

Examples:

36
In an Olympic race, the participants are ranked according to the ascending order of the
time taken to finish the race. In this, the number tells us something about the relative
performance of an athlete.
The grading system used in university is a measurement scale of the type ordinal
because there is a hierarchy involved.

3. Interval
The interval measurement scale tells us some quantitative data about the difference between
the measurements. In an ordinal measurement scale, we only get qualitative information
about the relative ranking.In our previous example of a race, we only know who was first,
second and third, but we know nothing about how close the second was to the first and the
second to the third. For this information, we need an interval scale.

Examples:

In temperature measurement, we use scales that are interval measurement scales. The
scales are also uniform in that the difference between 200C and 400C is the same as
the difference between 400C and 600C.
The number system that we use is another example of a uniform measurement scale.
Not all interval measurement scales are uniform. A log scale is commonly used for
plotting data, which is not uniform in nature.

In an interval scale, the ratio of values doesn't make sense. You cannot say, for example, that
200C is twice as cold as 400C. (it would be absurd for example, to say that 0.0010C is 1000
times colder than 10C). The main reason for this is that the zero scale is chosen arbitrarily.

4. Ratio
The ratio measurement scale is most commonly used in physical sciences and engineering
applications. Most physical measurements are ratio scales. Like the name suggests and
unlike the interval scale, here the ratio between two values makes perfect sense. The zero
scale is not chosen arbitrarily in this case.

For example, when we say that the mass of a body is 2kg, it means that it is twice as heavy
as a 1kg object that is defined in some scientific way.

The ratio measurement reflects our physical world and is thus very common in science and
engineering. On the other hand, it is very rare in social sciences and surveys.

How to cite this article: 

Siddharth Kalla (Feb 20, 2011). Measurement Scales. Retrieved from Explorable.com:  

37
https://explorable.com/measurement-scales

38
10 Variables and Statistics

In statistics, variables are central to any analysis and they need to be understood well
by the researcher. Even though the concept looks deceptively simple, many studies
and experienced researchers can go wrong by using the wrong variables.

Like any variable in mathematics, variables can vary, unlike mathematical constants like pi or
e. In statistics, variables contain a value or description of what is being studied in the sample
or population.

For example, if a researcher aims to find the average height of a tribe in Columbia, the
variable would simply be the height of the person in the sample. This is a simple measure for
a simple statistical study. However, most statistical analyses are not as straightforward.

In many cases, statistical variables do not contain numerical values but rather something
descriptive, such as the color of fins of a fish or the kind of species in a given natural habitat.

Qualitative to Quantitative Conversion


In many studies, the qualitative aspects of study are converted into numerical data for
statistical analysis. In this case, the final variable used in the statistical analysis is a number
instead of an attribute. This is central to good design of experiments.

For example, a research study might aim to find out how happy the sample group of students
is before and after eating a bar of chocolate. In this case, it is very difficult to describe
happiness, which is a very subjective and qualitative attribute. Converting this into a numerical
scale of say 1-10 is what will give the study some credibility and help the researchers draw
the right conclusions and inferences.

Quantitative Variables: Discrete and Continuous


Quantitative variables in statistics can be of different types, but the most commonly used
classification is that they can either be discrete or continuous. These types of variables have
some inherent differences that the researcher needs to be aware of.

39
For example, the test scores of a sample of students is clearly a discrete type of data that will
be represented by discrete variables.

On the other hand, the signal noise produced in a study involving communicating devices is
continuous in nature because the noise can take any value. This would be represented by a
continuous variable.

Discrete to Continuous Conversion


These different types of variables require different kinds of analysis. For example, a number of
continuous variables can be described using normal distribution. In fact, many discrete
variables can also be represented in this manner when the sample space is so large that it
looks like a continuous variable.

For example, the results of a coin toss involving 5 tosses is very discrete but if the number of
flips is increased to 5000, the data looks smooth and continuous and the probability of
obtaining say at least 2300 heads can be computed with a very good level of accuracy using
continuous variables (coin tosses follow a normal distribution).

How to cite this article: 

Siddharth Kalla (Apr 23, 2011). Variables and Statistics. Retrieved from Explorable.com:  
https://explorable.com/variables-and-statistics

40
11 Discrete Variables

A discrete variable is a kind of statistics variable that can only take on discrete specific
values. The variable is not continuous, which means there are infinitely many values
between the maximum and minimum that just cannot be attained, no matter what.

For example, the test scores on a standardized test are discrete because there are only so
many values that can be obtained on a test. It would be impossible, for example, to obtain a
342.34 score on SAT.

Practical Cases
A lot of studies involve the use of a discrete variable. A lot of studies involving human subjects
where qualitative experience is converted to quantitative data involves the use of a discrete
variable.

For example, suppose a company is launching a new line of potato chips. To get a sense of
how these new chips rate as compared to the ones already present in the market, the
company needs to perform tests involving human tasters. These people will rate this new
product and an old product in the same category and rate the products on a scale, typically on
a scale of 1-10. In this case, the score given by each taster for each of the products is a
discrete variable.

There are also simpler cases of statistics that involve discrete variables for study. For
example, a coin toss can either be a heads or tails. If you want to quantify this data, you can
assign 1 for heads and 0 for tails and compute the total score of a random coin tossing
experiment. In this case, the variable that keeps track of the outcome is a discrete variable.

Discrete variables are frequently encountered in probability calculations. The above example
of a coin tossing experiment is just one simple case. Suppose you go to a casino and want to
play the roulette. There are generally two different types of roulettes in most casinos - the
American and European. If you want to calculate which one gives you a higher probability of a
win, you will need to consider all possible outcomes. This is clearly a discrete variable since
on each play, there is a slot in which the ball lands. (As it turns out, the European roulette
offers better odds than the American roulette).

41
Discrete vs. Continuous Variables
The opposite of a discrete variable is a continuous variable, which can take on all possible
values between the extremes. Thus this variable can vary in a continuous manner.

For example, consider the length of a stretched rubber band. Its length can be any value from
its initial size to the maximum possible stretched size before it breaks. The length variable can
be 10.0 cm or 15.435 cm. The variation is continuous in nature.

How to cite this article: 

Siddharth Kalla (Sep 19, 2011). Discrete Variables. Retrieved from Explorable.com:  
https://explorable.com/discrete-variables

Thanks for reading!


Explorable.com Team

Explorable.com - Copyright © 2008-2015 All Rights Reserved.

42

You might also like