Basic Concepts Lecture Notes

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Basics Statistics and Techniques [Key Statistical Concepts]

Statistics:
Is the body of techniques used to facilitate the collection, organization, presentation, analysis,
and interpretation of data for the purpose of making effective decisions.
The numerical facts – or data – in these news items, (46%, $100,000, 6.9%, 2.4%) are referred to
as statistics almost everyday usage. The field of statistics involves much more than simply the
calculation and presentation of numerical data.

Descriptive statistics describe or summarize a set of data. Measures of central tendency and measures
of dispersion are the two types of descriptive statistics. The mean, median, and mode are three types
of measures of central tendency
When analyzing data, such as the grades earned by 100 students, it is possible to use both descriptive
and inferential statistics in your analysis. Typically, in most research conducted on groups of
people, you will use both descriptive and inferential statistics to analyze your results and draw
conclusions.
Descriptive statistics uses the data to provide descriptions of the population, either through numerical
calculations or graphs or tables. Inferential statistics makes inferences and predictions about a
population based on a sample of data taken from the population in question..

Populations and samples


A population is a group of persons or objects of a certain kind (called population units) that
have some characteristics, or properties, in common. The objects may be concrete, eg. Factories,
or abstract, eg. car accidents.
A sample is a part of a population, ie. set of population units that have been selected from a
population. The members of the sample are often referred to as sample units.
The main purpose of choosing a sample is to enable statements to be made about the population
more cheaply, quickly, etc.

1
Basics Statistics and Techniques [Key Statistical Concepts]

In order to do that, the sample must be representative of the population, ie. the sample must
present a small-scale image of the population. In other words, the proportion of sample units
having a certain characteristic should be approximately the same as the proportion of the
population units having that characteristic.

Datasets
A variable is a characteristic, or property, of a population unit that varies from one unit to
another, ie. its value varies from one unit to another. For a given variable, one may obtain the
value of that variable for every unit in a population or for every unit in a sample only. Each such
value may be called a data value, a piece of data, a raw data value, an observed value, an
observation, a score or a measurement. The set of data values constitutes a simple dataset. If
values for two or more variables are obtained for every unit in a population or in a sample, the
set of values constitutes a bi-variate or multi-variate dataset.

Distribution
The pattern of variation of data, which may be describe as symmetrical, positively skewed and
negatively skewed.
Outliers
Data values that are either much larger or much smaller than the general body of data; they
should be included in analysis unless they are the result of human or other known error.

SAMPLE SURVEY VERSES CENSUS


SAMPLING ERRORS
For any statistical study, statisticians define the term population to describe the collection of
every unit to whom the study pertains. For example, if we wanted to compile some statistics on
the annual salary of managing directors of PNG companies, then the population for the study
would be the managing directors of each and every PNG company. (A unit would be an
individual managing director.)

A census uses information from every unit in the study’s population.


A sample survey uses information from only a sample of the population.

Common sense would tell us that a census should be more accurate than a sample survey.
Whenever a statistician decides to perform a sample survey rather than a census, he/she is
potentially going to introduce errors into any findings. These errors are called sampling errors.
The size of these errors usually gets less as the size of the sample increases. However the size of
the errors also depends on the way in which the sample was chosen
In practice, as long as the sample is not too small and the sample is well constructed, the
information from a sample survey should be reasonably accurate and reliable. Weigh this

2
Basics Statistics and Techniques [Key Statistical Concepts]

against the enormous cost savings of gathering information from a sample rather than from the
whole population, and you can understand why a sample survey rather than as a census is used
for most statistical studies when the population is large.
Determining the size of a sample to use with a sample survey is beyond the scope of this course.
If you take a later course in statistics you will find that statisticians are able to specify the size
needed to obtain results with required accuracy. But it is worth noting here the rather obvious
fact that a larger sample (generally) gives more accurate information than a smaller sample.
More precisely, accuracy (as measured by sampling error) is inversely proportional to the
square root of the sample size. For example, errors from using a sample of size 48 are likely to
be only half as large as those from using a sample of size 12 [since 48 = 12  4, so errors
decrease in size by a factor 1/4 = ½].

Sampling methods
There are many methods of selecting a sample from a population; some are quite complex.
The basic method is called simple random sampling, where every unit in a population has the
same chance of being selected in a sample. Simple random samples are often drawn by using
random numbers.
We wish to find out about the number of times Unitech students visit the clinic, and for what
reasons they go there. A census is impractical (since finding out information from 1700 students
is going to take a very long time). We decide instead on a sample survey using 75 students.
We choose the sample by standing outside the clinic, and when a student comes along we
persuade them to answer a short questionnaire. When we have recorded information from 75
students we stop. The problem is that because we are selecting students in front of the clinic it is
possible that we might be finding the sicker than average students, and so the results of our
survey would be inaccurate – they would be biased.
Instead of selecting our sample in front of the clinic, we select 75 students studying in the
library. Perhaps it is the case that a higher than average number of females (compared to males)
study in the library and this could bias our clinic data. Or, perhaps students who study in the
library are studious, non-drinking, non-smoking and rather healthier than the average student.
Again this could bias our clinic data.
We require a fair way of selecting students for the survey – a method that attempts to make the
sample representative of the entire student population. A random sample tries to achieve this.
Essentially, the units in a random sample are chosen by lottery. Choosing a random sample is
often surprisingly difficult, but use of random samples can reduce the size of sampling errors.
One type of random sample is a simple random sample.

In a simple random sample, each unit in the population


has an equal chance of being included in the sample.

3
Basics Statistics and Techniques [Key Statistical Concepts]

A very similar technique is called systematic random sampling, where the population units are
arranged in order and units at regular intervals are chosen in the sample.
Stratified sampling involves dividing the population into strata, eg. provinces, so that the units
in each stratum are similar to each other, and then selecting a separate sample from each stratum.
Cluster sampling involves dividing the population into clusters, eg. villages in a province, so
that each cluster is as representative of the population as possible, and then selecting some of the
clusters to represent all of them.
These and other techniques can be combined into complex sample designs, especially for large
or dispersed populations. For our purposes, however, we shall from now onwards assume that
samples referred to are simple random samples unless otherwise stated.

Dependence/independence of variables
For a given dataset, two variables are independent if the value of one does not depend on the
value of the other, ie. if particular values of one variable do not tend to be associated with
particular values of the other variable, eg. income and religious denomination of a group of
people.
For a given dataset, two variables are dependent if they are not independent, ie. if particular
values of one variable do tend to be associated with particular values of the other variable, eg.
education level and income of a group of people.

Percentiles
A percentile is a value of the variable that cuts off a specified percentage of the area under the
frequency polygon or curve, measuring from the left-hand end. It is often denoted by P with a
subscript, eg. P47 for the 47th percentile. For example, the 10th percentile is the value of the
variable that separates the lowest 10 percent of the values in the dataset from the remaining 90
percent; the 73rd percentile is the value of the variable that separates the lowest 73 percent of the
values in the dataset from the remaining 27 percent.
If the percentile relates to a multiple of 10 percent, the percentile is also called a decile. For
example, the 10th percentile is the first decile, the 20th percentile is the second decile, and so on.
Similarly, if the percentile relates to a multiple of 25 percent, the percentile is also called a
quartile. Thus the 25th percentile is the first quartile (Q1), the 50th percentile is the second
quartile (Q2), and the 75th percentile is the third quartile (Q3).
A term that is equivalent to percentile is fractile, which refers to a fraction of a distribution.

There are two general types of data.


Quantitative data is information about quantities; that is, information that can be measured and
written down with numbers. Some examples of quantitative data are your height, your shoe size,
and the length of your fingernails.
Qualitative data is information about qualities; information that can't actually be measured.
Some examples of qualitative data are the softness of your skin, the grace with which you run,
and the color of your eyes.

4
Basics Statistics and Techniques [Key Statistical Concepts]

Here's a quick look at the difference between qualitative and quantitative data.
o The age of your car. (Quantitative.)

o The number of hairs on your knuckle. (Quantitative.)

o The softness of a cat. (Qualitative.)

o The color of the sky. (Qualitative.)

o The number of pennies in your pocket. (Quantitative.)

Remember, if we're measuring a quantity, we're making a statement about quantitative data. If
we're describing qualities, we're making a statement about qualitative data. Keep your L's and
N's together and it shouldn't be too tough to keep straight.
Measurement refers to the process of determining what is the value of a given variable for a
certain person or object, ie. which of the possible values of the variable applies to the person or
object. We can distinguish four types of measurement scale: nominal, ordinal, interval and
ratio.
A nominal scale is one for which the possible values consist of a list of labels or categories that
cannot be arranged in order of size. The categories may, however, be given numerical codes for
convenience.
An ordinal scale is one in which the categories can be arranged in order of size in some sense,
but where it is not possible to measure the difference between any two categories.
An interval scale is one for which the possible values can be arranged in order of size, are
equally spaced and may be negative. Since the values are spaced at equal intervals, one can
determine the difference between any two values.
A ratio scale is like an interval scale but the values are non-negative, ie. the scale has an absolute
zero. One can calculate products and ratios for values measured on a ratio-type scale.
Variables that are measured on a nominal or ordinal type of scale are called qualitative or
category-type variables or just attributes. Those measured on an interval or ratio-type scale are
called quantitative.
These scales are regarded as being ordered from the simplest or lowest (nominal) to the highest
(ratio). A variable measured on one scale-type may be converted into a lower scale
measurement, eg. age in years (ratio scale) may be converted into “young”, “middle-aged” and
“old” (ordinal scale).

5
Basics Statistics and Techniques [Key Statistical Concepts]

CATEGORY, RANK AND MEASUREMENT DATA


Most data in statistics is numerical, and much of it consists of measurements – like the weights
of children, the incomes of workers, or the monthly amounts of rainfall of a city. However, not
all data are measurements, and not all data can be analyzed the same way, so we begin by
looking properties of various types of data.
There are three broad categories of data that we may call category data, rank data and
measurement data. The sex of an individual is an example of category data. The position that a
student finishes in a race (eg, 6th) is an example of rank data. The number of computers in an
organization and the weight of a computer monitor are examples of measurement data.

Measurement
Most data in statistics in numerical, and much of it consists of measurements like the weights of
children, the incomes of workers, or the monthly amounts of rainfall of a city. However, not all
data are measurements, and not all data can be analysed the same way, so we bigin by looking at
properties of various types of data. There are three broad categories of data that we may call
category data, and rank.

Rank Data
This means arranging a set of persons or objects in order according to the value of a variable and
then assigning a number to each person or object on an ordinal scale (ie. 1st, 2nd, 3rd, etc.). The
order may be from lowest to highest or vice versa; the assigned number is the rank number, or
position number, of the person or object.
A tie occurs whenever two or more values in a dataset are the same. When the values are
ranked, tied values that are equal to each other must, of course, be given the same rank number.
There are various ways of doing this. From a statistical point of view, the best way to deal with
this case is to assign to each tied value the rank number that is the average of the rank numbers
covered by the tied values (even if it is not an integer).

PRIMARY AND SECONDARY DATA


Primary data is collected for a particular purpose. If we measured each Unitech female
student’s height (so that we could calculate the average height of Unitech female students), then
we would be gathering primary data.
Secondary data is data that was collected at some other time for some other purpose. For
example, another way of calculating the average height of Unitech female students would be to
approach the Health Centre and see it they would allow us to use their data base of student
information, which includes student heights. This would be secondary data.
Care has to be taken to ensure the accuracy of both primary and (particularly) secondary data.
We will look shortly at the problems of gathering primary data. Here we will mention a further
possible problem of re-using secondary data. Since the data was collected for another purpose, it
is possible that it may not be exactly the information required. For example, we might be
interested (for some reason) in the average height of Unitech female students without shoes.

6
Basics Statistics and Techniques [Key Statistical Concepts]

But when the information was collected by the Health Centre workers some students were
measured wearing shoes.
As a second example, suppose we wished to know the price of a retail commodity in shops five
years ago (to compare it with today’s prices). We go to the library and find newspapers five
years old, and look for an advertisement giving the previous price of the commodity. The price
we get from this source may very well be a discounted price – since usually sale prices are
advertised in the papers. So any statistics derived from this data may well be in error.

Discrete and continuous variables


Quantitative variables are classified as either discrete or continuous.
A discrete variable can take only certain separated values, usually integers, eg. number of
children in a family; it cannot take intermediate values. A continuous variable can in theory
take any possible value in a range, eg. age.
In practice, continuous variables are measured discretely (depending on the desired accuracy and
the precision of the measuring instrument), eg. age in completed years, length to nearest metre.

You might also like