MODULE 1 - Data Management

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

LECTURE NOTES 1 – DATA MANAGEMENT

Statistics is a branch of mathematics mainly concerned with describing and interpreting a collection of data, and
with drawing conclusions about populations from a knowledge of the properties of a sample.
Biostatistics is Statistics applied to biological and medical sciences.
Definition: 
 The term population as used in statistics, refers to a group of people, objects or events.
 The population is the collection of all elements under consideration in a statistical inquiry.
 The variable is a characteristics or attributes of the elements in a collection that can assume different values
for the different elements.
Examples:
1. All the students in College of Engineering of the University of the East.
2. All the bus companies in Metro Manila.
3. All the birds that fly over the Philippine islands.
 All the three examples above are populations. It is a well-defined collections of people, places of things.
Definition:
 A Sample is a portion or part of a population.
 The sample is a subset of the population.
 The parameter is a summary measure describing a specific characteristic of the population.
 The statistic is a summary measure describing a specific characteristic of the sample.
Examples:
1. All female students of the College of Engineering is a sample of the first population above.
2. All bus companies in Quezon City.
3. All the birds that fly over Bulacan.
Population are considered finite or infinite.
The data obtained about a population are known as parameters. They are designated by Greek letters such as
α,σ,μ.
Data about samples are known as estimates or statistics are designated by the letters of the English alphabet such
as  s,   x, d.
Definitions:

 Data are information we gather about the sample or the population


 Data maybe classified into two major types: quantitative or qualitative
1. Qualitative data refers to the attributes or characteristics of the samples.
2. Quantitative data refers to the numerical information gathered about the samples.
 Numerical data gathered about the samples are either discrete or continuous.
1. Discrete numbers are those obtained through counting.
2. Continuous numbers are the result of measurement.

Types of Data
Definitions
Variables which can be classified into two or more categories are called nominal variables.
1. Nominal can either be real or artificial.
a) A real nominal variable is that which is classified based on a naturally occurring attribute. Examples:
gender, breeds of cattle, color of hair, color of eyes, color of skin, breeds of dogs.
b) An artificial nominal variable is that which is classified based on man-made attribute following certain
rules. Examples: citizenship, religion
2. Ordinal variables are those grouped according to the rank or order of the categories.
Examples: military rank, second of four children, the rank in a beauty contest.
3. Interval data are data wherein addition and subtraction have meaning.
Example: grades in an examination, temperature (Interval data doesn't have a true zero e.g. degrees celcius.
Although you can say 60 degrees is hotter than 30 degrees you can't say that it is twice as hot.

4. A ratio variable refers to a variable where equality of ratio or proportion has meaning.
Examples: height, weight, age, time (You can say that at time 20 seconds, it is twice the amount of time than
10 seconds).

Almost all variables in the natural sciences can be classified as ratio, whereas in the social science, most variables
are classified as interval, ordinal or nominal.
Categories of Statistics
1. Descriptive Statistics is a field of study that systematizes the presentation, description and interpretation
of data gathered. This also includes the study of relationship between and among variables.
2. Inferential Statistics is statistics whose purpose is to draw inferences about a population based on a
study of samples.
Collection of Data
Measurement is the process of determining the value or label of the variable based on what has been observed.
Levels of Measurement
1. The ratio level of measurements has all of the following properties:
a) The numbers in the system are used to classify a person/object into distinct and nonoverlapping
categories.
b) The system arranges the categories according to magnitude
c) The system has a fixed unit of measurement representing a set size throughout the scale
d) The system has an absolute zero
2. The interval level of measurement satisfies only the first three properties of the ratio level.
3. The ordinal level of measurement satisfies only the first two properties of the ratio level.
4. The nominal level of measurement satisfies only the first property of the ratio level.
Types of Data According to Source:
1. Primary Data: are data documented from primary source and are data which are taken directly from
where they arise.
2. Secondary Data: are data documented from primary and are data which are taken from a published
source.
Ways of Collecting Primary Data

1. By direct observation or measurement - on the phenomenon of interest by recording the observations


made about the phenomenon as it actually happens
2. By interview
Three elements of an interview:
1. Interviewer or enumerator
2. Interviewee or respondent
3. Schedule or agenda
3. By mail Inquiry
4. By survey - variable of interest by asking people questions
5. By experiment - where there is a direct human intervention on the conditions that may affect the values of
the variables of interest

SAMPLING DESIGNS
Definitions:
1. Sampling Frame is the list of the units in the population
Example: list of all the students in the University of the East
2. Sample Design is a procedure or plan specified before any data are collected to obtain a sample from a
given population.

The primary objective in sample design is to minimize both sampling and no sampling errors. Errors are costly, not
only in terms of time and money spent in collecting the sample, but also in terms of the potential in making a
wrong
3. Sampling Errors refer to the differences which can occur between a sample statistic and the population
parameter being estimated.
Non-sampling Errors include all kinds of “human errors”. These include mistakes in collecting, analyzing, or
reporting data such as incorrectly adding a column of numbers or the failure of a respondent to provide truthful
Importance of Sampling Method
The entire population of a study is unavailable to the researcher. A subset or a sample from the target population
may be available. But in order to use this sample to infer the characteristics of the population, the sample must be
representative of the population about which inferences are to be made. The selection of a random sample from
the population is one way to satisfy this requirement. A random sample of size n experimental units is selected in
such a way that every different sample of size n has an equal chance of being selected.
Sample size refers to how many samples are to be chosen from the given population.
Sample size may be chosen as follows:
1. If the size of the population is < 500, sample size = 10% of the population size but at least 30.
2. If the size of the population is >500, sample size = 10% of the population size but at most 300.
You may also use the Slovin's Formula to get the sample size:

N
n=
1+ N ∝2

Where:
α is the level of significance (alpha)
N is the population size.

SAMPLING AND SAMPLING TECHNIQUES


 The target population is the population we want to study.
 The sampled population is the population from where we actually select the sample.
Probability sampling is a method of selecting a sample wherein each element in the population has a known,
nonzero chance of being included in the sample; otherwise it is nonprobability sampling.
Probability Sampling refers to the random selection of the sample.
 Types of Sampling

1. Simple Random Sampling: randomly choose numbers from a the random number table or the random
number from a calculator.
2. Systematic Sampling: Divide the total number of the population, N by the sample size n, that is, k=N/n.
Choose from the sampling frame every kth member by using a random number for a start.

Example: with a population of 300 choose a sample of 30. K=300/30=10


Using 205 as a random start, the other samples are: 215, 225, 235, 245, 255, 265, 275, 285, 295, 5, 15, 25, 35,
45, 55, … etc.
3. Stratified Random Sampling: Group the population into strata or homogenous groups then use
proportional or equal allocation to get the sample.

Example: Given the population of a small High School distributed as follows, use the stratified sampling to
get 48 samples.
Strata               Number of Population     Equal Allocation                Proportional Allocation
Grade 9                                15                                             12                                                 15
Grade 10                          120                                              12                                                  12
Grade 11                          110                                              12                                                  11
Grade 12                          100                                             12                                               10
Total                                  480                                             48                                               48
The strata are the four levels of high school, Year I, Year II, Year III and Year IV. To get the 48 samples using
equal allocation, simply divide 48 by 4 and you get 12 samples from each strata. Then simple random sampling
to choose the samples from each strata.  To get the sample using proportional allocation first divide n by N. In
this case 48/480= 0.1 and multiply this number by the population in each strata and you get the values in the
table. Then again use simple random sampling to choose the sample from each strata.
4. Cluster Sampling
Here the population is subdivided into groups (clusters). Elements within a cluster must be heterogeneous
(dissimilar). Each cluster has the same characteristics as the parent population. Clusters are picked at random
from the total number of clusters and all elements in the clusters are selected as sample units. This type of
sampling is used for convenience when sampling from a large geographical region.
Example: From a population of 300 households in a municipality choose 30 households.
Determine the number of clusters say 10. Hence there are 30 households in each cluster.  Randomly choose
5 clusters from the 10 clusters. Then choose 6 households from each cluster.

 Non probability Sampling


The probability of a unit from the population being selected into the sample is unknown.
1. Convenience Sampling (Haphazard)
Available individuals are used in the study.

2. Purposive sampling (Judgement)


Deliberate selection of individuals is done for the study.

3. Quota Sampling
Selection is done according to predetermined quotas. Each person gathering observations is given a
specified number of elements to sample. The decision s to whom to interview is usually left to the
discretion of the interviewer. A danger exists when the interviewer’s judgment and convenience contain
biases not conducive to a representative sample

Presentation of Data
1. Textual presentation of data incorporates important figures in a paragraph of text.
2. Tabular presentation of data arranges figures in a systematic manner in rows and columns.
3. Graphical presentation of data portrays numerical figures or relationships among variables in pictorial
form.

Organization of Data
 Raw data are data in their original form.
 The array is an ordered arrangement of data according to magnitude.
 The frequency distribution is a way of summarizing data by showing the number of observations that
belong in the different categories or classes.

 Measures of Central Tendency


Measures of Central Tendency are descriptive measures that are used to describe the center of a set of data,
arranged numerically.
1. The arithmetic mean is the most common type of average. It is the sum of all the observed values divided
by the numbers of observations.
2. The median is the value that divides the array into two equal parts.
3. The mode is the observed value that occurs with the greatest frequency in a data set.

 Measures of Location
Measures of Location, on position or fractiles are used to specify the location of specific data in relation to the
rest of the sample. It divides the distribution into equal number of parts.
1. The percentiles divide the ordered observation into 100 equal parts.
2. The quartiles divide the ordered observations into 4 equal parts.
3. The decile divides that observed observation into 10 equal parts.
Consider the given set of data:
Set A: 9, 12, 13, 15, 15, 17, 24
Set B: 7, 11, 15, 15, 17, 19, 21
Set C: 11, 11, 15, 15, 15, 18, 20

 Measures of Dispersion
Measures of Dispersion or Variability describes the spread or the scatterings of the values around the mean
1. The range is the distance between the maximum value and the minimum value.
2. The variance is the average squared difference of each observation from the mean.
3. The standard deviation is the positive square root of the variance.
4. The coefficient of variation is the ratio of the standard deviation to the mean, expressed as a percentage.

 Measures of Skewness and Kurtosis


Measure of Skewness measures the degree of symmetry of a distribution.
Sk = 0, symmetric distribution
Sk > 0, positively skewed distribution
Sk < 0, negatively skewed distribution
x́−Mo
S k 1=
s
3( x́− Md)
S k 2=
s

 Measures of Skewness
1. Symmetrical or Normal Distribution
 the mean, median, and mode all fall at the same point or equal.
2. Positively Skewed Distribution
 the extreme scores are larger, thus the mean is larger than the median.

3. Negatively Skewed Distribution


 The order of the measures of central tendency would be the opposite of the positively skewed
distribution, with the mean being smaller than the median, which is smaller than the mode.

 Measure of Kurtosis
Measure of kurtosis refers to the peakedness or flatness of the curve of the distribution.

K=
∑ (x−x́)
ns 4
K > 3, the distribution is Leptokurticii
K = 3, the distribution is Mesokurticiii
K < 3, the distribution is Platykurtic

 Measure of Kurtosis
• Leptokurtic. The curve is more peaked and the hump is narrower or sharper than the normal curve.
• Platykurtik. The curve is less peaked and the hump is flatter than the normal curve.
• Mesokurtic. The hump is the same as the normal curved. It is neither too flat nor too peaked.

Normal Distribution
 The normal distribution is pattern for the distribution of a set of data which follows a bell shaped curve.
 The graph of a normal distribution is called a normal curve.

Properties of Normal Distribution


1. Normal curve is bell shaped.
2. The mean, median and mode are located at the center of the distribution and it is unimodal.
3. It is symmetrical about mean.
4. It is continuous and is asymptotic with respect to the x-axis.
5. The total area under curve is 1.00 or 100%.

The Standard Normal Distribution


• A standard normal distribution with a
 mean of 0
 standard deviation of 1
• The z-score measures how many standard deviations an observed value is above or lower the mean.
X− x́
• Sample z score is given by the formula
s
• The standard score is useful when we want to compare two or more observed values from different data
set.
Area under the Standard Normal Curve

Given Steps
Between zero and any number Look up the area in the table

Look up both areas in the table and subtract the


Between two positives, or Between two negatives smaller from the larger.

Between a negative and a positive Look up both areas in the table and add them together

Less than a negative, or Greater than a positive Look up the area in the table and subtract from 0.5000

Greater than a negative, or Less than a positive Look up the area in the table and add to 0.5000

You might also like