Arm 4

Data Processing and Analysis
Introduction
❑ Processing: editing, coding, classification and tabulation
of collected data so that analysis can be performed.
❑ Analysis:
❑ The computation of certain measures along with searching for
patterns of relationship that exist among data-groups.
❑ Relationships or differences supporting or conflicting with
original or new hypotheses should be subjected to statistical
tests of significance to determine with what validity data can be
said to indicate any conclusions
❑ Some researchers are of the opinion that processing and

analysis are not different processes.
2
Types of analysis
❑ Correlation analysis: The joint variation of two or more variables for
determining the amount of correlation between two or more variables.
❑ Causal analysis: concerned with the study of how one or more variables
affect changes in another variable. It is thus a study of functional
relationships existing between two or more variables.
❑ Multivariate analysis: all statistical methods which simultaneously
analyze more than two variables on a sample of observations.
❑ Inferential analysis: Concerned with the various tests of significance for
testing hypotheses in order to determine with what validity data can be
said to indicate some conclusion or conclusions.
3
Statistics in research
The important statistical measures that are used to summarize the
survey/research data are:
1. measures of central tendency or statistical averages;
2. measures of dispersion;
3. measures of asymmetry (skewness);
4. measures of relationship;
5. other measures.
4
Measures of Central Tendency
Mean:
❑ Also known as arithmetic average.
❑ The most common measure of central tendency.
5
Median:
❑ The value of the middle item of series when it is arranged in
ascending or descending order of magnitude.
❑ It divides the series into two halves; in one half all items are less than
median, whereas in the other half all items have values higher than
median.
❑ Median is a positional average and is used only in the context of

qualitative phenomena, for example, in estimating intelligence, etc.,
which are often encountered in sociological fields.
❑ Median is not useful where items need to be assigned relative
importance and weights. 6
Mode:
❑ Mode is the most commonly or frequently occurring value in a series.
❑ The mode in a distribution is that item around which there is
maximum concentration.
❑ Mode is particularly useful in the study of popular sizes.
❑ For example, a manufacturer of shoes is usually interested in finding
out the size most in demand so that he may manufacture a larger
quantity of that size.
7
Measures of dispersion
❑ Measures of central tendency fail to give any idea about the
scatter of the values of items of a variable in the series around
the true value of average.
❑ In order to measure this scatter, statistical devices called
measures of dispersion are calculated.
❑ The most important measures of dispersion are:
1. Range,
2. Mean deviation,
3. Standard deviation.
8
Range:
❑ The difference between the values of the extreme items of a
series.
❑ Advantage: Gives an idea of the variability very quickly.

❑ Drawback: range is affected very greatly by fluctuations of
sampling. Its value is never stable, being based on only two
values of the variable,
9
Mean deviation:
❑ The average of difference of the values of items from some
average of the series (Mean, median or Mode)
❑ Only absolute differences (i.e. ignore minus sign)
11
Standard deviation:
❑ Coefficient of standard deviation = Standard deviation divided by the

arithmetic average of the series,
❑ When this coefficient of standard deviation is multiplied by 100, the resulting

figure is known as coefficient of variation.
❑ Variance = the square of standard deviation,

12
Measures of asymmetry (skewness)
❑ When the distribution of item in a series happens to be perfectly
symmetrical, we then have the following type of curve for the
distribution:
13
❑ When the distribution of item in a series happens to be perfectly
symmetrical, we then have the following type of curve for the
distribution:
❑ Mean x̅= Mode Z = Median M
❑ Curve: normal curve. The relating distribution: normal distribution. 14

❑ If the curve is distorted on the right side, we have positive skewness.
❑ When the curve is distorted towards left, we have negative skewness.
15
16
Measures of relationships
Correlation
❑ A correlation is a statistical measure of the relationship
between two variables.
❑ The measure is best used in variables that demonstrate a
linear relationship between each other.
❑ The fit of the data can be visually represented in a
scatterplot. Using a scatterplot, we can generally assess the
relationship between the variables and determine whether
they are correlated or not.
17
Correlation
❑ Pearson correlation coefficient:
18
Correlation Example
19
Correlation Example
20
Correlation
❑ The correlation coefficient is a value that indicates the strength of the

relationship between variables.
❑ The coefficient values and the respective interpretations of the values

are:
❑ -1: Perfect negative correlation. The variables tend to move in opposite

directions (i.e., when one variable increases, the other variable decreases).
❑ 0: No correlation. The variables do not have a relationship with each other.
❑ 1: Perfect positive correlation. The variables tend to move in the same

direction (i.e., when one variable increases, the other variable also
increases).
21
Correlation
22
Correlation
❑ For any two correlated events, A and B, their possible
relationships include:
❑ A causes B (direct causation);
❑ B causes A (reverse causation);
❑ A and B are both caused by C
❑ A causes B and B causes A (bidirectional or cyclic causation);
❑ There is no connection between A and B; the correlation is a

coincidence. 23
Correlation
❑ Correlation must not be confused with causality.
❑ The famous expression “correlation does not mean causation” is crucial to the
understanding of the two statistical concepts.
❑ If two variables are correlated, it does not imply that one variable causes the
changes in another variable.
❑ Correlation only assesses relationships between variables, and there may be

different factors that lead to the relationships.
❑ Causation may be a reason for the correlation, but it is not the only possible
explanation.
❑ Is there an example of correlation and causation from you experience/ work?

24
Correlation vs Causation
Young children who sleep with the light on are much more likely to develop myopia in
later life.
Therefore, sleeping with the light on causes myopia.
❑ This is a scientific example that resulted from a study at the University of

Pennsylvania Medical Center. Published in the May 13, 1999 issue of Nature, the
study received much coverage at the time in the popular press.
❑ However, a later study at Ohio State University did not find that infants sleeping
with the light on caused the development of myopia. It did find a strong link
between parental myopia and the development of child myopia, also noting that
myopic parents were more likely to leave a light on in their children's bedroom. In
this case, the cause of both conditions is parental myopia, and the above-stated
conclusion is false.
25
Simple regression
❑ Regression is the determination of a statistical
relationship between two or more variables.
❑ In simple regression, we have only two variables, one
variable (defined as independent) is the cause of the
behaviour of another one (defined as dependent variable).
❑ Regression can only interpret what exists physically i.e.,
there must be a physical way in which independent
variable X can affect dependent variable Y.
26
Simple regression
❑ The basic relationship between x and y is given by:
27
Simple regression
28
Simple regression
29
Simple regression
30
Simple regression
❑ In excel use data analysis from Data tab.
❑ If you cannot find Data analysis button, load the tool pack
using the link below:
https://www.excel-easy.com/data-analysis/analysis-
toolpak.html
31
Simple regression
❑ In excel use data analysis from Data tab.
❑ If you cannot find Data analysis button, load the tool pack
using the link below:
https://www.excel-easy.com/data-analysis/analysis-
toolpak.html
32
Simple regression
33
Multiple regression
❑ When there are two or more than two independent
variables, the analysis concerning relationship is known as
multiple correlation.
❑ The equation describing such relationship as the multiple
regression equation.
34
Multiple regression Simple regression
❑ In Excel, all the steps are same, except two columns of X1 and X2 are
selected in “Input X Range”
35
Sampling Fundamentals
Sampling
❑ Sampling: the selection of some part of an aggregate or
totality on the basis of which a judgement or inference
about the aggregate or totality is made.
❑ Sampling is needed because:
❑ It saves time and money
❑ The only way to analyze data when population contains infinitely many
members
❑ The only way when a test involves the destruction of the item
36
Important terms
Population:
❑ The total of items about which information is desired
❑ The population or universe can be finite or infinite (e.g. Number of stars in the sky,
Number of grains of sand in a sample)
Sampling error
❑ The inaccuracy in the information collected when a study of a small portion of the
population is used.
Confidence level and significance level:
❑ The expected percentage of times that the actual value will fall within the stated limits.
Thus, if we take a confidence level of 95%, then we mean that there are 95 chances in 100
(or .95 in 1) that the sample results represent the true condition of the population within
a specified precision range against 5 chances in 100 (or .05 in 1) that it does not.
37
Sampling Distributions
Sampling distribution of mean
❑ The probability distribution of all the possible means of random samples of a

given size that we take from a population.
❑ If samples are taken from a normal population, N (µ = mean, σp= standard

deviation the sampling distribution of mean would also be normal with mean
and standard deviation given as:
❑ n is the number of items in the sample.
❑ But when sampling is from a population which is not normal (may be

positively or negatively skewed), even then, the central limit theorem
states, that the sampling distribution of mean tends quite closer to the
normal distribution as the sample size increases (i.e. more than 30).
38
Central limit theorem
❑ Normal distributions are often used in the natural and social sciences to
represent real-valued random variables whose distributions are not
known. Therefore, physical quantities that are expected to be the sum of
many independent processes, such as measurement errors, often have
distributions that are nearly normal.
❑ In case we want to reduce the sampling distribution of mean to unit

normal distribution i.e., N (µ=0, σp =1), we can write the normal variate as:
❑ This characteristic of the sampling distribution of mean is very useful in

several decision situations for accepting or rejection of hypotheses. 39
Sampling Theory
❑ Sampling theory is a study of relationships existing between a population
and samples drawn from the population.
❑ Sampling theory is applicable only to random samples.
❑ This sort of movement from particular (sample) towards general

(universe) is what is known as statistical induction or statistical inference.
❑ The sampling theory for large samples is not applicable in small samples
because when samples are small, we cannot assume that the sampling
distribution is approximately normal.
40
Sampling Distribution
Student’s t-test, based on t-distribution
❑ For cases where sample size is 30 or less and the population variance is not
known.
❑ To test the significance of the mean of a random sample:
41
Estimation
42
Estimation
❑ For applying t-test, we work out the value of test statistic (i.e., ‘t’) and then compare with the table value of t
(based on ‘t’ distribution) at certain level of significance for given degrees of freedom.
❑ If the calculated value of ‘t’ is either equal to or exceeds the table value, we infer that the difference is
significant, but if calculated value of t is less than the concerning table value of t, the difference is not treated
as significant.
43
Estimation
❑ Example: In a plant that manufactures computer ICs and the processing

speed of the chips are inferred using a sample of 15 ICs (n=15). The sample
average 𝑋ത = 8.2267 and sample standard deviation σ𝑋 ҧ = 1.6722. Construct
a 95% confidence interval for the processing speed.
❑ Degree of freedom= n-1= 15-1 = 14
❑ Confidence interval (2 sided):
❑ (1-95%)/2= 5%/2=0.025
95%
2.5% 2.5%
44
Estimation
❑ At 0.025, t=2.145
= Error bound of mean
= 2.145 (1.672/√15) = 0.924
❑ Thus, with 95% confidence,
Mean lower bound = 8.2267-0.924= 7.3
Mean upper bound = 8.2267+0.924= 9.15

45
Estimation
46
Estimation
Population mean (Normal distribution)
❑ Example: From a random sample of 36 Karachi civil service personnel, the

mean age and the sample standard deviation were found to be 40 years and
4.5 years respectively. Construct a 95 per cent confidence interval for the
mean age of civil servants in Karachi
47
Estimation
48
Estimation
49
Estimation
Tables of normal distribution:
50
Estimation
51
Estimation
52
Estimation
Suggested problems (Walpole):
53
Estimation
Suggested problems (Walpole):
54

Arm 4

Uploaded by

Copyright:

Available Formats

You might also like

Arm 4

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Arm 4

Uploaded by

Copyright:

Available Formats

Data Processing and Analysis

❑ Some researchers are of the opinion that processing and

3. measures of asymmetry (skewness);

❑ Median is a positional average and is used only in the context of

❑ Advantage: Gives an idea of the variability very quickly.

❑ Coefficient of standard deviation = Standard deviation divided by the

❑ When this coefficient of standard deviation is multiplied by 100, the resulting

❑ Variance = the square of standard deviation,

❑ Mean x̅= Mode Z = Median M

❑ Curve: normal curve. The relating distribution: normal distribution. 14

❑ When the curve is distorted towards left, we have negative skewness.

❑ The correlation coefficient is a value that indicates the strength of the

❑ The coefficient values and the respective interpretations of the values

❑ -1: Perfect negative correlation. The variables tend to move in opposite

❑ 0: No correlation. The variables do not have a relationship with each other.

❑ 1: Perfect positive correlation. The variables tend to move in the same

❑ A causes B (direct causation);

❑ B causes A (reverse causation);

❑ A and B are both caused by C

❑ A causes B and B causes A (bidirectional or cyclic causation);

❑ There is no connection between A and B; the correlation is a

❑ Correlation only assesses relationships between variables, and there may be

❑ Is there an example of correlation and causation from you experience/ work?

Therefore, sleeping with the light on causes myopia.

❑ This is a scientific example that resulted from a study at the University of

❑ The total of items about which information is desired

Confidence level and significance level:

❑ The probability distribution of all the possible means of random samples of a

❑ If samples are taken from a normal population, N (µ = mean, σp= standard

❑ n is the number of items in the sample.

❑ But when sampling is from a population which is not normal (may be

❑ In case we want to reduce the sampling distribution of mean to unit

❑ This characteristic of the sampling distribution of mean is very useful in

❑ Sampling theory is applicable only to random samples.

❑ This sort of movement from particular (sample) towards general

❑ To test the significance of the mean of a random sample:

❑ Example: In a plant that manufactures computer ICs and the processing

❑ Degree of freedom= n-1= 15-1 = 14

❑ Confidence interval (2 sided):

= Error bound of mean

= 2.145 (1.672/√15) = 0.924

❑ Thus, with 95% confidence,

Mean lower bound = 8.2267-0.924= 7.3

Mean upper bound = 8.2267+0.924= 9.15

❑ Example: From a random sample of 36 Karachi civil service personnel, the

You might also like