Arm 4

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 55

Data Processing and Analysis

Introduction
❑ Processing: editing, coding, classification and tabulation
of collected data so that analysis can be performed.
❑ Analysis:
❑ The computation of certain measures along with searching for
patterns of relationship that exist among data-groups.
❑ Relationships or differences supporting or conflicting with
original or new hypotheses should be subjected to statistical
tests of significance to determine with what validity data can be
said to indicate any conclusions

❑ Some researchers are of the opinion that processing and


analysis are not different processes.
2
Data Processing and Analysis
Types of analysis
❑ Correlation analysis: The joint variation of two or more variables for
determining the amount of correlation between two or more variables.
❑ Causal analysis: concerned with the study of how one or more variables
affect changes in another variable. It is thus a study of functional
relationships existing between two or more variables.
❑ Multivariate analysis: all statistical methods which simultaneously
analyze more than two variables on a sample of observations.
❑ Inferential analysis: Concerned with the various tests of significance for
testing hypotheses in order to determine with what validity data can be
said to indicate some conclusion or conclusions.

3
Data Processing and Analysis
Statistics in research
The important statistical measures that are used to summarize the
survey/research data are:
1. measures of central tendency or statistical averages;

2. measures of dispersion;

3. measures of asymmetry (skewness);

4. measures of relationship;

5. other measures.

4
Data Processing and Analysis
Measures of Central Tendency
Mean:
❑ Also known as arithmetic average.
❑ The most common measure of central tendency.

5
Data Processing and Analysis
Measures of Central Tendency
Median:
❑ The value of the middle item of series when it is arranged in
ascending or descending order of magnitude.
❑ It divides the series into two halves; in one half all items are less than
median, whereas in the other half all items have values higher than
median.

❑ Median is a positional average and is used only in the context of


qualitative phenomena, for example, in estimating intelligence, etc.,
which are often encountered in sociological fields.
❑ Median is not useful where items need to be assigned relative
importance and weights. 6
Data Processing and Analysis
Measures of Central Tendency
Mode:
❑ Mode is the most commonly or frequently occurring value in a series.
❑ The mode in a distribution is that item around which there is
maximum concentration.
❑ Mode is particularly useful in the study of popular sizes.
❑ For example, a manufacturer of shoes is usually interested in finding
out the size most in demand so that he may manufacture a larger
quantity of that size.

7
Data Processing and Analysis
Measures of dispersion
❑ Measures of central tendency fail to give any idea about the
scatter of the values of items of a variable in the series around
the true value of average.
❑ In order to measure this scatter, statistical devices called
measures of dispersion are calculated.
❑ The most important measures of dispersion are:
1. Range,
2. Mean deviation,

3. Standard deviation.

8
Data Processing and Analysis
Measures of dispersion
Range:
❑ The difference between the values of the extreme items of a
series.

❑ Advantage: Gives an idea of the variability very quickly.


❑ Drawback: range is affected very greatly by fluctuations of
sampling. Its value is never stable, being based on only two
values of the variable,

9
Data Processing and Analysis
Measures of dispersion
Mean deviation:
❑ The average of difference of the values of items from some
average of the series (Mean, median or Mode)
❑ Only absolute differences (i.e. ignore minus sign)

11
Data Processing and Analysis
Measures of dispersion
Standard deviation:

❑ Coefficient of standard deviation = Standard deviation divided by the


arithmetic average of the series,

❑ When this coefficient of standard deviation is multiplied by 100, the resulting


figure is known as coefficient of variation.

❑ Variance = the square of standard deviation,


12
Data Processing and Analysis
Measures of asymmetry (skewness)
❑ When the distribution of item in a series happens to be perfectly
symmetrical, we then have the following type of curve for the
distribution:

13
Data Processing and Analysis
Measures of asymmetry (skewness)
❑ When the distribution of item in a series happens to be perfectly
symmetrical, we then have the following type of curve for the
distribution:

❑ Mean x̅= Mode Z = Median M

❑ Curve: normal curve. The relating distribution: normal distribution. 14


Data Processing and Analysis
Measures of asymmetry (skewness)
❑ If the curve is distorted on the right side, we have positive skewness.

❑ When the curve is distorted towards left, we have negative skewness.

15
Data Processing and Analysis
Measures of asymmetry (skewness)

16
Data Processing and Analysis
Measures of relationships
Correlation
❑ A correlation is a statistical measure of the relationship
between two variables.
❑ The measure is best used in variables that demonstrate a
linear relationship between each other.
❑ The fit of the data can be visually represented in a
scatterplot. Using a scatterplot, we can generally assess the
relationship between the variables and determine whether
they are correlated or not.

17
Data Processing and Analysis
Measures of relationships
Correlation
❑ Pearson correlation coefficient:

18
Data Processing and Analysis
Measures of relationships
Correlation Example

19
Data Processing and Analysis
Measures of relationships
Correlation Example

20
Data Processing and Analysis
Measures of relationships
Correlation

❑ The correlation coefficient is a value that indicates the strength of the


relationship between variables.

❑ The coefficient values and the respective interpretations of the values


are:

❑ -1: Perfect negative correlation. The variables tend to move in opposite


directions (i.e., when one variable increases, the other variable decreases).

❑ 0: No correlation. The variables do not have a relationship with each other.

❑ 1: Perfect positive correlation. The variables tend to move in the same


direction (i.e., when one variable increases, the other variable also
increases).
21
Data Processing and Analysis
Measures of relationships
Correlation

22
Data Processing and Analysis
Measures of relationships
Correlation
❑ For any two correlated events, A and B, their possible
relationships include:

❑ A causes B (direct causation);

❑ B causes A (reverse causation);

❑ A and B are both caused by C

❑ A causes B and B causes A (bidirectional or cyclic causation);

❑ There is no connection between A and B; the correlation is a


coincidence. 23
Data Processing and Analysis
Measures of relationships
Correlation
❑ Correlation must not be confused with causality.

❑ The famous expression “correlation does not mean causation” is crucial to the
understanding of the two statistical concepts.

❑ If two variables are correlated, it does not imply that one variable causes the
changes in another variable.

❑ Correlation only assesses relationships between variables, and there may be


different factors that lead to the relationships.

❑ Causation may be a reason for the correlation, but it is not the only possible
explanation.

❑ Is there an example of correlation and causation from you experience/ work?


24
Data Processing and Analysis
Measures of relationships
Correlation vs Causation
Young children who sleep with the light on are much more likely to develop myopia in
later life.

Therefore, sleeping with the light on causes myopia.

❑ This is a scientific example that resulted from a study at the University of


Pennsylvania Medical Center. Published in the May 13, 1999 issue of Nature, the
study received much coverage at the time in the popular press.

❑ However, a later study at Ohio State University did not find that infants sleeping
with the light on caused the development of myopia. It did find a strong link
between parental myopia and the development of child myopia, also noting that
myopic parents were more likely to leave a light on in their children's bedroom. In
this case, the cause of both conditions is parental myopia, and the above-stated
conclusion is false.
25
Data Processing and Analysis
Measures of relationships
Simple regression
❑ Regression is the determination of a statistical
relationship between two or more variables.
❑ In simple regression, we have only two variables, one
variable (defined as independent) is the cause of the
behaviour of another one (defined as dependent variable).
❑ Regression can only interpret what exists physically i.e.,
there must be a physical way in which independent
variable X can affect dependent variable Y.

26
Data Processing and Analysis
Measures of relationships
Simple regression
❑ The basic relationship between x and y is given by:

27
Data Processing and Analysis
Measures of relationships
Simple regression

28
Data Processing and Analysis
Measures of relationships
Simple regression

29
Data Processing and Analysis
Measures of relationships
Simple regression

30
Data Processing and Analysis
Measures of relationships
Simple regression
❑ In excel use data analysis from Data tab.

❑ If you cannot find Data analysis button, load the tool pack
using the link below:
https://www.excel-easy.com/data-analysis/analysis-
toolpak.html
31
Data Processing and Analysis
Measures of relationships
Simple regression
❑ In excel use data analysis from Data tab.

❑ If you cannot find Data analysis button, load the tool pack
using the link below:
https://www.excel-easy.com/data-analysis/analysis-
toolpak.html
32
Data Processing and Analysis
Measures of relationships
Simple regression

33
Data Processing and Analysis
Measures of relationships
Multiple regression
❑ When there are two or more than two independent
variables, the analysis concerning relationship is known as
multiple correlation.
❑ The equation describing such relationship as the multiple
regression equation.

34
Data Processing and Analysis
Measures of relationships
Multiple regression Simple regression
❑ In Excel, all the steps are same, except two columns of X1 and X2 are
selected in “Input X Range”

35
Sampling Fundamentals
Sampling
❑ Sampling: the selection of some part of an aggregate or
totality on the basis of which a judgement or inference
about the aggregate or totality is made.
❑ Sampling is needed because:
❑ It saves time and money

❑ The only way to analyze data when population contains infinitely many
members

❑ The only way when a test involves the destruction of the item

36
Sampling Fundamentals
Important terms
Population:

❑ The total of items about which information is desired

❑ The population or universe can be finite or infinite (e.g. Number of stars in the sky,
Number of grains of sand in a sample)

Sampling error

❑ The inaccuracy in the information collected when a study of a small portion of the
population is used.

Confidence level and significance level:

❑ The expected percentage of times that the actual value will fall within the stated limits.
Thus, if we take a confidence level of 95%, then we mean that there are 95 chances in 100
(or .95 in 1) that the sample results represent the true condition of the population within
a specified precision range against 5 chances in 100 (or .05 in 1) that it does not.
37
Sampling Fundamentals
Sampling Distributions
Sampling distribution of mean

❑ The probability distribution of all the possible means of random samples of a


given size that we take from a population.

❑ If samples are taken from a normal population, N (µ = mean, σp= standard


deviation the sampling distribution of mean would also be normal with mean
and standard deviation given as:

❑ n is the number of items in the sample.

❑ But when sampling is from a population which is not normal (may be


positively or negatively skewed), even then, the central limit theorem
states, that the sampling distribution of mean tends quite closer to the
normal distribution as the sample size increases (i.e. more than 30).
38
Sampling Fundamentals
Central limit theorem
❑ Normal distributions are often used in the natural and social sciences to
represent real-valued random variables whose distributions are not
known. Therefore, physical quantities that are expected to be the sum of
many independent processes, such as measurement errors, often have
distributions that are nearly normal.

❑ In case we want to reduce the sampling distribution of mean to unit


normal distribution i.e., N (µ=0, σp =1), we can write the normal variate as:

❑ This characteristic of the sampling distribution of mean is very useful in


several decision situations for accepting or rejection of hypotheses. 39
Sampling Fundamentals
Sampling Theory
❑ Sampling theory is a study of relationships existing between a population
and samples drawn from the population.

❑ Sampling theory is applicable only to random samples.

❑ This sort of movement from particular (sample) towards general


(universe) is what is known as statistical induction or statistical inference.

❑ The sampling theory for large samples is not applicable in small samples
because when samples are small, we cannot assume that the sampling
distribution is approximately normal.

40
Sampling Fundamentals
Sampling Distribution
Student’s t-test, based on t-distribution

❑ For cases where sample size is 30 or less and the population variance is not
known.

❑ To test the significance of the mean of a random sample:

41
Sampling Fundamentals
Estimation
Student’s t-test, based on t-distribution

42
Sampling Fundamentals
Estimation
Student’s t-test, based on t-distribution

❑ For applying t-test, we work out the value of test statistic (i.e., ‘t’) and then compare with the table value of t
(based on ‘t’ distribution) at certain level of significance for given degrees of freedom.

❑ If the calculated value of ‘t’ is either equal to or exceeds the table value, we infer that the difference is
significant, but if calculated value of t is less than the concerning table value of t, the difference is not treated
as significant.

43
Sampling Fundamentals
Estimation
Student’s t-test, based on t-distribution

❑ Example: In a plant that manufactures computer ICs and the processing


speed of the chips are inferred using a sample of 15 ICs (n=15). The sample
average 𝑋ത = 8.2267 and sample standard deviation σ𝑋 ҧ = 1.6722. Construct
a 95% confidence interval for the processing speed.

❑ Degree of freedom= n-1= 15-1 = 14

❑ Confidence interval (2 sided):

❑ (1-95%)/2= 5%/2=0.025
95%
2.5% 2.5%

44
Sampling Fundamentals
Estimation
Student’s t-test, based on t-distribution

❑ At 0.025, t=2.145

= Error bound of mean

= 2.145 (1.672/√15) = 0.924

❑ Thus, with 95% confidence,

Mean lower bound = 8.2267-0.924= 7.3

Mean upper bound = 8.2267+0.924= 9.15


45
Sampling Fundamentals
Estimation
Student’s t-test, based on t-distribution

46
Sampling Fundamentals
Estimation
Population mean (Normal distribution)

❑ Example: From a random sample of 36 Karachi civil service personnel, the


mean age and the sample standard deviation were found to be 40 years and
4.5 years respectively. Construct a 95 per cent confidence interval for the
mean age of civil servants in Karachi

47
Sampling Fundamentals
Estimation

48
Sampling Fundamentals
Estimation

49
Sampling Fundamentals
Estimation
Tables of normal distribution:

50
Sampling Fundamentals
Estimation

51
Sampling Fundamentals
Estimation

52
Sampling Fundamentals
Estimation
Suggested problems (Walpole):

53
Sampling Fundamentals
Estimation
Suggested problems (Walpole):

54

You might also like