Professional Documents
Culture Documents
Factors Affecting On Students Test Scores
Factors Affecting On Students Test Scores
A.H.S.S.S. Ariyarathne
(S/09/807)
(sakitha.sakitha@gmail.com)
4th December 2013
FACULTY OF SCIENCE
UNIVERSITY OF PERADENIYA
1
ACKNOWLEDGEMENT
Any accomplishment requires the effort of many people and there are no exceptions.
The report being submitted today is a result of collective effort. There are
innumerous helping hands behind this study.
First of all a special thank should be given to Mr. C. Manoj for giving this
opportunity and helping throughout the study. Also Mr. N.D. Rupasinghe had been a
great help in completing this research. And also all the academic and non academic
staff members of the department of statistics and computer science should be
thanked.
2
ABSTRACT
3
Contents
1. INTRODUCTION ........................................................................................................... 7
1.1 Background of the study .......................................................................................... 7
1.2 Importance of the study ........................................................................................... 7
1.3 Objectives ................................................................................................................ 8
2. LITERATURE REVIEW ................................................................................................ 9
2.1 Student role performance ......................................................................................... 9
2.2 School environment ................................................................................................. 9
2.3 Family background ................................................................................................ 10
3. METHODOGY .............................................................................................................. 12
3.1 Data Collecting ............................................................................................................ 12
3.2 Descriptive Statistics.............................................................................................. 13
3.2.1 Mean .............................................................................................................. 13
3.2.2 Standard deviation ......................................................................................... 14
3.2.3 Median ........................................................................................................... 14
3.2.4 Histogram....................................................................................................... 14
3.2.5 Box plot.......................................................................................................... 15
3.2.6 Scatter plot..................................................................................................... 15
3.3 Correlation ............................................................................................................. 16
3.4 Hypothesis Tests .................................................................................................... 16
3.4.1 Paired Sample t-test ....................................................................................... 17
3.4.2 Shapiro-Wilk test of normality ...................................................................... 17
3.5 Regression Analysis ............................................................................................... 18
3.5.1 Regression Model .......................................................................................... 18
3.5.2 Model Selection ............................................................................................. 19
3.5.3 Multicollinearity ................................................................................................... 19
4. RESULTS AND DISCUSSION .................................................................................... 21
4.1 Primary Analysis.................................................................................................... 21
4.2 Confirmatory Analysis ........................................................................................... 29
5. SUGGESTIONS AND CONCLUSIONS...................................................................... 38
REFERENCE ........................................................................................................................ 41
Appendix................................................................................................................................ 42
4
List of tables
5
List of figures
6
1. INTRODUCTION
There are several topical areas that are most commonly linked to academic
performance including student role performance (SRP), school factors, family factors
and peer factors. Student role performance is how well an individual fulfills the role
of a student in an educational setting. Sex, race, school effort, extra-curricular
activities, deviance and disabilities are all important influence on SRP and have been
shown to affect test scores. School environment factors, such as school size,
neighborhood and relationships between teachers and students also influence test
scores, according to a previous study Crosnoe, Johnson and Elder (2004). One’s
family background has also found to influence student test scores. Majorbanks
(1996) has found that socioeconomic status, parental involvement and family size are
particularly important family factors. Peer influence can also affect students’
performance. According to Santor, Messervey and Kusumaker (2000), peer pressure
and peer conformity can lead to an individual participating in risk taking behaviors
which have been found negative influence effect on test scores.
7
1.3 Objectives
This study takes a holistic approach to analyzing the influence on test scores by
creating regression model. This model includes many of the factors that have
previously been linked to affecting test scores. It consists of students’ role
performance (SRP) factor, English, family level factors such as income, expenditure
and school factors like number of teachers, number of students. With this model the
researcher is intend to find the influence factors on test scores.
Finally the researcher interested in finding in which subject students perform well
and in which subject they should improved. Students need to identify not only their
strength but their weaknesses as well. By understanding their weaknesses they could
take necessary steps to improve themselves. Therefore the researcher also intends to
find in which subject students should spent more time.
8
2. LITERATURE REVIEW
Student role performance (SRP) is how well an individual fulfills the role of a
student in an educational institution. SRP involves factor such as sex of the student,
students’ race, school effort, extracurricular activities, deviant behavior and student
disabilities. According to the past researches, Eitle (2004) it has been discovered that
academic achievement gap between the sexes with boys ahead of girls. However
more recent research, Chambers and Schreiber (2005) has shown that the
achievement gap been narrowing and that in some instances girls have higher
academic achievement than boys. For example, according to Ceballo, McLoyd and
Toyokawa (2004), girls have been found to extra effort at school, leading to better
school performance. Additionally, studies show that girls perform better in reading
than males. But males are found to outperform females in mathematics and science.
Past researches have been found many more influence factors on test scores other
than sex. According to Tam and Basset (2004) and Seyfried (1998), race has been
shown to play a major role in the life of a student. Hunt (2005) has proven from a
theoretical point of view that extracurricular activities are viewed as boosting
academic performance. It also has been concluded that student deviance and
delinquency have been linked to academic outcomes by Murdock, Anderman and
Hodge (2000).
9
performance and more access to resources such as computers, which have been
shown to enhance academic achievement. Smaller class sizes create more intimate
settings and therefore can increase teacher-student bonding which has also been
shown to have a positive effect on student success. According to Eamon (2005), the
relative social class of a student body also affects academic achievement. Students
from low socioeconomic background who attend poorly funded schools do not
perform as well as students from higher social classes.
Family background is key to a students’ life and outside of school, is the most
important influence on student learning and includes factors such as socioeconomic
status, two-parent versus single-parent household, divorce, parenting practices and
aspirations, mental characteristics, family size and neighborhood (Majoribanks
1996). The environment at home is a primary socialization agent and influences a
child’s interest in school and aspirations for the future.According to Teynes (2002)
the socio-economic status (SES) of a child is most commonly determined by
combining parents’ educational level, occupational status and income level. Studies
have repeatedly found that SES affects student outcomes. Students who have a low
SES earn lower scores and are more likely to drop out of school.
Majoribanks (1996) also shown that child from single-parent households do not
perform as well in school as children from two=parent households. There are several
different explanations for this achievement gap. Single-parents households have less
income and there is a lack of support for the single-parent which increases stress and
conflicts. Single parents often struggle with time management issues due to
balancing many different area of life on their own. Some research has also shown
10
that single-parents are less involved with their children and therefore give less
encouragement and have lower expectations of their children than two-parent
households.
This study proposes a holistic alternative model that combines student role
performance (SRP), family background and school environment on students test
scores. SRP is a set of behaviors and personal characteristics that affect how well a
student perform in school. English knowledge, sex and race are few examples for
SRP. It is expected that the higher a students’ SRP, the higher a students’ test score
will be. School is the institutional environment that sets the parameters of a students’
learning environment. Schools’ grade, number of students in the school, number of
teachers and number of computers per classroom are few examples of important
school factors. School can have a direct affect on test scores. For example, the
teacher to student ratio affects all students directly. School can also have an indirect
affect on test scores. For example, only some students at a particular school might
take college preparatory classes and that will most likely increase their student role
performance, which will indirectly affect test scores. It is expected that students’ test
scores will increase as the quality of the school increases. Family provides
connections to the resources that are needed to be a successful student. Family
income level, expenditure per child and family size influence the performance of
students. This study predicts that family income level increases so will students test
scores.
11
3. METHODOGY
In order to full fill the objectives of the study several statistical techniques were used.
Statistic is the science of conducting studies to collect, organizes, analysis,
summarize and draw conclusions from data. The basic foundation of statistics is data.
The first step of the statistical process is collecting data.
1. Direct observation – Data are collected without having any interaction with
the responder.
2. Questionnaires – Data collection through in which the questions are presented
that are to be answered by responders.
3. Personal interviews – Collecting data by contacting the responder personally
and ask questions.
4. Telephone interviews – Collecting data by asking questions over the phone.
5. Indirect oral interviews – Collecting data by asking questions about a person
from another person.
6. Secondary data – Use the data that has been already collected.
Data can be divided into two categories based on their characteristics. Observations
consist of words or codes are called qualitative data while observations consist of
numbers that represent an amount for counts are called quantitative data.
Quantitative data again can divide into two sections, discrete and continuous.
Observations which can take any value within a specific range are called continuous
data and observations which contain gaps between two consecutive values are called
discrete.
In this study the researcher has used secondary data which is collected from 420
schools in California consisting both quantitative and qualitative data and also both
discrete and continuous data under following 14 variables. District code, school
12
name, country, grade span of district, number of students, number of teachers,
percent qualifying for CalWorks (income assistance), percent qualifying for reduced
price lunch, number of computers, expenditure per student, district average income
in $1000, percent of English learners, average reading score and average math score.
Grade is the only categorical data that has been collected. In order to reduce the
complexity of the study average of read and math were taken as overall results of the
test and also to keep units of the variables in line income has multiply by $1000 in
order to take the actual value. There were no missing values or they have been
replaced with 0’s. Math test scores, read test scores and average test scores have been
used as dependent variables and others were used as independent variables. Schools’
grade, number of teacher, students and computers per classroom were used as school
environmental variables. Average income level, expenditure, calWorks and lunch
were used as family background variables while English is used as SRP variable.
Initially basic descriptive analyses were carried out. Descriptive statistics are used to
summarize the data in a clear and understandable way. Tables and graphs were used
to present the summary statistics.
3.2.1 Mean
Mean has to be computed by considering each and every observation in the series.
∞
µ= ∫ ∞
; X continuous
∑ ( )
µ= ; X discrete (1)
µ - Mean
th
Xi – i observation
N – Number of observations
13
3.2.2 Standard deviation
Standard deviation is the square root of the mean of the square deviation from the
mean.
∑ ( )
σ= ; X continuous
∫ ( )
= ; X Discrete (2)
δ – Standard deviation
Xi – ith observation
µ - Mean
N – Number of observations
3.2.3 Median
Median is the value which divides the series into two equal parts after arranged the
observations in either ascending order or descending order.
th
M= item ; N is odd
th
M = Average of ( ) and ( ) items ; N is even (3)
M – Median
N – Number of observations
3.2.4 Histogram
14
size. The rectangles of a histogram are drawn so that they touch each other to
indicate that the original variable is continuous.
Box plots display differences between populations without making any assumptions
of the underlying statistical distribution: they are non-parametric. The spacing
between the different parts of the box helps indicate the degree of dispersion (spread)
and skewness in the data, and identify outliers. In addition to the points themselves,
they allow one to visually estimate various L-estimators, notably the inter quartile
range, midhinge, range, mid-range, and tri-mean. Box plots can be drawn either
horizontally or vertically.
A plot in which the data is displayed as a collection of points, each having the value
of one variable determining the position on the horizontal axis and the value of the
other variable determining the position on the vertical axis is called scatter plot.
A scatter plot is used when a variable exists that is below the control of the
experimenter. If a parameter exists that is systematically incremented and/or
decremented by the other, it is called the control parameter or independent
variable and is customarily plotted along the horizontal axis. The measured
or dependent variable is customarily plotted along the vertical axis. If no dependent
variable exists, either type of variable can be plotted on either axis. A scatter plot will
illustrate only the degree of correlation between two variables.
15
3.3 Correlation
( , )
ρX,Y = (4)
σ σ
ρ – Correlation coefficient
cov(X,Y) – Covariance of X and Y
σX – Standard deviation of X
σY – Standard deviation of Y
Correlation coefficient can only take values between -1 and 1. Negative values
indicate the two variables are inversely associated and positive values indicate direct
association. ρ > 0.7 considered to be strong positive relationships, ρ < -0.7
considered to be strong negative relationships and ρ = 0 indicate that there is no
relationship between two variables. Overall correlation coefficient measure how one
variable depend on another variable.
Various hypothesis tests were used to make decisions about the data. A statistical
hypothesis test is a method of making decisions using data from a scientific study.
In statistics, a result is called statistically significant if it has been predicted as
unlikely to have occurred by chance alone, according to a pre-determined threshold
probability, the significance level. These tests are used in determining what outcomes
of a study would lead to a rejection of the null hypothesis for a pre-specified level of
significance; this can help to decide whether results contain enough information to
cast doubt on conventional wisdom, given that conventional wisdom has been used
to establish the null hypothesis. The critical region of a hypothesis test is the set of all
16
outcomes which cause the null hypothesis to be rejected in favour of the alternative
hypothesis
Hypothesis
1. Two tailed
H0: µ1-µ2 = 0 vs H1: µ1-µ2 ≠ 0
2. Right tailed
H0: µ1-µ2 ≤ 0 vs H1: µ1-µ2 > 0
3. Left tailed
H0: µ1-µ2 ≥ 0 vs H1: µ1-µ2 < 0
The paired sample t-test is a more powerful alternative to a two sample procedure,
such as the two sample t-test, but can only be used when we have matched samples.
Shapiro-Wilk test of normality is a hypothesis test that use to check whether a given
data set follows a normal distribution or not.
Hypothesis
H0: Data set follows a normal distribution. vs
H1: Data set doesn’t follow a normal distribution.
17
3.5 Regression Analysis
Finally regression models were used to identify the factors affected for test scores.
Regression analysis is a process of estimating relationship between two or more
variables when it is believed that some form of association exist between these
variables. Usually X denotes the independent variable and Y denotes the dependent
variable. Regression helps to understand how the dependent variable changes when
one independent variable varied while other independent variables are fixed.
Regression analysis is widely used for prediction and forecasting. Also it can be used
to understand which among the independent variables are related to the dependent
variable, and to explore the forms of these relationships.
Yi = β0 + β1 + εi ; i = 1, 2,..., n (5)
Yi = dependent variable
Xi = independent variable
0 = intercept parameter
1 = slope parameter
i = random error
After fitting the model to the data hypothesis test were used to confirm the
relationship.
Hypothesis
H0: There is no regression relationship between X & Y vs
H1: There is a regression relationship between X & Y
18
Hypothesis
H0 : β i = 0 vs
H1 : β i ≠ 0
In order to determine which of the available independent variables yield the simplest
adequate model, backward elimination model selection procedure was used. Model
selection is a process of selecting a model from a given set of models which describe
the entire data set accurately. However there are several standard procedures that
help in model selection processes.
1. Forward selection
In this procedure variables are added to the model one at a time until the
addition of another variable does not significantly improve the model.
2. Backward elimination
This procedure begins with a model that includes all the potential independent
variables. Variables are deleted from the model one at a time until further deletion of
a variable result in a rejection of the reduced model. This is the method which used
by the researcher in order to obtain the best model for the data.
3.5.3 Multicollinearity
19
reliability of the model as a whole, at least within the sample data themselves; it only
affects calculations regarding individual predictors.
Measuring multicollinearity
VIF(βj) = (6)
R 2 = 1- (7)
Several remedial measures are available for solving the problem of multicollinearity.
1. Deleting variables
Delete the variables which have multicollinearity. Major disadvantage of this method
is it could lost some information contain in those variables.
2. Expanding data
Expanding the data with new observations which are specifically designed to
breakup the linear dependencies that currently exist among independent variables.
3. Ridge regression
The difference of ridge regression is that is uses a biased estimator for β values.
All the null hypothesises were checked with 5% significant level using p-values
obtained by statistical soft wares, R Studio version 0.97.316, easy fit version 5.5 and
minitab version 16.1.1.
20
4. RESULTS AND DISCUSSION
According to the above table most of the variables have higher deviations which help
for a better analysis. But it also indicates the inconsistency of the data. The variables
computer, students and teachers has higher standard deviations than their mean.
Since these variables are non-negative this means that these data may have large
outliers or missing values replaced by 0’s. Skewed data can cause this kind of
behavior and also heteroskedasticity. Also differences between mean and median
indicate the skewness of data.
Test scores of math and read have low standard deviations which imply the
consistency of data and also their mean values and median values are close
comparing with the other variables. That signifies the low skewness.
21
Along with the above results graphical representations such as histograms, box plots
and scatter plots were used to identify and compare the distributions of variables.
Above graph is a histogram of number of students with x-axis represent the number
of students and y-axis represent the number of schools. According to the histogram
there are higher number of schools which have low number of students. Also it
shows low number of schools having high number of students. Therefore it is a right
skewed histogram which decreases the frequency, number of schools, as the number
of students increases.
In this histogram x-axis represent the number of teachers and y-axis represent the
number of schools. As shown in the graph there are higher number of schools which
have lower number of teachers and low number of schools which have high number
of teachers. Therefore it also a right skewed histogram which decreases the
frequency, number of schools, as the number of teachers increases.
22
Figure 3: Histogram of calWorks
This is the histogram for the data, number of computers per classroom. x-axis and y-
axis represent the number of computer per classroom and number of schools
respectively. Here also the graph is right skewed because the frequency, number of
schools decreases as the number of computers per classroom increases.
23
Figure 5: Histogram of expenditure
This graph represents the histogram of expenditure per student. Here also the y-axis
represents number of schools while x-axis represents the amount of expenditure per
student. The graph indicate that most of the schools are cluster around the average,
$5000-$5500 expenditure per student and also there are few schools with has low
and high expenditure.
This also a right skewed histogram of district average income. Most of data cluster
around left corner of the graph. Here the x-axis denotes the average income of the
district in $1000 and y-axis denotes the number of schools.
24
Above graph is a histogram of percentage of students who learn English with x-axis
represent the percentage of students and y-axis represent the number of schools.
According to the histogram there are higher number of schools which have low
percentages and low number of schools having high number of percentages.
Therefore it is a right skewed histogram which decreases the frequency, number of
schools, as the percentage increases.
This histogram represents the data of average test scores of read and maths. Most of
data cluster around center of the graph which confirms the conclusions made about
the consistency according to the Table 1. Here test scores are representing by y-axis
while number of schools represent by y-axis.
Boxplot of calworks, lunch, english
100
80
60
Data
40
20
Above box plots confirm the results gained by histograms. According the graph
students’ percentage of who need income assistance and percentage of English learns
are behave in a similar manner. Their median values are also almost the same. Both
the box plots have outliers in the same side. But the box plot of students percentage
who need low budget lunch is deviate from the other two box plots. As shown in the
diagram it spread through the population equally.
25
Boxplot of read, math
720
700
Data 680
660
640
620
600
read math
According to the above box plots there is not much difference between read scores
and maths scores except the fact that math score box plot has few outliers. So scores
of both subjects have equally distributed within the district.
Since box plot doesn’t give a clear image how expenditure and income have related
scatter plot has been used to identify the relationship between two variables. Here x-
axis denotes expenditure and y-axis denotes income. According to the graph there is
a weak relationship between expenditure and income since observations spread out
all over the graph. Also income has varied in vast range ($10000-$50000) while
expenditure stuck in small range ($4000-$7000).
26
Figure 12: Scatter plot between students and teachers
Since number of students and number of teachers have big difference box plots
cannot be used for comparison. Therefore scatter plot has drawn for fulfil the
comparison. In the above graph x-axis denotes the number of students while number
of teachers denote by y-axis. According to the graph there is a strong relationship
between the two variables. Deviation of observations from the fitted line is very low.
27
lunch math read students teachers
Average -0.86877 0.979143 0.981882 -0.15399 -0.14486
calworks 0.739422 -0.61769 -0.61185 0.090161 0.092645
computer 0.061386 -0.03295 -0.10901 0.928882 0.937242
English 0.653061 -0.56868 -0.69029 0.354879 0.351421
expenditure -0.06104 0.154989 0.217927 -0.11228 -0.09519
Income -0.68444 0.699398 0.697819 0.028392 0.043007
lunch 1 -0.82301 -0.87881 0.129234 0.124296
math -0.82301 1 0.922902 -0.11089 -0.1023
read -0.87881 0.922901 1 -0.1884 -0.17911
students 0.129234 -0.11089 -0.1884 1 0.997116
teachers 0.124296 -0.1023 -0.17911 0.997116 1
Above table confirms the results that gained by scatter plots. According to the figure
11 there is weak relationship between district average income and expenditure per
student. In the table it gives a value of ρ = 0.314484 as the correlation coefficient of
expenditure and income which indicate weak positive relationship. And also there
are weak positive relationships between average test scores and expenditure, income
assistance and English learners, number of computers and English learners, English
learners and number of students, English learners and number of teachers,
expenditure with read and math scores, lunch with number of students and teachers.
Relationships between average test score with number of teachers and number of
students, income assistance and income, read test score and number of computers,
expenditure and number of students, math and read test score with number of
students and number of teachers have negative weak relationships. Also there are
values close zero, ρ=0 which indicate that there is almost no association between two
variables. Average test scores and number of computers, income assistance and
number of computers, income assistance with expenditure, number of students and
teachers, expenditure with reduced lunch, number of teachers and English learners,
income with number of students and teachers have such relationship.
28
But most important relationships are strong positive relationships and strong negative
relationships. According to the table average with income, math and read test score
have strong positive relationships. It is obvious that average test scores and read and
math test score have strong positive relationship as average is calculated using those
two scores. But it is interesting that income and average test score has correlation
coefficient value of ρ = 0.712431. This ρ value indicates that students who are in
district with high average income tend get high marks in tests. But when it consider
test scores separately table shows reduction of ρ values. Yet they are strong positive
relationships because read and math test scores are also highly correlated. Also it is
no surprise that students who need income assistance and who need price reduced
lunch has correlation value of ρ = 0.739422. Numbers of computers with both
variables, number of teachers and students also have strong positive relationships. As
shown in the figure 12, number of students and number of teachers has strong
positive relationship.
Relationship between English learners and average test scores is strongly negative so
as the read and math scores separately. Also these test scores have strong negative
relationships with income assistance even though there is weak positive relationship
between income assistance and English learners. It is obvious that income and price
reduced lunch has strong negative relationship. Also there negative relationships
between price reduced lunch and test scores.
Therefore According to the above table test scores are mainly affected by income and
English knowledge. Since other variables, lunch and calWorks depend on income,
they are also affecting on test scores.
Using paired sample t-test it has been determined the difference between two test
scores and difference between price reduced lunch percentages and income
assistance percentages.
29
Hypothesis
H0: Mean of read score – Mean of math score ≤ 0 vs
H1: Mean of read score – Mean of math score > 0
Test statistic
p-value = 1.029e-05 < 0.05
Decision
H0 is rejected at 5% significant level. Therefore two mean values have significant
different at 5% significant level. According to the null hypothesis read average test
score is higher than math average test score.
Hypothesis
H0: Mean of lunch – Mean of calWorks ≤ 0 vs
H1: Mean of lunch – Mean of calWorks > 0
Test statistic
p-value = 2.2e-16 < 0.05
Decision
H0 is rejected at 5% significant level. Therefore two mean values have significant
different at 5% significant level. According to the null hypothesis lunch average is
higher than calWorks average.
After that using Shapiro-Wilk test of normality it has been determined the
distributions of read test scores, math test scores and average test scores.
Hypothesis
H0: Data set follows a normal distribution. vs
H1: Data set doesn’t follow a normal distribution.
30
Table 3: Shapiro-Wilk test results
Variable p-value α H0
Above table shows that variables, read test score, math test score and average test
score follow normal distribution.
Then following regression model was fitted to the data in order to identify the factors
affected on read test scores.
According to the above model number of teachers, math test scores, income and
expenditure per student have direct relationship with read test scores and all the other
variables have negative relationships with the dependent variable.
After fitting the model it has been tested whether there are regression relationships
with dependent variable and independent variables.
Hypothesis
H0: There is no regression relationship between X & Y vs
H1: There is a regression relationship between X & Y
Test statistic
p-value = 2.2e-16 < 0.05
31
Decision
H0 is rejected at 5% significant level. Therefore there is a regression
relationship between dependent variable and independent variables. Since H0 is
rejected it is necessary to test the regression parameters.
Hypothesis
H0 : β i = 0 vs
H1 : β i ≠ 0
Estimate Pr(>|t|) H0
32
According to the table 2 and regression models 8 it can be clearly identify
contradictory results. In the correlation table, between number of teachers and read
test score has a negative relationship, ρ = -0.17911. But regression model 8 indicates
that they have positive regression coefficient, β = 0.010. This could be happening
due to the fact that there is multicollinearity between independent variables. Also
between the variables read test score and calWorks, the correlation coefficient is ρ= -
0.61185. But the regression coefficient is β = 0.027. Hence using VIF it has been
check for multicollinearity in model 8.
Variable VIF
According to the above two table it has been determine that multicollinearity exist
among the independent variables because VIF values of students and teachers are
greater than 10. Therefore regression model 8 is not a suitable model for read test
score data. So using backward elimination model selection procedure and AIC it has
been determine the most simplest and adequate regression model that describe the
dependent variable, read test score data accurately.
33
Read = 261.6-0.1841*English+0.002753*expenditure+0.0001212*income-
0.2031*lunch+0.5958*math-0.0001204*students (9)
This model implies that unite unit change of English will decrease read test score by
0.1841 units. Hence read test score and percentage of English learners have a
negative relationship as mentioned in table 2. Also read test score has negative
relationships with percentage of students who need price reduced lunch and number
of students. Negative relationships, also known as inverse relationships mean that
when independent variable increases it will decrease the value of dependent variable.
Since the coefficient values of expenditure, income and math are positive, they have
positive relationships with the dependent variable.
In order to identify the factors affected on math test scores same steps were followed
as above. Using math test score as dependent variable and all the other variables as
independent variables, model was fitted to the data.
Above model implies that math test scores has direct relationship with number of
students, read test scores, average income, percentage of English learners and
number of computers. Number of teachers, expenditure and income assistance have
inverse relationship.
34
Hypothesis
H0: There is no regression relationship between X & Y vs
H1: There is a regression relationship between X & Y
Test statistic
p-value =2.2e-16< 0.05
Decision
H0 is rejected at 5% significant level. Therefore there is a regression relationship
between dependent variable and independent variables. Since H0 is rejected it is
necessary to test the regression parameters.
Hypothesis
H0 : β i = 0 vs
H1 : β i ≠ 0
Estimate Pr(>|t|) H0
35
Above table shows that there is no significant relationship between the dependent
variable, math test score with independent variables, number of teachers, number of
students, percentage of students who need price reduced lunch, schools’ grade,
number of computers and income assistance program. Relationships between the
dependent variable, math test score and independent variables, read test score,
expenditure and percentage of English learners are significant.
Again according to the table 2 and regression models 10 it can be clearly identify
misleading results. In the correlation table, between expenditure and read test score
has a positive relationship, ρ = 0.154989. But regression model 10 indicates that they
have positive regression coefficient, β = -0.0019. Also between the variables math
test score and computer, the correlation coefficient is ρ= -0.03295. But the regression
coefficient is β = 0.0029. Hence using VIF it has been check for multicollinearity in
model 10.
Variable VIF
According to the above two table it has been determine that multicollinearity exist
among the independent variables because VIF values of students and teachers are
36
greater than 10. Therefore regression model 10 is not a suitable model for read test
score data. So as in above using backward elimination and AIC it has been determine
the most appropriate regression model for math test score data.
This model implies that unite unit change of English will increase read test score by
0.0953 units. Hence math test score and percentage of English learners have a direct
relationship. Also math test score has negative relationships with percentage of
students who need income assistance and expenditure. Since the coefficient values of
percentage of English learners, income, read and numbers of computers are positive,
they have positive relationships with the dependent variable.
Then same procedure applies to average test score as well. Hence following reduced
model was obtained as a result of backward elimination model selection procedure
and AIC values.
Model shows that test scores are depend on the variables, calWorks, English,
expenditure, income and lunch. Among them calWorks, English and lunch have
negative impact on test scores. i.e. when the percentages of those variables increase it
reduce the test scores of students. Expenditure and income have positive
relationships with the test scores.
37
5. SUGGESTIONS AND CONCLUSIONS
According to the table 1 maximum value of percentage of students who need income
assistance is 78.9942 and minimum value percentage of students who need price
reduces lunch is 100. And also according to the paired sample t-test, mean of the
variable, percentage of students who need price reduced lunch is higher than the
mean value of the variable, percentage of students who need income assistance.
These results indicate that students prefer price reduced lunch even though they don’t
need income assistance. Therefore price reduction of food would be appropriate plan
than income assisting.
Average number of computers per classroom has a higher standard deviation than its
mean value. This implies that there are huge outliers. Also the minimum value of
computer is 0. That means there some schools in which with no computers. And the
maximum value, 3324 implies they have computers more than needed. A problem in
distribution of resources can be identified clearly. In modern society computer
knowledge is a must. Therefore a proper plan of resource distribution is needed.
Figure 12, scatter plot between number of teachers and number of students implies
that there is a strong relationship between those two variables. Also according to
table 2 there is a correlation coefficient of ρ = 0.997116 which indicate a strong
positive relationship among the variables, number of teachers and number of
students. Figure 12 also shows that there are no any big deviate points from the fitted
line which implies that the ratio between number of students and number of teachers
38
is maintained throughout the state. i.e. every school has appropriate number of
teachers according to the number of students.
Paired sample t-test confirms that average read test score is higher than average math
test score. But table 2 indicates that the maximum values of both variables implies
otherwise. i.e. maximum value of read test score, 704 is lower than maximum value
of math test score, 709.5. Minimum values are also implies the same. i.e. minimum
value of read test score, 604.5 is lower than minimum value of math test score,
605.4. Therefore math test score has right skewed data even though table 3 confirms
that both variables follow normal distributions. So students need to spent more time
on mathematic than reading.
Regression model 12 indicate that students’ average test scores are associated with
income, expenditure, percentage of students who need income assistance, percentage
of students who need price reduced lunch and percentage of English learners. It is
obvious that expenditure is entirely depends on income even though it is not proved
by the scatter plot, figure 11 and also calWorks and lunch. Hence students’ test
scores are basically depending on income and English knowledge. Income has
positive relationship with test scores and English has negative relationship. Which
means when income level is high students tend to take high scores and when English
knowledge is low students are taking low scores. But when analyzing the two test
scores separately it gives different results. According to the regression model 9 read
test scores are related with number of students. Math test scores has the highest
impact on read test scores because they are highly correlated as shown in the table 2,
ρ = 0.922902. But regression model 11 gives lot more abnormal results like β =
0.0953 between math test scores and English and β = -0.0018 between expenditure
and math test scores. Other than that math test scores have considerable relationship
with number of computers per classroom. This could be a result of multicollinearity
among the independent variables such as the relationship between expenditure and
income, lunch and calWorks and number of students and number of teachers.
However, this study along with numerous others has found that income level and
English knowledge are extremely important factors that influence student
39
achievement. Therefore calWorks is a good project that helps students who need
income assistance so that they can improve their test scores. By providing income
assistance to students they can reach to the resources which they can’t have in
schools. Also it’s better if students could have good English learning materials like
text books, good teachers and access to internet. Reducing the price of lunch also
would be a better way of improving the test scores of students despite the income
assistance.
Results of the study would be more accurate if there were more details about the
families such as average number of members in a family and about school like
number of male students and female students, whether the school is a private or
public. Other than understanding the factors which are affect students’ test results
this study could be used to identify the factors which more important to a school and
also how a school status can be improved.
40
REFERENCE
Battle, J., & Lewis, M. (2002). The increasing significance of class: The relative effects of
race and socioeconomic status on academic achievement. Journal of poverty, 6(2), 21-35.
Ceballo, R., McLoyd, V. C., & Toyokawa, T. (2004). The influence of neighborhood quality on
adolescents’ educational values and school effort.Journal of Adolescent Research, 19(6),
716-739.
Crosnoe, R., Johnson, M. K., & Elder, G. H. (2004). Intergenerational bonding in school: The
behavioral and contextual correlates of student-teacher relationships. Sociology of
Education, 77(1), 60-81.
Hunt, S. A., Abraham, W. T., Chin, M. H., Feldman, A. M., Francis, G. S., Ganiats, T. G., ... &
Riegel, B. (2005). ACC/AHA 2005 guideline update for the diagnosis and management of
chronic heart failure in the adult a report of the American College of Cardiology/American
Heart Association Task Force on Practice Guidelines (Writing Committee to Update the 2001
Guidelines for the Evaluation and Management of Heart Failure): developed in collaboration
with the American College of Chest Physicians and the International Society for Heart and
Lung Transplantation: endorsed by the Heart Rhythm Society.Circulation, 112(12), e154-
e235.
Stolzenberg, L., & Stewart, D. A. David Eitle 2004 A multilevel test of racial threat
theory. Criminology, 42, 673-698.
Tam, M. Y. S., & Bassett, G. W. (2004). Does diversity matter? Measuring the impact of high
school diversity on freshman GPA. Policy studies journal, 32(1), 129-143.
41
Appendix
Data set
Used R codes
Descriptive statistics
Graphs
Histogram
Box plot
42
Scatter plot
Correlation
Rcmdr> cor(Dataset[,c("Average","calworks","computer","district","english",
Rcmdr+ "expenditure","Income1","lunch","math","read")], use="complete.obs")
Rcmdr> shapiro.test(Dataset$Average)
Regression model
43