Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

FACTORS AFFECTING ON STUDENTS TEST SCORES

ST306- Data analysis and report writing

A.H.S.S.S. Ariyarathne
(S/09/807)
(sakitha.sakitha@gmail.com)
4th December 2013

DEPARTMENT OF STATISTICS AND COMPUTER SCIENCE

FACULTY OF SCIENCE

UNIVERSITY OF PERADENIYA

1
ACKNOWLEDGEMENT

Any accomplishment requires the effort of many people and there are no exceptions.
The report being submitted today is a result of collective effort. There are
innumerous helping hands behind this study.

First of all a special thank should be given to Mr. C. Manoj for giving this
opportunity and helping throughout the study. Also Mr. N.D. Rupasinghe had been a
great help in completing this research. And also all the academic and non academic
staff members of the department of statistics and computer science should be
thanked.

2
ABSTRACT

This research addresses the increasing importance of students test scores by


examining different factors that influence test scores. Schools average test scores in
California during the period 1998-1999 are examined using three models. In order to
achieve the target, data were collected under three categories, students test
performance, school characteristics and characteristics of district and 14 variables
using 420 schools. Initially basic descriptive analyses were carried out on all
numerical variables. Charts, graphs and tables were used to describe data on each
category. Pearson product moment correlation was the first analytical tool used in
order to identify the relationships among variables. Then finally linear regression
models were fitted to clarify the results which affect on students test results and
which are not. Hypothesis tests were used throughout the analysis in order to make
decision more accurately and more confidently.

Key words: Test scores, Correlation, Regression, Hypothesis test

3
Contents

1. INTRODUCTION ........................................................................................................... 7
1.1 Background of the study .......................................................................................... 7
1.2 Importance of the study ........................................................................................... 7
1.3 Objectives ................................................................................................................ 8
2. LITERATURE REVIEW ................................................................................................ 9
2.1 Student role performance ......................................................................................... 9
2.2 School environment ................................................................................................. 9
2.3 Family background ................................................................................................ 10
3. METHODOGY .............................................................................................................. 12
3.1 Data Collecting ............................................................................................................ 12
3.2 Descriptive Statistics.............................................................................................. 13
3.2.1 Mean .............................................................................................................. 13
3.2.2 Standard deviation ......................................................................................... 14
3.2.3 Median ........................................................................................................... 14
3.2.4 Histogram....................................................................................................... 14
3.2.5 Box plot.......................................................................................................... 15
3.2.6 Scatter plot..................................................................................................... 15
3.3 Correlation ............................................................................................................. 16
3.4 Hypothesis Tests .................................................................................................... 16
3.4.1 Paired Sample t-test ....................................................................................... 17
3.4.2 Shapiro-Wilk test of normality ...................................................................... 17
3.5 Regression Analysis ............................................................................................... 18
3.5.1 Regression Model .......................................................................................... 18
3.5.2 Model Selection ............................................................................................. 19
3.5.3 Multicollinearity ................................................................................................... 19
4. RESULTS AND DISCUSSION .................................................................................... 21
4.1 Primary Analysis.................................................................................................... 21
4.2 Confirmatory Analysis ........................................................................................... 29
5. SUGGESTIONS AND CONCLUSIONS...................................................................... 38
REFERENCE ........................................................................................................................ 41
Appendix................................................................................................................................ 42

4
List of tables

Table 1 : Summary of descriptive statistics ........................................................................... 21


Table 2: Pearson- product moment correlation ...................................................................... 27
Table 3: Shapiro-Wilk test results.......................................................................................... 31
Table 4: Test results of regression parameters I .................................................................... 32
Table 5: VIF values of model 8 ............................................................................................. 33
Table 6: Test results of regression parameters II ................................................................... 35
Table 7: VIF values of model 10 ........................................................................................... 36

5
List of figures

Figure 1: Histogram of students............................................................................................. 22


Figure 2: Histogram of teachers............................................................................................. 22
Figure 3: Histogram of calWorks .......................................................................................... 23
Figure 4: Histogram of computers ......................................................................................... 23
Figure 5: Histogram of expenditure ....................................................................................... 24
Figure 6: Histogram of income .............................................................................................. 24
Figure 7: Histogram of English.............................................................................................. 24
Figure 8: Histogram of average ............................................................................................. 25
Figure 9: Box plots of calWorks, lunch and English ............................................................. 25
Figure 10: Box plots of read and maths scores ...................................................................... 26
Figure 11: Scatter plot between expenditure and income ...................................................... 26
Figure 12: Scatter plot between students and teachers .......................................................... 27

6
1. INTRODUCTION

1.1 Background of the study

Education is the process of receiving or giving systematic instruction. Equal access


to education is among the basic human rights to which everyone entitled. Thomas,
Wang and Fan, X. (2001) show that educational gaps between various groups in
many countries are staggering. According to Battle and Lewis (2002), a person’s
education is closely linked to their life chances, income and well being. Therefore it
is important to have a clear understanding of what benefits or hinders one’s
educational attainment. Basically a person’s education depends on test scores it is
important to examine what factors influence students test scores. By gaining a better
understanding of students test scores it could be help determine what should be done
in order to improve them.

1.2 Importance of the study

There are several topical areas that are most commonly linked to academic
performance including student role performance (SRP), school factors, family factors
and peer factors. Student role performance is how well an individual fulfills the role
of a student in an educational setting. Sex, race, school effort, extra-curricular
activities, deviance and disabilities are all important influence on SRP and have been
shown to affect test scores. School environment factors, such as school size,
neighborhood and relationships between teachers and students also influence test
scores, according to a previous study Crosnoe, Johnson and Elder (2004). One’s
family background has also found to influence student test scores. Majorbanks
(1996) has found that socioeconomic status, parental involvement and family size are
particularly important family factors. Peer influence can also affect students’
performance. According to Santor, Messervey and Kusumaker (2000), peer pressure
and peer conformity can lead to an individual participating in risk taking behaviors
which have been found negative influence effect on test scores.

7
1.3 Objectives

This study takes a holistic approach to analyzing the influence on test scores by
creating regression model. This model includes many of the factors that have
previously been linked to affecting test scores. It consists of students’ role
performance (SRP) factor, English, family level factors such as income, expenditure
and school factors like number of teachers, number of students. With this model the
researcher is intend to find the influence factors on test scores.

Also relationship between number of students and number of teachers is going to


identified in the study. In order to give a proper education to students it is necessary
to give 100% attention to all of students. Hence there should be a common ratio
between teachers and students. As a secondary objective, the researcher is going to
find the relationship between number of teachers and students.

Finally the researcher interested in finding in which subject students perform well
and in which subject they should improved. Students need to identify not only their
strength but their weaknesses as well. By understanding their weaknesses they could
take necessary steps to improve themselves. Therefore the researcher also intends to
find in which subject students should spent more time.

8
2. LITERATURE REVIEW

2.1 Student role performance

Student role performance (SRP) is how well an individual fulfills the role of a
student in an educational institution. SRP involves factor such as sex of the student,
students’ race, school effort, extracurricular activities, deviant behavior and student
disabilities. According to the past researches, Eitle (2004) it has been discovered that
academic achievement gap between the sexes with boys ahead of girls. However
more recent research, Chambers and Schreiber (2005) has shown that the
achievement gap been narrowing and that in some instances girls have higher
academic achievement than boys. For example, according to Ceballo, McLoyd and
Toyokawa (2004), girls have been found to extra effort at school, leading to better
school performance. Additionally, studies show that girls perform better in reading
than males. But males are found to outperform females in mathematics and science.

Past researches have been found many more influence factors on test scores other
than sex. According to Tam and Basset (2004) and Seyfried (1998), race has been
shown to play a major role in the life of a student. Hunt (2005) has proven from a
theoretical point of view that extracurricular activities are viewed as boosting
academic performance. It also has been concluded that student deviance and
delinquency have been linked to academic outcomes by Murdock, Anderman and
Hodge (2000).

2.2 School environment

Students’ educational outcomes and academic success is greatly influence by the


type of school that they attend. School factors include school structure, school
composition and school climate. The school one attends is the institutional
environment that sets the parameters of a students’ learning experience. Depending
on the environment a school can either open or close the doors that lead to academic
achievement. Crosnoe, Johnson and Edler (2004) suggested that school sector (public
or private) and class size are two important components of schools. Furthermore
private schools tend to have both better funding and smaller class sizes than public
schools. The additional funding of private schools leads to better academic

9
performance and more access to resources such as computers, which have been
shown to enhance academic achievement. Smaller class sizes create more intimate
settings and therefore can increase teacher-student bonding which has also been
shown to have a positive effect on student success. According to Eamon (2005), the
relative social class of a student body also affects academic achievement. Students
from low socioeconomic background who attend poorly funded schools do not
perform as well as students from higher social classes.

Crosone et al defines school climate as “the general atmosphere of a school” (2004).


School climate is closely related to the interpersonal relations between students and
teachers. Trust between students and teachers increases if a school encourages
teamwork. Research shows that students who trust their teachers are more motivated
and as a result perform better in school.

2.3 Family background

Family background is key to a students’ life and outside of school, is the most
important influence on student learning and includes factors such as socioeconomic
status, two-parent versus single-parent household, divorce, parenting practices and
aspirations, mental characteristics, family size and neighborhood (Majoribanks
1996). The environment at home is a primary socialization agent and influences a
child’s interest in school and aspirations for the future.According to Teynes (2002)
the socio-economic status (SES) of a child is most commonly determined by
combining parents’ educational level, occupational status and income level. Studies
have repeatedly found that SES affects student outcomes. Students who have a low
SES earn lower scores and are more likely to drop out of school.

Majoribanks (1996) also shown that child from single-parent households do not
perform as well in school as children from two=parent households. There are several
different explanations for this achievement gap. Single-parents households have less
income and there is a lack of support for the single-parent which increases stress and
conflicts. Single parents often struggle with time management issues due to
balancing many different area of life on their own. Some research has also shown

10
that single-parents are less involved with their children and therefore give less
encouragement and have lower expectations of their children than two-parent
households.

This study proposes a holistic alternative model that combines student role
performance (SRP), family background and school environment on students test
scores. SRP is a set of behaviors and personal characteristics that affect how well a
student perform in school. English knowledge, sex and race are few examples for
SRP. It is expected that the higher a students’ SRP, the higher a students’ test score
will be. School is the institutional environment that sets the parameters of a students’
learning environment. Schools’ grade, number of students in the school, number of
teachers and number of computers per classroom are few examples of important
school factors. School can have a direct affect on test scores. For example, the
teacher to student ratio affects all students directly. School can also have an indirect
affect on test scores. For example, only some students at a particular school might
take college preparatory classes and that will most likely increase their student role
performance, which will indirectly affect test scores. It is expected that students’ test
scores will increase as the quality of the school increases. Family provides
connections to the resources that are needed to be a successful student. Family
income level, expenditure per child and family size influence the performance of
students. This study predicts that family income level increases so will students test
scores.

11
3. METHODOGY

In order to full fill the objectives of the study several statistical techniques were used.
Statistic is the science of conducting studies to collect, organizes, analysis,
summarize and draw conclusions from data. The basic foundation of statistics is data.
The first step of the statistical process is collecting data.

3.1 Data Collecting

There are few types of data collection methods.

1. Direct observation – Data are collected without having any interaction with
the responder.
2. Questionnaires – Data collection through in which the questions are presented
that are to be answered by responders.
3. Personal interviews – Collecting data by contacting the responder personally
and ask questions.
4. Telephone interviews – Collecting data by asking questions over the phone.
5. Indirect oral interviews – Collecting data by asking questions about a person
from another person.
6. Secondary data – Use the data that has been already collected.

Data can be divided into two categories based on their characteristics. Observations
consist of words or codes are called qualitative data while observations consist of
numbers that represent an amount for counts are called quantitative data.
Quantitative data again can divide into two sections, discrete and continuous.
Observations which can take any value within a specific range are called continuous
data and observations which contain gaps between two consecutive values are called
discrete.

In this study the researcher has used secondary data which is collected from 420
schools in California consisting both quantitative and qualitative data and also both
discrete and continuous data under following 14 variables. District code, school
12
name, country, grade span of district, number of students, number of teachers,
percent qualifying for CalWorks (income assistance), percent qualifying for reduced
price lunch, number of computers, expenditure per student, district average income
in $1000, percent of English learners, average reading score and average math score.
Grade is the only categorical data that has been collected. In order to reduce the
complexity of the study average of read and math were taken as overall results of the
test and also to keep units of the variables in line income has multiply by $1000 in
order to take the actual value. There were no missing values or they have been
replaced with 0’s. Math test scores, read test scores and average test scores have been
used as dependent variables and others were used as independent variables. Schools’
grade, number of teacher, students and computers per classroom were used as school
environmental variables. Average income level, expenditure, calWorks and lunch
were used as family background variables while English is used as SRP variable.

3.2 Descriptive Statistics

Initially basic descriptive analyses were carried out. Descriptive statistics are used to
summarize the data in a clear and understandable way. Tables and graphs were used
to present the summary statistics.

3.2.1 Mean

Mean has to be computed by considering each and every observation in the series.

µ= ∫ ∞
; X continuous

∑ ( )
µ= ; X discrete (1)

µ - Mean
th
Xi – i observation
N – Number of observations

13
3.2.2 Standard deviation

Standard deviation is the square root of the mean of the square deviation from the
mean.
∑ ( )
σ= ; X continuous

∫ ( )
= ; X Discrete (2)
δ – Standard deviation
Xi – ith observation
µ - Mean
N – Number of observations

3.2.3 Median

Median is the value which divides the series into two equal parts after arranged the
observations in either ascending order or descending order.
th
M= item ; N is odd

th
M = Average of ( ) and ( ) items ; N is even (3)

M – Median
N – Number of observations

3.2.4 Histogram

A histogram is a representation of tabulated frequencies, shown as


adjacent rectangles, erected over discrete intervals, with an area equal to the
frequency of the observations in the interval. The height of a rectangle is also equal
to the frequency density of the interval, i.e., the frequency divided by the width of the
interval. The total area of the histogram is equal to the number of data. A histogram
may also be normalized displaying relative frequencies. It then shows the proportion
of cases that fall into each of several categories, with the total area equalling 1. The
categories are usually specified as consecutive, non-overlapping intervals of a
variable. The intervals must be adjacent, and often are chosen to be of the same

14
size. The rectangles of a histogram are drawn so that they touch each other to
indicate that the original variable is continuous.

3.2.5 Box plot

In descriptive statistics, a box plot is a convenient way of graphically depicting


groups of numerical data through their quartiles. Box plots may also have lines
extending vertically from the boxes, whiskers indicating variability outside the upper
and lower quartiles. Outliers may be plotted as individual points.

Box plots display differences between populations without making any assumptions
of the underlying statistical distribution: they are non-parametric. The spacing
between the different parts of the box helps indicate the degree of dispersion (spread)
and skewness in the data, and identify outliers. In addition to the points themselves,
they allow one to visually estimate various L-estimators, notably the inter quartile
range, midhinge, range, mid-range, and tri-mean. Box plots can be drawn either
horizontally or vertically.

3.2.6 Scatter plot

A plot in which the data is displayed as a collection of points, each having the value
of one variable determining the position on the horizontal axis and the value of the
other variable determining the position on the vertical axis is called scatter plot.

A scatter plot is used when a variable exists that is below the control of the
experimenter. If a parameter exists that is systematically incremented and/or
decremented by the other, it is called the control parameter or independent
variable and is customarily plotted along the horizontal axis. The measured
or dependent variable is customarily plotted along the vertical axis. If no dependent
variable exists, either type of variable can be plotted on either axis. A scatter plot will
illustrate only the degree of correlation between two variables.

15
3.3 Correlation

After obtaining descriptive statistics for all numerical variables correlation


coefficient was used to identify the dependency between variables. Correlation is a
measure of the strength and direction of the linear relationship between two variables
that is defined as the covariance of the variables divided by the product of their
standard deviations.

( , )
ρX,Y = (4)
σ σ

ρ – Correlation coefficient
cov(X,Y) – Covariance of X and Y
σX – Standard deviation of X
σY – Standard deviation of Y

Correlation coefficient can only take values between -1 and 1. Negative values
indicate the two variables are inversely associated and positive values indicate direct
association. ρ > 0.7 considered to be strong positive relationships, ρ < -0.7
considered to be strong negative relationships and ρ = 0 indicate that there is no
relationship between two variables. Overall correlation coefficient measure how one
variable depend on another variable.

3.4 Hypothesis Tests

Various hypothesis tests were used to make decisions about the data. A statistical
hypothesis test is a method of making decisions using data from a scientific study.
In statistics, a result is called statistically significant if it has been predicted as
unlikely to have occurred by chance alone, according to a pre-determined threshold
probability, the significance level. These tests are used in determining what outcomes
of a study would lead to a rejection of the null hypothesis for a pre-specified level of
significance; this can help to decide whether results contain enough information to
cast doubt on conventional wisdom, given that conventional wisdom has been used
to establish the null hypothesis. The critical region of a hypothesis test is the set of all

16
outcomes which cause the null hypothesis to be rejected in favour of the alternative
hypothesis

3.4.1 Paired Sample t-test

A paired sample t-test is used to determine whether there is a significant difference


between the average values of the same measurement made under two different
conditions. Both measurements are made on each unit in a sample, and the test is
based on the paired differences between these two values (µ1-µ2). The usual null
hypothesis is varies according to the situation.
µ1 = Mean of the 1st sample
µ2 = Mean of the 2nd sample

Hypothesis

1. Two tailed
H0: µ1-µ2 = 0 vs H1: µ1-µ2 ≠ 0
2. Right tailed
H0: µ1-µ2 ≤ 0 vs H1: µ1-µ2 > 0
3. Left tailed
H0: µ1-µ2 ≥ 0 vs H1: µ1-µ2 < 0

The paired sample t-test is a more powerful alternative to a two sample procedure,
such as the two sample t-test, but can only be used when we have matched samples.

3.4.2 Shapiro-Wilk test of normality

Shapiro-Wilk test of normality is a hypothesis test that use to check whether a given
data set follows a normal distribution or not.
Hypothesis
H0: Data set follows a normal distribution. vs
H1: Data set doesn’t follow a normal distribution.
17
3.5 Regression Analysis

3.5.1 Regression Model

Finally regression models were used to identify the factors affected for test scores.
Regression analysis is a process of estimating relationship between two or more
variables when it is believed that some form of association exist between these
variables. Usually X denotes the independent variable and Y denotes the dependent
variable. Regression helps to understand how the dependent variable changes when
one independent variable varied while other independent variables are fixed.
Regression analysis is widely used for prediction and forecasting. Also it can be used
to understand which among the independent variables are related to the dependent
variable, and to explore the forms of these relationships.

Simple linear regression model with one independent variable,

Yi = β0 + β1 + εi ; i = 1, 2,..., n (5)

Yi = dependent variable
Xi = independent variable
0 = intercept parameter
1 = slope parameter
i = random error

After fitting the model to the data hypothesis test were used to confirm the
relationship.

Hypothesis
H0: There is no regression relationship between X & Y vs
H1: There is a regression relationship between X & Y

If H0 is rejected it is necessary to test the regression parameters.

18
Hypothesis
H0 : β i = 0 vs
H1 : β i ≠ 0

3.5.2 Model Selection

In order to determine which of the available independent variables yield the simplest
adequate model, backward elimination model selection procedure was used. Model
selection is a process of selecting a model from a given set of models which describe
the entire data set accurately. However there are several standard procedures that
help in model selection processes.
1. Forward selection
In this procedure variables are added to the model one at a time until the
addition of another variable does not significantly improve the model.
2. Backward elimination
This procedure begins with a model that includes all the potential independent
variables. Variables are deleted from the model one at a time until further deletion of
a variable result in a rejection of the reduced model. This is the method which used
by the researcher in order to obtain the best model for the data.

3.5.3 Multicollinearity

Collinearity is a linear relationship between two explanatory variables. Two variables


are perfectly collinear if there is an exact linear relationship between the two.
Multicollinearity refers to a situation in which two or more explanatory variables in
a multiple regression model are highly linearly related. Multicollinearity is a
statistical phenomenon in which two or more predictor variables in a multiple
regression model are highly correlated. In this situation the coefficient estimates of
the multiple regressions may change erratically in response to small changes in the
model or the data. Multicollinearity does not reduce the predictive power or

19
reliability of the model as a whole, at least within the sample data themselves; it only
affects calculations regarding individual predictors.

Measuring multicollinearity

1. Variance Inflation Factor (VIF)


Formal method of detecting multicollinearity,

VIF(βj) = (6)

βj = jth regression coefficient


Rj2 = Coefficient of multiple determination resulting from regression Xj from the
other regressor variables

R 2 = 1- (7)

If VIF > 10, there exists a multicollinearity.

2. Condition index and condition number


3. Variance proportion

Several remedial measures are available for solving the problem of multicollinearity.

1. Deleting variables
Delete the variables which have multicollinearity. Major disadvantage of this method
is it could lost some information contain in those variables.

2. Expanding data
Expanding the data with new observations which are specifically designed to
breakup the linear dependencies that currently exist among independent variables.

3. Ridge regression
The difference of ridge regression is that is uses a biased estimator for β values.

All the null hypothesises were checked with 5% significant level using p-values
obtained by statistical soft wares, R Studio version 0.97.316, easy fit version 5.5 and
minitab version 16.1.1.

20
4. RESULTS AND DISCUSSION

4.1 Primary Analysis

As the beginning of the analysis descriptive statistics such as mean, standard


deviation, median, maximum and minimum were obtained for each quantitative
variable.

Table 1: Summary of descriptive statistics

Variable Mean Stdev Median Min Max


calworks 13.246 11.4548 10.5205 0 78.9942
computer 303.383 441.341 117.5 0 3324
English 15.7682 18.2859 8.77763 0 85.5397
expenditure 5312.41 633.937 5214.52 3926.07 7711.51
income 15.3166 7.22589 13.7278 5.335 55.328
lunch 44.7052 27.1234 41.7507 0 100
math 653.343 18.7542 652.45 605.4 709.5
read 654.97 20.108 655.75 604.5 704
students 2628.79 3913.1 950.5 81 27176
teachers 129.067 187.913 48.565 4.85 1429
Average 654.157 19.0534 654.45 605.55 706.75

According to the above table most of the variables have higher deviations which help
for a better analysis. But it also indicates the inconsistency of the data. The variables
computer, students and teachers has higher standard deviations than their mean.
Since these variables are non-negative this means that these data may have large
outliers or missing values replaced by 0’s. Skewed data can cause this kind of
behavior and also heteroskedasticity. Also differences between mean and median
indicate the skewness of data.
Test scores of math and read have low standard deviations which imply the
consistency of data and also their mean values and median values are close
comparing with the other variables. That signifies the low skewness.

21
Along with the above results graphical representations such as histograms, box plots
and scatter plots were used to identify and compare the distributions of variables.

Figure 1: Histogram of students

Above graph is a histogram of number of students with x-axis represent the number
of students and y-axis represent the number of schools. According to the histogram
there are higher number of schools which have low number of students. Also it
shows low number of schools having high number of students. Therefore it is a right
skewed histogram which decreases the frequency, number of schools, as the number
of students increases.

Figure 2: Histogram of teachers

In this histogram x-axis represent the number of teachers and y-axis represent the
number of schools. As shown in the graph there are higher number of schools which
have lower number of teachers and low number of schools which have high number
of teachers. Therefore it also a right skewed histogram which decreases the
frequency, number of schools, as the number of teachers increases.

22
Figure 3: Histogram of calWorks

Above graph is a histogram of percentage of students who need income assistance


with x-axis represent the percentage of qualifying students needing income
assistance and y-axis represent the number of schools. According to the histogram
there are higher number of schools which need low percentage of income assistance
and low number of schools which need high percentage of income assistance.
Therefore it is a right skewed histogram.

Figure 4: Histogram of computers

This is the histogram for the data, number of computers per classroom. x-axis and y-
axis represent the number of computer per classroom and number of schools
respectively. Here also the graph is right skewed because the frequency, number of
schools decreases as the number of computers per classroom increases.

23
Figure 5: Histogram of expenditure

This graph represents the histogram of expenditure per student. Here also the y-axis
represents number of schools while x-axis represents the amount of expenditure per
student. The graph indicate that most of the schools are cluster around the average,
$5000-$5500 expenditure per student and also there are few schools with has low
and high expenditure.

Figure 6: Histogram of income

This also a right skewed histogram of district average income. Most of data cluster
around left corner of the graph. Here the x-axis denotes the average income of the
district in $1000 and y-axis denotes the number of schools.

Figure 7: Histogram of English

24
Above graph is a histogram of percentage of students who learn English with x-axis
represent the percentage of students and y-axis represent the number of schools.
According to the histogram there are higher number of schools which have low
percentages and low number of schools having high number of percentages.
Therefore it is a right skewed histogram which decreases the frequency, number of
schools, as the percentage increases.

Figure 8: Histogram of average

This histogram represents the data of average test scores of read and maths. Most of
data cluster around center of the graph which confirms the conclusions made about
the consistency according to the Table 1. Here test scores are representing by y-axis
while number of schools represent by y-axis.
Boxplot of calworks, lunch, english

100

80

60
Data

40

20

calworks lunch english

Figure 9: Box plots of calWorks, lunch and English

Above box plots confirm the results gained by histograms. According the graph
students’ percentage of who need income assistance and percentage of English learns
are behave in a similar manner. Their median values are also almost the same. Both
the box plots have outliers in the same side. But the box plot of students percentage
who need low budget lunch is deviate from the other two box plots. As shown in the
diagram it spread through the population equally.
25
Boxplot of read, math
720

700

Data 680

660

640

620

600
read math

Figure 10: Box plots of read and maths scores

According to the above box plots there is not much difference between read scores
and maths scores except the fact that math score box plot has few outliers. So scores
of both subjects have equally distributed within the district.

Figure 11: Scatter plot between expenditure and income

Since box plot doesn’t give a clear image how expenditure and income have related
scatter plot has been used to identify the relationship between two variables. Here x-
axis denotes expenditure and y-axis denotes income. According to the graph there is
a weak relationship between expenditure and income since observations spread out
all over the graph. Also income has varied in vast range ($10000-$50000) while
expenditure stuck in small range ($4000-$7000).

26
Figure 12: Scatter plot between students and teachers

Since number of students and number of teachers have big difference box plots
cannot be used for comparison. Therefore scatter plot has drawn for fulfil the
comparison. In the above graph x-axis denotes the number of students while number
of teachers denote by y-axis. According to the graph there is a strong relationship
between the two variables. Deviation of observations from the fitted line is very low.

In order to understand the dependency among variables Pearson-product moment


correlation were used.

Table 2: Pearson- product moment correlation


Average calworks computer English expenditure Income
Average 1 -0.62685 -0.07374 -0.64412 0.191273 0.712431
calworks -0.62685 1 0.05916 0.319576 0.067889 -0.51265
computer -0.07374 0.05916 1 0.291339 -0.07131 0.094343
English -0.64412 0.319576 0.291339 1 -0.0714 -0.30742
expenditure 0.191273 0.067889 -0.07131 -0.0714 1 0.314484
Income 0.712431 -0.51265 0.094343 -0.30742 0.314484 1
lunch -0.86877 0.739422 0.061386 0.653061 -0.06104 -0.68444
math 0.979143 -0.61769 -0.03295 -0.56868 0.154989 0.699398
read 0.981882 -0.61185 -0.10901 -0.69029 0.217927 0.697819
students -0.15399 0.090161 0.928882 0.354879 -0.11228 0.028392
teachers -0.14486 0.092645 0.937242 0.351421 -0.09519 0.043007

27
lunch math read students teachers
Average -0.86877 0.979143 0.981882 -0.15399 -0.14486
calworks 0.739422 -0.61769 -0.61185 0.090161 0.092645
computer 0.061386 -0.03295 -0.10901 0.928882 0.937242
English 0.653061 -0.56868 -0.69029 0.354879 0.351421
expenditure -0.06104 0.154989 0.217927 -0.11228 -0.09519
Income -0.68444 0.699398 0.697819 0.028392 0.043007
lunch 1 -0.82301 -0.87881 0.129234 0.124296
math -0.82301 1 0.922902 -0.11089 -0.1023
read -0.87881 0.922901 1 -0.1884 -0.17911
students 0.129234 -0.11089 -0.1884 1 0.997116
teachers 0.124296 -0.1023 -0.17911 0.997116 1

Above table confirms the results that gained by scatter plots. According to the figure
11 there is weak relationship between district average income and expenditure per
student. In the table it gives a value of ρ = 0.314484 as the correlation coefficient of
expenditure and income which indicate weak positive relationship. And also there
are weak positive relationships between average test scores and expenditure, income
assistance and English learners, number of computers and English learners, English
learners and number of students, English learners and number of teachers,
expenditure with read and math scores, lunch with number of students and teachers.

Relationships between average test score with number of teachers and number of
students, income assistance and income, read test score and number of computers,
expenditure and number of students, math and read test score with number of
students and number of teachers have negative weak relationships. Also there are
values close zero, ρ=0 which indicate that there is almost no association between two
variables. Average test scores and number of computers, income assistance and
number of computers, income assistance with expenditure, number of students and
teachers, expenditure with reduced lunch, number of teachers and English learners,
income with number of students and teachers have such relationship.

28
But most important relationships are strong positive relationships and strong negative
relationships. According to the table average with income, math and read test score
have strong positive relationships. It is obvious that average test scores and read and
math test score have strong positive relationship as average is calculated using those
two scores. But it is interesting that income and average test score has correlation
coefficient value of ρ = 0.712431. This ρ value indicates that students who are in
district with high average income tend get high marks in tests. But when it consider
test scores separately table shows reduction of ρ values. Yet they are strong positive
relationships because read and math test scores are also highly correlated. Also it is
no surprise that students who need income assistance and who need price reduced
lunch has correlation value of ρ = 0.739422. Numbers of computers with both
variables, number of teachers and students also have strong positive relationships. As
shown in the figure 12, number of students and number of teachers has strong
positive relationship.
Relationship between English learners and average test scores is strongly negative so
as the read and math scores separately. Also these test scores have strong negative
relationships with income assistance even though there is weak positive relationship
between income assistance and English learners. It is obvious that income and price
reduced lunch has strong negative relationship. Also there negative relationships
between price reduced lunch and test scores.
Therefore According to the above table test scores are mainly affected by income and
English knowledge. Since other variables, lunch and calWorks depend on income,
they are also affecting on test scores.

4.2 Confirmatory Analysis

Using paired sample t-test it has been determined the difference between two test
scores and difference between price reduced lunch percentages and income
assistance percentages.

29
Hypothesis
H0: Mean of read score – Mean of math score ≤ 0 vs
H1: Mean of read score – Mean of math score > 0
Test statistic
p-value = 1.029e-05 < 0.05

Decision
H0 is rejected at 5% significant level. Therefore two mean values have significant
different at 5% significant level. According to the null hypothesis read average test
score is higher than math average test score.

Hypothesis
H0: Mean of lunch – Mean of calWorks ≤ 0 vs
H1: Mean of lunch – Mean of calWorks > 0

Test statistic
p-value = 2.2e-16 < 0.05

Decision
H0 is rejected at 5% significant level. Therefore two mean values have significant
different at 5% significant level. According to the null hypothesis lunch average is
higher than calWorks average.

After that using Shapiro-Wilk test of normality it has been determined the
distributions of read test scores, math test scores and average test scores.

Hypothesis
H0: Data set follows a normal distribution. vs
H1: Data set doesn’t follow a normal distribution.

30
Table 3: Shapiro-Wilk test results

Variable p-value α H0

math 0.07593 > 0.05 Not reject

read 0.1597 > 0.05 Not reject

average 0.4361 > 0.05 Not reject

Above table shows that variables, read test score, math test score and average test
score follow normal distribution.

Then following regression model was fitted to the data in order to identify the factors
affected on read test scores.

Read = 260.6 + 0.010*teachers – 0.001*students + 0.598*math - 0.214*lunch +


0.0001*income + 0.003*expenditure - 0.178*English - 0.001*computer +
0.027*calWorks (8)

According to the above model number of teachers, math test scores, income and
expenditure per student have direct relationship with read test scores and all the other
variables have negative relationships with the dependent variable.

After fitting the model it has been tested whether there are regression relationships
with dependent variable and independent variables.

Hypothesis
H0: There is no regression relationship between X & Y vs
H1: There is a regression relationship between X & Y

Test statistic
p-value = 2.2e-16 < 0.05

31
Decision
H0 is rejected at 5% significant level. Therefore there is a regression
relationship between dependent variable and independent variables. Since H0 is
rejected it is necessary to test the regression parameters.

Hypothesis
H0 : β i = 0 vs
H1 : β i ≠ 0

Table 4: Test results of regression parameters I

Estimate Pr(>|t|) H0

(Intercept) 260.6 < 2e-16 Reject

teachers 0.010 0.6382 Not reject

students -0.001 0.5027 Not reject

math 0.598 < 2e-16 Reject

lunch -0.214 2.47e-15 Reject

Income 0.0001 0.0624 Not reject

expenditure 0.003 9.74e-08 Reject

English -0.178 2.80e-13 Reject

computer -0.001 0.6496 Not reject

calworks 0.027 0.4995 Not reject

According to the above table there is no significant relationship between the


dependent variable, read test score with independent variables, number of teachers,
number of students, average income, number of computers and income assistance
program. Relationships between the dependent variable, read test score and
independent variables, math test score, percentage of students who need price
reduced lunch, expenditure and percentage of English learners are significant.

32
According to the table 2 and regression models 8 it can be clearly identify
contradictory results. In the correlation table, between number of teachers and read
test score has a negative relationship, ρ = -0.17911. But regression model 8 indicates
that they have positive regression coefficient, β = 0.010. This could be happening
due to the fact that there is multicollinearity between independent variables. Also
between the variables read test score and calWorks, the correlation coefficient is ρ= -
0.61185. But the regression coefficient is β = 0.027. Hence using VIF it has been
check for multicollinearity in model 8.

Table 5: VIF values of model 8

Variable VIF

calworks 2.627860 < 10 No multicollinearity exists

computer 8.871393 < 10 No multicollinearity exists

English 2.459760 < 10 No multicollinearity exists

expenditure 1.280277 < 10 No multicollinearity exists

Income 2.741747 < 10 No multicollinearity exists

lunch 6.540446 < 10 No multicollinearity exists

math 3.636677 < 10 No multicollinearity exists

students 195.901414 > 10 Multicollinearity exists

teachers 218.684951 > 10 Multicollinearity exists

According to the above two table it has been determine that multicollinearity exist
among the independent variables because VIF values of students and teachers are
greater than 10. Therefore regression model 8 is not a suitable model for read test
score data. So using backward elimination model selection procedure and AIC it has
been determine the most simplest and adequate regression model that describe the
dependent variable, read test score data accurately.

33
Read = 261.6-0.1841*English+0.002753*expenditure+0.0001212*income-
0.2031*lunch+0.5958*math-0.0001204*students (9)

This model implies that unite unit change of English will decrease read test score by
0.1841 units. Hence read test score and percentage of English learners have a
negative relationship as mentioned in table 2. Also read test score has negative
relationships with percentage of students who need price reduced lunch and number
of students. Negative relationships, also known as inverse relationships mean that
when independent variable increases it will decrease the value of dependent variable.
Since the coefficient values of expenditure, income and math are positive, they have
positive relationships with the dependent variable.

In order to identify the factors affected on math test scores same steps were followed
as above. Using math test score as dependent variable and all the other variables as
independent variables, model was fitted to the data.

Math = 81.22 - 0.0066*teachers + 0.0001*students + 0.8804*read + 0.0318*lunch +


0.0002*income - 0.0019*expenditure + 0.0875*English + 0.0029*computer -
0.0867*calWorks (10)

Above model implies that math test scores has direct relationship with number of
students, read test scores, average income, percentage of English learners and
number of computers. Number of teachers, expenditure and income assistance have
inverse relationship.

Then hypothesis test was used to identify existence of regression relationship.

34
Hypothesis
H0: There is no regression relationship between X & Y vs
H1: There is a regression relationship between X & Y

Test statistic
p-value =2.2e-16< 0.05

Decision
H0 is rejected at 5% significant level. Therefore there is a regression relationship
between dependent variable and independent variables. Since H0 is rejected it is
necessary to test the regression parameters.

Hypothesis
H0 : β i = 0 vs
H1 : β i ≠ 0

Table 6: Test results of regression parameters II

Estimate Pr(>|t|) H0

(Intercept) 81.62 0.00289 Reject

calworks -0.0867 0.06648 Not reject

computer 0.0029 0.19939 Not reject

English 0.0875 0.00409 Reject

expenditure -0.0019 0.0012 Reject

Income 0.0002 0.00213 Reject

lunch 0.0318 0.35051 Not reject

students 0.0001 0.09951 Not reject

teachers -0.0066 0.80267 Not reject

read 0.8804 <2e-16 Reject

35
Above table shows that there is no significant relationship between the dependent
variable, math test score with independent variables, number of teachers, number of
students, percentage of students who need price reduced lunch, schools’ grade,
number of computers and income assistance program. Relationships between the
dependent variable, math test score and independent variables, read test score,
expenditure and percentage of English learners are significant.

Again according to the table 2 and regression models 10 it can be clearly identify
misleading results. In the correlation table, between expenditure and read test score
has a positive relationship, ρ = 0.154989. But regression model 10 indicates that they
have positive regression coefficient, β = -0.0019. Also between the variables math
test score and computer, the correlation coefficient is ρ= -0.03295. But the regression
coefficient is β = 0.0029. Hence using VIF it has been check for multicollinearity in
model 10.

Table 7: VIF values of model 10

Variable VIF

calworks 2.609378 < 10 No multicollinearity exists

computer 8.840243 < 10 No multicollinearity exists

english 2.746130 < 10 No multicollinearity exists

expenditure 1.341051 < 10 No multicollinearity exists

Income 2.707182 < 10 No multicollinearity exists

lunch 7.605457 < 10 No multicollinearity exists

read 6.154010 < 10 No multicollinearity exists

students 196.032205 > 10 Multicollinearity exists

teachers 218.769709 > 10 Multicollinearity exists

According to the above two table it has been determine that multicollinearity exist
among the independent variables because VIF values of students and teachers are

36
greater than 10. Therefore regression model 10 is not a suitable model for read test
score data. So as in above using backward elimination and AIC it has been determine
the most appropriate regression model for math test score data.

Math = 93.58 + 0.0953*English - 0.0018*expenditure + 0.0002*income +


0.8626*read - 0.0642*calWorks + 0.0013*computer (11)

This model implies that unite unit change of English will increase read test score by
0.0953 units. Hence math test score and percentage of English learners have a direct
relationship. Also math test score has negative relationships with percentage of
students who need income assistance and expenditure. Since the coefficient values of
percentage of English learners, income, read and numbers of computers are positive,
they have positive relationships with the dependent variable.

Then same procedure applies to average test score as well. Hence following reduced
model was obtained as a result of backward elimination model selection procedure
and AIC values.

Average = 654 - 0.0893*calworks - 0.2111*english + 0.0022*expenditure +


0.0006*Income - 0.3727*lunch (12)

Model shows that test scores are depend on the variables, calWorks, English,
expenditure, income and lunch. Among them calWorks, English and lunch have
negative impact on test scores. i.e. when the percentages of those variables increase it
reduce the test scores of students. Expenditure and income have positive
relationships with the test scores.

37
5. SUGGESTIONS AND CONCLUSIONS

According to the table 1 maximum value of percentage of students who need income
assistance is 78.9942 and minimum value percentage of students who need price
reduces lunch is 100. And also according to the paired sample t-test, mean of the
variable, percentage of students who need price reduced lunch is higher than the
mean value of the variable, percentage of students who need income assistance.
These results indicate that students prefer price reduced lunch even though they don’t
need income assistance. Therefore price reduction of food would be appropriate plan
than income assisting.

Average number of computers per classroom has a higher standard deviation than its
mean value. This implies that there are huge outliers. Also the minimum value of
computer is 0. That means there some schools in which with no computers. And the
maximum value, 3324 implies they have computers more than needed. A problem in
distribution of resources can be identified clearly. In modern society computer
knowledge is a must. Therefore a proper plan of resource distribution is needed.

According to the figure 11 there is a weak relationship between expenditure and


income. Correlation coefficient, ρ = 0.314484 between expenditure and income in
table 2 also confirms the weak relationship even though a strong relationship is
expected. This could be due to the fact that the average number of members in a
family is missing in the data. If there were data about the number of members in a
family, more accurate result would be obtained.

Figure 12, scatter plot between number of teachers and number of students implies
that there is a strong relationship between those two variables. Also according to
table 2 there is a correlation coefficient of ρ = 0.997116 which indicate a strong
positive relationship among the variables, number of teachers and number of
students. Figure 12 also shows that there are no any big deviate points from the fitted
line which implies that the ratio between number of students and number of teachers

38
is maintained throughout the state. i.e. every school has appropriate number of
teachers according to the number of students.

Paired sample t-test confirms that average read test score is higher than average math
test score. But table 2 indicates that the maximum values of both variables implies
otherwise. i.e. maximum value of read test score, 704 is lower than maximum value
of math test score, 709.5. Minimum values are also implies the same. i.e. minimum
value of read test score, 604.5 is lower than minimum value of math test score,
605.4. Therefore math test score has right skewed data even though table 3 confirms
that both variables follow normal distributions. So students need to spent more time
on mathematic than reading.

Regression model 12 indicate that students’ average test scores are associated with
income, expenditure, percentage of students who need income assistance, percentage
of students who need price reduced lunch and percentage of English learners. It is
obvious that expenditure is entirely depends on income even though it is not proved
by the scatter plot, figure 11 and also calWorks and lunch. Hence students’ test
scores are basically depending on income and English knowledge. Income has
positive relationship with test scores and English has negative relationship. Which
means when income level is high students tend to take high scores and when English
knowledge is low students are taking low scores. But when analyzing the two test
scores separately it gives different results. According to the regression model 9 read
test scores are related with number of students. Math test scores has the highest
impact on read test scores because they are highly correlated as shown in the table 2,
ρ = 0.922902. But regression model 11 gives lot more abnormal results like β =
0.0953 between math test scores and English and β = -0.0018 between expenditure
and math test scores. Other than that math test scores have considerable relationship
with number of computers per classroom. This could be a result of multicollinearity
among the independent variables such as the relationship between expenditure and
income, lunch and calWorks and number of students and number of teachers.

However, this study along with numerous others has found that income level and
English knowledge are extremely important factors that influence student

39
achievement. Therefore calWorks is a good project that helps students who need
income assistance so that they can improve their test scores. By providing income
assistance to students they can reach to the resources which they can’t have in
schools. Also it’s better if students could have good English learning materials like
text books, good teachers and access to internet. Reducing the price of lunch also
would be a better way of improving the test scores of students despite the income
assistance.

Results of the study would be more accurate if there were more details about the
families such as average number of members in a family and about school like
number of male students and female students, whether the school is a private or
public. Other than understanding the factors which are affect students’ test results
this study could be used to identify the factors which more important to a school and
also how a school status can be improved.

40
REFERENCE

Battle, J., & Lewis, M. (2002). The increasing significance of class: The relative effects of
race and socioeconomic status on academic achievement. Journal of poverty, 6(2), 21-35.

Ceballo, R., McLoyd, V. C., & Toyokawa, T. (2004). The influence of neighborhood quality on
adolescents’ educational values and school effort.Journal of Adolescent Research, 19(6),
716-739.

Chambers*, E. A., & Schreiber, J. B. (2004). Girls' academic achievement: varying


associations of extracurricular activities. Gender and Education, 16(3), 327-346.

Crosnoe, R., Johnson, M. K., & Elder, G. H. (2004). Intergenerational bonding in school: The
behavioral and contextual correlates of student-teacher relationships. Sociology of
Education, 77(1), 60-81.

Eamon, M. K. (2005). Social-demographic, school, neighborhood, and parenting influences


on the academic achievement of Latino young adolescents. Journal of youth and
adolescence, 34(2), 163-174.

Hunt, S. A., Abraham, W. T., Chin, M. H., Feldman, A. M., Francis, G. S., Ganiats, T. G., ... &
Riegel, B. (2005). ACC/AHA 2005 guideline update for the diagnosis and management of
chronic heart failure in the adult a report of the American College of Cardiology/American
Heart Association Task Force on Practice Guidelines (Writing Committee to Update the 2001
Guidelines for the Evaluation and Management of Heart Failure): developed in collaboration
with the American College of Chest Physicians and the International Society for Heart and
Lung Transplantation: endorsed by the Heart Rhythm Society.Circulation, 112(12), e154-
e235.

Marks, G. N. (2005). Accounting for immigrant non-immigrant differences in reading and


mathematics in twenty countries. Ethnic and racial studies, 28(5), 925-946.

Murdock, T. B., Anderman, L. H., & Hodge, S. A. (2000). Middle-grade predictors of


students’ motivation and behavior in high school. Journal of Adolescent Research, 15(3),
327-351.

Stolzenberg, L., & Stewart, D. A. David Eitle 2004 A multilevel test of racial threat
theory. Criminology, 42, 673-698.

Tam, M. Y. S., & Bassett, G. W. (2004). Does diversity matter? Measuring the impact of high
school diversity on freshman GPA. Policy studies journal, 32(1), 129-143.

41
Appendix

Data set

district school county grades students teachers calworks


75119 Sunol Glen Unified Alameda KK-08 195 10.9 0.5102
61499 Manzanita Elementary Butte KK-08 240 11.15 15.4167
Thermalito Union
61549 Elementary Butte KK-08 1550 82.9 55.0323

lunch computer expenditure income english read math


2.041 67 6384.9 22.69 0 692 690
47.92 101 5099.4 9.824 4.583 661 662
76.32 169 5502 8.978 30 636 651
Continue...

Used R codes

Descriptive statistics

Rcmdr> numSummary(Dataset[,c("Average", "calworks", "computer", "district",


Rcmdr+ "english", "expenditure", "Income1", "lunch", "math", "read", "students",
Rcmdr+ "teachers")], statistics=c("mean", "sd", "IQR", "quantiles"), quantiles=c(0,
Rcmdr+ .25,.5,.75,1))

Graphs

Histogram

Rcmdr> Hist(Dataset$calworks, scale="frequency", breaks="Sturges",


col="darkgray")

Box plot

Rcmdr> Boxplot( ~ district, data=Dataset, id.method="y")

42
Scatter plot

Rcmdr> scatterplot(teachers~students, reg.line=lm, smooth=TRUE, spread=TRUE,


Rcmdr+ boxplots='xy', span=0.5, data=Dataset)

Correlation

Rcmdr> cor(Dataset[,c("Average","calworks","computer","district","english",
Rcmdr+ "expenditure","Income1","lunch","math","read")], use="complete.obs")

Paired sample t-test

Rcmdr> t.test(Dataset$read, Dataset$math, alternative='greater', conf.level=.95,


Rcmdr+ paired=TRUE)

Shapiro-Wilk test of normality

Rcmdr> shapiro.test(Dataset$Average)

Regression model

Rcmdr> LinearModel.1 <- lm(Average ~ calworks + computer + district + english +


Rcmdr+ expenditure + Income1 + students + teachers, data=Dataset)

43

You might also like