TO: California Department of Education FROM: Maria Cristina Coello Recalde

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 21

TO: California Department of Education

FROM: Maria Cristina Coello Recalde


Data Analyst Consultant

DATE: October 8, 2018

______________________________________________________________________________

There are many factors that can help us explain the variation of academic performance at

the school level. Factors such as income of students, percentage of ESL student per school,

tutoring services at school, homework hours sent per week, teacher credentials and parental

involvement in school can help predict and understand school performance statewide.

Unfortunately, in this analysis certain factors such as income of students, tutoring

services at school, homework hours sent per week and parental involvement were not measured

because were not available in the data set provided by the California Department of Education.

However, there were some variables available in the data set that served as proxies for income of

students and parental involvement, but not for the rest which were omitted in the analysis.

The variables in this statistical analysis that help understand academic performance per

school are the % of students receiving free meals (which were a proxy for income, as income

was an absent variable in the Data), % of English Language Learners, % of Parent with no high

school diploma (which was another proxy but for parental involvement) and the % of teacher

with emergency credential. These variables are statistical significant as except for enrollment of

students per school, which does not represent a relationship but does improve the model by

diminishing error and increasing variance.


In this statistical model, 84% of academic performance can by explain by all these factors, as

except of enrollment of students.

Data collection needs to be improved in order to understand better how certain other

factors at student level and school level can actually have an effect on educational outcomes.

Data such as tutoring services at school and homework hours sent per week are key factors that

can help us understand better the performance of schools.


MODEL BUILDING TECHNICAL APPENDIX:

 Step 1: Variables that are likely to explain school performance:


1. Income of students
2. Percentage of English as a Second Language (ESL) student per school
3. Tutoring services at school
4. Homework hours sent per week
5. Teachers credentials
6. Parental involvement in school

 Step 2: Variables that cannot be measured because are not available in the data set
“calschooldist.csv”:
1. Income of school districts
2. Tutoring services at school
3. Homework hours sent per week
4. Parental Involvement

 Step 3: Variables available in “calschooldist.csv” that are likely to explain school


performance:
1. Meals: Percentage of students receiving free meals which is used as a proxy for low
income school districts.
2. Enroll: Number of students per school
3. Ell: Percentage of English language learners
4. Full: Percentage of teacher with full credentials
5. Col_grad: Percentage of parents college graduated which would be used a proxy for
parental involvement. Assuming that parents with at least a college diploma would
have more involvement in their children school.

 Step 4: Basic check of the data and variables chosen for the model:

First, plots were created to get a feel for the data. The dependent variable, “acadperf” was plot
against “meals”, “enroll”, “ell”, “full” and “col_grad”. Results as follow:

1. > plot(calschool$acadperf,calschool$meals, col=66):


In here there is a linear relationship, possible strong negative correlation, as the standardized test
goes up; the number of meals given goes down.

2. > plot(calschool$acadperf,calschool$enroll, col=66)

In here, there is not much clear relationship between the numbers of students enrolled with
academic performance, seems that there is not much of a correlation. There is low enrollment of
students with low standardized test results, and high numbers of student enrollment also have
high test results. There are any outliers as well.
3. > plot(calschool$acadperf,calschool$ell, col=66)

In here, there seems to be a negative relationship between ESL students per school with
academic performance. As the percentage of ESL students goes down, the tests result goes up.

4. > plot(calschool$acadperf,calschool$full, col=66)

Seems to be a non-linear relationship, much skewed data and not normally distributed.
5. > plot(calschool$acadperf,calschool$col_grad, col=66)

The percentage of parents with college degree has a positive relationship with academic
performance. Apparently as the percentage of parents with college degree goes up, the academic
performance as well, but very mildly and with plenty outlier.

Some basic statistical analysis:

In here we can see that for most of the cases N=400, except for 2 variables: “mobility” (N=399)
and “acs” (N=398). Hence, next step was to get rid of the missing data with the code “na.omit”
which creates a new object without any missing data.
“Calschooldist” was renamed as “calschooldist2” to create a new object without missing data
and in the table we can see that all variables now have the same N number. Then,
“calschooldist2” was renamed again as “calschool” to ease procedure and avoid mistakes.

 Step 5: Correlation matrix, add/drop variables and multicollinearity considerations:

> cor(calschool)
> calschoolcor=cor(calschool)
> calschoolcor
> correlation_table_allvariable<- calschoolcor
> View(correlation_table_allvariable)
> write.csv(correlation_table_allvariable, file = "Correlation Matric California School - All
Variables Included.csv")

> cor.prob <- function (X, dfr = nrow(X) - 2) {


+ R <- cor(X, use="pairwise.complete.obs")
+ above <- row(R) < col(R)
+ r2 <- R[above]^2
+ Fstat <- r2 * dfr/(1 - r2)
+ R[above] <- 1 - pf(Fstat, 1, dfr)
+ R[row(R) == col(R)] <- NA
+ R
+}

> correlation_table_calschool <- cor.prob(calschool)


> View(correlation_table_calschool)
> write.csv(correlation_table_calschool, file = "Correlation Matrix California Schools.csv")

Surprisingly, all variables are statistical significant which portrays there might be a problem with
the data. Nevertheless, there are many noteworthy correlations that might help in the model
building for academic performance. There are many strong correlations between independent
variables like “meals” with “ell” of 0.77106, “meals” with “not_hsg” of 0.68268, and “ell” with
“not_hsg” of 0.72030. These are high correlations among these variables, however, are less than
0.8 which diminish the concern for multicollinearity in the model.

The variables with strongest relationships with academic performance are:


- Meals: negative relationship with academic performance, but very strong relationship
with -0.90021.
- Ell: negative relationship with academic performance, but very strong with -0.76580.
- Not_hsg: negative relationship with academic performance, but moderate strong with
-0.68246.
- Grad_sch: positive relationship with academic performance, but moderate strong with
0.63329.
- Emer: negative relationship with academic performance, but moderate with -0.58827.
- Full: positive relationship with academic performance, but moderate with 0.57661.
- Col_grad: positive relationship with academic performance, but moderate with 0.52672.

“Not_hsg” has a higher relationship than “grad_sch” and “col_grad”. Therefore, “col_grad” was
taken down for consideration and instead, “not_hsg” would be used. Also, as the variable “emer”
has a stronger relationship than “full”, this would be replaced with “emer” in the model.

As of checking the correlation matrix, the analysis will continue with the following variable in
hierarchical order of correlation: “Meals”, “Ell”, “Not_hsg”, “Emer” and “Enroll”.

The “enroll” variable does not has a strong relationship with academic performance, although it
would still be taking for consideration just to see if during the regression model building it would
affect the entire model or become significant at some point.

 Step 7: Regression, bivariate/unadjusted model.

This model is in line with our expectations.


Intercept is the average value of academic performance (889.60032) when no meals are
provided. The meals coefficient show us that given a 1 unit increase in meals is associate with an
expected change of -4.00810 meals provided.

Pr (>|t|) tell us that meals is significant.

The residual standard error: 61.95 meals off when trying to predict academic performance,
represents the standard deviation of residuals. Meaning that the percentage error is 6.96.
The multiple R-square tell us the proportion of variance in the data that’s explained by the
model. It tells us that the variation in academic performance can be explain in a 81% by meals
provided.

F-statistic is a good indicator of whether there is a relationship between our predictor and the
response variables. F is higher than 1 which means that there is a relationship between our
variable.

 Step 8: Check for regression violations

Histogram of residuals:
> hist(scale(regression_1$residuals))
Graphs matrix:

layout(matrix(c(1,2,3,4),2,2))
> plot(regression_1)

The plot of regression one plus the histogram tell us that the residuals are around the mean which
indicates normal distribution and linearity. Also help us see that the errors are normally
distributed. That this model meets the assumptions for linear regressions.
The output of all the residuals and values are raw values. We standardized these in order to see
how many residuals go pass standard deviations jut to check the normality of the residuals of
standard distribution. The output tell us that less than 5% of our data fall outside of standard
deviation.

The Durbin-Watson statistic is always between 0 and 4. A value of 2 means that there is no


autocorrelation in the sample. Our value is 1.44 almost reaching 2 which shows no
autocorrelation in our sample.

 Step 9: Model building

After adding “ell” we can see that our model improve. Standard errors decrease and residual
errors as well. Also, it increase the multiple R-square and the variables remain statistical
significant.
After adding “ell” + “not_hsg” we can see that our model remain sort of constant. Variables
still remain significant and Multiple R-square is almost the same as the previous model.
Although, estimates slightly decrease and residual standard errors slightly decrease, while F
still diminishing.
After adding “ell” + “not_hsg” + “emer” we can see that our model remain sort of constant.
Variables still remain significant and Multiple R-square is almost the same as the previous
model but increases. Although, estimates slightly decrease and residual standard errors
slightly decrease, while F still diminishing.

After adding “ell” + “not_hsg” + “emer” and “enroll” we can see that our model remain sort of
constant. Variables still remain significant, as except of enroll which was expected to be that
way. Also, Multiple R-square is almost the same as the previous model but increases even more.
Although, estimates slightly decrease and standard errors slightly increase, while F still
diminishing, but still higher than one.

 Step 10: Recheck model assumptions

> hist(scale(regression_2final$residuals))

This histogram portrays normal distribution of residuals.

> layout(matrix(c(1,2,3,4),2,2))
> plot(regression_2final)
> regression_2final$standardized.residuals <- rstandard(regression_2final)
> regression_2final$large_residual <- regression_2final$standardized.residuals >2 |
regression_2final$standardized.residuals < -2
> sum(regression_2final$large_residual)
[1] 19
Standardized residual analysis tells us how many residuals are more than two deviations away
which is less than the 5% of the 398 sample. Plots portray linearity and normal distribution of the
residuals.
there is no need to worry about multicolinearity the value of VIF per variable are in an
acceptable range. None of the variable are above 10 so there is no need to worry about
intercorrelation.

The value of 1.41 tells us that errors are independent and uncorrelated because this value is more
than 1 and less than 3.

The factors that help understand academic performance per school are the % of students
receiving free meals (which were a proxy for income, as income was an absent variable in the
Data), % of English Language Learners, % of Parent with no high school diploma (which was
another proxy but for parental involvement) and the % of teacher with emergency credential.
These variables are statistical significant as except for enrollment of students per school, which
does not represent a relationship but does improve the model by diminishing error and increasing
variance.

84% of academic performance can by explain by all these factors, as except of enrollment of
students.

Advanced Extensions
1. Visualization Extensions

> corrmatrix <- cor(calschool, use = "complete.obs")


> corrplot.mixed(corrmatrix, number.cex = 0.70, tl.cex = 0.4)
>

>

corrplot(corrmatrix, type="lower")
stargazer(regression_1, regression_final, title="Regression Results", dep.var.labels=c("Academic
Performance"), type="text")
# 2. Additional Diagnostics

You might also like