Professional Documents
Culture Documents
TO: California Department of Education FROM: Maria Cristina Coello Recalde
TO: California Department of Education FROM: Maria Cristina Coello Recalde
TO: California Department of Education FROM: Maria Cristina Coello Recalde
______________________________________________________________________________
There are many factors that can help us explain the variation of academic performance at
the school level. Factors such as income of students, percentage of ESL student per school,
tutoring services at school, homework hours sent per week, teacher credentials and parental
involvement in school can help predict and understand school performance statewide.
services at school, homework hours sent per week and parental involvement were not measured
because were not available in the data set provided by the California Department of Education.
However, there were some variables available in the data set that served as proxies for income of
students and parental involvement, but not for the rest which were omitted in the analysis.
The variables in this statistical analysis that help understand academic performance per
school are the % of students receiving free meals (which were a proxy for income, as income
was an absent variable in the Data), % of English Language Learners, % of Parent with no high
school diploma (which was another proxy but for parental involvement) and the % of teacher
with emergency credential. These variables are statistical significant as except for enrollment of
students per school, which does not represent a relationship but does improve the model by
Data collection needs to be improved in order to understand better how certain other
factors at student level and school level can actually have an effect on educational outcomes.
Data such as tutoring services at school and homework hours sent per week are key factors that
Step 2: Variables that cannot be measured because are not available in the data set
“calschooldist.csv”:
1. Income of school districts
2. Tutoring services at school
3. Homework hours sent per week
4. Parental Involvement
Step 4: Basic check of the data and variables chosen for the model:
First, plots were created to get a feel for the data. The dependent variable, “acadperf” was plot
against “meals”, “enroll”, “ell”, “full” and “col_grad”. Results as follow:
In here, there is not much clear relationship between the numbers of students enrolled with
academic performance, seems that there is not much of a correlation. There is low enrollment of
students with low standardized test results, and high numbers of student enrollment also have
high test results. There are any outliers as well.
3. > plot(calschool$acadperf,calschool$ell, col=66)
In here, there seems to be a negative relationship between ESL students per school with
academic performance. As the percentage of ESL students goes down, the tests result goes up.
Seems to be a non-linear relationship, much skewed data and not normally distributed.
5. > plot(calschool$acadperf,calschool$col_grad, col=66)
The percentage of parents with college degree has a positive relationship with academic
performance. Apparently as the percentage of parents with college degree goes up, the academic
performance as well, but very mildly and with plenty outlier.
In here we can see that for most of the cases N=400, except for 2 variables: “mobility” (N=399)
and “acs” (N=398). Hence, next step was to get rid of the missing data with the code “na.omit”
which creates a new object without any missing data.
“Calschooldist” was renamed as “calschooldist2” to create a new object without missing data
and in the table we can see that all variables now have the same N number. Then,
“calschooldist2” was renamed again as “calschool” to ease procedure and avoid mistakes.
> cor(calschool)
> calschoolcor=cor(calschool)
> calschoolcor
> correlation_table_allvariable<- calschoolcor
> View(correlation_table_allvariable)
> write.csv(correlation_table_allvariable, file = "Correlation Matric California School - All
Variables Included.csv")
Surprisingly, all variables are statistical significant which portrays there might be a problem with
the data. Nevertheless, there are many noteworthy correlations that might help in the model
building for academic performance. There are many strong correlations between independent
variables like “meals” with “ell” of 0.77106, “meals” with “not_hsg” of 0.68268, and “ell” with
“not_hsg” of 0.72030. These are high correlations among these variables, however, are less than
0.8 which diminish the concern for multicollinearity in the model.
“Not_hsg” has a higher relationship than “grad_sch” and “col_grad”. Therefore, “col_grad” was
taken down for consideration and instead, “not_hsg” would be used. Also, as the variable “emer”
has a stronger relationship than “full”, this would be replaced with “emer” in the model.
As of checking the correlation matrix, the analysis will continue with the following variable in
hierarchical order of correlation: “Meals”, “Ell”, “Not_hsg”, “Emer” and “Enroll”.
The “enroll” variable does not has a strong relationship with academic performance, although it
would still be taking for consideration just to see if during the regression model building it would
affect the entire model or become significant at some point.
The residual standard error: 61.95 meals off when trying to predict academic performance,
represents the standard deviation of residuals. Meaning that the percentage error is 6.96.
The multiple R-square tell us the proportion of variance in the data that’s explained by the
model. It tells us that the variation in academic performance can be explain in a 81% by meals
provided.
F-statistic is a good indicator of whether there is a relationship between our predictor and the
response variables. F is higher than 1 which means that there is a relationship between our
variable.
Histogram of residuals:
> hist(scale(regression_1$residuals))
Graphs matrix:
layout(matrix(c(1,2,3,4),2,2))
> plot(regression_1)
The plot of regression one plus the histogram tell us that the residuals are around the mean which
indicates normal distribution and linearity. Also help us see that the errors are normally
distributed. That this model meets the assumptions for linear regressions.
The output of all the residuals and values are raw values. We standardized these in order to see
how many residuals go pass standard deviations jut to check the normality of the residuals of
standard distribution. The output tell us that less than 5% of our data fall outside of standard
deviation.
After adding “ell” we can see that our model improve. Standard errors decrease and residual
errors as well. Also, it increase the multiple R-square and the variables remain statistical
significant.
After adding “ell” + “not_hsg” we can see that our model remain sort of constant. Variables
still remain significant and Multiple R-square is almost the same as the previous model.
Although, estimates slightly decrease and residual standard errors slightly decrease, while F
still diminishing.
After adding “ell” + “not_hsg” + “emer” we can see that our model remain sort of constant.
Variables still remain significant and Multiple R-square is almost the same as the previous
model but increases. Although, estimates slightly decrease and residual standard errors
slightly decrease, while F still diminishing.
After adding “ell” + “not_hsg” + “emer” and “enroll” we can see that our model remain sort of
constant. Variables still remain significant, as except of enroll which was expected to be that
way. Also, Multiple R-square is almost the same as the previous model but increases even more.
Although, estimates slightly decrease and standard errors slightly increase, while F still
diminishing, but still higher than one.
> hist(scale(regression_2final$residuals))
> layout(matrix(c(1,2,3,4),2,2))
> plot(regression_2final)
> regression_2final$standardized.residuals <- rstandard(regression_2final)
> regression_2final$large_residual <- regression_2final$standardized.residuals >2 |
regression_2final$standardized.residuals < -2
> sum(regression_2final$large_residual)
[1] 19
Standardized residual analysis tells us how many residuals are more than two deviations away
which is less than the 5% of the 398 sample. Plots portray linearity and normal distribution of the
residuals.
there is no need to worry about multicolinearity the value of VIF per variable are in an
acceptable range. None of the variable are above 10 so there is no need to worry about
intercorrelation.
The value of 1.41 tells us that errors are independent and uncorrelated because this value is more
than 1 and less than 3.
The factors that help understand academic performance per school are the % of students
receiving free meals (which were a proxy for income, as income was an absent variable in the
Data), % of English Language Learners, % of Parent with no high school diploma (which was
another proxy but for parental involvement) and the % of teacher with emergency credential.
These variables are statistical significant as except for enrollment of students per school, which
does not represent a relationship but does improve the model by diminishing error and increasing
variance.
84% of academic performance can by explain by all these factors, as except of enrollment of
students.
Advanced Extensions
1. Visualization Extensions
>
corrplot(corrmatrix, type="lower")
stargazer(regression_1, regression_final, title="Regression Results", dep.var.labels=c("Academic
Performance"), type="text")
# 2. Additional Diagnostics