Professional Documents
Culture Documents
141 HW 1
141 HW 1
Professor Gupta
STA141A
4/21/18
Homework 1
Question 1
How many observations are recorded in the dataset? How many colleges are recorded?
There is 3312 observed data points. However, there are only 2431 colleges recorded.
Question 2
How many features are there? How many of these are categorical? How many are discrete? Are
there any other kinds of features in this dataset?
Question 3
How many missing values are in the dataset? Which feature has the most missing values? Are
there any patterns?
There are 23398 missing values in the dataset. The feature which has the most missing is the
There are 23398 missing values in the dataset. The feature which has the most missing is the
"other race" feature at 51 missing values. A pattern is that if there is a single race missing, most
likely the other races will be missing as well. Similarly, if one of the genders is missing the other
will be as well.
Question 4
Are there more public colleges or private colleges recorded? For each of these, what are the
proportions of highest degree awarded? Display this information in one graph and comment on
what you see.
Public colleges have 716 while private (non-profit and profit) have 2596. Therefore, private
colleges have more than public. The proportion of highest degree for public is 13.7% and for
private is 86.3%. The highest degree for both public and nonpublic colleges.
Question 5
What is the average undergraduate population? What is the median? What are the deciles?
Display these statistics and the distribution graphically. Do you notice anything unusual?
0 25 50 75 100
0 428 1295 3372 166816
0 10 20 30 40 50 60 70 80 90 100
0 153 319.2 536 847.6 1295 1811.8 2674.5 4550.8 9629.8 166816
0 10 20 30 40 50 60 70 80 90 100
0 153 319.2 536 847.6 1295 1811.8 2674.5 4550.8 9629.8 166816
The mean is 3599.502 or roughly 3600 (since you cannot have a portion of a person). The
median is 1295. In the deciles, there seems to be an outlier which is the maximum value.
However, removing that outlier will remove valuable information. Upon further inspection, the
outlier is University of Phoenix Online, therefore there would be a lot of people taking classes
there.
Question 6
Compare tuition graphically in the 5 most populous states. Discuss conclusions you can draw
from your results.
Besides connecting an outlier for CA, IL. FL, CA, IL, FL and TX all appear to have roughly
Besides connecting an outlier for CA, IL. FL, CA, IL, FL and TX all appear to have roughly
equal variances. The outlier in this case would be NY having an extremely large variance.
Question 7
Question 8
For schools that are for-profit in ownership and issue Bachelor’s degrees as their
primary_degree, do the following:
a. Visualize revenue_per_student and spending_per_student and describe the relation-ship. What
issues may arise when fitting a linear regression model?
b. Create a new variable called total_net_income. Think carefully about how this variable would
be calculated. Visualize the top 5 earning schools.
University Tuition
University Phoenix-Online 2103883392
Ashford 542588295
Kaplan 333051147
DeVry 315034368
Grand Canyon 275134706
For most schools, the lower the revenue the less they spend. The outliers that spend more on
fewer students are most likely from private colleges. Some problems that might occur will be
"outliers" skewing the data.
Question 9
Now, examine the relationship between avg_sat and admissionfor all schools.
a. Use an appropriate plot to visualize the relationship. Split the data into two groups based on
Now, examine the relationship between avg_sat and admissionfor all schools.
a. Use an appropriate plot to visualize the relationship. Split the data into two groups based on
their combination of avg_sat and admission. Justify your answer.
b. Using code to justify your answers, comment on how the following continuous variables
change depending on group:
(a)med_10yr_salary
(b) The percentage ofrace_white and race_asian combined
(c) The percentage of graduate students enrolled at a university
c. Using code to justify your answers, comment on whether the categorical variables are
dependent or independent of group:
(a)open_admission
(b)main_campus
(c)ownership
(d) Whether the university has more than 1 branch or not
Question 10
Categorical data that would improve the regression line would be ownership and
open_admissions. Highest_degree would show which degrees would give higher salaries.
Average family income would benefit more from adding open_admissions to explain the
differences between salaries coming from different colleges. Ownership often shows that the
students from private colleges earn higher salary than the ones from public colleges.
data=readRDS("~/Desktop/RStudio/Data141a/college_scorecard_2013.rds")
library(tidyverse)
##1
nrow(data)
length(data$main_campus)
#Two different ways to find the length of the dataset,
#one is by nrow and the other is finding the length of a single sub-data point
sum(data$main_campus)
#After finding the length, there maybe differing colleges recored,
#if the length equals the number of colleges, we use the sum function
##2
#if the length equals the number of colleges, we use the sum function
##2
table1 <- sapply(data, class)
dmy<- 'integer'
sapply(data, class)==dmy
(sapply(data,class)==dmy)[sapply(data, class)==dmy]
table1==dmy
table1[table1==dmy]
names(table1[table1==dmy])
#Lecture
##3
is.na(data)
dataNA=colSums(is.na(data))
which.max(dataNA)
#First we remove the NA values with is.na function, after we create a new dataset without the
NA values
#The which function finds the index that has the max NA values.
NA_counts<-colSums(is.na(data))
sum(NA_counts)
hist(NA_counts)
table(NA_counts)
which(NA_counts==490)
##4
table(data$ownership=="Public")
table(factor(data$ownership=="Public",labels=c("Not Public","Public")))
degree=as.numeric((data$highest_degree))
ownership=as.numeric((data$ownership))
table(data$ownership,data$highest_degree)
#Lecture
#From there, we can count the degrees for public against nonpublic
data$IsPublic<-factor(data$ownership=="Public",labels=c("Not Public","Public"))
dim(data)
#Next we plot the data into graphs, Moasaic and barplots relay the same information
#However, Mosaic gives all the data on the same graph.
mat1<-is.na(data)
mat2<-!is.na(data)
dim(mat1)
dim(mat2)
sum(mat1)/(3312*51)
sum(mat2)/(3312*51)
#Lecture
#Use the sum function to find the number of missing valeus then divide by the overall
##5
undergrad_mean = mean(data$undergrad_pop, na.rm = TRUE)
undergrad_median = median(data$undergrad_pop, na.rm = TRUE)
#First we calculate the mean and median using mean and median function
#https://stackoverflow.com/questions/26273892/
#r-splitting-dataset-into-quartiles-deciles-what-is-the-right-method
#Next we find the quantile and decile with the quantile function, and for decile,
#We need to specify the probabilties.
x = is.na(data$undergrad_pop)
y=data[x == FALSE,]
table(is.na(y$undergrad_pop))
table(x)
dataomit=na.omit(data)
hist(dataomit$undergrad_pop) +abline(v = undergrad_mean, col = "red")
+abline(v = undergrad_median, col = "blue")
hist(dataomit$undergrad_pop)+
hist(dataomit$undergrad_pop)+
abline(v=quantile(dataomit$undergrad_pop, c(.25,.5,.75)), col = "brown" )
hist(dataomit$undergrad_pop)+
abline(v=quantile(dataomit$undergrad_pop, prob=seq(0,1,length = 11)), col = "purple" )
#We omit the NA values since changing them to 0 or any value could skew the data result/graph
#Afterwards, we create ablines for the mean.median,quantile, and decile.
#https://www.r-bloggers.com/quartiles-deciles-and-percentiles/
##6
top5= data[data$state %in% c("CA","TX", "NY", "IL", "FL"),]
topcollege5 = droplevels(top5$state)
#California, Texas, New York, Illinois, Florida are given as the 5 most populous states
#Droplevels to remove data points which are not in use
#Use boxplot to show the top populous states and their tuition
##7
data_satavg=data[,c(6,14)]
which.max(data$avg_sat)
data[105,]
#Using the which function to find the index where the avg_sat is the largest,
#then calling it in the data
data_openpop=data[,c(6,5,15)]
which.max(data$undergrad_pop)
data[2371,]
#Using the which function to find the index where the undergrad_pop is the largest,
#then calling it in the data
data_zipfamily=data[,c(6,9,13,29)]
#Using the which function to find the index where the undergrad_pop is the largest,
#then calling it in the data
data_zipfamily=data[,c(6,9,13,29)]
a=data %>% filter(ownership == "Public")
b=data %>% filter(ownership == "Public") %>% summarize(which.min(avg_family_inc))
a[348,]
#subset the data to filter only public, then from the subset data, use which to find the smallest
#avg_family_inc index, call the index from the subset data to find the zip code.
data_gradpop=data[,c(6,5,15,16)]
which.max(data$grad_pop)
data[248,]
#Using the which function to find the index where the grad_pop is the largest,
#then calling it in the data
##8
newdata1=data %>% filter(ownership == "For Profit", primary_degree == "Bachelor")
ggplot(newdata1, aes(x=revenue_per_student, y=spend_per_student))+
geom_point()+geom_smooth(method = "lm", se = FALSE, lty=2)
newdata1$total_net_income = newdata1$undergrad_pop*
(newdata1$revenue_per_student - newdata1$spend_per_student)
##9
data$avg_sat
data$admission
plot(data$avg_sat,data$admission)
data$group = data %>% filter(avg_sat > 1200 & admission < .5)
data$group <- factor(c(data$avg_sat>1200, data$admission < .4), labels=c("above","below"))
#First subset the data by avg_sat and admission, however doing just that will not help
#To provide a better subset, we will require conditions on the data we're working on
#Then plot the data using the subsetted data, continue using the subsetted data for the other
graphs
##10
ggplot(data, aes(x=avg_family_inc, y=avg_10yr_salary))+geom_point()+
geom_smooth(method = "lm", se = FALSE, lty=2)
data4= data[-c(1789,2030.1956,682,1331,891,1085,1577,1241,1337,2180,1635,2204),]
#Plot the data using the original data frame, afterwards, find the outliers and remove them
#replot the data