Professional Documents
Culture Documents
C 177 Final
C 177 Final
Abby Vogel
c177 Final Project
3 May 2017
East Bay Gas Prices
Introduction
On a global scale, gasoline and fuel prices are determined by a range of factor. National
tariffs, taxes, and regulations, as well as state taxes and ever-changing market supply and
demand. At the local level, gas prices are nuanced, seemingly defined by an array of social,
political and spatial factors. The goal of this project is to determine the driving factors of
differential gas prices in four East Bay cities; Berkeley, Emeryville, Oakland, and Piedmont. The
relative strength of each of the explanatory variables is assessed using two forms of regression,
Ordinary Least Squares (OLS) and Geographic Weighted Regression (GWR).
Overview:
Research Question: What factors contribute to differences in East Bay gas prices?
Are these trends spatially dependent?
Null Hypothesis: Variation in gas prices are random and not spatially dependent.
Any variation is due to chance.
Study Area: Cities of the East San Francisco Bay Area in CaliforniaBerkeley,
Emeryville, Oakland, and Piedmont. [Figure 1]
Vogel 2!
price/station price as of April 27, Gas Buddy, WGS84 (KML) Point data
location 2017 Google Maps
popdens Population Density ACS 2015 (5 Year NAD83 GCS Census Block
Estimates) (TIGER) Group
med_income Median Income ACS 2015 (5 Year NAD83 GCS Census Block
Estimates) (TIGER) Group
med_rent Median Rent ACS 2015 (5 Year NAD83 GCS Census Block
Estimates) (TIGER) Group
brand Gas station brand Gas Buddy, WGS84 (KML) Point data
aggregated to 76, Google Maps
ARCO, Chevron,
Mobil, Quik Stop,
Shell, Valero and
Other
Vogel 4!
Methods
Linear Regression: Y = c + 1x1 + 2x2 + +nxn
Linear regression was used to compute the significance of each of the explanatory
variables. Exploratory regression and Ordinary Least Squares were both used to determine the
factors that were responsible for variation in gas prices by station. Modeling was performed in
both ArcMap and in R. OLS was performed in R to be able to use brand as a explanatory
Vogel 5!
variable, using the linear modeling function. Additionally, R was used to test a multiplicative
model that incorporated the effects of interaction terms.
Geographically Weighted Regression: GWR was used to assess the spatial significance
of each explanatory variable. GWR creates local regression estimates and works better to predict
spatially-dependent phenomenon. If the OLS creates spatially autocorrelated standardized
residuals, indicated by a statistically significant Morans I, then GWR will perform better.
However, if the GWR creates spatially autocorrelated standardized residuals, then the model is
still missing key explanatory variables. GWR was first used without brand as an explanatory
variable, and compared to a model with a dummy variable of brand added to compare the
relative strengths of the models.
Results
OLS: The exploratory regression of all variables (except brand) returned extremely low
R2 values. Running this entire field of variables under the OLS model:
price ~ t30_prop +dalone_prop + popdens + med_income + med_rent + near_dist + comp
resulted in an adjusted R2 value of 0.0025. This model failed to explain the variation in gas
prices, and resulted in highly spatially autocorrelated standard residuals. Finally, this model
yielded significant Koenker and Jarque-Bera Statistics, indicating that there is non-stationarity
and non-normally distributed residuals.
Next, brand was added to the linear model to see the if there was a change in the
performance of the OLS. Additionally, the model was limited to strictly significant terms. Using
R, the OLS model became:
price ~ brand + med_income + near_dist
Vogel 6!
This addition improves the predictive power of the OLS, with an adjusted R2 of 0.4336. To test
the normality of the standard residuals, both a Quantile-Quantile Plot and histogram were
generated. Excluding an extremely low outlier, the QQ Plot shows a generally linear relationship
with extreme trends away from normally distributed residuals at the low and high quantiles. This
indicates a normal curve with positive kurtosis, which is also shown in the histogram. The
standardized residuals were tested for spatial autocorrelation, with a statistically significant
Morans I, meaning that this model is missing a key explanatory variable.
Finally, the last linear model that was tested was a multiplicative Ordinary Least Squares
model. This model incorporates interaction terms for each of the variables:
price ~ brand * med_income * near_dist
This model yields the strongest predictive power of all of the OLS models, with an adjusted R2
of 0.4998. This model has more approximately normal standardized residuals shown in the QQ
plot and histogram. Using ArcMap, the autocorrelation of the standardized was calculated. This
model has the least spatially autocorrelated standardized residuals, with a z-score of 3.07 for
Morans I. However, this value is still significant with a p-value of 0.002.
Vogel 7!
GWR: First, a GWR of all variables excluding brand was modeled in ArcMap. This
model has an adjusted R2 of 0.3337, an improvement in efficiency over the OLS model of the
same variables. However, this model yields spatially autocorrelated standard residuals indicating
under-performance in the model. [Figure 12]
Brand of the gasoline was added to the model in the form of a dummy variables
indicating which corporation owned the station. The levels of this variable were aggregated to
76, ARCO, Chevron, Mobil, Quik Stop, Shell, Valero and Other to limit the effect of unique gas
brands, such as Berkeley Smog and Gas on over-fitting the model. Using the model price ~
brand + med_income + near_dist, the adjusted R2 improved to 0.7385. This model accounts for
the highest amount of variation in gas prices over the study area. However, this model also yields
spatially autocorrelated standard residuals with a z-score of 5.62 of Morans I. [Figure 13]
Conclusion
With this final result it is evident that there are key missing variables in this regression.
Further iteration of this analysis is needed to build a model that has strong explanatory power
Vogel 8!
without issues of spatial autocorrelation and non-normal standardized residuals. Overall, the
multiplicative OLS model with interaction of
price ~ brand * med_income * near_dist
produces the least autocorrelated standardized residuals. The GWR model of the same
explanatory variables yields the highest adjusted R2, but with autocorrelated residuals. From this
we can conclude that Brand, Median Income, and Distance to Major Transit Roads are the most
important explanatory variables considered in this analysis. We can reject the null hypothesis that
variation in gas price data is random and spatially independent.
Support Software
ArcGIS version 10.4 and R version 3.3.2 were used in this analysis.
Vogel 9!
Figures
Figure 1. Map of East Bay
Vogel 1! 0
Figures 12 and 13
Vogel 1! 5
R Code
gas<- read.csv("~/Desktop/gas_27_TableToExcel.csv")
hist(gas$price, main="Distribution of East Bay Gas Prices", xlab="Price (US Dollars)")
plot(price~Name, data=gas, main="Price by Brand", ylab="US Dollars")
gas$Name[64] <- "Valero"
levels(gas$Name)
for(i in 1:length(gas$Name))
if(!(gas$Name[i] %in% c("76", "ARCO", "Chevron", "Mobil", "Quik Stop", "Shell", "Valero")))
gas$Name[i]<-"Other"
levels(gas$Name) <- c("76", "ARCO", "Chevron", "Mobil", "Quik Stop", "Shell", "Valero", "Other")
write.csv(gas, "~/Desktop/gas2.csv")
gas2 <- read.csv("~/Desktop/gas2.csv")
gas$Name<- droplevels(gas$Name)
levels(gas$Name)
full <- read.csv("~/Desktop/last.csv")
full$comp <- as.factor(full$comp)
ols1 <- lm(full$price~full$t30_prop+full$popdens+full$dalone_pro+full$med_income+full$med_rent+full$brand+full
$NEAR_DIST+full$comp)
ols2 <- lm(full$price~full$t30_prop+full$popdens+full$dalone_pro+full$med_income+full$med_rent+full$NEAR_DIST+full
$comp)
levels(full$brand) <- c(levels(full$brand), "Other")
for(i in 1:length(full$brand))
if(!(full$brand[i] %in% c("76", "ARCO", "Chevron", "Mobil", "Quik Stop", "Shell", "Valero")))
full$brand[i]<-"Other"
full$brand<- droplevels(full$brand)
ols3 <- lm(full$price ~ full$brand+full$med_income+full$NEAR_DIST)
summary(ols3)
qqplot(x=rnorm(length(ols.stdres1)),y=ols.stdres1, xlab="Generated Normal Values", ylab="Std Residuals", main="QQ Plot")
abline(b=1, a=0)
hist(ols.stdres1, breaks=40, main="Histogram of Std Residuals", xlab="StdResid")
ols.stdres <- rstandard(ols3)
ols.stdres1 <- ols.stdres[-42]
ols4 <- lm(full$price ~ full$brand*full$med_income*full$NEAR_DIST)
summary(ols4)
qqplot(x=rnorm(length(ols.stdres3)),y=ols.stdres3, xlab="Generated Normal Values", ylab="Std Residuals", main="QQ Plot")
abline(b=1, a=0)
hist(ols.stdres3, breaks=40, main="Histogram of Std Residuals", xlab="StdResid")
ols.stdres2 <- rstandard(ols4)
ols.stdres3 <- ols.stdres2[-42]
last$multiplic <- ols.stdres2
last$addit <- ols.stdres
write.csv(last, "~/Desktop/last2.csv")