Professional Documents
Culture Documents
HW1 - Predicting Airfares On New Routes - JJ - (2024 Spring)
HW1 - Predicting Airfares On New Routes - JJ - (2024 Spring)
1) An MS Word file that contains your answers to the questions. You can use screenshots to
report the required results if it is convenient for you.
2) An R script that contains your code.
Note: Please do NOT submit any other file format. Failing to submit in the correct file
format will cause the loss of homework grades!
1
The file CSV_Airfares.csv contains real data that were collected for the third quarter of a year.
They consist of the following predictors and responses (i.e., the target variable):
Note that some cities are served by more than one airport, and in those cases the airports are
distinguished by their 3-letter code.
For this homework, the categorical variables Vacation, SW, Slot, Gate have been transformed
into the following dummy variables:
- VACATION_YES: =1 if Vacation is YES; and =0 otherwise;
- VACATION_NO: =1 if Vacation is NO; and =0 otherwise;
- SW_YES: =1 if SW is YES; and =0 otherwise;
- SW_NO: =1 if SW is NO; and =0 otherwise;
- SLOT_FREE: =1 if Slot is FREE; and =0 otherwise;
- SLOT_CTRL: =1 if Slot is CONTROLLED; and =0 otherwise;
- GATE_FREE: =1 if Gate is FREE; and =0 otherwise;
- GATE_CONS: =1 if Gate is CONSTRAINED; and =0 otherwise.
2
Complete the following tasks (write necessary R code):
a. Partition the original dataset into training (60%) and validation sets (40%). The model
will be fit to the training data and evaluated on the validation set. (1 point)
b. Build a multiple linear regression model for predicting the average fare on a new route.
Include all numerical predictors in the regression. For the four categorical variables (i.e.,
Vacation, SW, Slot, Gate), do NOT use the original variables. Instead, use the four
dummy variables: VACATION_YES, SW_YES, SLOT_CTRL, and GATE_CONS.
Finally, do not use S_CODE, S_CITY, E_CODE, and E_CITY in the regression, because
they are not numeric. (1 point)
c. Report the model estimation results. Based on the estimated values in the results, write
out how the linear regression model looks like. Note: you need to put the estimated
coefficient values in the square brackets.
e.g., FARE = [Intercept value] + [coefficient 1]*SW_Yes + [coefficient
2]*DISTANCE+…
Also, interpret the meanings of the two model coefficients for SW_Yes and
DISTANCE.
Provide the corresponding estimation results to support your answer.
(Model estimation results: 1 point)
(Coefficients interpretation: 1 point)
d. Use a “Backward” variable selection to reduce the number of predictors. How many
variables are being selected? Report all the variables selected. Provide the estimation
results. (Estimation results: 1 point, variables selected (or removed): 1 point)
e. Compare the predictive accuracy of the full model in (c) and the “Backward” model in
(d). Focus on measures such as RMSE and Adjusted R2. Which model performs better,
and why? (Prediction accuracy for two models: 1 point, performance comparison: 1
point)
f. What suggestions/insights can you provide to the airline company? For example, how do
they price the new routes? (2 points)
Submission checklist
[ ] Answers to questions c, d, e, f
[ ] R script that includes your code