Professional Documents
Culture Documents
Data Analysis Methods
Data Analysis Methods
Mid Term
Abstract
Onkar Deshmukh
M06156153
deshmuop@mail.uc.edu
Table of Contents
1.
Purpose: ................................................................................................................................................ 2
2.
3.
Data Cleansing....................................................................................................................................... 4
4.
5.
6.
7.
6.1
6.2
7.2
Summary ............................................................................................................................................. 15
1. Purpose:
Purpose of this project is to analyze given landing.csv data and study what factors and how they would
impact the landing distance. Also, come up with a best fit model which fits factors affecting landing
distance.
R Code
> summary(FlightData)
> attach(FlightData)
Output:
>
>
>
>
>
>
>
>
par(mfrow= c(2,4))
hist(duration)
hist(no_pasg)
hist(speed_ground)
hist(speed_air)
hist(height)
hist(pitch)
hist(distance)
Observations:
Decision:
We need to cleanse the data to choose the observations abiding by the rules given in the document and
filter out the observations that contain values beyond the threshold for that variable.
3. Data Cleansing
Objective: As noted in previous section, data given in the file has few observations that are not in line
with the given recommended threshold values. For our analysis we need to clean up this data. Data
cleansing rules specified in the requirement document are:
R Code:
rule1 <- which(duration < 40)
rule2 <- which (speed_ground< 30)
> rule4 <- which (speed_air < 30)
> rule5 <- which (speed_air > 140)
> rule6 <- which (speed_ground > 140)
> rule7 <- which (height < 6)
> rule8 <- which (distance > 6000)
> rulei1 <- union(rule1,rule2)
> rulei2 <- union(rulei1,rule4)
> rulei3 <- union(rulei2,rule5)
> rulei4 <- union(rulei3,rule6)
> rulei5 <- union(rulei4,rule7)
> rulei6 <- union(rulei5,rule8)
> FlightDataClean <- FlightData [-rulei6,]
> dim(FlightDataClean)
> detach(FlightData)
> attach(FlightDataClean)
R Output:
> dim(FlightDataClean)
[1] 781
8
Observations:
19 observations have been filtered out. 5 observations have duration less than 40 minutes. 2
observations have speed_ground less than 30 MPH and 1 observation for speed_ground is more
than 140 MPH. 10 observations have height greater than 6 meters. 1 observation has landing
distance greater than 6000 meters.
Conclusion:
Output:
Conclusion:
Preliminary graphical analysis makes us believe that speed_air and speed_ground seem to have
an effect on landing distance
cor(FlightDataClean[,2:8])
cor(FlightDataClean[,2:8],use = "pairwise.complete.obs")
plot(speed_ground,speed_air)
speed_diff = speed_air - speed_ground
summary(speed_diff)
hist(speed_diff)
R Output:
Conclusion:
Missing values present in Speed_air is definitely an issue in data analysis. We cant just drop this
variable. However, Speed_ground has a correlation coefficient of .989 which means that we can use
Speed_ground as a substitute for Speed_air during our analysis. This will eliminate the issue as well as
we wont lose significant information.
Code:
aircraftmodel <- rep(0,length(aircraft))
aircraftmodel[which(aircraft == "boeing")] <- 1
plot(distance~aircraftmodel)
Output:
Output:
Observation:
From the graphs and correlation, it can be concluded that speed_ground has an effect on landing
distance. Speed_ground seems to have a linear relationship with distance in the range of 80-120. In this
range, distance seems to increase linearly with speed_ground. For the range 40-80 it distance seems to
have a nonlinear relationship.
Conclusion: To explain nonlinear component in the graph, we can conclude that there is a quadratic
component needs to be involved while we are fitting a model to explain relationship between
speed_ground and distance.
Observation:
Null hypothesis: Variables (regressors) have no impact on the response (landing distance).
P value for aircraftmodel, speed_ground and height is less than 0.05. This means that we have
95% confidence that we can reject null hypothesis. That means, it seems that aircraftmodel,
speed_ground and height may have an impact on the model that we are trying to fit.
In ideal scenario, if this value is 1 then the model that we are trying to come up fits given data
perfectly. R-squared value given here is .856 which is close to 1.
10
R-squared value of .856 indicates that 85.6% of the variability in landing distance is explained by
the variables and model that we have come up with.
We have 95% confidence and enough evidence to believe that aircraftmodel, speed_ground and
height dont have an impact on our model. So we need to consider effect of these variables
separately. We also need to monitor adjusted R-squared to decide if our model has any
improvement in explaining variability
Output:
11
Observations:
Residual plot for speed_ground seems to have a nonlinear or quadratic pattern. It has a Ushaped plot.
Because aircraftmodel has only two discrete values 0 and 1, we are still getting those as two
discrete residual values. No meaningful conclusion can be drawn at this point
Residual plot for height has random non-symmetric pattern, so meaningful conclusion is difficult
to be drawn.
Conclusion:
We need to improve our mode by improving nonlinear residual plot for speed_ground. To
incorporate nonlinearity shown in the curved graph, we need to include a nonlinear component
in our model so that we can better explain variability using nonlinear equation.
We can keep height and aircraftmodel variables as it is in the model.
12
13
Observations:
R-Square and adjusted R-Square values have gone up. Now these values are 0.9776 each.
P-values for all the variables in the model are less than 0.05
Residual plots for speed_ground and speed_ground_sqr are randomly distributed
Conclusion:
R-Square value of 0.9776 indicates that 97.76% of the variability in the landing distance data is
explained by the model that we have come up with
This model is the best choice amongst all the models that we discussed so far
14
8. Summary
Based on the analysis, we can conclude that:
speed_ground and speed_air are highly correlated and they both seem to have an impact on
landing distance
From the data and regression analysis, we cant reject probability of height having an impact on
landing distance
Referring to the plots, we can conclude that speed_ground has a strong relationship with
landing distance. Part of the graph points out a linear relationship and part of the graph
indicates nonlinear relationship. However, nonlinear and U-shaped residual plot for
speed_ground makes reinforces that there is a nonlinear or quadratic relationship between
speed_ground and landing distance. Hence, we need to incorporate nonlinear component in our
model to find most accurately fitting model
In the end, model that includes a squared term of speed_ground, has a very high R-Squared
value (0.9776) which means that the nonlinear model that we came up in section 7.3 is the
better fit than other models that we discussed and explains most of the variability in the landing
distance.
15