Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

DATA ANALYSIS METHODS

Mid Term
Abstract

Onkar Deshmukh
M06156153
deshmuop@mail.uc.edu

DATA ANALYSIS METHODS

Table of Contents
1.

Purpose: ................................................................................................................................................ 2

2.

Understanding the data ........................................................................................................................ 2

3.

Data Cleansing....................................................................................................................................... 4

4.

Plots and Data Visualization.................................................................................................................. 5

5.

Correlation and Variable Selection ....................................................................................................... 6

6.

Effect of Variables on landing distance ................................................................................................. 8

7.

6.1

Effect of Aircraft make on landing distance:................................................................................. 8

6.2

Effect of speed_ground on landing distance: ............................................................................... 9

Model Fitting and Regression Analysis ............................................................................................... 10


7.1

Initial Model with all variables .................................................................................................... 10

7.2

Model with only aircraftmodel, speed_ground and height ........................................................ 11

7.3 Squaring the values for speed_ground and Improving Model....................................................... 13


8.

Summary ............................................................................................................................................. 15

DATA ANALYSIS METHODS

1. Purpose:
Purpose of this project is to analyze given landing.csv data and study what factors and how they would
impact the landing distance. Also, come up with a best fit model which fits factors affecting landing
distance.

2. Understanding the data


Objective: Need to understand the data given to us. This section will give us basic overview of the data.
What kind of values we have, how many variables we have and what are their summary level statistics
like mean, min and max. Histograms will help us understand value frequency distribution of these
variables.
R Code : > FlightData <- read.csv("Landing.csv")
> dim(FlightData)
Output: [1] 800
8

R Code
> summary(FlightData)
> attach(FlightData)
Output:

>
>
>
>
>
>
>
>

par(mfrow= c(2,4))
hist(duration)
hist(no_pasg)
hist(speed_ground)
hist(speed_air)
hist(height)
hist(pitch)
hist(distance)

DATA ANALYSIS METHODS

Observations:

Given dataset contains 8 variables and 800 observations


Two types of aircrafts exist in the dataset
All the variables except for Speed_air are completely populated. Speed_air has 600 N/A values
which we might need to handle separately
Looking at minimum and maximum values of these variables, we realize that some of the
observations dont contain recommended values for Duration, Speed_ground, Speed_air, Height
and distance.

Decision:
We need to cleanse the data to choose the observations abiding by the rules given in the document and
filter out the observations that contain values beyond the threshold for that variable.

DATA ANALYSIS METHODS

3. Data Cleansing
Objective: As noted in previous section, data given in the file has few observations that are not in line
with the given recommended threshold values. For our analysis we need to clean up this data. Data
cleansing rules specified in the requirement document are:

Duration of a normal flight should be greater than 40 minutes


Speed_ground and Speed_air should not be less than 30MPH and greater than 140 MPH
Height should at least be 6 meters at the threshold of runway
Length of airport runway should be less than 6000 feet

R Code:
rule1 <- which(duration < 40)
rule2 <- which (speed_ground< 30)
> rule4 <- which (speed_air < 30)
> rule5 <- which (speed_air > 140)
> rule6 <- which (speed_ground > 140)
> rule7 <- which (height < 6)
> rule8 <- which (distance > 6000)
> rulei1 <- union(rule1,rule2)
> rulei2 <- union(rulei1,rule4)
> rulei3 <- union(rulei2,rule5)
> rulei4 <- union(rulei3,rule6)
> rulei5 <- union(rulei4,rule7)
> rulei6 <- union(rulei5,rule8)
> FlightDataClean <- FlightData [-rulei6,]
> dim(FlightDataClean)
> detach(FlightData)
> attach(FlightDataClean)

R Output:
> dim(FlightDataClean)
[1] 781
8

Observations:

19 observations have been filtered out. 5 observations have duration less than 40 minutes. 2
observations have speed_ground less than 30 MPH and 1 observation for speed_ground is more
than 140 MPH. 10 observations have height greater than 6 meters. 1 observation has landing
distance greater than 6000 meters.

Conclusion:

We have cleaner data for our analysis

DATA ANALYSIS METHODS

4. Plots and Data Visualization


Objective: Graphically understand impact of different variables on landing distance.
Rcode:
pairs(FlightDataClean)

Output:

DATA ANALYSIS METHODS


Observation:

Speed_air and speed_ground seems to have a prominent graphical pattern


Other variables dont have a meaningful pattern

Conclusion:

Preliminary graphical analysis makes us believe that speed_air and speed_ground seem to have
an effect on landing distance

5. Correlation and Variable Selection


Objective: Understand correlation between all the variables. Also, we need to select variables that can
be used in our linear model. If any of these variables are highly correlated then we can use either one of
these variables. Also, as described in chapter1, speed_air has too many missing values. We need to find
if we can use an alternate variable in place of speed_air. This variable should be highly correlated to
speed_air.
R Code:
>
>
>
>
>
>

cor(FlightDataClean[,2:8])
cor(FlightDataClean[,2:8],use = "pairwise.complete.obs")
plot(speed_ground,speed_air)
speed_diff = speed_air - speed_ground
summary(speed_diff)
hist(speed_diff)

R Output:

DATA ANALYSIS METHODS

DATA ANALYSIS METHODS


Observation:

Speed_ground and Speed_air have very strong positive correlation.


Because of 600 missing values in Speed_air the correlation of it with other variables cant be
determined. So we need to drop N/A values and then find the correlation
Speed_air and Speed_ground plot is a linear graph
Speed_air is N/A for values less than 90 MPH. Its populated only when its greater than 90
MPH

Conclusion:
Missing values present in Speed_air is definitely an issue in data analysis. We cant just drop this
variable. However, Speed_ground has a correlation coefficient of .989 which means that we can use
Speed_ground as a substitute for Speed_air during our analysis. This will eliminate the issue as well as
we wont lose significant information.

6. Effect of Variables on landing distance


6.1 Effect of Aircraft make on landing distance:
Objective: Understand effect of aircraft model on landing distance
R
>
>
>

Code:
aircraftmodel <- rep(0,length(aircraft))
aircraftmodel[which(aircraft == "boeing")] <- 1
plot(distance~aircraftmodel)

Output:

DATA ANALYSIS METHODS


Observation: Based on the given data, it seems that landing distance for boeing has an upward shift as
compared to airbus aircraft.
Conclusion: It seems that range of landing distance for boeing aircrafts is greater than the range of
landing distance for airbus model

6.2 Effect of speed_ground on landing distance:


Objective: To understand effect of speed_ground on landing distance.
Rcode:
plot(distance~speed_ground)

Output:

Observation:
From the graphs and correlation, it can be concluded that speed_ground has an effect on landing
distance. Speed_ground seems to have a linear relationship with distance in the range of 80-120. In this
range, distance seems to increase linearly with speed_ground. For the range 40-80 it distance seems to
have a nonlinear relationship.
Conclusion: To explain nonlinear component in the graph, we can conclude that there is a quadratic
component needs to be involved while we are fitting a model to explain relationship between
speed_ground and distance.

DATA ANALYSIS METHODS

7. Model Fitting and Regression Analysis


7.1 Initial Model with all variables
Objective: Goal of this section is to define a model which will fit for all the variables present in the
cleaned up dataset.
Rcode:
> Model1 <- lm(distance~aircraftmodel+duration+no_pasg+speed_ground+height+pi
tch)
> summary(Model1)

Observation:

Null hypothesis: Variables (regressors) have no impact on the response (landing distance).
P value for aircraftmodel, speed_ground and height is less than 0.05. This means that we have
95% confidence that we can reject null hypothesis. That means, it seems that aircraftmodel,
speed_ground and height may have an impact on the model that we are trying to fit.
In ideal scenario, if this value is 1 then the model that we are trying to come up fits given data
perfectly. R-squared value given here is .856 which is close to 1.

10

DATA ANALYSIS METHODS


Conclusion: Based on above observations we can conclude that:

R-squared value of .856 indicates that 85.6% of the variability in landing distance is explained by
the variables and model that we have come up with.
We have 95% confidence and enough evidence to believe that aircraftmodel, speed_ground and
height dont have an impact on our model. So we need to consider effect of these variables
separately. We also need to monitor adjusted R-squared to decide if our model has any
improvement in explaining variability

7.2 Model with only aircraftmodel, speed_ground and height


Objective: Objective here is to reduce number of variables from previously built model and analyze the
impact on goodness of the fit of this new model. For that we are going to consider only 3 variables:
aircraftmodel, speed_ground and height. Moreover, we also need to plot residuals for these 3 variables.
Rcode:
>
>
>
>
>
>
>

Model2 <- lm(distance~aircraftmodel+speed_ground+height)


summary(Model2)
Residuals1 <- Model2$res
par(mfrow=c(1,3))
plot(Residuals1~aircraftmodel)
plot(Residuals1~speed_ground)
plot(Residuals1~height)

Output:

11

DATA ANALYSIS METHODS

Observations:

Residual plot for speed_ground seems to have a nonlinear or quadratic pattern. It has a Ushaped plot.
Because aircraftmodel has only two discrete values 0 and 1, we are still getting those as two
discrete residual values. No meaningful conclusion can be drawn at this point
Residual plot for height has random non-symmetric pattern, so meaningful conclusion is difficult
to be drawn.

Conclusion:

We need to improve our mode by improving nonlinear residual plot for speed_ground. To
incorporate nonlinearity shown in the curved graph, we need to include a nonlinear component
in our model so that we can better explain variability using nonlinear equation.
We can keep height and aircraftmodel variables as it is in the model.

12

DATA ANALYSIS METHODS

7.3 Squaring the values for speed_ground and Improving Model


Objective: As described in previous model, we are going to square the values for speed_ground to
include nonlinear nature of the curve. We also need to monitor R-squared and Adjusted R-Squared
values for this new model
Rcode:
>
>
>
>
>
>
>
>
>

speed_ground_sqr <- speed_ground^2


model3 <- lm(distance~aircraftmodel+speed_ground+height+speed_ground_sqr)
summary(model3)
residuals2 <- model3$res
par(mfrow=c(1,4))
plot(residuals2~aircraftmodel)
plot(residuals2~height)
plot(residuals2~speed_ground)
plot(residuals2~speed_ground_sqr)

13

DATA ANALYSIS METHODS

Observations:

R-Square and adjusted R-Square values have gone up. Now these values are 0.9776 each.
P-values for all the variables in the model are less than 0.05
Residual plots for speed_ground and speed_ground_sqr are randomly distributed

Conclusion:

R-Square value of 0.9776 indicates that 97.76% of the variability in the landing distance data is
explained by the model that we have come up with
This model is the best choice amongst all the models that we discussed so far

14

DATA ANALYSIS METHODS

8. Summary
Based on the analysis, we can conclude that:

speed_ground and speed_air are highly correlated and they both seem to have an impact on
landing distance
From the data and regression analysis, we cant reject probability of height having an impact on
landing distance
Referring to the plots, we can conclude that speed_ground has a strong relationship with
landing distance. Part of the graph points out a linear relationship and part of the graph
indicates nonlinear relationship. However, nonlinear and U-shaped residual plot for
speed_ground makes reinforces that there is a nonlinear or quadratic relationship between
speed_ground and landing distance. Hence, we need to incorporate nonlinear component in our
model to find most accurately fitting model
In the end, model that includes a squared term of speed_ground, has a very high R-Squared
value (0.9776) which means that the nonlinear model that we came up in section 7.3 is the
better fit than other models that we discussed and explains most of the variability in the landing
distance.

15

You might also like