Download as pdf or txt
Download as pdf or txt
You are on page 1of 76

Prediction of Flight Delay Analysis

A PROJECT REPORT
ON
PREDICTION OF FLIGHT DELAY ANALYSIS
Submitted in partial fullfilment of the requirements for the award of degree of
MASTER OF COMPUTER APPLICATIONS
SUBMITTED BY:
K.TRIVIKRAM ( 18MCA043L)
UNDER THE GUIDANCE OF
Ms.R. JAYAMMA, MCA, M.TECH
Assistant Professor, Dept. of M.sc(CS)

DEPARTMENT OF COMPUTER SCIENCE


Re-accredited at ‘A’ by NAAC

KAKARAPARTI BHAVANARAYANA COLLEGE


(Approved by AICTE, AFFILIATED TO KRISHNA UNIVERSITY, MACHILIPATNAM)
Kothapet, Vijayawada, Krishna(DT), Pin-520001
2019-2021

DEPARTMENT OF MCA K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

KAKARAPARTI BHAVANARAYANA COLLEGE


(Approved by AICTE, AFFILIATED TO KRISHNA UNIVERSITY, MACHILIPATNAM)
Kothapet, Vijayawada, Krishna (DT), Pin-520001

DEPARTMENT OF COMPUTER SCIENCE

CERTIFICATE
This is to certify that the project work entitled “PREDICTION OF FLIGHT DELAY ANALYSIS”
is a bonafide work carried out by K.TRIVIKRAM(18MCA043) in partial fulfilment for the award of
the degree in MASTER OF COMPUTER APPLICATIONS of KRISHNA UNIVERSITY,
MACHILIPATNAM during the academic year 2019-2021. All corrections / suggestions indicated for
internal assessment have been incorporated in the report. The project work has been approved as it
satisfies the academic requirements in respect of project work prescribed for the above degree.

Project Guide Head of the Department

External Examiner

DEPARTMENT OF MCA K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

ACKNOWLEDGEMENT

The satisfaction that accompanies the successful completion of any task would be incomplete without
mentioning the people who made it possible and whose constant guidance and encouragement crown
all the efforts with success. This acknowledgement transcends the reality of formality when we would
like to express deep gratitude and respect to all those people behind the screen who guided, inspired
and helped me for the completion of the work. I wish to place on my record my deep sense gratitude to
my project guide, Ms.R.JAYAMMA, Assistant Professor, Department of M.Sc(CS) for her
constant motivation and valuable help throughout the project work.

My sincere thanks to Mrs. SHAMIM, Head of the Department of M.Sc(CS) for her guidance
regarding the project. I also extended my thanks to Dr.P.BHARATHI DEVI, Head of the
Department of MCA for her valuable help throughout the project. I also extend my thanks to
Dr.MAZHARUNNISA BEGUM DIRECTOR for P.G. CENTRE, I extend gratitude to SRI.
S.VENKATESH, DIRECTOR for P.G. COURSES for his valuable suggestions.

K.TRIVIKRAM

(Regd.NO:18MCA043)

DEPARTMENT OF MCA K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

DECLARATION

I hereby declare the project work entitled “PREDICTION OF FLIGHT DELAY ANALYSIS”
submitted to K.B.N P.G COLLEGE affiliated to KRISHNA UNIVERSITY, has been done under the
guidance of Ms.R. JAYAMMA, Assistant Professor, Department of M.Sc(CS) during the period of
study in that it has found formed the basis for the award of the degree/diploma or other similar title to
any candidate of University.
.

Signature of Student
Name: K.Trivikram
Regd.No:18MCA043
College name: KBN PG COLLEGE

DATE:
PLACE: VIJAYAWADA

DEPARTMENT OF MCA K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

ABSTRACT
The prediction of flight delays is heavily investigated in the last few decades. Flight delays hurt
airlines, airports, and passengers. The development of accurate prediction models for flight delays
became cumbersome due to the complexity of air transportation system, the number of methods for
prediction, and the deluge of flight data. The flight delay analysis is based on scheduled arrival,
departure and actual time. In this context, this paper presents a thorough literature review of
approaches used to build flight delay prediction models. We propose a taxonomy and summarize the
initiatives used to address the flight delay prediction problem, according to scope, data, and
computational methods, giving particular attention to an increased usage of machine learning methods.
Besides, then we will check the accuracy metrics for flight delay prediction.

DEPARTMENT OF MCA K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

INDEX

S.NO CONTENTS PAGE NO


INTRODUCTION
1.1 PROBLEM STATEMENT
1. 1–3
1.2 EXISTING SYSTEM
1.3 PROPOSED SYSTEM
SYSTEM REQUIREMENTS
2.1 HARDWARE REQUIREMENTS

2. 2.2 SOFTWARE REQUIREMENTS 4 – 19


2.3 SYSTEM ENVIRONMENT
2.4 FEASIBILITY STUDY

3. 20 – 21
REVIEW OF LITERATURE

DESIGN AND IMPLEMEMTATION

4. 4.1 DESGIN 22 – 39
4.2 UML DIAGRAM
4.3 IMPLEMENTATION

5. 40 – 43
SAMPLE CODE

6. 44 – 55
SCREENSHOTS
SYSTEM TESTING

7. 7.1 TYPES OF TESTS 56 – 63


7.2 TESTING METHODOLOGIES

RESULT ANALYSIS
8. 64 – 66

9. CONCLUSION AND FUTURE SCOPE 67 – 68

10. REFERENCES 69 – 70

DEPARTMENT OF MCA K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

1.INTRODUCTION

DEPARTMENT OF MCA 1 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

1.INTRODUCTION

The prediction of flight delays is heavily investigated in the last few decades. Flight delays hurt
airlines, airports, and passengers. The development of accurate prediction models for flight delays
became cumbersome due to the complexity of air transportation system, the number of methods for
prediction, and the deluge of flight data. The flight delay analysis is based on scheduled arrival,
departure and actual time. In this context, this paper presents a thorough literature review of
approaches used to build flight delay prediction models. We propose a taxonomy and summarize the
initiatives used to address the flight delay prediction problem, according to scope, data, and
computational methods, giving particular attention to an increased usage of machine learning methods.
Besides, then we will check the accuracy metrics for flight delay prediction.

1.1 PROBLEM STATEMENT


Air transportation plays a vital role in the transportation infrastructure as well as contributes
significantly to the economy. Airports are known for their capability to increase business activities
near them and hence result in economic development. Aviation industries also provide a huge number
of jobs. Record 3.7 billion passengers availed air transport facilities in the year 2016 and this number is
expected to keep increasing every year. The worldwide air traffic report [6] released by the
International Air Transport Association showed that the demand for air travel increased by 6.3 percent
in the year 2016 as compared to the year 2015. This kind of air volume traffic needs to be constantly
monitored and checked to prevent any problems from occurring.

An aircraft is said to be delayed when it departs and/or arrives later than its actual planned time. There
are several causes of an aircraft being delayed such as weather changes, problems in maintenance,
previous delays being propagated down the line, traffic congestion and many more. These delays are a
huge challenge for the aviation industry as well as their customers and passengers. In the USA alone,
these delays result in loss of about 22 billion US dollars every year. This is because aviation companies
are forced to pay the government authorities when they keep aircraft on hold for more than a certain
stipulated time. Airplane delays also cause a lot of problems for the travelling passengers. A delay of
an aircraft can be problematic for the travelling passengers as it prevents them from fulfilling their
commitments and attending preplanned events. This can result in the passenger losing a lot of money
as well as make him or her frustrated and angry.

DEPARTMENT OF MCA 2 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

Several models have already been proposed to correctly forecast delays in flights. We utilize a
machine learning technique called Logistic regression to predict delays in aircrafts. This technique
takes various independent parameters and trains a model to classify whether an aircraft is going to be
delayed or not. We implemented the algorithm on the Microsoft Azure Learning Studio platform.
We also utilised a weather dataset and joined it with the airport dataset at the respective locations to
determine the effect of weather conditions on flight delays as well as make the prediction more
accurate for real world scenarios. We train the model using 70 percent of the dataset and then test it
with the remaining 30 percent of the data. The model was able to successfully predict the correct
outcome in more than 80 percent of the scenarios.

1.2 Existing System:


Yufeng et al propose a new model for calculating distributions of delay in departure time of airplanes.
These distributions are used to determine congestion in air traffic. The paper studies important
determinants affecting the departure time of airplanes.
Michael et al propose a model for evaluating the characteristics of queueing networks that are not static
and have arriving times based on fixed schedules as well as continuously changing times of service.
Beatty et al propose the idea of using a Delay Multiplier to determine an initial delay in aircraft times
on an operation timetable.
Yufeng et al distribution of delays in takeoffs byutilising the component mechanisms which were
trained using a genetic algorithm. But this technique is resource heavy and has not been fully tested.

1.3 Proposed System:


The Flight delays will be calculated based on the scheduled time i.e arrival time of flight, departure
time of flight and actual time of the flight. Based on the scheduled time will calculate the difference in
time and make it as a target variable. We considered the datatset for flight delay analysis, where we can
start the analytics preprossessing of the dataset in order to make it feasible to machine learning format.
The flight delay analysis is a regression problem, then will use regression based models like linear
regression and logistic regression etc. If the data has collinearity or interdependencies we will go for
lasso or ridge regression then we will check the accuracy metrics, like Rmse for validating our model.

DEPARTMENT OF MCA 3 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

2.SYSTEM REQUIREMENTS

DEPARTMENT OF MCA 4 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

2.SYSTEM REQUIREMENTS

2.1 HARDWARE REQUIREMENTS


The hardware requirement specifies each interface of the software elements and the hardware elements
of the system. These hardware requirements include configuration characteristics.

• Operating system: windows, Linux


• Processor : minimum Intel i3
• Ram: minimum 4 gb
• Hard disk : minimum 250gb

2.2 SOFTWARE REQUIREMENTS


The software requirements specify the use of all required software products like data management
system. The required software product specifies the numbers and version. Each interface specifies the
purpose of the interfacing software as related to this software product.

• Python idel 3.7 version (or)


• Anaconda 3.7 ( or)
• Jupiter (or)
• Google colab

Libraries:

 Matplotlib

 Numpy

 Pandas

 Regex

 Requests

 Scikit-learn

 Scipy

 Sklearn.

 Language: Python

DEPARTMENT OF MCA 5 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

2.3 INTRODUCTION TO SYSTEM ENVIRONMENT

ANACONDA
Anaconda is a complete, open source data science package with a community of over 6 million users. It
is easy to download and install, and it is supported on Linux, MacOs, and Windows.
The distribution comes with more than 1,000 data packages as well as the Conda package and virtual
environment manager, so it elminates the need to learn to install each library independently.
As Anaconda’s website says, “The Python and R conda packages in the Anaconda Repository are
curated and compiled in our secure environment so you get optimized binaries that ‘just work’ on your
system”.

Fig: 2.1 Anaconda Distribution

DEPARTMENT OF MCA 6 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

What is Anaconda Navigator?


Anaconda Navigator is a desktop graphical user interface (GUI) included in Anaconda® distribution
that allows you to launch applications and easily manage conda packages, environments, and channels
without using command-line commands. Navigator can search for packages on Anaconda Cloud or in
a local Anaconda Repository. It is available for Windows, macOS, and Linux.

Why use Navigator?


In order to run, many scientific packages depend on specific versions of other packages. Data scientists
often use multiple versions of many packages and use multiple environments to separate these
different versions.
The command-line program conda is both a package manager and an environment manager. This helps
data scientists ensure that each version of each package has all the dependencies it requires and works
correctly.
Navigator is an easy, point-and-click way to work with packages and environments without needing to
type conda commands in a terminal window. You can use it to find the packages you want, install them
in an environment, run the packages, and update them – all inside Navigator.

What applications can we access using Navigator?


The following applications are available by default in Navigator:

 Jupyter Notebook

 Spyder

 PyCharm

 VSCode

 Glueviz

 Orange 3 App

 RStudio

 Anaconda Prompt (Windows only)

 Anaconda PowerShell (Windows only)

 Jupyter Lab

DEPARTMENT OF MCA 7 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

 JupyterLab: This is an extensible working environment for interactive and reproducible


computing, based on the Jupyter Notebook and Architecture.

 Qt Console: It is the PyQt GUI that supports inline figures, proper multiline editing with
syntax highlighting, graphical calltips and more.
 Spyder: Spyder is a scientific Python Development Environment. It is a powerful Python
IDE with advanced editing, interactive testing, debugging and introspection features.

 VS Code: It is a streamlined code editor with support for development operations like
debugging, task running and version control.

 Glueviz: This is used for multidimensional data visualization across files. It explores
relationships within and among related datasets.

 Orange 3: It is a component-based data mining framework. This can be used for data
visualization and data analysis. The workflows in Orange 3 are very interactive and provide
a large toolbox.

 Rstudio: It is a set of integrated tools designed to help you be more productive with R. It
includes R essentials and notebooks.

 Jupyter Notebook: This is a web-based, interactive computing notebook environment.


We can edit and run human-readable docs while describing the data analysis.

The Jupyter Notebook is an open source web application that you can use to create and share
documents that contain live code, equations, visualizations, and text. Jupyter Notebook is maintained
by the people at Project Jupyter.
Jupyter Notebooks are a spin-off project from the IPython project, which used to have an IPython
Notebook project itself. The name, Jupyter, comes from the core supported programming languages
that it supports: Julia, Python, and R. Jupyter ships with the IPython kernel, which allows you to write
your programs in Python, but there are currently over 100 other kernels that you can also use.
The Jupyter Notebook is not included with Python, so if you want to try it out, you will need to install
Jupyter.
There are many distributions of the Python language. This article will focus on just two of them for the
purposes of installing Jupyter Notebook. The most popular is CPython, which is the reference version
of Python that you can get from their website. It is also assumed that you are using Python.

DEPARTMENT OF MCA 8 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

 PyCharm: It is the most popular IDE for Python, and includes great features such as
excellent code completion and inspection with advanced debugger and support for web
programming and various frameworks. PyCharm is created by Czech company, Jet brains
which focusses on creating integrated development environment for various web
development languages like JavaScript and PHP. PyCharm offers some of the best features
to its users and developers in the following aspects
 Code completion and inspection.

 Advanced debugging.

 Support for web programming and frameworks such as Django and Flask.

Features of PyCharm
Besides, a developer will find PyCharm comfortable to work with because of the features mentioned
below −

 Code Completion: PyCharm enables smoother code completion whether it is for


built in or for an external package.

 SQLAlchemy as Debugger: You can set a breakpoint, pause in the debugger and
can see the SQL representation of the user expression for SQL Language code.

 Git Visualization in Editor: When coding in Python, queries are normal for a
developer. You can check the last commit easily in PyCharm as it has the blue sections that
can define the difference between the last commit and the current one.
 Code Coverage in Editor: You can run .py files outside PyCharm Editor as well
marking it as code coverage details elsewhere in the project tree, in the summary section
etc.

 Package Management: All the installed packages are displayed with proper visual
representation. This includes list of installed packages and the ability to search and add new
packages.
 Local History: It is always keeping track of the changes in a way that complements
like Git. Local history in PyCharm gives complete details of what is needed to rollback and
what is to be added.

 Refactoring : It is the process of renaming one or more files at a time and PyCharm
includes various shortcuts for a smooth refactoring process.

DEPARTMENT OF MCA 9 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

 Wamp Server: WAMPs are packages of independently-created programs installed


on computers that use a Microsoft Windows operating system. Apache is a web server.
MySQL is an open-source database. PHP is a scripting language that can manipulate
information held in a database and generate web pages dynamically each time content is
requested by a browser. Other programs may also be included in a package, such as php My
Admin which provides a graphical user interface for the MySQL database manager, or the
alternative scripting languages Python or Perl.

LIBRARIES
Matplotlib:

 Matplotlib is a Python 2D plotting library which produces publication quality figures in a


variety of hardcopy formats and interactive environments across platforms.

 Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter
notebook, web application servers, and four graphical user interface toolkits.

Fig: 2.3 Matplotlib images

DEPARTMENT OF MCA 10 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

 Matplotlib tries to make easy things easy and hard things possible.
 You can generate plots, histograms, power spectra, bar charts, error charts, scatterplots, etc., with
just a few lines of code.
 For simple plotting the pyplot module provides a MATLAB-like interface, particularly when
combined with IPython.
 For the power user, you have full control of line styles, font properties, axes properties, etc, via an
object oriented interface or via a set of functions familiar to MATLAB users.

Numpy:
NumPy is the fundamental package for scientific computing with Python. It contains among other
things:
 a powerful N-dimensional array object
 sophisticated (broadcasting) functions
 tools for integrating C/C++ and Fortran code
 useful linear algebra, Fourier transform, and random number capabilities
 Besides its obvious scientific uses, NumPy can also be used as an efficient multi- dimensional
container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly
and speedily integrate with a wide variety of databases.
 NumPy is licensed under the BSD license, enabling reuse with few restrictions.

Pandas:

History of development
In 2008, pandas development began at AQR Capital Management. By the end of 2009 it had been open
sourced, and is actively supported today by a community of like-minded individuals around the world
who contribute their valuable time and energy to help make open source pandas possible.
Since 2015, pandas is a NumFOCUS sponsored project. This will help ensure the success of
development of pandas as a world-class open-source project.

DEPARTMENT OF MCA 11 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

Timeline

2008: Development of pandas started

2009: pandas become open source

2012: First edition of Python for Data Analysis is published

2015: pandas becomes a NumFOCUS sponsored project

2018: First in-person core developer sprint

Library Highlights
 A fast and efficient DataFrame object for data manipulation with integrated indexing.
 Tools for reading and writing data between in-memory data structures and different formats:
CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format; Intelligent
data alignment and integrated handling of missing data: gain automatic label-based
alignment in computations and easily manipulate messy data into an orderly form.
 Flexible reshaping and pivoting of data sets.
 Intelligent label-based slicing, fancy indexing, and subsetting of large data sets.
 Columns can be inserted and deleted from data structures for size mutability.
 Aggregating or transforming data with a powerful group by engine allowing split-
apply-combine operations on data sets.
 High performance merging and joining of data sets.
 Hierarchical axis indexing provides an intuitive way of working with high- dimensional data
in a lower-dimensional data structure.
 Time series-functionality: date range generation and frequency conversion, moving window
statistics, date shifting and lagging. Even create domain-specific time offsets and join time
series without losing data.
 Highly optimized for performance, with critical code paths written in Cython or C.
 Python with pandas is in use in a wide variety of academic and
commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising,
Web Analytics, and more.

DEPARTMENT OF MCA 12 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

Mission
Pandas aims to be the fundamental high-level building block for doing practical, real world data
analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible
open source data analysis / manipulation tool available in any language.

Vision

Accessible to everyone

Free for users to use and modify

Flexible

Powerful

Easy to use

Fast

Values
Is in the core of pandas to be respectful and welcoming with everybody, users, contributors and the
broader community. Regardless of level of experience, gender, gender identity and expression, sexual
orientation, disability, personal appearance, body size, race, ethnicity, age, religion, or nationality.
Regex:
 A regular expression, regex or regexp (sometimes called a rational expression) is a sequence of
characters that define a search pattern.
 Usually such patterns are used by string searching algorithms for "find" or "find and replace"
operations on strings, or for input validation.
 It is a technique developed in theoretical computer science and formal language theory.
 Regular expressions are used in search engines, search and replace dialogs of word processors
and text editors, in text processing utilities such as sed and AWK and in lexical
analysis.
 Many programming languages provide regex capabilities either built-in or via libraries.
Requests:
 Requests is a Python HTTP library, released under the Apache2 License.
 The goal of the project is to make HTTP requests simpler and more human-friendly.
 The current version is 2.22.0
 The requests library is the de facto standard for making HTTP requests in Python.

DEPARTMENT OF MCA 13 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

 It abstracts the complexities of making requests behind a beautiful, simple API so that you can
focus on interacting with services and consuming data in your application.
Scikit-learn:
 cikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning
library for the Python programming language.
 It features various classification, regression and clustering algorithms including support vector
machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to
interoperate with the Python numerical and scientific libraries NumPy
and SciPy.
 Scikit-learn is largely written in Python, and uses numpy extensively for high- performance linear
algebra and array operations.
 Furthermore, some core algorithms are written in Cython to improve performance.
 Support vector machines are implemented by a Cython wrapper around LIBSVM; logistic
regression and linear support vector machines by a similar wrapper around LIBLINEAR.
 In such cases, extending these methods with Python may not be possible.
 Scikit-learn integrates well with many other Python libraries, such as matplotlib and plotly for
plotting, numpy for array vectorization, pandas dataframes, scipy, and many more.
 Scikit-learn is one of the most popular machine learning libraries on GitHub.
SciPy:
 SciPy is a free and open-source Python library used for scientific computing and technical
computing.
 SciPy contains modules for optimization, linear algebra, integration, interpolation, special
functions, FFT, signal and image processing, ODE solvers and other tasks common in science
and engineering.
 SciPy builds on the NumPy array object and is part of the NumPy stack which includes tools like
Matplotlib, pandas and SymPy, and an expanding set of scientific computing libraries.
 This NumPy stack has similar users to other applications such as MATLAB, GNU Octave, and
Scilab.
 The NumPy stack is also sometimes referred to as the SciPy stack.
 SciPy is also a family of conferences for users and developers of these tools: SciPy (in the United
States), EuroSciPy (in Europe) and SciPy.in (in India).
 Enthought originated the SciPy conference in the United States and continues to sponsor many of
the international conferences as well as host the SciPy website.
 The SciPy library is currently distributed under the BSD license, and its development is

DEPARTMENT OF MCA 14 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

sponsored and supported by an open community of developers.


 It is also supported by NumFOCUS, a community foundation for supporting reproducible and
accessible science.
 The basic data structure used by SciPy is a multidimensional array provided by
the NumPy module.
 NumPy provides some functions for linear algebra, Fourier transforms, and random number
generation, but not with the generality of the equivalent functions in SciPy.
 NumPy can also be used as an efficient multidimensional container of data with arbitrary
datatypes.
 This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.
 Older versions of SciPy used Numeric as an array type, which is now deprecated in favor of the
newer NumPy array code.

PYTHON:
 Python is a general purpose, dynamic, high level and interpreted programming language. It
supports Object Oriented programming approach to develop applications. It is simple and easy to
learn and provides lots of high-level data structures.
 Python is easy to learn yet powerful and versatile scripting language which makes it attractive for
Application Development.
 Python's syntax and dynamic typing with its interpreted nature, makes it an ideal language for
scripting and rapid application development.
 Python supports multiple programming pattern, including object oriented, imperative and
functional or procedural programming styles.
 Python is not intended to work on special area such as web programming. That is why it is known
as multipurpose because it can be used with web, enterprise, 3D CAD etc.
 We don't need to use data types to declare variable because it is dynamically typed so we can
write a=10 to assign an integer value in an integer variable.
 Python makes the development and debugging fast because there is no compilation step included
in python development and edit-test-debug cycle is very fast.
Python features:
 Python provides lots of features that are listed below.
 Easy to Learn and Use: Python is easy to learn and use. It is developer-friendly and high level
programming language.
 Expressive Language: Python language is more expressive means that it is more understandable

DEPARTMENT OF MCA 15 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

and readable.
 Interpreted Language: Python is an interpreted language i.e. interpreter executes the code line by
line at a time. This makes debugging easy and thus suitable for beginners.
 Cross-platform Language: Python can run equally on different platforms such as Windows,
Linux, Unix and Macintosh etc. So, we can say that Python is a portable language.
 Free and Open Source: Python language is freely available at official web address. The
source-code is also available. Therefore it is open source.
 Object-Oriented Language: Python supports object oriented language and concepts of classes and
objects come into existence.
 Extensible: It implies that other languages such as C/C++ can be used to compile the code and
thus it can be used further in our python code.
 Large Standard Library: Python has a large and broad library and provides rich set of module and
functions for rapid application development.
 GUI Programming Support: Graphical user interfaces can be developed using Python.
 Integrated: It can be easily integrated with languages like C, C++, JAVA etc.

Python applications:
Python is known for its general purpose nature that makes it applicable in almost each domain of
software development. Python as a whole can be used in any sphere of development.
Here, we are specifying applications areas where python can be applied.

 Web Applications:
We can use Python to develop web applications. It provides libraries to handle internet protocols such
as HTML and XML, JSON, Email processing, request, beautifulSoup, Feedparser etc. It also provides
Frameworks such as Django, Pyramid, Flask etc to design and develop web based applications. Some
important developments are: PythonWikiEngines, Pocoo, PythonBlogSoftware etc.

 Desktop GUI Applications:


Python provides Tk GUI library to develop user interface in python based application. Some other
useful toolkits wxWidgets, Kivy, pyqt that are useable on several platforms. The Kivy is popular for
writing multi touch applications.

 Software Development:
Python is helpful for software development process. It works as a support language and can be used for
build control and management, testing etc.

DEPARTMENT OF MCA 16 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

 Scientific and Numeric:


Python is popular and widely used in scientific and numeric computing. Some useful library and
package are SciPy, Pandas, IPython etc. SciPy is group of packages of engineering, science and
mathematics.

 Business Applications:
Python is used to build business applications like ERP and e-commerce systems. Tryton is a high level
application platform.

 Console Based Application:


We can use Python to develop console based applications. For example: IPython.

 Audio or Video based Applications:


Python is awesome to perform multiple tasks and can be used to develop multimedia applications.
Some of real applications are: TimPlayer, cplay etc.

 3D CAD Applications:
To create CAD application Fandango is a real application which provides full features of CAD.

 Enterprise Applications:
Python can be used to create applications which can be used within an Enterprise or an Organization.
Some real time applications are: OpenErp, Tryton, Picalo etc.

 Applications for Images:


Using Python several application can be developed for image. Applications developed are: VPython,
Gogh, imgSeek etc
2.4 FEASIBILITY STUDY
An important outcome of preliminary investigation is the determination that the system request is
feasible. This is possible only if it is feasible within limited resource and time. The different
feasibilities that have to be analyzed are
Operational Feasibility
Economic Feasibility
Technical Feasibility

Operational Feasibility
Operational Feasibility deals with the study of prospects of the system to be developed. This system
operationally eliminates all the tensions of the Admin and helps him in effectively tracking the project
progress. This kind of automation will surely reduce the time and energy, which previously consumed
in manual work. Based on the study, the system is proved to be operationally feasible.

DEPARTMENT OF MCA 17 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

Economic Feasibility
Economic Feasibility or Cost-benefit is an assessment of the economic justification for a computer
based project. As hardware was installed from the beginning & for lots of purposes thus the cost on
project of hardware is low. Since the system is a network based, any number of employees connected
to the LAN within that organization can use this tool from at any time. The Virtual Private Network is
to be developed using the existing resources of the organization. So the project is economically
feasible.

Technical Feasibility
According to Roger S. Pressman, Technical Feasibility is the assessment of the technical resources of
the organization. The organization needs IBM compatible machines with a graphical web browser
connected to the Internet and Intranet. The system is developed for platform Independent environment.
Java Server Pages, JavaScript, HTML, SQL server and WebLogic Server are used to develop the
system. The technical feasibility has been carried out. The system is technically feasible for
development and can be developed with the existing facility.

DEPARTMENT OF MCA 18 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

3.REVIEW OF LITERATURE

DEPARTMENT OF MCA 19 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

3.REVIEW OF THE LITERATURE

Flight delays hurt airlines, airports, and passengers. Their prediction is crucial during the
decision-making process for all players of commercial aviation. Moreover, the development of
accurate prediction models for flight delays became cumbersome due to the complexity of air
transportation system, the number of methods for prediction, and the deluge of flight data. In this
context, this paper presents a thorough literature review of approaches used to build flight delay
prediction models from the Data Science perspective. We propose a taxonomy and summarize the
initiatives used to address the flight delay prediction problem, according to scope, data, and
computational methods, giving particular attention to an increased usage of machine learning methods.
Besides, we also present a timeline of significant works that depicts relationships between flight delay
prediction problems and research trends to address them.
The expected growth in air travel demand and the positive correlation with the economic factors
highlight the significant contribution of the aviation community to the U.S. economy. On‐time
operations play a key role in airline performance and passenger satisfaction. Thus, an accurate
investigation of the variables that cause delays is of major importance. The application of machine
learning techniques in data mining has seen explosive growth in recent years and has garnered interest
from a broadening variety of research domains including aviation. This study employed a support
vector machine (SVM) model to explore the non-linear relationship between flight delay outcomes.
Individual flight data were gathered from 20 days in 2018 to investigate causes and patterns of air
traffic delay at three major New York City airports. Considering the black box characteristic of the
SVM, a sensitivity analysis was performed to assess the relationship between dependent and
explanatory variables. The impacts of various explanatory variables are examined in relation to delay,
weather information, airport ground operation, demand-capacity, and flow management
characteristics. The variable impact analysis reveals that factors such as pushback delay, taxi-out
delay, ground delay program, and demand-capacity imbalance with the probabilities of 0.506, 0.478,
0.339, and 0.338, respectively, are significantly associated with flight departure delay. These findings
provide insight for better understanding of the causes of departure delays and the impacts of various
explanatory factors on flight delay patterns.

DEPARTMENT OF MCA 20 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

4.DESIGN AND IMPLEMENTATION

DEPARTMENT OF MCA 21 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

4.DESIGN AND IMPLEMENTATION

4.1 INTRODUCTION TO DESIGN

Systems design is the process of defining elements of a system like modules, architecture, components
and their interfaces and data for a system based on the specified requirements. It is the process of
defining, developing and designing systems which satisfies the specific needs and requirements of a
business or organization.
This system is conducted for the purpose of single platform web application to multiple users. The
existent system which increases the chances for errors and it also causes much more stress to the
people which are engrossed in the work.

4.2 UML (Unified Modeling Language) DIAGRAMS

UML is a method for describing the system architecture in detail using the blue print. UML represents
a collection of best engineering practice that has proven successful in the modeling of large and
complex systems. The UML is very important parts of developing object oriented software and the
software development process. The UML uses mostly graphical notations to express the design of
software projects. Using the helps UML helps project teams communicate explore potential designs
and validate the architectural design of the software.

UML offers a set of standardized diagram types with which complex data, processes and systems can
easily be arranged in a clear, intuitive manner.

UML is neither a procedure nor a process; rather, it provides a "dictionary" of symbols


- each of which has a specific meaning. It offers diagram types for object-oriented analysis, design and
programming, thereby ensuring a seamless transition from requirements placed on a system to final
implementation. Structure and system behaviour are likewise shown, thereby offering clear reference
points for solution optimization.

One major aspect of UML is the ability to use diagrams as a part of project documentation. These can
be utilised in various ways in the most diverse kinds of documents; for example, Use Case Diagrams
used in describing functional requirements can be specified in the requirements definition. Classes or
component diagrams can be used as software architecture in a design document. As a matter of

DEPARTMENT OF MCA 22 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

principle, UML diagrams can be used in practically any technical documentation (e.g. test plans) while
also serving as part of the user handbook.

1.Use Case Diagram:

Use case diagram represents the functionality of the system. Use case focus on the behavior of the
system from external point of view. Actors are external entities that interact with the system.

USECASE DIAGRAM

Use cases:
A use case describes a sequence of actions that provide something of measurable value to an actor and
is drawn as a horizontal ellipse.

Actors:
An actor is a person, organization, or external system that plays a role in one or more interactions with
the system.

DEPARTMENT OF MCA 23 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

System boundary boxes:


A rectangle is drawn around the use cases, called the system boundary box, to indicate the scope of
system. Anything within the box represents functionality that is in scope and anything outside the box
is not.
Four relationships among use cases are used often in practice.

Include:
In one form of interaction, a given use case may include another. "Include is a Directed Relationship
between two use cases, implying that the behaviour of the included use case is inserted into the
behaviour of the including use case.
The first use case often depends on the outcome of the included use case. This is useful for extracting
truly common behaviours from multiple use cases into a single description. The notation is a dashed
arrow from the including to the included use case, with the label "«include»". There are no
parameters or return values. To specify the location in a flow of events in which the base use case
includes the behaviour of another, you simply write include followed by the name of use case you want
to include, as in the following flow for track order.

Extend:
In another form of interaction, a given use case (the extension) may extend another. This relationship
indicates that the behaviour of the extension use case may be inserted in the extended use case under
some conditions. The notation is a dashed arrow from the extension to the extended use case, with the
label "«extend»". Modellers use the «extend» relationship to indicate use cases that are "optional" to
the base use case.

Generalization:
In the third form of relationship among use cases, a generalization/specialization relationship exists. A
given use case may have common behaviours, requirements, constraints, and assumptions with a more
general use case. In this case, describe them once, and deal with it in the same way, specialized cases.
The notation is a solid line ending in a hollow triangle drawn from the specialized to the more general
use case (following the standard generalization notation.

Associations:
Associations between actors and use cases are indicated in use case diagrams by solid lines. An
association exists whenever an actor is involved with an interaction described by a use case.

DEPARTMENT OF MCA 24 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

Associations are modelled as lines connecting use cases and actors to one another, with an optional
arrowhead on one end of the line. The arrowhead is often used to indicating the direction of the initial
invocation of the relationship or to indicate the primary actor within the use case.

Identified Use Cases:


The “user model view” encompasses a problem and solution from the preservative of those
individuals whose problem the solution addresses. The view presents the goals and objectives of the
problem owners and their requirements of the solution. This view is composed of “use case diagrams”.
These diagrams describe the functionality provided by a system to external integrators. These
diagrams contain actors, use cases, and their relationships.

2. Class Diagram
Class-based Modeling, or more commonly class-orientation, refers to the style of object-oriented
programming in which inheritance is achieved by defining classes of objects; as opposed to the objects
themselves (compare Prototype-based programming).
The most popular and developed model of OOP is a class-based model, as opposed to an object-based
model. In this model, objects are entities that combine state (i.e., data), behavior (i.e., procedures, or
methods) and identity (unique existence among all other objects). The structure and behavior of an
object are defined by a class, which is a definition, or blueprint, of all objects of a specific type. An
object must be explicitly created based on a class and an object thus created is considered to be an
instance of that class. An object is similar to a structure, with the addition of method pointers, member
access control, and an implicit data member which locates instances of the class (i.e. actual objects of
that class) in the class hierarchy (essential for runtime features).

Class Diagram

DEPARTMENT OF MCA 25 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

3. Sequence Diagram:
A sequence diagram in Unified Modeling Language (UML) is a kind of interaction diagram that shows
how processes operate with one another and in what order. It is a construct of a Message Sequence
Chart.
Sequence diagrams are sometimes called event diagrams, event scenarios, and timing diagrams.A
sequence diagram shows, as parallel vertical lines (lifelines), different processes or objects that live
simultaneously, and, as horizontal arrows, the messages exchanged between them, in the order in
which they occur. This allows the specification of simple runtime scenarios in a graphical manner. If
the lifeline is that of an object, it demonstrates a role. Note that leaving the instance name blank can
represent anonymous and unnamed instances. In order to display interaction, messages are used. These
are horizontal arrows with the message name written above them. Solid arrows with full heads are
synchronous calls, solid arrows with stick heads are asynchronous calls and dashed arrows with stick
heads are return messages. This definition is true as of UML 2, considerably different from UML 1.x.
Activation boxes, or method-call boxes, are opaque rectangles drawn on top of lifelines to represent
that processes are being performed in response to the message (Execution Specifications in UML).
Objects calling methods on themselves use messages and add new activation boxes on top of any
others to indicate a further level of processing. When an object is destroyed (removed from memory),
an X is drawn on top of the lifeline, and the dashed line ceases to be drawn below it (this is not the case
in the first example though). It should be the result of a message, either from the object itself, or
another.
A message sent from outside the diagram can be represented by a message originating from a filled-in
circle (found message in UML) or from a border of sequence diagram (gate in UML).

DEPARTMENT OF MCA 26 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

Sequence Diagram

4. Collaboration Diagram:
A Sequence diagram is dynamic, and, more importantly, is time ordered. A Collaboration diagram is
very similar to a Sequence diagram in the purpose it achieves; in other words, it shows the dynamic
interaction of the objects in a system. A distinguishing feature of a Collaboration diagram is that it
shows the objects and their association with other objects in the system apart from how they interact
with each other. The association between objects is not represented in a Sequence diagram.
A Collaboration diagram is easily represented by modeling objects in a system and representing the
associations between the objects as links. The interaction between the objects is denoted by arrows. To
identify the sequence of invocation of these objects, a number is placed next to each of these arrows.

DEPARTMENT OF MCA 27 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

Defining a Collaboration Diagram:


A sophisticated modeling tool can easily convert a collaboration diagram into a sequence diagram and
the vice versa. Hence, the elements of a Collaboration diagram are essentially the same as that of a
Sequence diagram.

Collaboration diagram

Activity Diagram:
Activity diagrams are graphical representations of workflows of stepwise activities and actions with
support for choice, iteration and concurrency. In the Unified Modeling Language, activity diagrams
can be used to describe the business and operational step-by-step workflows of components in a
system. An activity diagram shows the overall flow of control.Activity diagrams are constructed from
a limited repertoire of shapes, connected with arrows.

The most important shape types:


 rounded rectangles represent activities;
 diamonds represent decisions;
 bars represent the start (split) or end (join) of concurrent activities;
 a black circle represents the start (initial state) of the workflow;
 An encircled black circle represents the end (final state).

DEPARTMENT OF MCA 28 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

Arrows run from the start towards the end and represent the order in which activities happen. However,
the join and split symbols in activity diagrams only resolve this for simple cases; the meaning of the
model is not clear when they are arbitrarily combined with the decisions or loops.

Activity diagram

5. State Chart Diagram:


Objects have behaviors and states. The state of an object depends on its current activity or condition. A
state chart diagram shows the possible states of the object and the transitions that cause a change in
state. A state diagram, also called a state machine diagram or state chart diagram, is an illustration of
the states an object can attain as well as the transitions between those states in the Unified Modeling
Language. A state diagram resembles a flowchart in which the initial state is represented by a large
black dot and subsequent states are portrayed as boxes with rounded corners. There may be one or two
horizontal lines through a box, dividing it into stacked sections. In that case, the upper section contains
the name of the state, the middle section (if any) contains the state variables and the lower section
contains the actions performed in that state. If there are no horizontal lines through a box, only the
name of the state is written inside it. External straight lines, each with an arrow at one end, connect
various pairs of boxes. These lines define the transitions between states. The final state is portrayed as

DEPARTMENT OF MCA 29 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

a large black dot with a circle around it. Historical states are denoted as circles with the letter H
inside.

State Chart Diagram

DEPARTMENT OF MCA 30 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

4.3 IMPLEMENTATION

System Architecture

Introduction
A delay of an aircraft can be problematic for the travelling passengers as it prevents them from
fulfilling their commitments and attending preplanned events. This can result in the passenger losing a
lot of money as well as make him or her frustrated and angry. Several models have already been
proposed to correctly forecast delays in flights. We utilize a machine learning technique called Lasso
regression to predict delays in aircrafts. This technique takes various independent parameters and
trains a model to classify whether an aircraft is going to be delayed or not. We implemented the
algorithm on the Microsoft Azure Learning Studio platform. We also utilised a weather dataset and
joined it with the airport dataset at the respective locations to determine the effect of weather
conditions on flight delays as well as make the prediction more accurate for real world scenarios. We

DEPARTMENT OF MCA 31 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

train the model using 70 percent of the dataset and then test it with the remaining 30 percent of the data.
The model was able to successfully predict the correct outcome in more than 80 percent of the
scenarios.

Dataset Description:
 The sample data has been collected from department of transportation which consists of all the
records of flight details and weather data.
 Dataset: 2015 flight delays and cancellations from kaggle.
 The dataset consists of 23,123entries and 31 columns.
 The dataset contains data on on-time, delayed, canceled and diverted flights, flight details,
arrival, departure and scheduled times of flights.

Features
YEAR: Year of the Flight Trip
MONTH: Month of the Flight Trip
DAY: Day of the Flight Trip
DAY_OF_WEEK: Day of week of the Flight Trip
AIRLINE: Airline Identifier
FLIGHT_NUMBER: Flight Identifier
TAIL_NUMBER: Aircraft Identifier
ORIGIN_AIRPORT: Starting Airport
DESTINATION_AIRPORT: Destination Airport
SCHEDULED_DEPARTURE: Planned Departure Time
DEPARTURE_TIME: WHEEL_OFF - TAXI_OUT
DEPARTURE_DELAY: Total Delay on Departure
TAXI_OUT: The time duration elapsed between departure from the origin airport gate and wheels off
WHEELS_OFF: The time point that the aircraft's wheels leave the ground
SCHEDULED_TIME: Planned time amount needed for the flight trip
ELAPSED_TIME: AIR_TIME+TAXI_IN+TAXI_OUT
AIR_TIME: The time duration between wheels_off and wheels_on time
DISTANCE: Distance between two airports
WHEELS_ON: The time point that the aircraft's wheels touch on the ground
TAXI_IN: The time duration elapsed between wheels-on and gate arrival at the destination airport
SCHEDULED_ARRIVAL: Planned arrival time
ARRIVAL_TIME: WHEELS_ON+TAXI_IN

DEPARTMENT OF MCA 32 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

ARRIVAL_DELAY: ARRIVAL_TIME-SCHEDULED_ARRIVAL
DIVERTED: Aircraft landed on airport that out of schedule
CANCELLED: Flight Cancelled (1 = cancelled)
CANCELLATION_REASON: Reason for Cancellation of flight: A - Airline/Carrier; B - Weather; C -
National Air System; D - Security
AIR_SYSTEM_DELAY: Delay caused by air system
SECURITY_DELAY: Delay caused by security
AIRLINE_DELAY: Delay caused by the airline
LATE_AIRCRAFT_DELAY: Delay caused by aircraft
WEATHER_DELAY: Delay caused by weather.

Project Modules
Pre processing:
Data pre processing is a technique that is used to convert raw data into a clean dataset. The data is
gathered from different sources is in raw format which is not feasible for the analysis. Pre-processing
for this approach takes 4 simple yet effective steps.

Cleaning missing values:


In some cases the dataset contain missing values. We need to be equipped to handle the problem when
we come across them. After all we might not need to try to do that. One in every of the foremost
common plan to handle the matter is to require a mean of all the values of the same column and have it
to replace the missing data. The library used for the task is called Scikit Learn preprocessing. It
contains a class called Imputer which will help us take care of the missing data.

Training and Test data:


Splitting the Dataset into Training set and Test Set Now the next step is to split our dataset into two
parts i.e .. Training set and a Test set. We will train our machine learning models on our training set, i.e
our machine learning models will try to understand any correlations in our training set and then we will
test the models on our test set to examine how accurately it will predict. A general rule of the thumb is
to assign 80% of the dataset to training set and therefore the remaining 20% to test set.

Feature Scaling:
The final step of data pre- processing is feature scaling. But what is it? It is a method used to
standardize the range of independent variables or features of data. But why is it necessary? A lot of

DEPARTMENT OF MCA 33 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

machine learning models are based on Euclidean distance. If, for example, the values in one column
(x) is much higher than the value in another column (y), (x2-x1) squared will give a far greater value
than (y2-y1) squared. So clearly, one square distinction dominates over the other square distinction. In
the machine learning equations, the square difference with the lower value in comparison to the far
greater value will almost be treated as if it does not exist. We do not want that to happen. That is why
it’s necessary to transform all our variables into the same scale.

Label Encoding
In machine learning, we usually deal with datasets which contains multiple labels in one or more than
one columns. These labels can be in the form of words or numbers. To make the data understandable or
in human readable form, the training data is often labeled in words.
Label encoding refers to converting the labels into numeric form so as to convert it into the
machine-readable form. Machine learning algorithms can then decide in a better way on how those
labels must be operated. It is an important pre-processing step for the structured dataset in supervised
learning.

Limitation of label Encoding:


Label encoding convert the data in machine readable form, but it assigns a unique number(starting
from 0) to each class of data. This may lead to the generation of priority issue in training of data sets. A
label with high value may be considered to have high priority than a label having lower value.

Min Max Scaling


An alternative approach to Z-score normalization (or standardization) is the so-called Min-Max
scaling (often also simply called "normalization" - a common cause for ambiguities). In this approach,
the data is scaled to a fixed range - usually 0 to 1. The cost of having this bounded range - in contrast to
standardization - is that we will end up with smaller standard deviations, which can suppress the effect
of outliers.

Feature Selection
Feature selection is also called variable selection or attribute selection.
It is the automatic selection of attributes in your data (such as columns in tabular data) that are most
relevant to the predictive modeling problem you are working on.
feature selection… is the process of selecting a subset of relevant features for use in model
construction

DEPARTMENT OF MCA 34 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

Feature selection is different from dimensionality reduction. Both methods seek to reduce the number
of attributes in the dataset, but a dimensionality reduction method do so by creating new combinations
of attributes, where as feature selection methods include and exclude attributes present in the data
without changing them.

Correlation matrix:
A correlation matrix is a table showing correlation coefficients between sets of variables. Each random
variable (X i) in the table is correlated with each of the other values in the table (X j). This allows you
to see which pairs have the highest correlation.

Applying Algorithms
The dataset is split as train and test data and then train the model with regression algorithms such as
Support Vector Regression and LASSO regression.

Validation of Model
Model validation is the process of checking whether the user input is suitable for model binding and if
not it should provide useful error messages to the user. The first part is to ensure that only valid entries
are made. This should filter inputs which don’t make any sense.

Calculating R-squared metrics:


R-Squared (R² or the coefficient of determination) is a statistical measure in a regression model that
determines the proportion of variance in the dependent variable that can be explained by the
independent variable. In other words, r-squared tells how well the data fit the regression model (the
goodness of fit).

Algorithms

Elastic net regression:


Linear regression refers to a model that assumes a linear relationship between input variables and the
target variable.
With a single input variable, this relationship is a line, and with higher dimensions, this relationship
can be thought of as a hyperplane that connects the input variables to the target variable. The
coefficients of the model are found via an optimization process that seeks to minimize the sum squared
error between the predictions (yhat) and the expected target values (y).

DEPARTMENT OF MCA 35 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

loss = sum i=0 to n (y_i – yhat_i)^2


A problem with linear regression is that estimated coefficients of the model can become large, making
the model sensitive to inputs and possibly unstable. This is particularly true for problems with few
observations (samples) or more samples (n) than input predictors (p) or variables (so-called p >> n
problems).
One approach to addressing the stability of regression models is to change the loss function to include
additional costs for a model that has large coefficients. Linear regression models that use these
modified loss functions during training are referred to collectively as penalized linear regression.
One popular penalty is to penalize a model based on the sum of the squared coefficient values. This is
called an L2 penalty. An L2 penalty minimizes the size of all coefficients, although it prevents any
coefficients from being removed from the model.

l2_penalty = sum j=0 to p beta_j^2


Another popular penalty is to penalize a model based on the sum of the absolute coefficient values.
This is called the L1 penalty. An L1 penalty minimizes the size of all coefficients and allows some
coefficients to be minimized to the value zero, which removes the predictor from the model.

l1_penalty = sum j=0 to p abs(beta_j)


Elastic net is a penalized linear regression model that includes both the L1 and L2
penalties during training.
Using the terminology from “The Elements of Statistical Learning,” a hyperparameter
“alpha” is provided to assign how much weight is given to each of the L1 and L2
penalties. Alpha is a value between 0 and 1 and is used to weight the contribution of
the L1 penalty and one minus the alpha value is used to weight the L2 penalty.

elastic_net_penalty = (alpha * l1_penalty) + ((1 – alpha) * l2_penalty)


For example, an alpha of 0.5 would provide a 50 percent contribution of each penalty
to the loss function. An alpha value of 0 gives all weight to the L2 penalty and a value
of 1 gives all weight to the L1 penalty.
The benefit is that elastic net allows a balance of both penalties, which can result in better performance
than a model with either one or the other penalty on some
problems.

Another hyperparameter is provided called “lambda” that controls the weighting of thesum of both
penalties to the loss function. A default value of 1.0 is used to use the fully weighted penalty; a value of

DEPARTMENT OF MCA 36 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

0 excludes the penalty. Very small values of lambada, such as 1e-3 or smaller, are common.

elastic_net_loss = loss + (lambda * elastic_net_penalty)

LASSO Regression:
Lasso regression is a type of linear Regression that uses shrinkage. Shrinkage is where data values are
shrunk towards a central point, like the mean. The lasso procedure encourages simple, sparse models
(i.e. models with fewer parameters). This particular type of regression is well-suited for models
showing high levels of muticollinearity or when you want to automate certain parts of model selection,
like variable selection/parameter elimination.
The acronym “LASSO” stands for Least Absolute Shrinkage and Selection Operator.
A tuning parameter, λ controls the strength of the L1 penalty. λ is basically the of amount shrinkage:
When λ = 0, no parameters are eliminated. The estimate is equal to the one found with linear
regression.
 As λ increases, more and more coefficients are set to zero and eliminated (theoretically, when λ
= ∞, all coefficients are eliminated).
 As λ increases, bias increases.
 As λ decreases, variance increases.

Lasso regression is one of the regularization methods that creates parsimonious models in the
presence of large number of features, where large means either of the below two things:
 Large enough to enhance the tendency of the model to over-fit. Minimum ten variables can
cause overfitting.
 Large enough to cause computational challenges. This situation can arise in case of millions or
billions of features.
 Lasso regression performs L1 regularization that is it adds the penalty equivalent to the
absolute value of the magnitude of the coefficients. Here the minimization objective is as f

Minimization objective = LS Obj + λ (sum of absolute value of coefficients)


Where LS Obj stands for Least Squares Objective which is nothing but the linear regression objective
without regularization and λ is the turning factor that controls the amount of regularization. The bias
will increase with the increasing value of λ and the variance will decrease as the amount of shrinkage
(λ) increases.
It is basically an alternative to the classic least squares estimate to avoid many of the problems with
overfitting when we have a large number of independent variables.

DEPARTMENT OF MCA 37 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

By using large coefficient, we are putting a huge emphasis on the particular feature that it canbe a good
predictor of the outcome. And when it is too large, the algorithm starts modeling intricate relations to
calculate the output & ends up overfitting to the particular data. Lasso regression adds a factor of the
sum of the absolute value of the coefficients the optimization objective.

DEPARTMENT OF MCA 38 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

5. SAMPLE CODE

DEPARTMENT OF MCA 39 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

5.SAMPLE CODE

SOURCE CODE:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')
Load the dataset
df = pd.read_csv("flightsdelay.csv")
print("Data read Sucessfully")
df

EXPLORATORY DATA ANALYSIS


To display columns,datatype, Null values ,Shape, Describe, Nunique, Isnull values
df.info()
df.shape
df.describe()
To display columns Number of unique values
df.nunique()
To display columns missing values
df.isnull().sum()

DATA PRE-PROCESSING

missing value correction


***Missing value Correction:***
**Here we have two types of missing values**
**One is numerical value correction :- in this scenario we do correction with mean.**
**Second one is string value correction :- in this scenario we do correction with Value_counts (we'll
go with MAX)**
cd = df['CarrierDelay'].astype('float').mean(axis = 0)
df['CarrierDelay'].replace(np.nan,cd,inplace = True)

DEPARTMENT OF MCA 40 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

wd = df['WeatherDelay'].astype('float').mean(axis = 0)
df['WeatherDelay'].replace(np.nan,wd,inplace = True)
nas = df['NASDelay'].astype('float').mean(axis = 0)
df['NASDelay'].replace(np.nan,nas,inplace = True)
sd = df['SecurityDelay'].astype('float').mean(axis = 0)
df['SecurityDelay'].replace(np.nan,sd,inplace = True)
lad = df['LateAircraftDelay'].astype('float').mean(axis = 0)
df['LateAircraftDelay'].replace(np.nan,lad,inplace = True)
df.isnull().sum()

LABEL ENCODING

Converting Categorical variables to numerical


df
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
df['TailNum'] = le.fit_transform(df['TailNum'])
df['Origin']=le.fit_transform(df['Origin'])
df['Dest']=le.fit_transform(df['Dest'])
df['CancellationCode']=le.fit_transform(df['CancellationCode'])
df.info()
Data splitting
x = df.drop('LateAircraftDelay',axis = 1)
y = df['LateAircraftDelay']
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,random_state = 10)

LINEAR REGRESSION

Importing Model
from sklearn.linear_model import LinearRegression
model_lr = LinearRegression()
model_lr.fit(x_train,y_train)
from sklearn.metrics import r2_score
r2_score(y_test, model_lr.predict(x_test))

DEPARTMENT OF MCA 41 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

Here we got 64% of Accuracy in LinearRegression

LASSO REGRESSION

Importing Model
from sklearn.linear_model import Lasso
model_lso = Lasso()
model_lso.fit(x_train,y_train)
from sklearn.metrics import r2_score

r2_score(y_test, model_lso.predict(x_test))
Here we got 65% of Accuracy in Lasso Regressor

RIDGE REGRESSION

Importing Model
from sklearn.linear_model import Ridge
model_rdg = Ridge()
model_rdg.fit(x_train,y_train)
from sklearn.metrics import r2_score
r2_score(y_test, model_lso.predict(x_test))
Here we got 65% of Accuracy in Ridge Regressor

ELASTIC NET REGRESSION

Importing Model
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.1,random_state = 10)
from sklearn.linear_model import ElasticNet
model_en = ElasticNet(alpha = 1.0)
model_en.fit(x_train,y_train)
from sklearn.metrics import r2_score
r2_score(y_test, model_en.predict(x_test))
Here we got 70% of Accuracy in Elastic net regressor.

DEPARTMENT OF MCA 42 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

6. SCREENSHOTS

DEPARTMENT OF MCA 43 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

6. SCREENSHOTS

Raw Dataset

Instance of Dataset

DEPARTMENT OF MCA 44 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

DEPARTMENT OF MCA 45 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

DEPARTMENT OF MCA 46 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

DEPARTMENT OF MCA 47 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

DEPARTMENT OF MCA 48 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

DEPARTMENT OF MCA 49 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

DEPARTMENT OF MCA 50 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

DEPARTMENT OF MCA 51 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

DEPARTMENT OF MCA 52 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

DEPARTMENT OF MCA 53 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

DEPARTMENT OF MCA 54 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

7.SYSTEM TESTING

DEPARTMENT OF MCA 55 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

7.SYSTEM TESTING

The purpose of testing is to discover errors. Testing is the process of trying to discover every
conceivable fault or weakness in a work product. It provides a way to check the functionality of
components, subassemblies, assemblies and/or a finished product It is the process of exercising
software with the intent of ensuring that the Software system meets its requirements and user
expectations and does not fail in an unacceptable manner. There are various types of test. Each test
type addresses a specific testing requirement.

7.1 TYPES OF TESTS

Unit testing
Unit testing involves the design of test cases that validate that the internal program logic is functioning
properly, and that program inputs produce valid outputs. All decision branches and internal code flow
should be validated. It is the testing of individual software units of the application .it is done after the
completion of an individual unit before integration. This is a structural testing, that relies on
knowledge of its construction and is invasive. Unit tests perform basic tests at component level and test
a specific business process, application, and/or system configuration. Unit tests ensure that each
unique path of a business process performs accurately to the documented specifications and contains
clearly defined inputs and expected results.

Integration testing
Integration tests are designed to test integrated software components to determine if they actually run
as one program. Testing is event driven and is more concerned with the basic outcome of screens or
fields. Integration tests demonstrate that although the components were individually satisfaction, as
shown by successfully unit testing, the combination of components is correct and consistent.
Integration testing is specifically aimed at exposing the problems that arise from the combination of
components.

Functional test
Functional tests provide systematic demonstrations that functions tested are available as specified by
the business and technical requirements, system documentation, and user manuals.
Functional testing is centered on the following items:
Valid Input : identified classes of valid input must be accepted. Invalid Input : identified classes
of invalid input must be rejected. Functions : identified functions must be exercised.

DEPARTMENT OF MCA 56 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

Output : identified classes of application outputs must be exercised. Systems/Procedures : interfacing


systems or procedures must be invoked.
Organization and preparation of functional tests is focused on requirements, key functions, or special
test cases. In addition, systematic coverage pertaining to identify Business process flows; data fields,
predefined processes, and successive processes must be considered for testing. Before functional
testing is complete, additional tests are identified and the effective value of current tests is determined.

System Test
System testing ensures that the entire integrated software system meets requirements. It tests a
configuration to ensure known and predictable results. An example of system testing is the
configuration oriented system integration test. System testing is based on process descriptions and
flows, emphasizing pre-driven process links and integration points.

White Box Testing


White Box Testing is a testing in which in which the software tester has knowledge of the inner
workings, structure and language of the software, or at least its purpose. It is purpose. It is used to test
areas that cannot be reached from a black box level.

Black Box Testing


Black Box Testing is testing the software without any knowledge of the inner workings, structure or
language of the module being tested. Black box tests, as most other kinds of tests, must be written from
a definitive source document, such as specification or requirements document, such as specification or
requirements document. It is a testing in which the software under test is treated, as a black box. you
cannot “see” into it. The test provides inputs and responds to outputs without considering how the
software works.

Unit Testing
Unit testing is usually conducted as part of a combined code and unit test phase of the software
lifecycle, although it is not uncommon for coding and unit testing to be conducted as two distinct
phases.

Test strategy and approach


Field testing will be performed manually and functional tests will be written in detail.

DEPARTMENT OF MCA 57 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

Test objectives

 All field entries must work properly.

 Pages must be activated from the identified link.

 The entry screen, messages and responses must not be delayed.

Features to be tested

 Verify that the entries are of the correct format

 No duplicate entries should be allowed

 All links should take the user to the correct page.

Integration Testing
Software integration testing is the incremental integration testing of two or more integrated software
components on a single platform to produce failures caused by interface defects. The task of the
integration test is to check that components or software applications,
e.g. components in a software system or – one step up – software applications at the company level –
interact without error.

Acceptance Testing
User Acceptance Testing is a critical phase of any project and requires significant participation by the
end user. It also ensures that the system meets the functional requirements.

Test Results
All the test cases mentioned above passed successfully. No defects encountered.

7.2 TESTING METHODOLOGIES


The following are the Testing Methodologies:
 Unit Testing.
 Integration Testing.
 User Acceptance Testing.
 Output Testing.

DEPARTMENT OF MCA 58 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

 Validation Testing.

Unit Testing
Unit testing focuses verification effort on the smallest unit of Software design that is the module. Unit
testing exercises specific paths in a module’s control structure to ensure complete coverage and
maximum error detection. This test focuses on each module individually, ensuring that it functions
properly as a unit. Hence, the naming is Unit Testing.
During this testing, each module is tested individually and the module interfaces are verified for the
consistency with design specification. All the important processing path are tested for the expected
results. All error handling paths are also tested.

Integration Testing
Integration testing addresses the issues associated with the dual problems of verification and program
construction. After the software has been integrated a set of high order tests are conducted. The main
objective in this testing process is to take unit tested modules and builds a program structure that has
been dictated by design.

The following are the types of Integration Testing:

 Top Down Integration


This method is an incremental approach to the construction of program structure. Modules are
integrated by moving downward through the control hierarchy, beginning with the main program
module. The module subordinates to the main program module are incorporated into the structure in
either a depth first or breadth first manner.
In this method, the software is tested from main module and individual stubs are replaced when the test
proceeds downwards.

 Bottom-up Integration
This method begins the construction and testing with the modules at the lowest level in the program
structure. Since the modules are integrated from the bottom up, processing required for modules
subordinate to a given level is always available and the need for stubs is eliminated. The bottom up
integration strategy may be implemented with the following steps:
 The low-level modules are combined into clusters into clusters that perform a specific Software
sub-function.

DEPARTMENT OF MCA 59 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

 A driver (the control program) for testing is written to coordinate test case input and output.
 The cluster is tested.
 Drivers are removed and clusters are combined moving upward in the program structure

The bottom up approaches tests each module individually and then each module is module is
integrated with a main module and tested for functionality.

User Acceptance Testing


User Acceptance of a system is the key factor for the success of any system. The system under
consideration is tested for user acceptance by constantly keeping in touch with the prospective system
users at the time of developing and making changes wherever required. The system developed
provides a friendly user interface that can easily be understood even by a person who is new to the
system.

Output Testing
After performing the validation testing, the next step is output testing of the proposed system, since no
system could be useful if it does not produce the required output in the specified format. Asking the
users about the format required by them tests the outputs generated or displayed by the system under
consideration. Hence the output format is considered in 2 ways – one is on screen and another in
printed format.

Validation Checking
Validation checks are performed on the following fields.

 Text Field:
The text field can contain only the number of characters lesser than or equal to its size. The text fields
are alphanumeric in some tables and alphabetic in other tables. Incorrect entry always flashes and error
message.

 Numeric Field:
The numeric field can contain only numbers from 0 to 9. An entry of any character flashes an error
messages. The individual modules are checked for accuracy and what it has to perform. Each module
is subjected to test run along with sample data. The individually tested modules are integrated into a

DEPARTMENT OF MCA 60 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

single system. Testing involves executing the real data information is used in the program the
existence of any program defect is inferred from the output. The testing should be planned so that all
the requirements are individually tested.
A successful test is one that gives out the defects for the inappropriate data and produces and output
revealing the errors in the system.

Preparation of Test Data


Taking various kinds of test data does the above testing. Preparation of test data plays a vital role in the
system testing. After preparing the test data the system under study is tested using that test data. While
testing the system by using test data errors are again uncovered and corrected by using above testing
steps and corrections are also noted for future use.

Using Live Test Data


Live test data are those that are actually extracted from organization files. After a system is partially
constructed, programmers or analysts often ask users to key in a set of data from their normal activities.
Then, the systems person uses this data as a way to partially test the system. In other instances,
programmers or analysts extract a set of live data from the files and have them entered themselves.
It is difficult to obtain live data in sufficient amounts to conduct extensive testing. And, although it is
realistic data that will show how the system will perform for the typical processing requirement,
assuming that the live data entered are in fact typical, such data generally will not test all combinations
or formats that can enter the system. This bias toward typical values then does not provide a true
systems test and in fact ignores the cases most likely to cause system failure.

Using Artificial Test Data


Artificial test data are created solely for test purposes, since they can be generated to test all
combinations of formats and values. In other words, the artificial data, which can quickly be prepared
by a data generating utility program in the information systems department, make possible the testing
of all login and control paths through the program.
The most effective test programs use artificial test data generated by persons other than those who
wrote the programs. Often, an independent team of testers formulates a testing plan, using the systems
specifications.
The package “Virtual Private Network” has satisfied all the requirements specified as per software
requirement specification and was accepted.

DEPARTMENT OF MCA 61 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

USER TRAINING
Whenever a new system is developed, user training is required to educate them about the working of
the system so that it can be put to efficient use by those for whom the system has been primarily
designed. For this purpose the normal working of the project was demonstrated to the prospective
users. Its working is easily understandable and since the expected users are people who have good
knowledge of computers, the use of this system is very easy.

MAINTAINENCE
This covers a wide range of activities including correcting code and design errors. To reduce the need
for maintenance in the long run, we have more accurately defined the user’s requirements during the
process of system development. Depending on the requirements, this system has been developed to
satisfy the needs to the largest possible extent. With development in technology, it may be possible to
add many more features based on the requirements in future. The coding and designing is simple and
easy to understand which will make maintenance easier.

TESTING STRATEGY
A strategy for system testing integrates system test cases and design techniques into a well-planned
series of steps that results in the successful construction of software. The testing strategy must
co-operate test planning, test case design, test execution, and the resultant data collection and
evaluation. A strategy for software testing must accommodate low-level tests that are necessary to
verify that a small source code segment has been correctly implemented as well as high level tests that
validate major system functions against user requirements.
Software testing is a critical element of software quality assurance and represents the ultimate review
of specification design and coding. Testing represents an interesting anomaly for the software. Thus, a
series of testing are performed for the proposed system before the system is ready for user acceptance
testing.

DEPARTMENT OF MCA 62 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

8.RESULT ANALYSIS

DEPARTMENT OF MCA 63 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

8.RESULT ANALYSIS

DEPARTMENT OF MCA 64 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

DEPARTMENT OF MCA 65 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

9.CONCLUSION AND FUTURE SCOPE

DEPARTMENT OF MCA 66 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

9.CONCLUSION

Overall, our models are only of limited utility since none were capable of correctly predicting flight
delays with both precision and recall greater than 50%. This seemingly low performance is likely due
to the many causes of flight delays being outside the scope of our data. It is unclear if it is even possible
to predict whether or not a flight will be delayed so far in advance, as we have set up the problem,
because so many of the causes of delays (e.g. mechanical issues and weather) cannot be known in
advance. Despite this, we were successful in creating models that outperform baseline models, and
perform at least about as well as prior work, even when we often use less information, and generalize
to more airports.

Although imperfect, this model still makes potentially useful predictions about which flights are more
or less likely to be delayed.

FUTURE SCOPE

To improve our model it is essential to understand what features are important to the model.This
can be done for logistic regression. This can help us to inspire new feature ideas in both high bias as
well as high variance cases, find out the top features and data leakage which can occur in case the
column affecting the output label is included. This more beneficial to coming feature. Regarding this
any problems arise to update our model and most of problems solved by this model.

DEPARTMENT OF MCA 67 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

10.REFERENCES

DEPARTMENT OF MCA 68 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

10.REFERENCES

[1]Yufeng Tu, Michael Ball, Wolfgang Jank. Estimating Flight Departure Delay Distributions-A
Statistical Approach with Long-term Trend and Short-term Pattern. 2006

[2]Pernkopf, F. and D. Bouchaffra. A genetic-based em algorithm for learning Gaussian mixture


models. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 1344–1348. (2005)

[3]Mueller, Eric R., and Gano B. Chatterji. "Analysis of aircraft arrival and departure delay
characteristics." AIAA aircraft technology, integration and operations (ATIO) conference. 2002.

[4] Beatty, Roger, et al. "Preliminary evaluation of flight delay propagation through an airline
schedule." Air Traffic Control Quarterly 7.4 (1999): 259-270.

[5]Sternberg A, Soares J, Carvalho D, Ogasawara E. A Review on Flight Delay Prediction. arXiv


preprint arXiv:1703.06118. 2017 Mar 15.

[6]shervin AhmadBeygi,Amy Cohn,Yihan Guan,and Peter Belobaba.2008.

[7] Shawn Allan, J.A Beeslev, Jim Evans, and SteveGaddy. 2001. Analysis of delay causality at
Newark international airport.

[8]Michal Ball, Cynthia Bamhart,Martin Dresner, Mark Hansen,Kevin Neels, Odoni,Everett


Peterson,Lance Sherry, Antonio A. Trani, and Bo Zou.2010.

[9]Kimyj, Choi S, Briceno S, et al. A deep learning approach to flight delay prediction[C]. 35th Digital
Avionics Systems Conference, Sacramento, USA, 2016: 1–6.

[10] Lecun y, Bengio y, and Hinton G E. Deep learning[J]. Nature, 2015, 521(7553): 436–
444.doi: 10.1038/nature14539.

[11] Huang Gao, Liu Zhuang, and Weinber k q. Densely connected convolutional networks[C]. 30th
IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, USA, 2017:
2261–2269.

DEPARTMENT OF MCA 69 K. B. N PG COLLEGE


Prediction of Flight Delay Analysis

[12] HU Jie, Shen Li, and SUN Gang. Squeeze-and-excitation networks[OL].


https://arxiv.org/pdf/1709.01507.pdf, 2018.4.

[13] Nair V and Hinton G E. Rectified linear units improve restricted boltzmann machines[C]. 27th
International Conference on Machine Learning, Haifa, Israel, 2010: 807–814.

[14] Rumelharted E, Hinton G E, and Williams R J. Learning representations by back-propagating


errors[J]. Nature, 1986, 323(9): 533–536.doi: 10.1038/323533a0.

[15]Duan Kaibo, Keerthi ss, Chu Wei, et al. Multi-category classification by soft-max combination of
binary classifiers[C]. 4th International Workshop on Multiple Classifier Systems, Guildford, United
Kingdom, 2003: 125–134.

DEPARTMENT OF MCA 70 K. B. N PG COLLEGE

You might also like