Professional Documents
Culture Documents
prashant major project final
prashant major project final
prashant major project final
Project Report (BCA-605) entitled “ Flight delay analysis system by Machine learning” is done by me and
it is an authentic work carried out by me at department of Computer Application, ASB. The matter
embodied in this project work has not been submitted earlier for the award of any degree or diploma to
the best of my knowledge and belief.
Certificate
The project has been done under my supervision & guidance and the project has
not formed the basis for the award of any degree / diploma or other similar title to
any candidate.
I would like to express my deepest gratitude to all those who have contributed to the
successful completion of the “Flight delay analysis system by Machine learning”. This
project has been a challenging yet rewarding journey, and it would not have been possible
First and foremost, I would like to thank my project mentor, prof Shilpa Narula and Dean
Sir for their invaluable guidance, insightful feedback, and continuous encouragement
throughout the development of this project. Their expertise and patience have been
teaching and mentorship have provided me with the foundational knowledge and skills
necessary to undertake this project. Their dedication to education and research has inspired
A special thanks to my friends and classmates, who have provided moral support and
testing and provide feedback has been crucial in refining the application.
I would also like to acknowledge the contributions of the developers and communities
behind the tools and technologies used in this project, including the developers of python,
the Google Translate API, and various open-source libraries. Their hard work and
innovation have made it possible to create a robust and efficient translation application.
Finally, I am deeply appreciative of my family for their unwavering support and
encouragement. Their belief in my abilities has been a constant source of motivation and
strength.
Thank you all for your support and contributions to the Flight delay analysis system by
Machine learning
Sincerely,
Prashant
BCA (2021-2024)
Abstract
Flight delays are a significant issue affecting airlines, airports, and passengers, leading to a
passenger dissatisfaction. Over the past few decades, the prediction of flight delays has
been a heavily researched area due to the complexity of the air transportation system, the
variety of prediction methods available, and the massive amounts of flight data generated.
Developing accurate models to predict these delays has proven to be a challenging task.
The difficulty arises from the intricate nature of the factors influencing flight operations,
such as weather conditions, air traffic control restrictions, and technical issues, as well as
In this context, our paper presents a thorough literature review of the approaches used to
the different strategies and techniques employed in this domain. To achieve this, we
propose a detailed taxonomy that categorizes the various initiatives based on their scope,
the types of data they utilize, and the computational methods they implement. This
making it easier to identify trends, strengths, and weaknesses in the existing research.
A particular emphasis is placed on the increasing use of machine learning methods for flight
delay prediction. Machine learning has gained prominence due to its ability to handle large
volumes of data and uncover complex patterns that traditional statistical methods might
miss. By reviewing the state-of-the-art machine learning techniques applied in this field,
we highlight the advancements and potential these methods offer for improving prediction
accuracy.
Furthermore, our paper goes beyond merely reviewing the existing approaches by also
evaluating the accuracy metrics used in flight delay prediction. Understanding and
comparing the performance of different models is crucial for identifying the most effective
strategies and guiding future research efforts. By providing this comprehensive review and
analysis, we aim to contribute to the ongoing efforts to mitigate the impact of flight delays
1. Chapter- 1 Introduction
7 Bibliography
8 Appendices / Annexures
Chapter 1:
INTRODUCTION
Flight delays are a significant issue affecting airlines, airports, and passengers, leading
inefficiencies, and passenger dissatisfaction. Over the past few decades, the prediction
of flight delays has been a heavily researched area due to the complexity of the air
transportation system, the variety of prediction methods available, and the massive
amounts of flight data generated. Developing accurate models to predict these delays
has proven to be a challenging task. The difficulty arises from the intricate nature of
the factors influencing flight operations, such as weather conditions, air traffic control
restrictions, and technical issues, as well as the need to process and analyze large
datasets.
In this context, our paper presents a thorough literature review of the approaches used
achieve this, we propose a detailed taxonomy that categorizes the various initiatives
based on their scope, the types of data they utilize, and the computational methods they
existing research.
A particular emphasis is placed on the increasing use of machine learning methods for
flight delay prediction. Machine learning has gained prominence due to its ability to
handle large volumes of data and uncover complex patterns that traditional statistical
applied in this field, we highlight the advancements and potential these methods offer
Furthermore, our paper goes beyond merely reviewing the existing approaches by also
evaluating the accuracy metrics used in flight delay prediction. Understanding and
comparing the performance of different models is crucial for identifying the most
mitigate the impact of flight delays through better prediction and more informed
decision-making.
Air transportation plays a vital role in the transportation infrastructure and contributes
significantly to the economy. Airports are known for their capability to increase
business activities in their vicinity, thus driving economic development. The aviation
industry also provides a substantial number of jobs. In 2016, a record 3.7 billion
passengers used air transport, and this number is expected to increase annually.
According to the worldwide air traffic report released by the International Air Transport
Association, the demand for air travel grew by 6.3 percent in 2016 compared to 2015.
Such a high volume of air traffic needs to be constantly monitored and managed to
prevent problems.
An aircraft is considered delayed when it departs and/or arrives later than its scheduled
time. There are several causes of flight delays, including weather changes, maintenance
issues, cascading delays from previous flights, and air traffic congestion. These delays
present a significant challenge for the aviation industry and its customers. In the USA
alone, flight delays cost approximately 22 billion US dollars annually. Airlines incur
penalties from government authorities when aircraft are held for extended periods, and
Numerous models have been proposed to accurately forecast flight delays. In our study,
This technique uses various independent parameters to train a model that classifies
with airport data, we assessed the impact of weather conditions on flight delays, thereby
enhancing prediction accuracy for real-world scenarios. The model was trained with 70
percent of the dataset and tested with the remaining 30 percent, achieving an accuracy
Various models have been developed to predict flight delays. Yufeng et al. propose a
model for calculating distributions of departure delay times, which helps in determining
air traffic congestion. This study identifies key factors influencing departure times.
Michael et al. present a model for evaluating the characteristics of queuing networks
with variable arrival times and dynamic service schedules. Beatty et al. introduce the
concept of a Delay Multiplier to predict initial delays in flight schedules. While Yufeng
et al. utilize genetic algorithms to train component mechanisms for predicting takeoff
Our proposed system calculates flight delays based on scheduled and actual arrival and
departure times. By calculating the time differences, we derive the target variable for
delay prediction. We preprocess the flight delay dataset to make it suitable for machine
performance is validated using accuracy metrics like Root Mean Square Error (RMSE).
Objectives
prediction models.
2. To propose a taxonomy that categorizes these approaches based on scope, data, and
computational methods.
delays.
4. To evaluate and compare the accuracy metrics of different flight delay prediction
models.
Limitations
1. Data Quality and Availability: The accuracy of prediction models heavily depends
on the quality and availability of historical flight data, which can sometimes be
incomplete or inconsistent.
of which are difficult to quantify or predict accurately, such as sudden weather changes
4. Model Generalizability: Models trained on data from specific regions or time periods
poses challenges related to data processing speed and integration with existing air
Feasibility Study
Feasibility studies assess the practicality of a project. The main aspects to consider are:
Operational Feasibility
This system aims to streamline and automate administrative tasks, reducing time and effort
Economic Feasibility
economically feasible.
Technical Feasibility
The system requires IBM-compatible machines with graphical web browsers connected to
the Internet and Intranet. It is platform-independent and developed using Java Server Pages,
JavaScript, HTML, SQL Server, and WebLogic Server. The technical feasibility has been
assessed, confirming that the project can be developed with the existing resources.
The Software Development Life Cycle (SDLC) for a flight delay analysis using
a machine learning system involves several phases to ensure the successful development
1. Requirement Analysis
Objective:
Gather and analyze the requirements for the flight delay analysis system using machine
learning.
Activities:
economic viability.
- Define the scope of the system, outlining the features, functionalities, and limitations.
Deliverables:
- Feasibility Report
2. System Design
Objective:
Design the system architecture and components based on the gathered requirements.
Activities:
- Create a high-level design (HLD) specifying the overall system architecture, including
- Develop a low-level design (LLD) detailing the components, algorithms, and data storage
mechanisms.
- Select appropriate technologies for data ingestion, preprocessing, model training, and
deployment.
Deliverables:
3. Implementation
Objective:
Develop the flight delay analysis system according to the design specifications.
Activities:
- Code the data ingestion pipeline to collect flight data from various sources such as airline
normalization.
- Integrate the models into the system and deploy them to a scalable infrastructure.
Deliverables:
- Source Code
identifying errors and ensuring the software operates correctly. This phase involves
evaluating individual components, integrated components, and the final product to confirm
Types of Tests
Unit Testing aims to verify that individual units of code, such as functions or methods,
operate as intended. It focuses on a single unit of code, involving the creation of test cases
for each function or method to ensure all decision branches and internal code flows yield
valid outputs. Unit testing is structural and invasive, requiring an understanding of the
code’s construction.
conducted after unit testing and involves evaluating the interactions between integrated
components. This type of testing focuses on event-driven scenarios to validate that the
combined components produce the expected outcomes and is designed to identify issues
Functional Testing validates that the software’s functions work according to specified
requirements. This type of testing focuses on the business and technical requirements,
system documentation, and user manuals. It includes checking for valid and invalid inputs,
ensuring all identified functions and outputs are exercised. Functional testing is a form of
black box testing, meaning it does not require knowledge of the internal code structure.
System Testing aims to ensure that the entire integrated system meets specified
requirements. This comprehensive testing covers the entire system and is configuration-
oriented, focusing on the system’s process descriptions and flows. It validates the complete
requires detailed knowledge of the software’s internal code and involves testing specific
code paths, loops, and logical decisions. This code-based testing ensures that internal
Black Box Testing tests the software’s functionality without knowledge of its internal code.
outputs without considering how the software processes the input. This requirement-based
testing focuses on what the software should do rather than how it does it.
Unit Testing
Unit testing is conducted as part of the software lifecycle, focusing on verifying that each
Testing is performed manually with detailed functional tests. Key objectives include
ensuring the proper functioning of all field entries, activation of pages from identified links,
Features to be Tested
Key features to be tested include verifying that entries are in the correct format, preventing
duplicate entries, and confirming that links navigate to the correct pages.
Integration Testing
Integration testing focuses on verifying that integrated components work together without
errors. It addresses issues arising from combining individual components and includes
methods like Top Down and Bottom-Up Integration. Top Down Integration begins with the
main module and integrates sub-modules incrementally. Bottom-Up Integration starts with
the lowest-level modules and integrates upwards, ensuring all subordinate modules are
Acceptance Testing
User Acceptance Testing (UAT) involves end users to ensure the system meets functional
requirements. It ensures that the system is user-friendly and meets user expectations.
Test Results
5. Deployment
Objective:
Deploy the flight delay analysis system into a production environment for real-world use.
Activities:
- Plan the deployment process, including server setup, configuration, and scalability
considerations.
- Deploy the system to the production environment while ensuring minimal downtime and
optimal performance.
- Provide user training sessions to stakeholders to familiarize them with the system's
- Create documentation including user manuals and deployment guides for reference.
Deliverables:
- Deployment Plan
- User Manuals
- Deployment Guide
6 Maintenance
operational and up-to-date after its initial deployment. It ensures that the software continues
to meet user needs and adapts to changing requirements and environments. The main types
1. Corrective Maintenance
• Purpose: To fix bugs and errors that are discovered in the software after it has been
deployed.
• Scope: Includes both minor and major fixes, such as correcting a misplaced decimal
incorrect results.
2. Adaptive Maintenance
operating system.
3. Perfective Maintenance
of new functionalities.
• Example: Adding a new reporting feature or optimizing the code to run faster.
4. Preventive Maintenance
problems.
5. Emergency Maintenance
• Purpose: To address urgent and unexpected problems that need immediate attention
• Scope: Involves quick fixes and patches to resolve critical issues that can cause
Each type of maintenance plays a critical role in the software lifecycle, ensuring that the
software remains functional, efficient, and relevant over time. Effective maintenance
Deliverables:
- Maintenance Logs
- Update/Enhancement Documentation
- Support Logs
Additional Considerations
- Data Security: Ensure data privacy and security measures are in place to protect sensitive
flight information.
- Regulatory Compliance: Ensure compliance with aviation regulations and data protection
laws.
- Model Interpretability: Consider methods to interpret and explain machine learning model
predictions to stakeholders.
- Scalability: Design the system to handle large volumes of flight data and accommodate
future growth.
- User Interface: Develop a user-friendly interface for stakeholders to interact with the
of flight delay analysis using machine learning, developers can systematically plan,
develop, test, deploy, and maintain a robust and efficient system that meets user needs
users, stakeholders, and other sources. The following tools and techniques are commonly
used:
understanding the needs and constraints of the users and stakeholders to ensure that the
final product meets their expectations. Here are the primary tools and techniques used for
requirement elicitation:
1. Interviews
questions. This approach ensures consistency in the information collected across different
stakeholders and is useful for gathering specific details about the system requirements.
Unstructured Interviews: These are more conversational and do not follow a strict
agenda. This allows stakeholders to express their needs and concerns more freely, which
the interviewer has a set of prepared questions but is free to explore topics in more depth
effectively.
Online Surveys: These can be distributed via email or through survey platforms. They are
efficient for reaching a broad audience and can provide quick responses. Online tools often
Paper Surveys: These are used in environments where digital access is limited or not
preferred. They are particularly useful for reaching participants who may not be tech-savvy.
3. Workshops
Workshops are collaborative sessions where stakeholders come together to discuss and
define requirements. They are effective for generating ideas, solving problems, and
building consensus.
Brainstorming Sessions: These are structured activities where participants generate ideas
and solutions without immediate criticism or evaluation, fostering creativity and a broad
range of ideas.
Focus Groups: In focus groups, a small, diverse group of stakeholders discusses specific
topics in depth. This method provides qualitative insights and helps understand different
4. Observation
Observation involves watching users interact with the current system or perform tasks to
Participant Observation: The analyst actively engages in the user’s activities, gaining
firsthand experience and deeper insights into their tasks and environment.
Non-participant Observation: The analyst observes without engaging in the user’s
activities, minimizing the influence on the user’s natural behavior and providing an
unbiased view.
5. Document Analysis
Manuals: User manuals and guides offer detailed information about how the existing
system operates.
System Logs: Logs can reveal common issues, usage patterns, and areas that need
improvement.
Reports: Business and operational reports can highlight key metrics, performance issues,
6. Prototyping
Low-fidelity Prototypes: Simple sketches or wireframes that illustrate basic concepts and
layout.
High-fidelity Prototypes: More detailed and interactive models that closely resemble the
final product, helping stakeholders understand the functionality and user interface.
Use cases describe how users will interact with the system to achieve specific goals,
Use Case Descriptions: Detailed narratives that describe each use case, including the steps
8. Focus Groups
Focus groups involve guided discussions with a selected group of stakeholders to gather
consistent insights.
Heterogeneous Groups: Participants with varied backgrounds and roles, offering a range
9. Brainstorming
Brainstorming sessions generate a wide range of ideas and potential requirements from
stakeholders.
them with the group, ensuring a wide range of ideas without initial influence from others.
focused discussions.
User Stories: Short, simple descriptions of a feature or requirement from the perspective
Analyzing similar systems or products in the market to understand their features, strengths,
and weaknesses.
Feature Comparison: Identifying the features offered by competitors and assessing their
Gap Analysis: Determining what is missing or could be improved in the current system
compared to competitors.
Mind mapping is a visual tool that helps organize and structure information, making it
Central Theme: The main idea or problem is placed at the center of the mind map.
Branches: Related requirements, features, and ideas radiate out from the central theme,
By using a combination of these tools and techniques, you can effectively gather and
document comprehensive requirements, ensuring the final system meets the needs of its
collected requirements are reviewed, refined, and organized to ensure they are clear,
complete, and feasible. The goal is to define and document what the system should do and
1. Classification of Requirements
system. For example, "The system must allow users to log in using a username and
password."
usability criteria.
2. Prioritization of Requirements
Not all requirements are equally important. Prioritizing them helps in focusing on the most
• Kano Model: Categorizes requirements into Basic needs, Performance needs, and
Excitement needs.
3. Feasibility Analysis
This involves evaluating whether the requirements can be realistically implemented within
• Operational Feasibility: Ensuring the proposed system will function within the
4. Conflict Resolution
Conflicts between requirements from different stakeholders are identified and resolved to
solution.
multiple criteria.
• Use Case Diagrams: Illustrate how different users will interact with the system.
• Data Flow Diagrams (DFD): Show how data moves through the system.
• State Diagrams: Describe the states of the system and how it transitions from one
state to another.
• Reviews and Inspections: Peer reviews and inspections to check for completeness,
7. Documentation
• User Stories: Short descriptions of features from the user’s perspective, often used
in agile methodologies.
• Simulations: Running scenarios to see how the system behaves under different
conditions.
expectations:
requirements.
• Progress Reports: Updates on the status of requirement analysis and any changes.
By systematically analyzing requirements, you ensure that the final system is well-defined,
feasible, and aligned with the stakeholders' needs and expectations. This thorough analysis
helps in minimizing misunderstandings, reducing the risk of project failures, and ensuring
1. Introduction
1.1 Purpose
The purpose of this document is to outline the software requirements for the Flight Delay
Analysis System (FDAS). The system aims to analyze flight delay data to identify patterns,
- Project Managers: To gain insight into the project scope and deliverables.
- Testers: To create test plans and test cases based on the requirements.
1.4 Project Scope
The FDAS will collect, process, and analyze flight delay data from various sources. It will
provide insights through data visualizations and reports, helping airlines and airports
1.5 References
2. Overall Description
The FDAS is a standalone system that integrates with airline and airport databases. It will
utilize APIs to fetch real-time and historical flight data, process this data, and provide
- Reporting tools
- Airline Operations Staff: Require insights to improve scheduling and minimize delays.
- Data Analysts: Require access to raw data and analysis tools for in-depth study.
- Executives: Need high-level reports and dashboards for decision-making.
- Web-based application accessible via modern browsers (Chrome, Firefox, Safari, Edge)
3. System Features
- Interactive dashboards
- No specific hardware interfaces required; system runs on standard web and server
infrastructure.
This document provides a comprehensive overview of the requirements for the Flight Delay
Analysis System. It serves as a guide for the development and implementation phases,
ensuring all stakeholders have a clear understanding of the project's objectives and
constraints.
Hardware Requirements
The hardware requirements detail the specifications for the interfaces between software and
- RAM: Minimum 4 GB
Software Requirements
The software requirements specify the necessary software products, including their
versions and the purpose of each interfacing software as it relates to the main software
product:
- Python IDE: Version 3.7 or higher (alternatively, Anaconda 3.7, Jupyter, or Google Colab)
- Libraries:
o Matplotlib
o NumPy
o Pandas
o Regex
o Requests
o Scikit-learn
o SciPy
o Sklearn
Anaconda
Anaconda is a comprehensive, open-source data science package used by a
community of over 6 million users. It supports easy installation and is
compatible with Linux, macOS, and Windows. The distribution includes over
1,000 data packages, along with the Conda
According to Anaconda’s website, "The Python and R conda packages in the Anaconda
Repository are curated and compiled in our secure environment so you get optimized
Anaconda Navigator
launch applications and manage conda packages, environments, and channels without
using command-line commands. Navigator can search for packages on Anaconda Cloud or
in a local Anaconda Repository and is available for Windows, macOS, and Linux.
o Jupyter Notebook
o Spyder
o PyCharm
o VSCode
o Glueviz
o Orange 3
o RStudio
o JupyterLab
on Jupyter Notebook.
- Qt Console: A PyQt GUI supporting inline figures, multiline editing with syntax
- Spyder: A powerful Python IDE for scientific development with features for advanced
- VS Code: A streamlined code editor supporting debugging, task running, and version
control.
- Glueviz: Used for multidimensional data visualization across files to explore relationships
within datasets.
- Orange 3: A data mining framework for data visualization and analysis with interactive
workflows.
Libraries
Matplotlib
and environments. It supports Python scripts, IPython shells, Jupyter notebooks, web
application servers, and more. Matplotlib aims to make simple tasks easy and complex
tasks possible.
NumPy
dimensional array object, sophisticated functions, tools for integrating C/C++ and Fortran
code, and useful linear algebra, Fourier transform, and random number capabilities. NumPy
Pandas
by NumFOCUS since 2015. It provides a fast and efficient DataFrame object for data
manipulation with integrated indexing, tools for reading/writing data, intelligent data
Seaborn
• Common Uses:
Scikit-learn
- A machine learning library for Python featuring various classification, regression, and
SciPy
- An open-source Python library for scientific and technical computing. It includes modules
for optimization, linear algebra, integration, interpolation, special functions, FFT, signal
processing, and more. It builds on the NumPy array object and is part of the NumPy stack.
matplotlib.pyplot
• Creating basic plots like line graphs, scatter plots, and histograms.
SKlearn.model_selection
This module in scikit-learn provides tools for model selection and evaluation, including
functions for splitting datasets into training and testing sets, cross-validation, and
hyperparameter tuning.
Common Uses:
Catboost.CatBoostClassifier, Catboost.Pool
tasks.
Common Uses:
sklearn.metrics.confusion_matrix
This function from scikit-learn computes a confusion matrix to evaluate the performance
of a classification model.
Common Uses:
negatives.
sklearn.preprocessing
The preprocessing module in scikit-learn provides functions for preprocessing data before
Common Uses:
sklearn.naive_bayes.GaussianNB
Gaussian Naive Bayes is a simple probabilistic classifier based on Bayes' theorem with the
Common Uses:
sklearn.ensemble.RandomForestClassifier
multiple decision trees during training and combines their predictions to improve accuracy
Common Uses:
It classifies data points based on the majority class among their nearest neighbors in the
feature space.
Common Uses:
sklearn.exceptions.DataConversionWarning
Common Uses:
• Ignoring warnings that may not be critical for the current workflow.
WARNINGS
The warnings module in Python provides functions to control how warnings are displayed
and handled.
Common Uses:
These libraries and modules collectively provide a robust ecosystem for data analysis,
visualization, machine learning, and warnings handling in Python. They are widely used in
programming. It is known for its readability, simplicity, and broad standard library. Python
is used in various domains such as web development, scientific computing, data analysis,
Systems Design
interfaces, and data for a system to meet specified requirements. This process involves
developing and designing systems that satisfy the specific needs of a business or
organization. The goal of this design is to create a single-platform web application for
multiple users, aiming to reduce errors and alleviate the stress experienced by individuals
architecture in detail.
UML is a collection of best engineering practices that have proven effective in modeling
complex and large systems. It is essential for developing object-oriented software and
enhances the software development process. By using UML, project teams can
architectural design.
UML provides a standardized set of diagram types to arrange complex data, processes, and
systems clearly and intuitively. Although UML is not a process or procedure, it serves as a
"dictionary" of symbols, each with a specific meaning. It supports object-oriented analysis,
design, and programming, ensuring a smooth transition from system requirements to final
implementation. UML diagrams illustrate both structure and behavior, providing clear
UML diagrams play a significant role in project documentation. They can be used in various
documents, such as requirements definitions, design documents, test plans, and user
- Use Case Diagrams: Describe functional requirements and interactions with external
entities.
A use case diagram represents the system's functionality from an external perspective. It
focuses on the behavior of the system by depicting the interactions between actors (external
- Use Cases: These describe a sequence of actions providing measurable value to an actor.
- Actors: These are entities (people, organizations, or external systems) that interact with
- System Boundary Boxes: A rectangle surrounding the use cases, indicating the scope of
the system. Everything inside the box is within the system's scope, while anything outside
is not.
- Include: Indicates that one use case incorporates the behavior of another. Represented
- Extend: Suggests that a use case can be extended by the behavior of another under certain
- Associations: Represented by solid lines between actors and use cases, indicating their
interaction. Optional arrowheads can denote the direction of the initial interaction or
primary actor.
The "user model view" provides a perspective on the problem and solution from the
viewpoint of individuals whose problem the solution addresses. This view outlines the
goals and objectives of problem owners and their solution requirements, composed of use
case diagrams. These diagrams describe the functionality provided by a system to external
Class Diagram
where inheritance is achieved through the definition of classes of objects, as opposed to the
used and developed within the realm of Object-Oriented Programming (OOP), wherein
objects encapsulate state (data), behavior (procedures or methods), and identity (unique
existence among all other objects). The structure and behavior of an object are determined
by a class, which serves as a blueprint for all objects of a specific type. An object is
instantiated explicitly based on a class, and once created, it is considered an instance of that
class. Objects resemble structures, enhanced with method pointers, member access control,
and an implicit data member that locates instances of the class within the class hierarchy,
Sequence Diagram
A sequence diagram, within the Unified Modeling Language (UML), is a type of interaction
diagram illustrating how processes interact with each other and the order in which these
diagrams are also referred to as event diagrams, event scenarios, and timing diagrams.
These diagrams depict different processes or objects living concurrently as parallel vertical
lines (lifelines), with horizontal arrows indicating the messages exchanged between them
in chronological order. This graphical representation allows for the specification of runtime
scenarios in a visual manner. Lifelines, when representing objects, denote roles. Messages
are used to display interactions, with solid arrows indicating synchronous calls, solid
arrows with stick heads representing asynchronous calls, and dashed arrows with stick
heads signifying return messages. Activation boxes, or method-call boxes, are opaque
Objects invoking methods on themselves utilize messages and add new activation boxes
X drawn atop the lifeline, occurs when an object is removed from memory, typically
resulting from a message either from the object itself or another. Messages originating from
outside the diagram are depicted by a filled-in circle (found message in UML) or from the
4. Collaboration Diagram
similar purpose in illustrating the dynamic interaction of objects within a system. However,
objects apart from their interactions with each other. Unlike sequence diagrams,
collaboration diagrams depict object associations. These diagrams can be easily converted
into sequence diagrams and vice versa using sophisticated modeling tools. The elements
within a collaboration diagram are essentially the same as those found in a sequence
diagram.
activities and actions, supporting choice, iteration, and concurrency. These diagrams,
within the Unified Modeling Language (UML), describe the business and operational
decisions by diamonds, and the start (split) and end (join) of concurrent activities by bars.
An initial state is denoted by a black circle, while a final state is depicted as an encircled
black circle. Arrows indicate the flow of control, with solid lines illustrating simple cases
of activity sequences. However, the combination of join and split symbols with decisions
Objects possess behaviors and states that depend on their current activity or condition. A
state chart diagram, also known as a state machine diagram, illustrates the possible states
an object can attain and the transitions that cause a change in state. Resembling a flowchart,
an initial state is represented by a large black dot, subsequent states by boxes with rounded
corners, and transitions between states by external straight lines with arrows. Historical
states are designated by circles containing the letter "H," while the final state is depicted as
SYSTEMS DEVELOPMENT
Introduction
Flight delays can significantly inconvenience passengers, preventing them from fulfilling
their commitments and attending preplanned events. This disruption can lead to financial
losses, frustration, and anger. To address this issue, several predictive models have been
proposed to forecast flight delays accurately. Among these, we employ a machine learning
technique known as Lasso regression to predict aircraft delays. Lasso regression leverages
various independent parameters to train a model that classifies whether an aircraft will
experience a delay. Our implementation of this algorithm was carried out using Microsoft
To enhance the accuracy of our predictions and account for real-world conditions, we
integrated a weather dataset with our flight data. This integration was performed by
matching weather conditions to the respective airport locations. We trained our model using
70 percent of the dataset and evaluated its performance on the remaining 30 percent.
Impressively, the model predicted the correct outcome in more than 80 percent of the cases.
Dataset Description
The sample data used in our study was sourced from the Department of Transportation and
utilized the "2015 Flight Delays and Cancellations" dataset available on Kaggle. This
delayed, canceled, and diverted flights, along with detailed flight schedules and operational
times.
Key Features of the Dataset
• MONTH - Month
• DAY_OF_MONTH - Day of Month
• DAY_OF_WEEK - Day of Week
• OP_UNIQUE_CARRIER - Unique Carrier Code
• ORIGIN - Origin airport location
• DEST - Destination airport location
• DEP_TIME - Actual Departure Time (local time: hhmm)
• DEP_DEL15 - Departure Delay Indicator, 15 Minutes or More (1=Yes, 0=No)
[TARGET VARIABLE]
• DISTANCE - Distance between airports (miles)
Project Modules
Data Preprocessing
Data preprocessing is crucial for converting raw data into a clean, analyzable dataset. The
data collected from various sources is often in raw format, which is unsuitable for analysis
missing values that need to be addressed. A common approach is to replace missing values
with the mean of the respective column. We used the Scikit-Learn library's preprocessing
The next step is to split the dataset into training and test sets. Typically, 80% of the dataset
is used for training the model, while the remaining 20% is reserved for testing. This split
ensures that the model can learn from one subset and be evaluated on another to gauge its
accuracy.
3. Feature Scaling
Feature scaling standardizes the range of independent variables, making them comparable.
Many machine learning models rely on Euclidean distance, and without feature scaling,
variables with larger values can disproportionately influence the model. Scaling ensures
4. Label Encoding
Datasets often contain categorical labels, which must be converted into numeric form for
machine learning algorithms. Label encoding transforms these labels into a machine-
readable format. While label encoding assigns unique numbers to each class, it may
inadvertently introduce priority issues if higher numeric values are interpreted as higher
Feature Selection
Feature selection, also known as variable or attribute selection, involves identifying the
most relevant attributes for predictive modeling. Unlike dimensionality reduction, which
creates new combinations of attributes, feature selection retains existing attributes and
Correlation Matrix
identify pairs with the highest correlations. This matrix helps in understanding the
Applying Algorithms
The processed dataset is divided into training and test sets, and regression algorithms such
as Support Vector Regression and Lasso Regression are applied. These algorithms help
Model validation ensures that the input data is suitable for model binding and provides
useful error messages for invalid entries. This step filters nonsensical inputs and validates
R-squared (R²) is a statistical measure that indicates the proportion of variance in the
dependent variable explained by the independent variable. It assesses the goodness of fit of
the regression model, telling us how well the data fits the model.
Algorithms
CatBoostClassifier
particularly well-suited for handling categorical data. It was developed by Yandex and can
be used for classification and regression tasks. CatBoost's primary strength lies in its ability
to handle categorical features without the need for extensive preprocessing (e.g., one-hot
encoding).
Principle:
- Gradient Boosting: CatBoost is based on the gradient boosting framework, which builds
an ensemble of trees sequentially. Each tree tries to correct the errors of the previous ones.
into numerical values internally using various encoding techniques. This reduces the
Advantages:
- Automatic Handling of Categorical Data: Saves time and effort in data preprocessing.
overfitting.
Limitations:
- Training Time: Training can be slower compared to simpler models like linear regression
or decision trees.
datasets.
It assumes that the features are normally distributed (Gaussian distribution). This model is
Principles:
- Bayes' Theorem: Uses Bayes' theorem to calculate the probability of each class given the
input features and assigns the class with the highest probability.
- Naive Assumption: Assumes that all features are independent of each other given the
Advantages:
- Simplicity: Easy to implement and understand.
- Efficiency: Fast training and prediction times, making it suitable for real-time
applications.
- Works Well with Small Datasets: Particularly effective when the dataset is small and the
Limitations:
- Not Suitable for All Distributions: Assumes a Gaussian distribution for features, which
RandomForestClassifier
trees during training. It combines the predictions from these trees to improve accuracy and
control over-fitting. It is highly robust and versatile, suitable for a wide range of
classification tasks.
Principles:
- Ensemble Learning: Builds multiple decision trees (forest) and merges their predictions.
- Bagging: Uses bootstrap aggregating (bagging) to train each tree on a random subset of
- Random Feature Selection: Each tree is built using a random subset of features, which
- High Accuracy: Often achieves higher accuracy compared to individual decision trees.
Limitations:
- Interpretability: Less interpretable than single decision trees due to the ensemble nature.
KNeighborsClassifier
regression. In the classification context, it predicts the class of a data point by looking at
the 'k' closest training examples in the feature space. It is simple and effective for smaller
Principles:
stores all training instances and delays the learning process until a query is made.
- Distance Metrics: Commonly uses Euclidean distance to measure the closeness of data
points.
- Voting Mechanism: For classification, it uses a majority voting mechanism where the
Advantages:
phase.
- Flexibility: Can handle multi-class classification and can be used for regression as well.
Limitations:
features.
- Memory Usage: Requires storing all training data, which can be memory intensive.
Each of these classifiers has its own strengths and is suited to different types of problems
and datasets. Choosing the right model depends on the specific characteristics of your data
SOURCE CODE:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings(action='ignore', category=DataConversionWarning)
warnings.filterwarnings(action='ignore', category=FutureWarning)
pd.set_option('display.max_columns', None)
print(data.head())
Data preprocessing
data = data.drop(['Unnamed: 9'], axis=1)
print(data['DEP_DEL15'].value_counts())
data_pos = data.loc[positive_rows]
data_neg = data.loc[~positive_rows]
data = data.dropna(axis=0)
print(data.info())
data['DEP_DEL15'] = data['DEP_DEL15'].astype(int)
print(data.shape)
data.describe()
plt.figure(figsize=(15,5))
plt.xlabel("Distance")
plt.ylabel("Frequency")
plt.title("Distribution of distance")
plt.show()
1]['DISTANCE'].values.mean()} miles")
0]['DISTANCE'].values.mean()} miles")
plt.figure(figsize=(15,5))
sns.countplot(x=data['OP_UNIQUE_CARRIER'], data=data)
plt.xlabel("Carriers")
plt.ylabel("Count")
plt.show()
plt.figure(figsize=(10,70))
plt.xlabel("Airport")
plt.ylabel("Count")
plt.show()
plt.figure(figsize=(10,70))
plt.xlabel("Airport")
plt.ylabel("Count")
plt.show()
data = data.rename(columns={'DEP_DEL15':'TARGET'})
Encoding the categorical variable
def label_encoding(categories):
categories = list(set(list(categories.values)))
mapping = {}
mapping[categories[idx]] = idx
return mapping
data['OP_UNIQUE_CARRIER'] =
data['OP_UNIQUE_CARRIER'].map(label_encoding(data['OP_UNIQUE_CARRIER'])
data['ORIGIN'] = data['ORIGIN'].map(label_encoding(data['ORIGIN']))
data['DEST'] = data['DEST'].map(label_encoding(data['DEST']))
data.head()
data['TARGET'].value_counts()
X = data.drop(['MONTH','TARGET'], axis=1)
y = data[['TARGET']].values
random_state=41)
y_preds).ravel()
# Calculating accuracy
false_negative + true_positive)
return accuracy
#__Logistic Regression__
lr = LogisticRegression(random_state=0).fit(X_train, y_train)
#__ CatBoostClassifier __
# Initialize CatBoostClassifier
catboost = CatBoostClassifier(random_state=0)
catboost.fit(X_train, y_train, verbose=False)
gnb = GaussianNB()
gnb.fit(X_train, y_train)
#__ RandomForestClassifier __
rf = RandomForestClassifier(random_state=0)
rf.fit(X_train, y_train)
#__KNNClassifier__
knn = KNeighborsClassifier(n_neighbors=2)
knn.fit(X_train, y_train)
acc = []
preds_val = model.predict(X_val)
acc.append(accuracy)
_x = p.get_x() + p.get_width() / 2
plt.xlabel("Models")
plt.ylabel("Accuracy")
plt.show()
CHAPTER 6
Raw Dataset
Data Format
• MONTH - Month
• DAY_OF_MONTH - Day of Month
• DAY_OF_WEEK - Day of Week
• OP_UNIQUE_CARRIER - Unique Carrier Code
• ORIGIN - Origin airport location
• DEST - Destination airport location
• DEP_TIME - Actual Departure Time (local time: hhmm)
• DEP_DEL15 - Departure Delay Indicator, 15 Minutes or More (1=Yes, 0=No)
[TARGET VARIABLE]
• DISTANCE - Distance between airports (miles)
Fig 6.4- data format
Data Preprocessing
Fig 6.5
Fig 6.6
Fig 6.7
Fig 6.8
Exploratory data analysis
Fig 6.9
Fig 6.10
Fig 6.11- count of carriers in the dataset
Fig 6.11- count of origin and destination airport
Fig 6.11- count of origin and destination airport
Fig 6.12- count of origin and destination airport
Modelling
Fig 6.13
Fig 6.15
Fig 6.16
In concluding our study on predicting flight delays, it's evident that while our models
exhibit a level of usefulness, they fall short of achieving precision and recall rates
complexity of the factors influencing flight delays, many of which are beyond the scope
of the data we've utilized. The inherent unpredictability of issues like mechanical
Despite these challenges, it's noteworthy that our models have surpassed baseline
This is particularly commendable considering that we've often utilized less detailed
information and have sought to generalize our findings across a wider range of airports.
Although our models may not provide foolproof predictions, they still offer valuable
insights into the likelihood of flight delays. By identifying patterns and trends in
historical data, our models can effectively highlight which flights are more prone to
delays, thus enabling airlines and travelers to make more informed decisions.
Looking ahead, there are several avenues for enhancing the performance of our
discerning which features play the most influential role in predicting delays, we can
refine our models and potentially uncover new variables that may have been
overlooked.
Furthermore, addressing issues such as data leakage, where inadvertent inclusion of
certain columns may skew results, is paramount. By implementing robust checks and
balances to detect and mitigate data leakage, we can ensure the integrity and accuracy
of our predictions.
In summary, while our models may not be perfect, they represent a significant step
forward in the realm of flight delay prediction. By continuing to refine and improve
upon our methodologies, we can strive towards more accurate and reliable predictions,
Future Scope
In the future, we plan to make our flight delay analysis model better in several ways:
1. Use Real-Time Data: We want to include real-time data like current weather
conditions, air traffic, and runway availability. This will help our model give more
techniques, such as gradient boosting and neural networks, to improve our model's
accuracy.
3. Improve Features: We will focus on finding and using the most important factors
(features) that affect flight delays. This will help make our predictions more reliable.
4. Seasonal and Regional Variations: We will create models that account for different
patterns in different seasons and regions. This will help us provide more precise
5. Understand Causes: By looking into why delays happen, not just when they happen,
to understand for everyone. This might include clear explanations and visual aids.
7. Support for Decision-Making: Our model will be integrated into systems used by
airlines and airports to help them make better decisions and reduce delays.
8. Ethical and Fair: We will make sure our model is fair and does not have biases. We
will check and correct any issues to ensure it treats all flights equally.
9. Continuous Improvement: We will keep an eye on how our model performs and
10. Backend and Frontend Development: We will also work on both the backend (the
technical infrastructure) and the frontend (the user interface) of our model. This
means creating a strong system for processing data and training models, as well as
By working on these areas, we aim to make our flight delay analysis model more accurate,
[1]Yufeng Tu, Michael Ball, Wolfgang Jank. Estimating Flight Departure Delay
Distributions-A
mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence 27,
1344–1348. (2005)
[3]Mueller, Eric R., and Gano B. Chatterji. "Analysis of aircraft arrival and departure delay
2002.
[4] Beatty, Roger, et al. "Preliminary evaluation of flight delay propagation through an
[7] Shawn Allan, J.A Beeslev, Jim Evans, and SteveGaddy. 2001. Analysis of delay
[9]Kimyj, Choi S, Briceno S, et al. A deep learning approach to flight delay prediction[C].
[10] Lecun y, Bengio y, and Hinton G E. Deep learning[J]. Nature, 2015, 521(7553):
[11] Huang Gao, Liu Zhuang, and Weinber k q. Densely connected convolutional
networks[C]. 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR
2261–2269.
https://arxiv.org/pdf/1709.01507.pdf, 2018.4.
[13] Nair V and Hinton G E. Rectified linear units improve restricted boltzmann
807–814.
9 Fig 9: Implementation