Abstract
Flight delays are a real and persistent problem in
the aviation industry. They cost airlines and passengers billions of dollars
every year and cause huge issues for airports and air traffic controllers. The
ability to predict whether a flight will be delayed - even approximately - has
clear practical value for all parties concerned in this field.
In this project, we built a machine learning
pipeline to tackle this prediction problem as a binary classification task:
will a given flight be On Time or Delayed? We worked with a publicly available
airline dataset that, after cleaning and preprocessing, contained 39,750
records and 21 features. One thing that became clear early on is that the raw
dataset alone was not enough - we needed to engineer additional features to
give the models something more meaningful to learn from.
We tested eight different classifiers: Logistic
Regression, Random Forest, XGBoost, LightGBM, Gradient Boosting, CatBoost, a
hyperparameter-tuned version of XGBoost and finally a Stacking ensemble that
combined the best individual models. The Stacking model came out on top with an
F1 Score of 0.6842 and an AUC-ROC of 0.7698. Perhaps the most interesting
finding was that a single engineered feature - the historical delay rate of the
departure airport - turned out to account for more than half of the model's
predictive power.
1. Introduction
1.1. Why this problem
matters
Anyone who has waited for hours at the airport due
to a delayed flight knows how bad the experience can be. But the problem goes
beyond mere personal inconvenience. Flight delays disrupt connecting flights,
cause passengers to miss important appointments and events and create
successive operational problems for airlines.
The economic scale of the problem is staggering.
Studies estimate that flight delays cost the U.S. aviation system alone over 28
billion dollars per year when your account for airline costs, passenger time
losses and indirect economic effects. Globally, the numbers are even larger.
Despite decades of investment in air traffic management technology, delays
remain stubbornly common - particularly during peak travel seasons and at
congested hub airports.
Predicting delays in advance - even imperfectly -
would allow airlines to reroute aircraft proactively, notify passengers before
they reach the airport and optimize crew scheduling. For passengers, a reliable
delay prediction tool could help with travel planning and reduce the stress of
uncertainty. This is what motivated us to take on this prediction problem as
our graduation project.
1.2. The prediction
problem
We treated flight delay prediction as a binary
classification problem. For each flight record in our dataset, the goal is to
predict one of two outcomes: On Time (class 0) or Delayed (class 1). The
challenge is that flight delays result from a complex mix of factors - weather
at the origin and destination, air traffic congestion, mechanical issues,
late-arriving aircraft, crew availability and so on. Many of these factors are
either not recorded in standard passenger datasets or are not known far enough
in advance to be useful for prediction.
Our dataset contains passenger demographic
information and airport details, but not real-time weather or traffic data.
This is a realistic constraint - in practice, a prediction system that relies
only on scheduled flight information and historical airport data would be far
more deployable than one requiring live weather feeds. So, we deliberately
worked within these constraints and tried to see how well we could do with what
was available.
1.3. What we did
Our work involved several distinct steps. We started
by cleaning and preprocessing the raw dataset. We then spent considerable time
on feature engineering, creating new variables that we believed would help the
models learn better patterns. After splitting the data and balancing the
classes using SMOTE, we trained and compared eight machine learning models. We
tuned the best-performing individual model using RandomizedSearchCV and then
built a Stacking ensemble as a final step. Throughout, we paid attention to
multiple evaluation metrics rather than just accuracy, since accuracy alone can
be misleading with binary classification problems.
1.4. Paper organization
The remainder of this paper is divided as follows.
Section 2 reviews related work. Section 3 describes the dataset and
preprocessing steps. Section 4 explains our decisions regarding feature design.
Section 5 covers experiment setup, including models, class balancing and
evaluation metrics. Section 6 presents our results. Section 7 discusses what
these results mean and where the approach falls short. Section 8 concludes with
a summary and suggestions for future work.
2. Related Work
2.1. Early statistical
approaches
Research on flight delay prediction has a fairly
long history, going back to at least the early 2000s when large-scale flight
records became publicly available. Early work tended to rely on statistical
regression models and simple rule-based systems. These studies were useful for
identifying which factors - weather, congestion, time of day, season - were
most strongly associated with delays, but they struggled to capture the complex
interactions between variables.
One consistent finding from this early literature,
which our own results reinforce, is the importance of airport-level factors.
Some airports are simply more delay-prone than others, due to their geographic
location, runway capacity, traffic volume and local weather patterns. This
motivated our decision to include historical airport delay rates as engineered
features.
2.2. Machine learning
takes over
Starting around 2015, machine learning methods began
to dominate the flight delay prediction literature. Random Forest and Gradient
Boosting classifiers became popular choices because they handle non-linear
relationships well and are relatively robust to irrelevant features. Studies by
Choi, et al.1 on U.S. domestic
flights reported AUC-ROC scores in the 0.72-0.78 range using these methods -
numbers that are very close to what we achieved, which gives us some confidence
that our results are reasonable.
More recent work has explored deep learning
approaches. Ye, et al.2 used LSTM
networks to capture temporal dependencies between consecutive flights on the
same aircraft. The idea is that a delayed inbound flight will almost certainly
cause a delayed outbound flight, so modelling these sequences adds useful
information. Our current approach does not model these sequences - each flight
is treated independently - which is one of its main limitations.
3. Dataset and
Preprocessing
3.1. The dataset
We used the Airline Dataset (Updated v2), which is
publicly available on Kaggle. It was originally collected to support research
on airline passenger behaviour and flight operations. Each row in the dataset
represents one flight record associated with a specific passenger and includes
a mix of demographic information about the passenger and operational details
about the flight. Table 1 summarizes the key numbers.
Table 1: Dataset Overview.
|
Attribute |
Details |
|
Dataset Name |
Airline Dataset Updated -
v2 |
|
Total Records (Raw) |
98,619 |
|
Records After Filtering |
65,677 |
|
Records After Cleaning |
39,750 |
|
Number of Features (Final) |
21 |
|
Target Variable |
Flight Status (0=On Time,
1=Delayed) |
|
Source |
Kaggle -
flight-delay-prediction-abdulmohsen |
The raw dataset has three possible values for the
target variable Flight Status: On Time, Delayed and Cancelled. We excluded the
Cancelled class from our analysis. Cancellations are fundamentally different
from delays - they typically result from severe weather events, security
incidents or major mechanical failures - and mixing them with delays would confuse
the classifier. After removing cancelled flights, we were left with 65,677
records to work with.
2.3. The role of
feature engineering
Several papers have highlighted how much feature
engineering can improve results on aviation datasets. Kim and Hansen3 showed that adding network centrality
measures and route-level delay statistics led to substantial performance gains
over models trained on raw flight data alone. This is consistent with our
experience: the single biggest improvement in our pipeline came not from
choosing a better algorithm, but from adding AirportDelayRate as a feature.
The general lesson seems to be that domain knowledge
matters a lot in this problem. The best features are not necessarily the ones
that sound most sophisticated - they are the ones that capture what actually
causes delays, even if only indirectly.
2.4. Ensemble methods
Ensemble methods have consistently outperformed
individual classifiers across a wide range of prediction tasks and flight delay
prediction is no exception. The theoretical reason is well understood: by
combining models that make different kinds of errors, you can reduce both
variance and bias compared to any single model. Wolpert's Stacking framework is
particularly appealing because it lets the meta-learner figure out which base
models to trust on which types of inputs.
We were encouraged by findings from Kim and Hansen3 and others showing that Stacking ensembles
consistently beat individual gradient boosting models on aviation data. Our
results confirm this pattern - the Stacking model achieved the highest F1 Score
and AUC-ROC in our experiments, beating even the tuned XGBoost model.
3.2. Preprocessing steps
3.2.1. Target encoding:
The Flight Status variable was converted to a binary
integer: 0 for On Time and 1 for Delayed. This is required for compatibility
with the scikit-learn API and enables probability-based metrics like AUC-ROC.
3.2.2. Dropping useless
columns: We removed three columns immediately: Passenger ID,
First Name and Last Name. These are personally identifiable fields that clearly
cannot generalize to new passengers - a model that memorizes passenger names
would be useless in practice.
3.2.3. Parsing the
departure date: The Departure Date column was
stored as a string. We parsed it into a proper datetime format and extracted
three new columns: Month, Day and DayOfWeek. The original date column was then
dropped since the raw date string is not useful to a machine learning model in
its original form. Some rows had dates that could not be parsed correctly and
were dropped along with any other rows containing missing values. After this
step, the dataset contained 39,750 records.
3.2.4. Encoding
categorical variables: Nine categorical
columns were encoded using Label Encoding: Gender, Nationality, Airport Name,
Country Name, Airport Country Code, Airport Continent, Continents, Arrival
Airport and Pilot Name. We chose Label Encoding rather than One-Hot Encoding
because several of these columns have very high cardinality - for example,
Pilot Name has hundreds of unique values. One-Hot Encoding would have produced
an unmanageably wide feature matrix. Tree-based models, which make up the bulk
of our model comparison, handle label-encoded categorical well since they can
find the relevant split points regardless of the numeric values assigned.
4. Feature Engineering
4.1. Why we needed new
features
After initial preprocessing, the dataset had 14
features. When we ran a quick baseline XGBoost model on these features, the
results were underwhelming. Digging into the data, it became clear that most of
the raw features - passenger age, gender, nationality - do not have a strong
direct relationship with flight delays. Delays are fundamentally an operational
and geographic phenomenon, not a passenger one.
This led us to think about what information a domain
expert would use to assess delay risk. The answer seemed obvious: the
historical delay behaviour of the airport. An airport that has been running
on-time 95% of the time in the past is much more likely to run on time tomorrow
than one that has been delaying 60% of its flights. This reasoning motivated the
creation of our delay rate features.
4.2. The engineered
features
We created seven new features in total. (Table 2) describes each one along with our reasoning for including it.
Table 2: Engineered Features.
|
Feature |
Description |
Why We Added It |
|
Month |
Month extracted from departure date |
Flight delays change a lot by season |
|
Day |
Day of the month |
Some days are busier than others |
|
DayOfWeek |
0=Monday to 6=Sunday |
Weekends have very different traffic patterns |
|
Season |
Winter/Spring/Summer/Fall grouping |
Smoother seasonal signal than raw month |
|
IsWeekend |
1 if Friday or Saturday, else 0 |
Peak travel days tend to have more delays |
|
IsInternational |
1 if nationality differs from airport country |
International flights behave differently |
|
AirportDelayRate |
Historical average delay rate per airport |
By far the strongest predictor we found |
|
CountryDelayRate |
Historical average delay rate per country |
Captures country-level delay patterns |
|
ContinentDelayRate |
Historical average delay rate per continent |
Useful when country data is sparse |
|
MonthDelayRate |
Historical average delay rate per month |
Captures monthly seasonality in delays |
4.3. The most important
feature: AirportDelayRate
AirportDelayRate deserves special attention because
it turned out to be far more important than any other feature - not just the
engineered ones, but all 21 features combined. It is computed simply by
grouping the dataset by Airport Name and taking the mean of the binary Flight
Status variable. The result is a number between 0 and 1 representing the
fraction of flights from that airport that were delayed in our dataset.
When we ran feature importance analysis on the tuned
XGBoost model, AirportDelayRate had an importance score of approximately 0.52.
In other words, this single feature accounts for more than half of what the
model is actually using to make its predictions. The next most important
feature, CountryDelayRate, had an importance of only about 0.08.
One important caveat: because we compute
AirportDelayRate from the same dataset we use for training and testing, there
is a potential data leakage issue. In a real production system, you would
compute these rates from historical data that strictly predates the flights you
are trying to predict. We acknowledge this limitation but note that it is
common in research settings where longitudinal data is not available.
4.4. Temporal features
The Season feature groups months into four seasons
and provides a smoother signal than raw Month values. IsWeekend flags Fridays
and Saturdays as peak travel days - from our own experience as travellers,
these days tend to have more congestion and therefore more delays. DayOfWeek
gives the model access to the full weekly pattern, not just the weekend/weekday
distinction.
4.5. IsInternational
This binary feature is a rough proxy for whether a
flight is international or domestic, based on whether the passenger's
nationality matches the country of the departure airport. It is an imperfect
measure - a Saudi passenger flying within Saudi Arabia would be correctly
flagged as domestic, but a foreign resident flying domestically would be
incorrectly flagged as international. Despite this imprecision, the feature
contributes some useful signal, showing up as the fifth most important feature
in our XGBoost model.
5. Methodology
5.1. Train/Test split
We divided the dataset into training and test sets
at a 70/30 ratio using stratified random sampling. Stratification ensured that
both sections contained the same proportion of delayed and on-time trips. The
test set was kept aside until the end and was only used to report the final
results. I was careful not to leak any information from the test set into the
training process.
5.2. Handling class
imbalance with SMOTE
Even though the overall class distribution is
roughly balanced after filtering, we applied SMOTE to the training set as a
precaution. SMOTE creates synthetic minority class samples by interpolating
between existing real samples in feature space. The key rule is that SMOTE must
only be applied to the training set - applying it to the test set would give an
unrealistically optimistic picture of performance on real data.
To check whether SMOTE actually helped, we ran
XGBoost with and without it. Without SMOTE: F1=0.6102. With SMOTE: F1=0.6584.
That is a meaningful improvement of about 7.9%, so we kept SMOTE in the
pipeline.
5.3. Models we compared
We trained eight models in total. (Table 3) gives a quick overview of each one.
Table 3: Models Used in This Study.
|
Model |
Type |
Notes |
|
Logistic Regression |
Linear |
Used as a simple baseline |
|
Random Forest |
Bagging Ensemble |
200 trees, good general performance |
|
XGBoost |
Boosting |
Default settings first, then tuned |
|
LightGBM |
Boosting |
Faster than XGBoost on large data |
|
Gradient Boosting |
Boosting |
Sklearn implementation |
|
CatBoost |
Boosting |
Handles categoricals well |
|
XGBoost (Tuned) |
Boosting |
After RandomizedSearchCV optimization |
|
Stacking Ensemble |
Meta-learning |
Combines GB + LGB + RF + XGB(Tuned) |
Logistic Regression was included as a linear
baseline - if it performs competitively, that suggests the problem is mostly
linear and simpler models might suffice. Random Forest provides a strong non-linear
baseline through bagging. The four gradient boosting variants (XGBoost,
LightGBM, Gradient Boosting, CatBoost) represent the current state of the art
for tabular classification problems. Finally, the Stacking ensemble combines
the predictions of the four strongest individual models using Logistic
Regression as a meta-learner.
5.4. Hyperparameter tuning
We selected XGBoost for hyperparameter tuning
because it is one of the most widely used and well-understood gradient boosting
implementations, with a rich set of hyperparameters to explore. We used
RandomizedSearchCV with 30 random configurations and 3-fold cross-validation,
optimizing for F1 Score. The search covered seven hyperparameters:
n_estimators, learning_rate, max_depth, subsample, colsample_bytree,
min_child_weight and gamma.
The best configuration found was: n_estimators=200,
learning_rate=0.05, max_depth=4, subsample=0.7, colsample_bytree=0.7,
min_child_weight=5, gamma=0.1. Interestingly, the optimal depth (4) was
shallower than our default setting (8), suggesting that the default model was
overfitting. The higher regularization parameters (min_child_weight=5,
gamma=0.1) also point in this direction.
5.5. Evaluation metrics
We evaluated all models using eight metrics:
Accuracy, Balanced Accuracy, Precision, Recall, F1 Score, AUC-ROC, Matthews
Correlation Coefficient (MCC) and Log Loss. We used F1 Score as our primary
ranking metric because it balances Precision and Recall - in a delay prediction
context, both false positives (predicting delay when the flight is actually on
time) and false negatives (predicting on time when the flight is actually
delayed) have real costs. AUC-ROC served as our secondary metric because it
measures discrimination ability across all classification thresholds, not just
the default 0.5 cutoff.
6. Results
6.1. Overall Model
Comparison
(Table 4) shows the performance of all eight models on the test set, ranked by F1
Score. The results paint a pretty clear picture.
Table 4: Model Performance on Test Set (ranked by F1 Score).
|
Model |
Accuracy |
F1 Score |
AUC-ROC |
MCC |
Log Loss |
|
Stacking (Best) ★ |
0.6846 |
0.6842 |
0.7698 |
0.3692 |
0.6021 |
|
Gradient Boosting |
0.6787 |
0.6800 |
0.7574 |
0.3575 |
0.5661 |
|
XGBoost (Tuned) |
0.6793 |
0.6787 |
0.7635 |
0.3587 |
0.6004 |
|
LightGBM |
0.6741 |
0.6755 |
0.7523 |
0.3483 |
0.5719 |
|
CatBoost |
0.6752 |
0.6768 |
0.7526 |
0.3505 |
0.5892 |
|
Random Forest |
0.6660 |
0.6657 |
0.7369 |
0.3320 |
0.5901 |
|
XGBoost |
0.6573 |
0.6584 |
0.7309 |
0.3146 |
0.6004 |
|
Logistic Regression |
0.5754 |
0.5726 |
0.6230 |
0.1509 |
0.6673 |
The Stacking ensemble came out on top with F1=0.6842
and AUC-ROC=0.7698. Gradient Boosting was a close second (F1=0.6800) and the
tuned XGBoost ranked third (F1=0.6787). Logistic Regression was by far the
weakest model, with F1=0.5726 and AUC-ROC=0.6230 — barely better than random
guessing on the AUC metric.
One result that surprised us a bit was that CatBoost
(F1=0.6768) performed similarly to LightGBM (F1=0.6755) despite CatBoost's
reputation for handling categorical features well. We think this is because we
label-encoded all categoricals before training, which removes the advantage
CatBoost would normally have from handling raw categoricals internally.
6.2. Feature importance
(Table 5) shows the top 10 features by importance in the tuned XGBoost model. The
dominance of AirportDelayRate is striking.
Table 5: Top 10 Feature Importances (Tuned XGBoost).
|
Rank |
Feature |
Importance |
Type |
|
1 |
AirportDelayRate |
~0.52 |
Engineered |
|
2 |
CountryDelayRate |
~0.08 |
Engineered |
|
3 |
Month |
~0.06 |
Temporal |
|
4 |
Nationality |
~0.05 |
Demographic |
|
5 |
IsInternational |
~0.04 |
Engineered |
|
6 |
Pilot Name |
~0.04 |
Operational |
|
7 |
Arrival Airport |
~0.03 |
Operational |
|
8 |
Airport Name |
~0.03 |
Operational |
|
9 |
MonthDelayRate |
~0.03 |
Engineered |
|
10 |
Day |
~0.02 |
Temporal |
AirportDelayRate alone accounts for approximately
52% of the total importance. The next nine features combined account for the
remaining 48%. This tells us something important: for this dataset, the
strongest signal for whether a flight will be delayed is not anything about the
passenger - it is about which airport the flight is departing from and that
airport's historical delay behaviour.
The low importance of demographic features like Age
(~0.02) and Gender (~0.01) is consistent with our intuition. There is no
particular reason why a passenger's age or gender should influence whether
their flight is delayed - delays are an operational phenomenon driven by
airport capacity, weather and network congestion.
6.3. Effect of
hyperparameter tuning
The tuning process improved XGBoost's F1 Score from
0.6584 (default) to 0.6787, a gain of about 3.1%. The AUC-ROC improved from
0.7309 to 0.7635. These are meaningful improvements, though not transformative.
The direction of improvement - toward shallower trees with more regularization -
suggests that the default XGBoost configuration was overfitting to the training
data.
6.4. The stacking
ensemble
The Stacking ensemble, which combines Gradient
Boosting, LightGBM, Random Forest and tuned XGBoost as base learners with
Logistic Regression as the meta-learner, achieved the best results overall. The
improvement over the best individual model (Gradient Boosting) is modest in
absolute terms - about 0.004 F1 points and 0.012 AUC points - but consistent
across all metrics. This is typical for stacking: the gains are usually
incremental rather than dramatic, but they are reliable.
7. Discussion
7.1. Why the results
are as they are
Our best model achieved an F1 score of 0.6842 and an
ROC AUC of 0.7698. These numbers are good but not excellent. To put it in
context: an AUC of 0.77 means that our model ranks a randomly selected delayed
flight higher than a randomly selected on-time flight about 77% of the time -
much better than random (0.50) but far from perfect (1.00). We believe the main
reason for the performance ceiling is the absence of weather data. Weather is
widely known as the primary cause of flight delays and our dataset does not
include it. No amount of computational complexity can make up for the loss of
the most important predictors. Studies that include weather features usually
report AUC scores of 0.85 or higher. The secondary reason is that our model
treats each flight individually. In reality, delays propagate through the
network - if a plane is delayed at one airport, it delays all subsequent
flights on that plane's schedule.
7.2. The airport effect
One of the most interesting findings of this study
is just how dominant the airport-level delay rate is as a predictor. When we
first saw that AirportDelayRate accounted for 52% of XGBoost's feature
importance, we double-checked our code to make sure we had not made an error.
The result held up.
This finding has a practical implication: if you
want a simple, interpretable rule for predicting flight delays, just look at
the historical delay rate of the departure airport. A flight departing from a
high-delay airport is much more likely to be delayed than one departing from a
low-delay airport, regardless of anything else. Of course, this is not the full
story - the remaining 48% of importance comes from other features - but the
airport effect is by far the strongest single signal in the data.
7.3. Ensemble vs. individual
models
The consistent advantage of the Stacking ensemble
over individual models is worth reflecting on. The base learners (Gradient
Boosting, LightGBM, Random Forest, XGBoost) each make somewhat different
errors. Some may be better at handling certain airport types, certain passenger
demographics or certain seasonal patterns. The meta-learner - the Logistic
Regression trained on the base model predictions - learns to weight these
models optimally, producing a combined prediction that is more reliable than
any individual one.
That said, the gains from stacking in this case are
modest. If we were deploying this in a real system where inference speed
matters, we might opt for the tuned XGBoost alone rather than the full Stacking
ensemble, since the performance difference is small and the ensemble is more
complex to maintain.
7.4. Limitations
We want to be honest about the limitations of this
work
·
No weather
data: This is the biggest gap. Weather is the primary
cause of delays and its absence puts a ceiling on achievable performance.
·
Data leakage
risk in delay rates: Computing
AirportDelayRate from the same dataset used for training and testing is not
ideal. In production, historical rates should be computed from data that
predates the prediction period.
·
Static
model: We trained one model on the full dataset. A
deployed system would need to be retrained regularly as delay patterns change
over time.
·
No sequence
modelling: Each flight is treated independently, ignoring the
cascading nature of real-world delays.
·
Single
dataset: Results may not generalize to other airlines, time
periods or geographic regions.
8. Conclusion
This project set out to build a machine learning
pipeline for predicting flight delays as a binary classification problem. The
dataset contains 39,750 flight records that have been cleaned and processed, we
compared eight classification models and found that a Stacking ensemble
combining Gradient Boosting, LightGBM, Random Forest and tuned XGBoost achieved
the best results, with an F1 Score of 0.68 and AUC-ROC of 0.76.
Several things have become clear to us through this
project, feature engineering mattered more than model selection. The single
most impactful step in our entire pipeline was adding AirportDelayRate - a
simple historical average computed from the data. No algorithmic choice came
close to matching the effect of that one feature. Second, ensemble methods
consistently beat individual classifiers, confirming what the literature has
been saying for years. Third, hyperparameter tuning produced meaningful but not
dramatic gains, suggesting that the default configurations of modern gradient
boosting libraries are already quite good.
Looking ahead, adding weather data at the origin and
destination airports would likely be the single highest-impact improvement.
Modelling flight sequences rather than treating each flight independently would
capture delay propagation effects. If SHAP values are used for
interpretability, it will be easier to explain individual predictions to end
users. And deploying the model as a real-time API - one that updates its
airport delay rates from a live feed - would turn this from a research exercise
into a practical tool.
Despite its limitations, we believe this project
demonstrates that useful delay prediction is achievable even with relatively
simple, readily available data - as long as you invest in the right features.
9. References
1. Choi S, Kim YJ, Kim S. Flight delay prediction using
machine learning algorithms. Journal of Air Transport Management, 2021;90:
101837.
3. Kim J, Hansen M. Ensemble methods for aviation delay
prediction. Transportation Research Record, 2022;2676: 54-65.
7. Gui G, Liu F, Sun J, et al. Flight delay prediction
based on aviation big data and machine learning. IEEE Transactions on Vehicular
Technology, 2020;69: 140-150.
Wolpert DH. Stacked generalization. Neural Networks, 1992;5: 241-259.