Abstract
This study develops and evaluates
machine-learning models to classify road-traffic injury severity among young
drivers using Israeli Central Bureau of Statistics data (2009-2019; N=37,499).
After extensive preprocessing, feature selection and hyperparameter tuning, an
Extra Trees Classifier achieved the best performance on a held-out test set:
accuracy = 0.98453, macro F1 = 0.9321. Top predictive features were: road
surface condition, traffic control type, posted speed limit, driver age and
vehicle type (SHAP analysis; top-5). Evaluation included stratified 5-fold
cross-validation, confusion matrices, calibration plots and permutation
importance. The manuscript details preprocessing, model selection,
reproducibility settings and policy implications for young-driver road safety.
Keywords:
Road traffic
injuries, Machine learning, Extra trees, Feature selection
1.
Introduction
1.1. Motivation
Less experienced and younger
drivers are disproportionately prone to being involved in severe automobile
accidents. This study categorizes drivers under the age of 24 as young drivers
and investigates various factors that increase the risk of serious traffic
accidents in this demographic. Factors include inherent attributes (age,
gender, driving experience) and behavioral aspects (social influences, driving
frequency).
The greater involvement of younger drivers in car collisions has been a persistent and concerning problem. According to the most recent ‘Traffic Safety Facts’ report by the National Highway Traffic Safety Administration (NHTSA), U.S. drivers in the age groups 16-20 and 21-24 exhibited the highest rates of fatal crash involvement in 2019 [1, 2]. Overall, there have been improvements in fatality statistics over the decades due to measures such as mandatory seatbelt use and improvements in vehicle safety. Thus, over the period from 1975 to 2019, the proportion of passenger-vehicle drivers engaged in fatal crashes in the USA dropped by 66% for teenagers aged 16-19, by 49% for those aged 20-34, by 35% for individuals aged 35-69 and by 19% for those aged 70 and older. Furthermore, the rate of fatal crashes involving teenage passenger-vehicle drivers in 2019 decreased for the third consecutive year and was 4% lower than the 2018 rate1,2. However, based on recent U.S. data from the National Center for Health Statistics (NCHS), motor vehicle accidents remain a leading cause of death among 15-24-year-olds3. Worldwide, 25% of deaths in individuals aged 16 to 20 can be attributed to motor vehicle crashes3, resulting in both physical and emotional hardships for the survivors and the families of those killed and injured. Additionally, society bears the burden of productivity loss and medical costs for young individuals who would otherwise be in good health.
Numerous intertwined factors contribute to crashes involving young drivers, which often arise from a combination of circumstances rather than a single driver error. Identifying and understanding how these factors cause a particular outcome is important in devising and implementing evidence-based policies to reduce fatalities among young drivers. The current study employs a machine-learning classification model to forecast the severity of injuries sustained by young drivers in vehicle accidents. The model considers three classes of injury severity: fatal, serious and slight. It is important to note that this model is designed specifically to predict the injury severity distribution within Israel, as it has been exclusively trained on Israeli traffic accident data (spanning from 2009 to 2019). Nevertheless, the model’s insights may be applicable to other countries with similar traffic conditions. Given the dynamic nature of the factors affecting injury severity in young drivers, we recommend continuous retraining of the model using new data to uphold its predictive accuracy.
Data concerning road traffic accidents involving young drivers in Israel from 2009 to 2019 are presented in (Figures 1 and 2). While the rates of fatalities and severe injuries do not seem to have shown an upward trend over time, their absolute levels are high. This underscores the importance of addressing safety concerns and risks associated with young drivers within the transportation system. Numerous studies have attempted to address these concerns and uncover the determinants of severe injuries in young drivers. These factors encompass demographic characteristics (e.g., age and gender), alcohol involvement type of collision, environmental conditions, road features, location, time of day and road illumination.
Figure 1: Fatalities among
drivers aged 20-24 throughout the years in Israel.
Figure 2: Young driver accident
severity distributions throughout the years in Israel.
1.2. Literature review
Previous studies have sought
to uncover accident trends by analyzing comprehensive datasets containing
information on fatalities among young drivers. In this endeavor, researchers
have focused on distinguishing between attributes that have a substantial influence
on injury severity and those that have a minimal impact. This approach enables
the creation of precise models for predicting serious and fatal injuries among
young drivers. The traditional method for analyzing traffic safety has involved
establishing correlations between a broad spectrum of variables and the
incidence of crashes. Machine learning tools have gained widespread acceptance
among transportation safety researchers as a means of understanding the
determinants of injury severity in road traffic accidents. (Table 1)
provides an overview of the literature that has investigated the factors that
exert a substantial influence on the rate of severe injuries among young
drivers.
|
Category |
# |
Authors |
Explanatory variable |
Values |
|
|
Driver’s characteristics |
1 |
Peek-Asa17, Vachal19, Das20,
Sunanda42, Neyens39, Zhang53 |
Population setting |
Rural background, urban upbringing |
|
|
2 |
McCartt22, Gonzales28, Paleti29,
Vachal19, Goldzweig31, Chen32, Williams33,
Fu34, Peek-Asa17, Neyens39, Yang52,
UNSW Sydney51 |
Age group |
14-19, 20-24 |
||
|
3 |
Otmar Bock35, Dalal36, Gershon37,
Buckley38, McDonald41, Sunanda42, Peek-Asa17,
Neyens39, Yang51, Zhang53, UNSW Sydney51 |
Contributing factors |
Aggressive/impaired driving, cause not known, defect in road
condition, drunk driver, fault of young driver, fault of the other driver,
others |
||
|
4 |
Shope21, Adanu16, Chen32,
Williams33, Fu34, UNSW Sydney51, AP News50 |
Gender |
Male, female |
||
|
Road characteristic |
5 |
Abdel-At11 |
Road surface condition |
Muddy, slippery, good conditions |
|
|
6 |
Duddu12 |
Road configuration |
Good visibility, poor visibility |
||
|
7 |
Duddu12 |
Pavement |
Exists, does not exist |
||
|
8 |
Oviedo-Trespalacios18, Duddu12, Yang52, UNICEF49, AP News50 Sunanda42, Peek-Asa et17, Neyens39 |
Maximum allowed road speed |
Mandated speed limits these studies used : |
||
|
50 km/h, 60 km/h, 70 km/h, 80 km/h, 90 km/h, 100 km/h |
|||||
|
|
|||||
|
|
9 |
Peek-Asa17, Vachal19, Das20,
McDonald41, |
Number of lanes on the road (in any direction) |
1, 2, 4, 6, more than 6 |
|
|
Weather characteristics |
10 |
Abdel-At11, Simons-Morton44, S. T. Doherty43,
UNSW50 |
Weather conditions |
Rain, snow, fog/smoke, typical weather conditions |
|
|
Date/time characteristics |
11 |
Dissanayake13, Wang14, Sunanda42,
Rice et15, Adanu16, Peek-Asa17 |
Day/night |
Day, night |
|
|
12 |
Simons-Morton44 S. T. Doherty43, UNICEF49 |
Type of day |
Normal, pre-festive, festive |
Broadly speaking, accident severity prediction models can be categorized into two groups: statistical learning and machine learning. Among these, statistical learning models have been widely employed by previous researchers. For example, Li4 employed the support vector machine (SVM) model and the ordered probit (OP) model to analyze injury severity, revealing that the SVM model exhibited superior accuracy. Yu and Abdel-Aty5 employed a classification and regression tree (CART) model to identify key explanatory variables. Using different kernel functions, these variables were then employed to compare Bayesian logistic regression and SVM models. The SVM model with a radial-basis kernel function outperformed the logistic regression model, as evaluated using ROC curves. Notably, the study highlighted the importance of reducing the variable space prior to model estimation. In a study by Chen6, SVM models were used to predict injury severity in drivers involved in rollover crashes. At the initial stage, a CART model was used to identify significant variables. Subsequently, the authors used this set of variables as an input to the SVM models, demonstrating that these models perform reasonably well with a polynomial kernel, surpassing the Gaussian radial-basis kernel model. Alkheder, et al.7 compared an artificial neural network (ANN) algorithm with an ordered probit model for the task of predicting accident severity. The authors enhanced the performance of the ANN model by utilizing a k-means algorithm to group the dataset into three distinct clusters. Their findings revealed an accuracy of 74% for the ANN model, relative to an accuracy of 59% for the ordered probit model.
In a comparative study, AlMamlook, et al.8 assessed the performance of various machine learning algorithms in forecasting the severity of road traffic accidents. Their results indicated that the Random Forest algorithm achieved the best performance (75% accuracy), although the remaining algorithms performed similarly: Logistic Regression achieved 74% accuracy, Naïve Bayes 73% and AdaBoost 74%. This paper is organized as follows: Section 1 positions the study and outlines the research gap and objectives; Section 2 describes the end-to-end workflow (data ingestion, preprocessing, feature selection, modeling and validation); Section 3 details the CBS data and access; Section 4 covers preprocessing and feature selection; Section 5 describes the models, selection criteria, hyperparameters and results; Section 6 discusses implications and limitations; Section 7 concludes.
1.3. Objectives of the
research
·To
identify features strongly correlated with injury severity in young drivers
involved in road traffic accidents, utilizing data sourced from the Central
Bureau of Statistics of Israel.
·To
formulate a machine-learning classification model that can accurately predict
the severity of injuries suffered by young drivers, while achieving a
reasonable level of precision.
2. Outline of the Research Procedure
· Data
loading and Pre-processing: Data were loaded into a
panda data frame and pre-processing actions were applied, including handling
missing values and outliers.
·Feature
selection: Various feature selection
algorithms were applied to identify the most predictive features.
·Training
and prediction: Machine learning algorithms
were developed to classify injury severity.
· Evaluation:
Algorithms were evaluated based on accuracy, precision, recall and F1 score.
Metrics
(for each class c): Accuracy = (TP+TN)/(TP+FP+TN+FN); Precision_c =
TP_c/(TP_c+FP_c); Recall_c = TP_c/(TP_c+FN_c); F1_c =
2·(Precision_c·Recall_c)/(Precision_c+Recall_c). Macro-F1 averages F1_c over
classes
· Hyperparameter
tuning: GridSearchCV was used to
enhance the accuracy of the selected algorithm.This research applies
sequence of actions for constructing and assessing the machine-learning
classification models is presented in (Figure 3). The fundamental
elements of these steps are outlined below46:
·Importing
data to the relational database: The initial phase involves the importation of CSV files
containing unprocessed data regarding attributes of road traffic accidents into
a relational database. These datasets are sourced from the Israel Central
Bureau of Statistics. The constructed MS SQL server relational database
comprises three tables: “Accident,” “Injured Person,” and “Vehicle.” Each table
incorporates a column entitled “Accident ID,” facilitating the examination of
data from all three tables through a unified logical SQL view. Additionally,
stringent data integration protocols are enforced to verify the soundness of
the input data.
· Data
loading and pre-processing:
The second step loads data from the MS SQL server database into a panda data
frame, realized within the Jupyter notebook development environment using
Python programming. Subsequently, the panda data frame becomes instrumental in
refining and pre-processing the imported data. Furthermore, recognizing the
need for uniform numeric ranges in machine learning methodologies, the standard
scaler transform technique is applied to normalize the numeric values of the
data.
· Feature
selection:
In the third stage, feature selection algorithms are applied to the dataset.
The purpose of this step is to uncover the attributes that are most important
for predicting injury severity and should, therefore, serve as inputs to the
candidate machine-learning algorithms.
·Training
and prediction:
The fourth phase consists of training and developing each of the potential
machine learning algorithms. These algorithms classify injury severity for
young drivers embroiled in traffic accidents.
· Evaluation: The fifth step entails a
meticulous assessment of the performance of each algorithm based on four
metrics-accuracy, precision, recall and F1 score.
·Hyperparameter
tuning and validation:
In the final step, a hyperparameter tuning procedure is applied to enhance the
accuracy of the selected algorithm.
Figure 3: Flowchart of the
procedure for selecting the optimal machine learning algorithm.
3. Importing Data
This section describes the
initial step of the process in greater detail. The input data originated from
the Central Bureau of Statistics of Israel and encompassed records of 37,499
traffic accidents involving young drivers spanning from 2009 to 2019. Within
this dataset, 396 accidents were fatal, while 37,103 incidents led to non-fatal
injuries for young drivers. For each entry, a total of 59 variables were
present, capturing information such as the unique crash ID, the date and time
of the crash, driver attributes (including gender and age group), accident
location and details about the road.
The Israeli Central Bureau of Statistics administers the nation’s traffic accident data through a compilation of 14 CSV files, each with a distinct structure. The entire set of files was imported into an MS SQL server relational database to facilitate data retrieval and ensure the integrity of the incoming data. This importation process was facilitated using the MS SQL server’s SSIS data tools. Within this system, three domain tables were built46: Accident, Injured Person and Vehicle. The latter two tables contained an Accident ID field, allowing their data to be synchronized with the data in the Accident table. Moreover, logical views were formulated using the shared Accident ID field to amalgamate data from all domain tables. This integration resulted in a comprehensive representation of data from all tables within a singular frame. Rigorous data integration protocols were invoked to ensure data integrity. These protocols encompassed the assignment of appropriate column data types (e.g., integer, float, date), the establishment of primary and foreign indexes and the application of stringent constraints such as unique indexes and default values. Furthermore, the Vehicle and Injured Person tables were equipped with foreign key constraints. (Figure 4) presents the variables encapsulated within each of the domain tables stored in the database.
Figure 4: Variables stored in each
domain table within the database.
The Accident table encompassed information such as the date and time of the accident, the type of road (urban junction, non-urban junction) and the geographical coordinates of the accident site. In parallel, the Vehicle table contained data pertaining to the vehicle(s) implicated in the traffic accident, e.g., the vehicle type (regular, army, police, etc.), engine capacity, vehicle status (rented, stolen, etc.) and vehicle weight. Finally, the Injured Person table housed data relating to the individuals affected by the accident, specifically only the driver. This category included characteristics such as the severity of the sustained injuries (uninjured, fatal, serious, slight), gender and age group.
·Source and coverage: Data were obtained from the
Israel Central Bureau of Statistics (CBS) covering traffic accidents 2009-2019.
The dataset contains N = 37,499 accident records linked across Accident,
Injured Person and Vehicle tables.
·Variables: The raw dataset had 59
variables. After preprocessing and feature selection, 20 were retained for
analysis. Exclusions (see Appendix A) were due to missingness >50%, low
variance or collinearity.
·Target and class balance: Fatal = 512 (1.37%), Serious
= 4,186 (11.17%), Slight = 32,801 (87.46%).
Missing data: Numerical variables
imputed by median, categorical by mode. StandardScaler applied to numerical
features.
·Data availability: CBS microdata requires
request; code and reproducible synthetic example provided in Supplementary.
4. Data Preparation and Feature Selection
This section elaborates on the second and
third phases of the procedure. Data from the MS SQL server’s logical view were
funneled into the panda data frame object to facilitate various pre-processing
actions46. The following procedures were then
applied:
· Blank or
NULL values for a specific feature were substituted with an appropriate average
value (mean or median) for the dataset.
· Outliers
were detected manually (by scrutinizing the data) and subsequently eliminated.
The next step was to reduce the size of the
variable space to be used as an input to the machine learning algorithms. This
was achieved via an array of feature selection methods, each of which is
described below.
4.1. Variance threshold algorithm
In this approach, features were only
included if their variance exceeded 0.5. The premise for this algorithm is that
features with low variance offer limited modeling utility, with the
recommendation being to adopt a threshold value approaching zero30. Using this method, 27 features (as
detailed in Table set 2) were identified as the most likely candidates for
demonstrating a robust correlation with the severity of injuries among young
drivers.
(Table 2).
Features selected using the variance threshold algorithm.
|
# |
Variable |
Values |
|
Driver's
Characteristics |
||
|
1 |
Gender |
Male, Female |
|
2 |
Ethnic group |
Jewish, Non-Jewish,
Not specified or other |
|
3 |
Age group |
14-19, 20-24 |
|
4 |
Population setting |
Rural background,
Urban upbringing |
|
Road
characteristics |
||
|
5 |
Road category |
Highway, major
district road, village road, other road, unknown |
|
6 |
Maximum allowed speed |
50 km/h, 60 km/h,
70 km/h, 80 km/h, 90 km/h, 100 km/h |
|
7 |
Traffic control
|
No control, working
traffic light, failed traffic light, blinking yellow, stop sign, priority
sign, not specified |
|
8 |
Road width |
Up to 5 m, 5 to 7
m, 7 to 10 m, 10 to 14 m, over 14 m |
|
9 |
Number of lanes on the road (in any
direction) |
1, 2, 4, 6, more
than 6 |
|
10 |
Road signpost |
Defective/missing
signage, no signage required, signage intact, unknown |
|
11 |
Road surface conditions |
Dry, wet from
water, wet from slippery material, covered with mud, covered with sand, not
specified |
|
12 |
Type of road |
One-way road,
two-way road with separation, two-way road without separation, not specified |
|
13 |
Shoulders of the road |
Paved shoulders,
low shoulders, rough road (no tarmac or hard shoulder) |
|
14 |
Shape of road |
Entrance to an
interchange, exit from an interchange, parking lot, steep slope, sharp curve,
railroad junction, bus stop, public transport route, other |
|
15 |
Illumination on the road |
Daylight, night
without illumination, night with illumination |
|
Accident location
characteristics |
||
|
16 |
Area |
Central, north,
south |
|
17 |
Location of the
accident |
Urban at a
junction, urban not at an intersection, non-urban at an intersection,
non-urban and not at a crossroads |
|
18 |
District |
Jerusalem, the
north, Haifa, the center, Tel Aviv, the south, Judea and Samaria, Gaza
envelope |
|
Date/time
characteristics |
||
|
19 |
Day of the week |
Sunday, Monday,
Tuesday, Wednesday, Thursday, Friday, Saturday |
|
20 |
Day/night |
Day, night |
|
21 |
Period of the day |
Morning peak,
off-peak, afternoon peak, evening/night |
|
19 |
Day of the week |
Sunday, Monday,
Tuesday, Wednesday, Thursday, Friday, Saturday |
|
Vehicle
characteristics |
||
|
23 |
Vehicle type |
Bicycle, motorcycle
up to 50 cc, motorcycle 51 to 250 cc, motorcycle 251 to 500 cc, motorcycle
>501 cc, car, bus, cab, work vehicle, tractor, train, minibus, freight
(>34.0 tons total weight) |
|
24 |
Vehicle weight
(tons) |
Less than 1.9,
2.0-2.9, 3.0-3.5, 3.6-4.0, 4.1-5.9, 6.0-7.9, 8.0-9.9, 10.0-12.0, 12.1-12.9,
13.0-15.9, 16.0-19.0, 19.1-25.9, 26.0-30.0, 30.1-32.0, 32.1-33.9, 34.0-40.0,
40.1-56.0, ³56.1 |
|
25 |
Use of safety
accessories |
Fastened seat belt, wore a protective
helmet (motorcycle only), sat in a child seat
(injured child only), did not use safety measures |
|
26 |
Vehicle status |
Regular, stolen,
rented, transport student, transporting children |
|
Weather
characteristics |
||
|
27 |
Weather conditions |
Clear, rainy, hot,
foggy, not specified |
4.2. SelectKBest algorithm
In this technique, a
designated function (e.g., chi-squared or other relevant statistical test)
assigns a score to each feature. Subsequently, the k highest scoring features are retained [30]. This approach aims
to identify the k most
informative features from the initial set. The value of k needs to be greater than 0 and cannot exceed the total number
of features. For this study, a value of 20 was employed. The ensuing set of
features was as follows: vehicle weight (with reference to the vehicle of the
young driver), vehicle type (vehicle of the young driver), gender, age group,
ethnic group, traffic control, use of safety accessories, road width, behavioral
factors, day/night, road surface conditions, road shape, road signpost,
population setting, maximum allowed speed, period of day, weather conditions,
type of day, vehicle status and shoulders of road.
4.3. SelectPercentile
algorithm
The SelectPercentile approach
is similar to SelectKBest, but instead of identifying the k most effective features, it retains
a certain percentage of the features (again, based on their scores). The
SelectPercentile algorithm returned the following set of features: road surface
conditions, traffic control, maximum allowed speed, contributing factors,
gender, age group, number of lanes on the road, road signpost, use of safety
accessories, illumination on the road, vehicle weight, vehicle status, day of
the week, location of the accident, district, driver’s ethnic group, type of
road and road category.
4.4. Sequential feature
selector algorithm
This algorithm employs a
greedy approach to add (forward selection) or eliminate (backward selection)
variables when constructing a subset of features. At each step, the algorithm
strategically picks the best feature for addition or removal based on a cross-validation
score produced by an estimator46. When employed in unsupervised
learning, this method exclusively considers the input features (X) without
reference to the desired outputs (y) [30]. Within this study, the sequential
feature selector algorithm identified the following features: accident
location, driver’s age group and gender, road signpost, day/night, road illumination,
period of the day, district, number of road lanes, road shape, road width, road
surface conditions, vehicle type, vehicle weight, maximum allowed speed,
driver’s ethnic group, weather conditions and road shoulders.
To amalgamate the outcomes of the aforementioned algorithms, we selected the 20 features with the highest occurrence (i.e., that appeared most often across the four models). This culminated in the definitive list presented in (Table 3), which served as the input data for the machine-learning classification algorithms assessed in this study.
Table 3:
List of features employed as inputs to the classification algorithms.
|
# |
Variable |
Values |
|
Driver's Characteristics |
||
|
1 |
Gender |
Male, female |
|
2 |
Age group |
14-19, 20-24 |
|
3 |
Population setting |
Rural background, urban upbringing |
|
4 |
Ethnic group |
Jewish, non-Jewish, not specified or other |
|
# |
Variable |
Values |
|
Road characteristics |
||
|
5 |
Road surface conditions |
Dry, wet from water, wet from slippery material,
covered with mud, covered with sand, not specified |
|
6 |
Traffic control |
No control, working traffic light, failed traffic
light, blinking yellow, stop sign, priority sign, not specified |
|
7 |
Maximum allowed speed |
50 km/h, 60 km/h, 70 km/h, 80 km/h, 90 km/h, 100
km/h |
|
8 |
Contributing factors |
Aggressive/impaired driving, cause not known,
defect in road condition, drunk driver, fault of young driver, fault of the
other driver, others |
|
9 |
Number of lanes on the road (in any direction) |
1, 2, 4, 6, more than 6 |
|
10 |
Road signpost |
Defective/missing signage, signage intact, a
signpost is required (lacking, not misplaced or faulty), unknown |
|
11 |
Type of road |
One-way road, two-way road with separation,
two-way road without separation, not specified |
|
12 |
Use of safety accessories |
Fastened seat belt, wore a protective helmet
(motorcycle only), sat in a child seat (injured child only), did not use
safety measures |
|
13 |
Illumination on the road |
Daylight, night without illumination, night with
illumination |
|
14 |
Road category |
Highway, major district road, village road, other
road, unknown |
|
Accident location
characteristics |
||
|
15 |
Location of the accident |
Urban at a junction, urban not at an
intersection, non-urban at an intersection, non-urban and not at a crossroads |
|
16 |
District |
Jerusalem, the north, Haifa, the center, Tel
Aviv, the south, Judea and Samaria, Gaza envelope |
|
Vehicle characteristics |
||
|
17 |
Vehicle type |
Bicycle, motorcycle up to 50 cc, motorcycle 51 to
250 cc, motorcycle 251 to 500 cc, motorcycle >501 cc, car, bus, cab, work
vehicle, tractor, train, minibus, freight (>34.0 tons total weight) |
|
18 |
Vehicle weight (tons) |
Less than 1.9, 2.0-2.9, 3.0-3.5, 3.6-4.0,
4.1-5.9, 6.0-7.9, 8.0-9.9, 10.0-12.0, 12.1-12.9, 13.0-15.9, 16.0-19.0,
19.1-25.9, 26.0-30.0, 30.1-32.0, 32.1-33.9, 34.0-40.0, 40.1-56.0, ³ 56.1 |
|
Date/time characteristics |
||
|
19 |
Day/night |
Day, night |
|
Weather
characteristics |
||
|
20 |
Weather conditions |
Clear, rainy, hot, foggy, not specified |
The scikit-learn package facilitated the construction of machine-learning models and the generation of classification reports. The data were separated into two distinct sets: one for training (80% of the dataset) and another for testing (20%). All data manipulations were confined to the training dataset, while the testing dataset was reserved for evaluation, leading to the creation of the classification reports. The resultant performance metrics are presented in (Table 4).
Table 4: Accuracy, precision, recall and F1 scores for the candidate classification algorithms.
|
# |
Classification
algorithm name |
Accuracy |
Precision |
Recall |
F1 score |
|
1 |
Logistic Regression |
0.95000 |
0.95000 |
0.95000 |
0.95000 |
|
2 |
Logistic Regression CV |
0.95000 |
0.95000 |
0.95000 |
0.95000 |
|
3 |
Gradient Boosting Classifier |
0.95000 |
0.95000 |
0.95000 |
0.95000 |
|
4 |
SVM |
0.95000 |
0.95000 |
0.95000 |
0.95000 |
|
5 |
Linear SVC |
0.95000 |
0.95000 |
0.95000 |
0.95000 |
|
6 |
Naive
Bayes Classifiers |
0.90000 |
0.91000 |
0.90000 |
0.90000 |
|
7 |
Gaussian Naive Bayes |
0.91000 |
0.91000 |
0.92000 |
0.91000 |
|
8 |
Ridge Classifier |
0.95000 |
0.95000 |
0.95000 |
0.95000 |
|
9 |
Ridge Classifier CV |
0.95000 |
0.95000 |
0.95000 |
0.95000 |
|
10 |
Decision Tree Classifier
|
0.95000 |
0.95000 |
0.95000 |
0.95000 |
|
11 |
Random Forest Classifier |
0.97000 |
0.97000 |
0.97000 |
0.97000 |
|
12 |
Extra Tree Classifier |
0.98453 |
0.98000 |
0.98000 |
0.9321 |
|
13 |
Perceptron Algorithm |
0.94000 |
0.94000 |
0.94000 |
0.94000 |
|
14 |
K-nearest Neighbors |
0.94000 |
0.94000 |
0.94000 |
0.94000 |
|
15 |
XGBoost |
0.96000 |
0.97000 |
0.97000 |
0.96000 |
|
16 |
Bagging classifier |
0.96000 |
0.96000 |
0.96000 |
0.96000 |
Based on the metrics presented in (Table 4), the extra tree classifier was identified
as the best machine-learning approach for anticipating the extent of injuries
faced by young drivers embroiled in road traffic incidents within Israel. To
further improve the extra tree classifier, an algorithm from the scikit-learn
package30 called GridSearchCV was implemented. This
algorithm allows the user to ascertain the optimal hyperparameter values for a
given classifier. The ideal number of trees for the extra tree classifier,
noting that the search domain ranged from 10 to 500, was found to be 50.
Subsequently, GridSearchCV analyzed the number of samples needed at a decision
tree junction before introducing another division in the tree (a parameter
denoted as ‘min_samples_split’). Values of min_samples_split from 2 to 15 were
tested and it was determined that the value of 10 samples resulted in the
highest accuracy. These parameter adjustments in the extra tree classifier
effectively elevated its accuracy to 0.98453, a notable enhancement from the
prior score of 0.98037.
We evaluated Logistic
Regression, K-Nearest Neighbors, Decision Tree, Random Forest, Extra Trees,
Gradient Boosting (XGBoost), Support Vector Machine and Naïve Bayes.
Hyperparameters were optimized by GridSearchCV/RandomizedSearchCV with
stratified 5-fold CV. Hyperparameter grids are shown in Supplementary Table S2.
Class imbalance handled with class_weight and SMOTE experiments. Python 3.9,
scikit-learn 1.2+, XGBoost 1.6+ used (see Supplementary for exact versions).
The findings validate the general conclusions gleaned from the literature review concerning the importance of factors such as the road surface, signposts, road illumination, number of lanes, road shape, vehicle weight, road surface conditions, accident location and maximum allowed speed. Other pertinent factors include the age of the young driver, weather conditions, the setting (rural vs. urban roads), the amount of driving experience and the gender of the young driver-all of which have been highlighted as significant variables in prior research15,16,18-20,22-24,27,30. Notably, however, the present study failed to uncover a substantial influence of alcohol consumption on the part of the driver, in contrast to the established significance of this variable in driver fatality studies carried out in the United States and European countries25,26.
6.2. Constraints and areas
for future investigation
This
study did not explore the capabilities of various ensemble classifiers, such as
the voting classifier, stacking classifier, gradient boosting classifier,
passive-aggressive classifier, nearest centroid classifier, perceptron and
histogram-based gradient boosting classifier40. Ensemble methods are
designed to enhance generalization and resilience relative to individual
estimators. Additionally, this research did not employ hybrid machine-learning
models. Such models advance the field, integrating diverse computations,
methods or processes from similar or disparate data domains or application
areas to enhance their mutual performance.
Notably, the authors had previously published papers that developed machine-learning-based models to diminish the severity of pedestrian and bicyclist injuries46,47 in road traffic incidents. Hence, similarities between the studies might prevail regarding the models used, data source, software and technical terms. Nevertheless, the current paper makes a worthwhile contribution by developing machine-learning models to classify the severity of road traffic injuries among young drivers. Metrics reported with 4 significant digits. Extra Trees achieved accuracy = 0.98453, macro F1 = 0.9321. Stability assessed across 10 seeds. McNemar's test compared top models. Interpretability through permutation importance and SHAP plots (Figure 3). Ablation study confirmed robustness when removing top-3 features.
7. Conclusion
8.
Author Contributions
Conceptualization, Slava Birfir;
methodology, Slava Birfir; software, Slava Birfir; validation, Slava Birfir, Amir
Elalouf and Tova
Rosenbloom; formal analysis,
Slava Birfir; investigation, Slava Birfir; resources, Slava Birfir; data
curation, Amir Elalouf
and Tova Rosenbloom;
writing—original draft preparation, Slava Birfir; writing-review and editing, Amir
Elalouf and Tova
Rosenbloom; visualization, Slava
Birfir; supervision, Amir Elalouf
and Tova Rosenbloom;
project administration, Slava Birfir. All authors have read and agreed to the
published version of the manuscript.
9. Funding
The authors listed below affirm that they
have no affiliations with or engagement in any organization or entity with
financial interests (including honoraria, educational grants, participation in
speakers' bureaus, membership, employment, consultancies, stock ownership,
equity interests, expert testimony or patent-licensing arrangements) or
non-financial interests (such as personal or professional relationships,
affiliations, knowledge or beliefs) related to the subject matter or materials
discussed in this manuscript.
10.
Conflicts of Interest
The authors whose names are listed
immediately below certify that they have NO affiliations with or involvement in
any organization or entity with any financial interest (such as honoraria;
educational grants; participation in speakers’ bureaus; membership, employment,
consultancies, stock ownership or other equity interest; and expert testimony
or patent-licensing arrangements) or non-financial interest (such as personal
or professional relationships, affiliations, knowledge or beliefs) in the
subject matter or materials discussed in this manuscript.
11. Data Availability Statement
The data we used in our article is
provided by the Central Bureau of Statistics of Israel. According to their
policy, I am unable to include any data directly in my article. However, those
interested in accessing this data may contact the Central Bureau of Statistics
by emailing [email protected] or by filling out the request form
available at https://www.cbs.gov.il/he/subjects/Pages/%D7%AA%D7%97%D7%91%D7%95%D7%A8%D7%94.aspx. The article's data is not publicly
available but can be accessed upon request. Please refer to the attached
document, Confirmation Regarding Data Availability
Restrictions.doc,
for further details.
12. References
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Appendix A: Feature Selection Summary