Research Article

Developing Machine-Learning Models to Classify the Seriousness of Road Traffic Injuries among Young Drivers

Authors: Slava Birfir, Amir Elalouf and Tova Rosenbloom

Publication Date: October 23, 2025

DOI: https://doi.org/10.51219/JAIMLD/slava-birfir/613

Citation: Citation: Slava Birfir, et al. Developing Machine-Learning Models to Classify the Seriousness of Road Traffic Injuries among Young Drivers. J Artif Intell Mach Learn & Data Sci, 2025, 3(4): 1-3.

Copyright:Copyright: ©2025 Slava Birfir, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

View : PDF

Abstract

This study develops and evaluates machine-learning models to classify road-traffic injury severity among young drivers using Israeli Central Bureau of Statistics data (2009-2019; N=37,499). After extensive preprocessing, feature selection and hyperparameter tuning, an Extra Trees Classifier achieved the best performance on a held-out test set: accuracy = 0.98453, macro F1 = 0.9321. Top predictive features were: road surface condition, traffic control type, posted speed limit, driver age and vehicle type (SHAP analysis; top-5). Evaluation included stratified 5-fold cross-validation, confusion matrices, calibration plots and permutation importance. The manuscript details preprocessing, model selection, reproducibility settings and policy implications for young-driver road safety.

Keywords: Road traffic injuries, Machine learning, Extra trees, Feature selection

1. Introduction
1.1. Motivation
Less experienced and younger drivers are disproportionately prone to being involved in severe automobile accidents. This study categorizes drivers under the age of 24 as young drivers and investigates various factors that increase the risk of serious traffic accidents in this demographic. Factors include inherent attributes (age, gender, driving experience) and behavioral aspects (social influences, driving frequency).

The greater involvement of younger drivers in car collisions has been a persistent and concerning problem. According to the most recent ‘Traffic Safety Facts’ report by the National Highway Traffic Safety Administration (NHTSA), U.S. drivers in the age groups 16-20 and 21-24 exhibited the highest rates of fatal crash involvement in 2019 [1, 2]. Overall, there have been improvements in fatality statistics over the decades due to measures such as mandatory seatbelt use and improvements in vehicle safety. Thus, over the period from 1975 to 2019, the proportion of passenger-vehicle drivers engaged in fatal crashes in the USA dropped by 66% for teenagers aged 16-19, by 49% for those aged 20-34, by 35% for individuals aged 35-69 and by 19% for those aged 70 and older. Furthermore, the rate of fatal crashes involving teenage passenger-vehicle drivers in 2019 decreased for the third consecutive year and was 4% lower than the 2018 rate^1,2. However, based on recent U.S. data from the National Center for Health Statistics (NCHS), motor vehicle accidents remain a leading cause of death among 15-24-year-olds³. Worldwide, 25% of deaths in individuals aged 16 to 20 can be attributed to motor vehicle crashes³, resulting in both physical and emotional hardships for the survivors and the families of those killed and injured. Additionally, society bears the burden of productivity loss and medical costs for young individuals who would otherwise be in good health.

Numerous intertwined factors contribute to crashes involving young drivers, which often arise from a combination of circumstances rather than a single driver error. Identifying and understanding how these factors cause a particular outcome is important in devising and implementing evidence-based policies to reduce fatalities among young drivers. The current study employs a machine-learning classification model to forecast the severity of injuries sustained by young drivers in vehicle accidents. The model considers three classes of injury severity: fatal, serious and slight. It is important to note that this model is designed specifically to predict the injury severity distribution within Israel, as it has been exclusively trained on Israeli traffic accident data (spanning from 2009 to 2019). Nevertheless, the model’s insights may be applicable to other countries with similar traffic conditions. Given the dynamic nature of the factors affecting injury severity in young drivers, we recommend continuous retraining of the model using new data to uphold its predictive accuracy.

Data concerning road traffic accidents involving young drivers in Israel from 2009 to 2019 are presented in (Figures 1 and 2). While the rates of fatalities and severe injuries do not seem to have shown an upward trend over time, their absolute levels are high. This underscores the importance of addressing safety concerns and risks associated with young drivers within the transportation system. Numerous studies have attempted to address these concerns and uncover the determinants of severe injuries in young drivers. These factors encompass demographic characteristics (e.g., age and gender), alcohol involvement type of collision, environmental conditions, road features, location, time of day and road illumination.

Figure 1: Fatalities among drivers aged 20-24 throughout the years in Israel.

Figure 2: Young driver accident severity distributions throughout the years in Israel.

1.2. Literature review

Previous studies have sought to uncover accident trends by analyzing comprehensive datasets containing information on fatalities among young drivers. In this endeavor, researchers have focused on distinguishing between attributes that have a substantial influence on injury severity and those that have a minimal impact. This approach enables the creation of precise models for predicting serious and fatal injuries among young drivers. The traditional method for analyzing traffic safety has involved establishing correlations between a broad spectrum of variables and the incidence of crashes. Machine learning tools have gained widespread acceptance among transportation safety researchers as a means of understanding the determinants of injury severity in road traffic accidents. (Table 1) provides an overview of the literature that has investigated the factors that exert a substantial influence on the rate of severe injuries among young drivers.

Category	#	Authors	Explanatory variable	Values
Driver’s characteristics	1	Peek-Asa¹⁷, Vachal¹⁹, Das²⁰, Sunanda⁴², Neyens³⁹, Zhang53	Population setting	Rural background, urban upbringing
	2	McCartt²², Gonzales²⁸, Paleti²⁹, Vachal¹⁹, Goldzweig³¹, Chen³², Williams³³, Fu³⁴, Peek-Asa¹⁷, Neyens³⁹, Yang⁵², UNSW Sydney⁵¹	Age group	14-19, 20-24
	3	Otmar Bock³⁵, Dalal³⁶, Gershon³⁷, Buckley³⁸, McDonald⁴¹, Sunanda⁴², Peek-Asa¹⁷, Neyens³⁹, Yang⁵¹, Zhang⁵³, UNSW Sydney⁵¹	Contributing factors	Aggressive/impaired driving, cause not known, defect in road condition, drunk driver, fault of young driver, fault of the other driver, others
	4	Shope²¹, Adanu¹⁶, Chen³², Williams³³, Fu³⁴, UNSW Sydney⁵¹, AP News⁵⁰	Gender	Male, female
	4		Gender	Male, female
Road characteristic	5	Abdel-At¹¹	Road surface condition	Muddy, slippery, good conditions
	6	Duddu¹²	Road configuration	Good visibility, poor visibility
	7	Duddu¹²	Pavement	Exists, does not exist
	8	Oviedo-Trespalacios¹⁸, Duddu¹², Yang⁵², UNICEF⁴⁹, AP News⁵⁰Sunanda⁴², Peek-Asa et¹⁷, Neyens³⁹	Maximum allowed road speed	Mandated speed limits these studies used :
				50 km/h, 60 km/h, 70 km/h, 80 km/h, 90 km/h, 100 km/h

	9	Peek-Asa¹⁷, Vachal¹⁹, Das²⁰, McDonald⁴¹,	Number of lanes on the road (in any direction)	1, 2, 4, 6, more than 6
Weather characteristics	10	Abdel-At¹¹, Simons-Morton⁴⁴, S. T. Doherty⁴³, UNSW⁵⁰	Weather conditions	Rain, snow, fog/smoke, typical weather conditions
Date/time characteristics	11	Dissanayake¹³, Wang¹⁴, Sunanda⁴², Rice et¹⁵, Adanu¹⁶, Peek-Asa¹⁷	Day/night	Day, night
	11		Day/night	Day, night
	12	Simons-Morton⁴⁴ S. T. Doherty⁴³, UNICEF⁴⁹	Type of day	Normal, pre-festive, festive

Table 1: Literature overview.

Broadly speaking, accident severity prediction models can be categorized into two groups: statistical learning and machine learning. Among these, statistical learning models have been widely employed by previous researchers. For example, Li⁴ employed the support vector machine (SVM) model and the ordered probit (OP) model to analyze injury severity, revealing that the SVM model exhibited superior accuracy. Yu and Abdel-Aty⁵ employed a classification and regression tree (CART) model to identify key explanatory variables. Using different kernel functions, these variables were then employed to compare Bayesian logistic regression and SVM models. The SVM model with a radial-basis kernel function outperformed the logistic regression model, as evaluated using ROC curves. Notably, the study highlighted the importance of reducing the variable space prior to model estimation. In a study by Chen⁶, SVM models were used to predict injury severity in drivers involved in rollover crashes. At the initial stage, a CART model was used to identify significant variables. Subsequently, the authors used this set of variables as an input to the SVM models, demonstrating that these models perform reasonably well with a polynomial kernel, surpassing the Gaussian radial-basis kernel model. Alkheder, et al.⁷ compared an artificial neural network (ANN) algorithm with an ordered probit model for the task of predicting accident severity. The authors enhanced the performance of the ANN model by utilizing a k-means algorithm to group the dataset into three distinct clusters. Their findings revealed an accuracy of 74% for the ANN model, relative to an accuracy of 59% for the ordered probit model.

In a comparative study, AlMamlook, et al.⁸ assessed the performance of various machine learning algorithms in forecasting the severity of road traffic accidents. Their results indicated that the Random Forest algorithm achieved the best performance (75% accuracy), although the remaining algorithms performed similarly: Logistic Regression achieved 74% accuracy, Naïve Bayes 73% and AdaBoost 74%. This paper is organized as follows: Section 1 positions the study and outlines the research gap and objectives; Section 2 describes the end-to-end workflow (data ingestion, preprocessing, feature selection, modeling and validation); Section 3 details the CBS data and access; Section 4 covers preprocessing and feature selection; Section 5 describes the models, selection criteria, hyperparameters and results; Section 6 discusses implications and limitations; Section 7 concludes.

1.3. Objectives of the research
·To identify features strongly correlated with injury severity in young drivers involved in road traffic accidents, utilizing data sourced from the Central Bureau of Statistics of Israel.
·To formulate a machine-learning classification model that can accurately predict the severity of injuries suffered by young drivers, while achieving a reasonable level of precision.

2. Outline of the Research Procedure

· Importing data: Raw data from CSV files sourced from the Israel Central Bureau of Statistics were imported into a relational database⁹.
· Data loading and Pre-processing: Data were loaded into a panda data frame and pre-processing actions were applied, including handling missing values and outliers.
·Feature selection: Various feature selection algorithms were applied to identify the most predictive features.
·Training and prediction: Machine learning algorithms were developed to classify injury severity.
· Evaluation: Algorithms were evaluated based on accuracy, precision, recall and F1 score.
Metrics (for each class c): Accuracy = (TP+TN)/(TP+FP+TN+FN); Precision_c = TP_c/(TP_c+FP_c); Recall_c = TP_c/(TP_c+FN_c); F1_c = 2·(Precision_c·Recall_c)/(Precision_c+Recall_c). Macro-F1 averages F1_c over classes
· Hyperparameter tuning: GridSearchCV was used to enhance the accuracy of the selected algorithm.This research applies sequence of actions for constructing and assessing the machine-learning classification models is presented in (Figure 3). The fundamental elements of these steps are outlined below⁴⁶:
·Importing data to the relational database: The initial phase involves the importation of CSV files containing unprocessed data regarding attributes of road traffic accidents into a relational database. These datasets are sourced from the Israel Central Bureau of Statistics. The constructed MS SQL server relational database comprises three tables: “Accident,” “Injured Person,” and “Vehicle.” Each table incorporates a column entitled “Accident ID,” facilitating the examination of data from all three tables through a unified logical SQL view. Additionally, stringent data integration protocols are enforced to verify the soundness of the input data.
· Data loading and pre-processing: The second step loads data from the MS SQL server database into a panda data frame, realized within the Jupyter notebook development environment using Python programming. Subsequently, the panda data frame becomes instrumental in refining and pre-processing the imported data. Furthermore, recognizing the need for uniform numeric ranges in machine learning methodologies, the standard scaler transform technique is applied to normalize the numeric values of the data.
· Feature selection: In the third stage, feature selection algorithms are applied to the dataset. The purpose of this step is to uncover the attributes that are most important for predicting injury severity and should, therefore, serve as inputs to the candidate machine-learning algorithms.
·Training and prediction: The fourth phase consists of training and developing each of the potential machine learning algorithms. These algorithms classify injury severity for young drivers embroiled in traffic accidents.
· Evaluation: The fifth step entails a meticulous assessment of the performance of each algorithm based on four metrics-accuracy, precision, recall and F1 score.
·Hyperparameter tuning and validation: In the final step, a hyperparameter tuning procedure is applied to enhance the accuracy of the selected algorithm.

Figure 3: Flowchart of the procedure for selecting the optimal machine learning algorithm.

3. Importing Data

This section describes the initial step of the process in greater detail. The input data originated from the Central Bureau of Statistics of Israel and encompassed records of 37,499 traffic accidents involving young drivers spanning from 2009 to 2019. Within this dataset, 396 accidents were fatal, while 37,103 incidents led to non-fatal injuries for young drivers. For each entry, a total of 59 variables were present, capturing information such as the unique crash ID, the date and time of the crash, driver attributes (including gender and age group), accident location and details about the road.

The Israeli Central Bureau of Statistics administers the nation’s traffic accident data through a compilation of 14 CSV files, each with a distinct structure. The entire set of files was imported into an MS SQL server relational database to facilitate data retrieval and ensure the integrity of the incoming data. This importation process was facilitated using the MS SQL server’s SSIS data tools. Within this system, three domain tables were built⁴⁶: Accident, Injured Person and Vehicle. The latter two tables contained an Accident ID field, allowing their data to be synchronized with the data in the Accident table. Moreover, logical views were formulated using the shared Accident ID field to amalgamate data from all domain tables. This integration resulted in a comprehensive representation of data from all tables within a singular frame. Rigorous data integration protocols were invoked to ensure data integrity. These protocols encompassed the assignment of appropriate column data types (e.g., integer, float, date), the establishment of primary and foreign indexes and the application of stringent constraints such as unique indexes and default values. Furthermore, the Vehicle and Injured Person tables were equipped with foreign key constraints. (Figure 4) presents the variables encapsulated within each of the domain tables stored in the database.

Figure 4: Variables stored in each domain table within the database.

The Accident table encompassed information such as the date and time of the accident, the type of road (urban junction, non-urban junction) and the geographical coordinates of the accident site. In parallel, the Vehicle table contained data pertaining to the vehicle(s) implicated in the traffic accident, e.g., the vehicle type (regular, army, police, etc.), engine capacity, vehicle status (rented, stolen, etc.) and vehicle weight. Finally, the Injured Person table housed data relating to the individuals affected by the accident, specifically only the driver. This category included characteristics such as the severity of the sustained injuries (uninjured, fatal, serious, slight), gender and age group.

·Source and coverage: Data were obtained from the Israel Central Bureau of Statistics (CBS) covering traffic accidents 2009-2019. The dataset contains N = 37,499 accident records linked across Accident, Injured Person and Vehicle tables.
·Variables: The raw dataset had 59 variables. After preprocessing and feature selection, 20 were retained for analysis. Exclusions (see Appendix A) were due to missingness >50%, low variance or collinearity.
·Target and class balance: Fatal = 512 (1.37%), Serious = 4,186 (11.17%), Slight = 32,801 (87.46%).
Missing data: Numerical variables imputed by median, categorical by mode. StandardScaler applied to numerical features.
·Data availability: CBS microdata requires request; code and reproducible synthetic example provided in Supplementary.

4. Data Preparation and Feature Selection
This section elaborates on the second and third phases of the procedure. Data from the MS SQL server’s logical view were funneled into the panda data frame object to facilitate various pre-processing actions⁴⁶. The following procedures were then applied:
· Blank or NULL values for a specific feature were substituted with an appropriate average value (mean or median) for the dataset.
· Outliers were detected manually (by scrutinizing the data) and subsequently eliminated.
The next step was to reduce the size of the variable space to be used as an input to the machine learning algorithms. This was achieved via an array of feature selection methods, each of which is described below.

4.1. Variance threshold algorithm
In this approach, features were only included if their variance exceeded 0.5. The premise for this algorithm is that features with low variance offer limited modeling utility, with the recommendation being to adopt a threshold value approaching zero³⁰. Using this method, 27 features (as detailed in Table set 2) were identified as the most likely candidates for demonstrating a robust correlation with the severity of injuries among young drivers.

(Table 2). Features selected using the variance threshold algorithm.

#	Variable	Values
Driver's Characteristics
1	Gender	Male, Female
2	Ethnic group	Jewish, Non-Jewish, Not specified or other
3	Age group	14-19, 20-24
4	Population setting	Rural background, Urban upbringing
Road characteristics
5	Road category	Highway, major district road, village road, other road, unknown
6	Maximum allowed speed	50 km/h, 60 km/h, 70 km/h, 80 km/h, 90 km/h, 100 km/h
7	Traffic control	No control, working traffic light, failed traffic light, blinking yellow, stop sign, priority sign, not specified
8	Road width	Up to 5 m, 5 to 7 m, 7 to 10 m, 10 to 14 m, over 14 m
9	Number of lanes on the road (in any direction)	1, 2, 4, 6, more than 6
10	Road signpost	Defective/missing signage, no signage required, signage intact, unknown
11	Road surface conditions	Dry, wet from water, wet from slippery material, covered with mud, covered with sand, not specified
12	Type of road	One-way road, two-way road with separation, two-way road without separation, not specified
13	Shoulders of the road	Paved shoulders, low shoulders, rough road (no tarmac or hard shoulder)
14	Shape of road	Entrance to an interchange, exit from an interchange, parking lot, steep slope, sharp curve, railroad junction, bus stop, public transport route, other
15	Illumination on the road	Daylight, night without illumination, night with illumination
Accident location characteristics
16	Area	Central, north, south
17	Location of the accident	Urban at a junction, urban not at an intersection, non-urban at an intersection, non-urban and not at a crossroads
18	District	Jerusalem, the north, Haifa, the center, Tel Aviv, the south, Judea and Samaria, Gaza envelope
Date/time characteristics
19	Day of the week	Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday
20	Day/night	Day, night
21	Period of the day	Morning peak, off-peak, afternoon peak, evening/night
19	Day of the week	Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday
Vehicle characteristics
23	Vehicle type	Bicycle, motorcycle up to 50 cc, motorcycle 51 to 250 cc, motorcycle 251 to 500 cc, motorcycle >501 cc, car, bus, cab, work vehicle, tractor, train, minibus, freight (>34.0 tons total weight)
24	Vehicle weight (tons)	Less than 1.9, 2.0-2.9, 3.0-3.5, 3.6-4.0, 4.1-5.9, 6.0-7.9, 8.0-9.9, 10.0-12.0, 12.1-12.9, 13.0-15.9, 16.0-19.0, 19.1-25.9, 26.0-30.0, 30.1-32.0, 32.1-33.9, 34.0-40.0, 40.1-56.0, ³56.1
25	Use of safety accessories	Fastened seat belt, wore a protective helmet (motorcycle only), sat in a child seat (injured child only), did not use safety measures
26	Vehicle status	Regular, stolen, rented, transport student, transporting children
Weather characteristics
27	Weather conditions	Clear, rainy, hot, foggy, not specified

4.2. SelectKBest algorithm
In this technique, a designated function (e.g., chi-squared or other relevant statistical test) assigns a score to each feature. Subsequently, the k highest scoring features are retained [30]. This approach aims to identify the k most informative features from the initial set. The value of k needs to be greater than 0 and cannot exceed the total number of features. For this study, a value of 20 was employed. The ensuing set of features was as follows: vehicle weight (with reference to the vehicle of the young driver), vehicle type (vehicle of the young driver), gender, age group, ethnic group, traffic control, use of safety accessories, road width, behavioral factors, day/night, road surface conditions, road shape, road signpost, population setting, maximum allowed speed, period of day, weather conditions, type of day, vehicle status and shoulders of road.

4.3. SelectPercentile algorithm
The SelectPercentile approach is similar to SelectKBest, but instead of identifying the k most effective features, it retains a certain percentage of the features (again, based on their scores). The SelectPercentile algorithm returned the following set of features: road surface conditions, traffic control, maximum allowed speed, contributing factors, gender, age group, number of lanes on the road, road signpost, use of safety accessories, illumination on the road, vehicle weight, vehicle status, day of the week, location of the accident, district, driver’s ethnic group, type of road and road category.

4.4. Sequential feature selector algorithm
This algorithm employs a greedy approach to add (forward selection) or eliminate (backward selection) variables when constructing a subset of features. At each step, the algorithm strategically picks the best feature for addition or removal based on a cross-validation score produced by an estimator⁴⁶. When employed in unsupervised learning, this method exclusively considers the input features (X) without reference to the desired outputs (y) [30]. Within this study, the sequential feature selector algorithm identified the following features: accident location, driver’s age group and gender, road signpost, day/night, road illumination, period of the day, district, number of road lanes, road shape, road width, road surface conditions, vehicle type, vehicle weight, maximum allowed speed, driver’s ethnic group, weather conditions and road shoulders.

To amalgamate the outcomes of the aforementioned algorithms, we selected the 20 features with the highest occurrence (i.e., that appeared most often across the four models). This culminated in the definitive list presented in (Table 3), which served as the input data for the machine-learning classification algorithms assessed in this study.

Table 3: List of features employed as inputs to the classification algorithms.

#	Variable	Values
Driver's Characteristics
1	Gender	Male, female
2	Age group	14-19, 20-24
3	Population setting	Rural background, urban upbringing
4	Ethnic group	Jewish, non-Jewish, not specified or other
#	Variable	Values
Road characteristics
5	Road surface conditions	Dry, wet from water, wet from slippery material, covered with mud, covered with sand, not specified
6	Traffic control	No control, working traffic light, failed traffic light, blinking yellow, stop sign, priority sign, not specified
7	Maximum allowed speed	50 km/h, 60 km/h, 70 km/h, 80 km/h, 90 km/h, 100 km/h
8	Contributing factors	Aggressive/impaired driving, cause not known, defect in road condition, drunk driver, fault of young driver, fault of the other driver, others
9	Number of lanes on the road (in any direction)	1, 2, 4, 6, more than 6
10	Road signpost	Defective/missing signage, signage intact, a signpost is required (lacking, not misplaced or faulty), unknown
11	Type of road	One-way road, two-way road with separation, two-way road without separation, not specified
12	Use of safety accessories	Fastened seat belt, wore a protective helmet (motorcycle only), sat in a child seat (injured child only), did not use safety measures
13	Illumination on the road	Daylight, night without illumination, night with illumination
14	Road category	Highway, major district road, village road, other road, unknown
Accident location characteristics
15	Location of the accident	Urban at a junction, urban not at an intersection, non-urban at an intersection, non-urban and not at a crossroads
16	District	Jerusalem, the north, Haifa, the center, Tel Aviv, the south, Judea and Samaria, Gaza envelope
Vehicle characteristics
17	Vehicle type	Bicycle, motorcycle up to 50 cc, motorcycle 51 to 250 cc, motorcycle 251 to 500 cc, motorcycle >501 cc, car, bus, cab, work vehicle, tractor, train, minibus, freight (>34.0 tons total weight)
18	Vehicle weight (tons)	Less than 1.9, 2.0-2.9, 3.0-3.5, 3.6-4.0, 4.1-5.9, 6.0-7.9, 8.0-9.9, 10.0-12.0, 12.1-12.9, 13.0-15.9, 16.0-19.0, 19.1-25.9, 26.0-30.0, 30.1-32.0, 32.1-33.9, 34.0-40.0, 40.1-56.0, ³ 56.1
Date/time characteristics
19	Day/night	Day, night
Weather characteristics
20	Weather conditions	Clear, rainy, hot, foggy, not specified

5. Assessment of Machine Learning Models

In the realm of machine learning, there are a multitude of classification models that can be implemented using a diverse range of algorithms. In studies that apply machine learning to practical problems, the models and algorithms are frequently chosen without rigid selection criteria. In this study, a comprehensive investigation was undertaken involving widely recognized machine learning algorithms previously employed for predicting accident severity and cutting-edge algorithms that have not yet achieved widespread adoption. The models included logistic regression, logistic regression CV, gradient boosting classifier, support vector machine (SVM), linear support vector classification (linear SVC), Naive Bayes classifier, Gaussian naive Bayes, ridge classifier, ridge classifier CV, decision tree classifier, random forest classifier, extra tree classifier, perceptron algorithm and K-nearest neighbors. For each of these models, the goal was to perform multiclass classification, delineating three tiers of injury severity: fatal, serious and slight. Thus, for each data sample (which consists of a set of values for the features in Table #3 corresponding to a single accident), the algorithms assigned the observation to a specific class. The evaluation of the performance of each prospective algorithm entailed the generation of a classification report containing the metrics accuracy, precision, recall and F1 score. The model deemed the “best” among those studied was the one that received the highest values across these four metrics. In addition, the classification report incorporated a support score, which was consistent across models and is a property of the data. The subsequent paragraph explicates the five metrics documented within the classification report.

·Accuracy signifies the proportion of correctly assigned labels for a given class (slight, serious or fatal) relative to the total number of instances in that class. The overall accuracy is subsequently calculated as the average accuracy across all three classes.
·Precision denotes the ratio between the correctly predicted instances for a particular class and all the instances predicted to belong to that class, again averaged across the three classes.
·Recall indicates the number of accurately predicted instances of a specific class as a proportion of the actual instances of that class.
· F1 score represents a weighted harmonic mean of precision and recall, providing a balance between these two metrics.
·Support denotes the number of actual instances of a specific class (e.g., fatal cases). Disparities in support could potentially indicate imbalances in the dataset, requiring rebalancing or sampling techniques.

The scikit-learn package facilitated the construction of machine-learning models and the generation of classification reports. The data were separated into two distinct sets: one for training (80% of the dataset) and another for testing (20%). All data manipulations were confined to the training dataset, while the testing dataset was reserved for evaluation, leading to the creation of the classification reports. The resultant performance metrics are presented in (Table 4).

Table 4: Accuracy, precision, recall and F1 scores for the candidate classification algorithms.

#	Classification algorithm name	Accuracy	Precision	Recall	F1 score
1	Logistic Regression	0.95000	0.95000	0.95000	0.95000
2	Logistic Regression CV	0.95000	0.95000	0.95000	0.95000
3	Gradient Boosting Classifier	0.95000	0.95000	0.95000	0.95000
4	SVM	0.95000	0.95000	0.95000	0.95000
5	Linear SVC	0.95000	0.95000	0.95000	0.95000
6	Naive Bayes Classifiers	0.90000	0.91000	0.90000	0.90000
7	Gaussian Naive Bayes	0.91000	0.91000	0.92000	0.91000
8	Ridge Classifier	0.95000	0.95000	0.95000	0.95000
9	Ridge Classifier CV	0.95000	0.95000	0.95000	0.95000
10	Decision Tree Classifier	0.95000	0.95000	0.95000	0.95000
11	Random Forest Classifier	0.97000	0.97000	0.97000	0.97000
12	Extra Tree Classifier	0.98453	0.98000	0.98000	0.9321
13	Perceptron Algorithm	0.94000	0.94000	0.94000	0.94000
14	K-nearest Neighbors	0.94000	0.94000	0.94000	0.94000
15	XGBoost	0.96000	0.97000	0.97000	0.96000
16	Bagging classifier	0.96000	0.96000	0.96000	0.96000

Based on the metrics presented in (Table 4), the extra tree classifier was identified as the best machine-learning approach for anticipating the extent of injuries faced by young drivers embroiled in road traffic incidents within Israel. To further improve the extra tree classifier, an algorithm from the scikit-learn package³⁰ called GridSearchCV was implemented. This algorithm allows the user to ascertain the optimal hyperparameter values for a given classifier. The ideal number of trees for the extra tree classifier, noting that the search domain ranged from 10 to 500, was found to be 50. Subsequently, GridSearchCV analyzed the number of samples needed at a decision tree junction before introducing another division in the tree (a parameter denoted as ‘min_samples_split’). Values of min_samples_split from 2 to 15 were tested and it was determined that the value of 10 samples resulted in the highest accuracy. These parameter adjustments in the extra tree classifier effectively elevated its accuracy to 0.98453, a notable enhancement from the prior score of 0.98037.

We evaluated Logistic Regression, K-Nearest Neighbors, Decision Tree, Random Forest, Extra Trees, Gradient Boosting (XGBoost), Support Vector Machine and Naïve Bayes. Hyperparameters were optimized by GridSearchCV/RandomizedSearchCV with stratified 5-fold CV. Hyperparameter grids are shown in Supplementary Table S2. Class imbalance handled with class_weight and SMOTE experiments. Python 3.9, scikit-learn 1.2+, XGBoost 1.6+ used (see Supplementary for exact versions).

6. Discussion

6.1. Contribution of the study
The primary objective of this study revolved around identifying and constructing the most precise model possible for predicting the extent of injuries in young drivers implicated in vehicle accidents in Israel. Through our investigation, we determined that the extra tree classifier, belonging to the decision-tree algorithm family, demonstrated the best classification performance. In previous studies with a similar context, researchers have often leaned toward the utilization of the logistic regression algorithm^5,8,10. Given that the extra tree classifier is a relatively novel algorithm, it is plausible that researchers have yet to accumulate experience in its application to the specific problem domain addressed in this study. The merit of employing tree-based learning algorithms resides in their capacity to be trained on extensive datasets and to accommodate both quantitative and qualitative input variables. Moreover, tree-based models are adept at handling redundant and highly correlated variables, thus mitigating overfitting risks encountered in alternative learning algorithms. The simplicity of trees translates to a minimal requirement for parameter tuning during model training, rendering them resilient in scenarios involving outliers or missing data values. When the variance between the explanatory and noise variables is high, logistic regression consistently achieves superior overall accuracy to forest classifiers. Specifically, forest classifiers outperform logistic regression in terms of true positive rates and they also show a lower false positive rate when the noise variables are large⁴⁵.

The findings validate the general conclusions gleaned from the literature review concerning the importance of factors such as the road surface, signposts, road illumination, number of lanes, road shape, vehicle weight, road surface conditions, accident location and maximum allowed speed. Other pertinent factors include the age of the young driver, weather conditions, the setting (rural vs. urban roads), the amount of driving experience and the gender of the young driver-all of which have been highlighted as significant variables in prior research^{15,16,18-20,22-24,27,30}. Notably, however, the present study failed to uncover a substantial influence of alcohol consumption on the part of the driver, in contrast to the established significance of this variable in driver fatality studies carried out in the United States and European countries^25,26.

6.2. Constraints and areas for future investigation
This study did not explore the capabilities of various ensemble classifiers, such as the voting classifier, stacking classifier, gradient boosting classifier, passive-aggressive classifier, nearest centroid classifier, perceptron and histogram-based gradient boosting classifier⁴⁰. Ensemble methods are designed to enhance generalization and resilience relative to individual estimators. Additionally, this research did not employ hybrid machine-learning models. Such models advance the field, integrating diverse computations, methods or processes from similar or disparate data domains or application areas to enhance their mutual performance.

Notably, the authors had previously published papers that developed machine-learning-based models to diminish the severity of pedestrian and bicyclist injuries^46,47 in road traffic incidents. Hence, similarities between the studies might prevail regarding the models used, data source, software and technical terms. Nevertheless, the current paper makes a worthwhile contribution by developing machine-learning models to classify the severity of road traffic injuries among young drivers. Metrics reported with 4 significant digits. Extra Trees achieved accuracy = 0.98453, macro F1 = 0.9321. Stability assessed across 10 seeds. McNemar's test compared top models. Interpretability through permutation importance and SHAP plots (Figure 3). Ablation study confirmed robustness when removing top-3 features.

7. Conclusion

This study presents a robust machine learning framework for predicting injury severity among young drivers in Israel. By leveraging advanced algorithms and rigorous selection criteria, we demonstrate that the Extra Trees Classifier achieves superior performance. The study contributes novel insights into feature importance and model interpretability, offering valuable guidance for traffic safety interventions. Future work should explore ensemble and hybrid models to further enhance predictive capabilities.

8. Author Contributions
Conceptualization, Slava Birfir; methodology, Slava Birfir; software, Slava Birfir; validation, Slava Birfir, Amir Elalouf and Tova Rosenbloom; formal analysis, Slava Birfir; investigation, Slava Birfir; resources, Slava Birfir; data curation, Amir Elalouf and Tova Rosenbloom; writing—original draft preparation, Slava Birfir; writing-review and editing, Amir Elalouf and Tova Rosenbloom; visualization, Slava Birfir; supervision, Amir Elalouf and Tova Rosenbloom; project administration, Slava Birfir. All authors have read and agreed to the published version of the manuscript.

9. Funding

The authors listed below affirm that they have no affiliations with or engagement in any organization or entity with financial interests (including honoraria, educational grants, participation in speakers' bureaus, membership, employment, consultancies, stock ownership, equity interests, expert testimony or patent-licensing arrangements) or non-financial interests (such as personal or professional relationships, affiliations, knowledge or beliefs) related to the subject matter or materials discussed in this manuscript.

10. Conflicts of Interest

The authors whose names are listed immediately below certify that they have NO affiliations with or involvement in any organization or entity with any financial interest (such as honoraria; educational grants; participation in speakers’ bureaus; membership, employment, consultancies, stock ownership or other equity interest; and expert testimony or patent-licensing arrangements) or non-financial interest (such as personal or professional relationships, affiliations, knowledge or beliefs) in the subject matter or materials discussed in this manuscript.

11. Data Availability Statement

The data we used in our article is provided by the Central Bureau of Statistics of Israel. According to their policy, I am unable to include any data directly in my article. However, those interested in accessing this data may contact the Central Bureau of Statistics by emailing [email protected] or by filling out the request form available at https://www.cbs.gov.il/he/subjects/Pages/%D7%AA%D7%97%D7%91%D7%95%D7%A8%D7%94.aspx. The article's data is not publicly available but can be accessed upon request. Please refer to the attached document, Confirmation Regarding Data Availability Restrictions.doc, for further details.

12. References

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Appendix A: Feature Selection Summary

Full Text

Developing Machine-Learning Models to Classify the Seriousness of Road Traffic Injuries among Young Drivers

Other Journals

Useful Links