Machine Learning to predict car accident severity

James Sopkin
5 min readSep 25, 2020

There are roughly 6 million car accidents that occur in the United States annually. The CDC listed unintentional injuries as one of the top 5 leading causes of death, with car accidents making up a significant portion of those unintentional injuries.

I recently found a dataset containing car accident data on kaggle and wanted to create models that could accurately predict the severity of each car accident.

Data wrangling and cleaning

The first step to creating a predictive model is data wrangling- a process in which the dataset is transformed to make it more usable. Initially, the dataset contained nearly 50 columns, or “features”, that could be used to come up with predictions. These features ranged from gps coordinates, to time of the accident, to presense of stop signs. High-cardinality features, features containing many null or NaN values, and features that contained only 1 unique value were dropped. This greatly improved the efficiency and runtime of my models without much sacrifice in accuracy.

Next, I converted start time of each accident and end time of traffic disruptions from those car accidents into datetime format. From there, I was able to create new features. I added columns for the year, month, day, and hour that the accidents occured. I then added columns for which day of the week the accident occured, which season the accident occured in, and the overall time of traffic disruption caused by each car accident.

I wanted to make sure that I picked the best features for my predictive models, so I used permutation importances to determine them. If a feature has an extremely high importance that may indicate data leakage, but that does appear to be true of any of my top features. I dropped any feature that had a permutation importance of zero from my dataset.

Finally, I removed some outliers from the traffic disruption column. Initially the data was heavily skewed with traffic disruption reaching up to 700,000 minutes, despite the mean disruption lasting less than an hour. Below, I showed a histogram of the traffic disruption before and after removing the upper and lower .5% of the data.

Before and after filtering out outliers

Model Building

Now that our data is cleaned and wrangled, it’s time to build our model. First, let’s establish a baseline accuracy for comparison to our predictive models, to determine how effective they are. Since we are predicting whether an accident is severe or not, we should use a classifier for our predictions. A baseline accuracy for a classification problem can be established by setting every prediction to the most frequent possible value.

Since the majority of accidents are not severe- at 68.43%- that would be our baseline accuracy.

Linear Model

Our first model is a logistic regression with an ordinal encoder and simple imputer. A result of true indicates a severe accident, and a result of false would be a non-severe accident. While the model is quite good at predicting when an accident is not severe, it is very poor at predicting when an accident is severe. The logistic regression model incorrectly predicted a non-severe accident 5202 times compared to the 914 times it correctly predicted a severe accident. The overall false negative rate was (5,202 false negatives/6,116 true instances) = 85.06%, which would leave a true positive rate of only 14.94%. The overall accuracy of this model was 71.16%- not much higher than our baseline model.

Tree Based Classifiers

I then decided to create a tree based model to boost the accuracy of my predictions. With a decision tree classifier model, I experienced a slight increase in validation accuracy, but the model was definitely overfitting the data. A random forest classifier further boosted validation accuracy, but was still overfitting the data.

Decision Tree vs Random Forest Accuracy

XGBClassifier

The final predictive model I used was the “eXtreme Gradient Boosting” model. It further increased validation accuracy while reducing data overfit. This model has an roc_auc score of .88 indicating that the model can differentiate between severe and non severe accidents very effectively. As you can see in the below plot, the blue roc curve indicates the tradeoff between true positive rate and false positive rate (A perfect roc_auc score of 1 would mean the model can achieve a 100% true positive rate with 0% false positive rate).

As a basic visualization of how my XGB classifier comes up with its predictions, I created a shaply plot to demonstrate how the different features all contribute a certain amount to each prediction.

Conclusion- The final model was an XGBoost Classifier. After data wrangling and fitting the model, an accuracy of about 82% with relatively low amount of overfitting was achieved. The model could further be improved through hyperparameter tuning, but the achieved result is satisfactory.

--

--