CrashRisk-QLD — Fatal binary classifier

LightGBM binary classifier estimating P(fatal) given pre-/at-crash conditions in a Queensland road crash. Severe class imbalance (1.90% positive on the held-out test year); we evaluate primarily on PR-AUC and report a full threshold sweep.

© State of Queensland (Department of Transport and Main Roads) 2026. Licensed CC-BY 4.0. Source: data.qld.gov.au, dataset 'Crash data from Queensland roads', version rqC45037 (2025-06), retrieved 2026-04-30.

Disclaimer. This is a population-level statistical model trained on publicly reported crash data. It is NOT suitable for individual driver risk assessment, insurance underwriting, pre-incident law enforcement targeting, or any decision with legal or financial consequence to an individual. Use it for research, road-safety analysis, and education.

Quick start

from huggingface_hub import hf_hub_download
import joblib, lightgbm as lgb

REPO = "Mattysmittttt/crashrisk-qld-fatal"
booster = lgb.Booster(model_file=hf_hub_download(REPO, "model.txt"))
pre = joblib.load(hf_hub_download(REPO, "preprocessor.joblib"))
# pre.transform(X)  →  booster.predict(...)  →  P(fatal)

Intended uses

Same as the multi-class severity head: research, education, descriptive analysis. Pair the two heads — the severity head gives a full distribution, the fatal head gives a sharper imbalanced-binary signal.

Out of scope

Insurance underwriting — using population-level statistical patterns to set individual premiums creates fairness concerns and is outside this model's intended scope.
Individual driver risk assessment — these features describe road conditions and aggregate vehicle context, not driver behaviour or identity.
Pre-incident law enforcement targeting — geographic patterns may reflect reporting biases as much as actual risk; using them to pre-target locations creates feedback loops.
Any decision with legal or financial consequence to a single individual — full stop.

Training data

See the dataset card.

Train: 152,842
Test: 14,358 (positives: 273, rate: 1.901%)

Training details

Objective: binary log-loss (no scale_pos_weight — at ~50× it caused early stopping at iteration 1; imbalance is handled at threshold-selection time via the threshold sweep instead).
Optuna: 20 trials, TPE sampler, early stopping on val log-loss
Best iteration: 23
Best params:

{
  "num_leaves": 83,
  "learning_rate": 0.05082341959721458,
  "feature_fraction": 0.6563696899899051,
  "bagging_fraction": 0.9208787923016158,
  "bagging_freq": 0,
  "min_data_in_leaf": 480,
  "lambda_l1": 0.08916674715636552,
  "lambda_l2": 6.143857495033091e-07
}

Features after preprocessing: 155

Evaluation (held-out test = 2024)

PR-AUC (primary): 0.0737 [95% CI 0.0579, 0.0958]
ROC-AUC: 0.7858 [0.7623, 0.8093] (sanity check: must NOT exceed 0.95 — observed: ok)
Best-F1 operating point: precision = 0.092, recall = 0.282, F1 = 0.139 (threshold = 0.0620; 95% F1 CI 0.112, 0.165)
Recall @ Precision = 0.20: 0.026 (threshold = 0.1744)
Recall @ Precision = 0.50: 0.007 (threshold = 0.2448)

Threshold sweep (compact)

Full table on the artifact: threshold_table.csv.

threshold	precision	recall	f1
0.006	0.019	1.000	0.037
0.007	0.020	0.996	0.040
0.007	0.021	0.993	0.042
0.007	0.022	0.989	0.043
0.007	0.023	0.989	0.045
0.008	0.024	0.985	0.047
0.008	0.025	0.982	0.048
0.008	0.025	0.974	0.050
0.008	0.026	0.974	0.051
0.008	0.027	0.974	0.053
0.008	0.028	0.974	0.054
0.009	0.028	0.963	0.055
0.009	0.029	0.960	0.056
0.009	0.030	0.956	0.058
0.009	0.030	0.941	0.059
0.010	0.031	0.930	0.060
0.010	0.032	0.927	0.061
0.010	0.032	0.916	0.062
0.010	0.033	0.912	0.064
0.011	0.034	0.901	0.065
0.011	0.035	0.890	0.067
0.011	0.036	0.886	0.069
0.012	0.037	0.875	0.071
0.012	0.037	0.850	0.071
0.013	0.038	0.839	0.073
0.013	0.040	0.835	0.076
0.014	0.041	0.821	0.077
0.014	0.041	0.799	0.078
0.015	0.042	0.777	0.080
0.016	0.043	0.747	0.080
0.016	0.044	0.736	0.083
0.017	0.044	0.689	0.082
0.018	0.044	0.656	0.082
0.020	0.045	0.637	0.085
0.021	0.048	0.630	0.089
0.022	0.049	0.597	0.090
0.024	0.050	0.575	0.093
0.026	0.051	0.538	0.093
0.028	0.054	0.516	0.097
0.030	0.058	0.509	0.104
0.033	0.060	0.473	0.106
0.036	0.064	0.447	0.111
0.039	0.066	0.403	0.113
0.044	0.070	0.370	0.118
0.049	0.077	0.337	0.125
0.056	0.085	0.297	0.132
0.067	0.092	0.242	0.133
0.083	0.088	0.154	0.112
0.111	0.117	0.103	0.109
1.000	1.000	0.000	0.000

Pick a threshold for your use case rather than relying on the default 0.5 — the default is rarely optimal under heavy class imbalance.

Top 20 features (LightGBM gain)

loc_suburb (gain = 28660)
count_unit_car (gain = 5619)
crash_speed_limit (gain = 5061)
loc_abs_statistical_area_2 (gain = 4840)
loc_post_code (gain = 4436)
count_unit_pedestrian (gain = 2189)
count_unit_motorcycle_moped (gain = 1875)
crash_hour (gain = 1678)
count_unit_truck (gain = 1453)
crash_year (gain = 1014)
loc_state_electorate (gain = 975)
crash_lighting_condition_daylight (gain = 878)
crash_latitude (gain = 748)
loc_abs_statistical_area_3 (gain = 619)
crash_longitude (gain = 608)
crash_road_horiz_align_straight (gain = 601)
crash_roadway_feature_no roadway feature (gain = 554)
loc_local_government_area (gain = 513)
crash_road_horiz_align_curved - view open (gain = 471)
crash_lighting_condition_darkness - not lighted (gain = 400)

SHAP explainability

See reports/shap_fatal/ in the source repository for global and local SHAP plots. The same set of physical drivers (speed_limit, lighting, surface, roadway_feature) consistently dominates, which is the expected sanity-check signal.

Geographic surface

See reports/maps/model_lga_risk_surface.png (and the interactive HTML in the same folder) for a per-LGA map of mean P(fatal) under a fixed conditions grid.

Limitations & biases

Same set as the severity card. The binary head is more sensitive to under-reporting bias than the multi-class one because the positive class is small enough that a few mis-coded rows shift metrics noticeably.

Ethical considerations

Population-level, not causal: The model encodes correlations between pre-crash conditions and recorded outcomes. It does not assign fault and cannot be read as a statement about individual responsibility.
Geographic predictions can stigmatise: We publish per-LGA aggregates only, never per-address. Even at LGA level, higher predicted risk reflects historical reporting and demographics as much as it does inherent road danger.
Demographic features deliberately excluded: We do not include gender, age, or any demographic field, even though some are present in the casualties aggregates. This is to avoid encoding protected-class proxies. Vehicle-type counts are kept because they describe the crash configuration, not the people involved.
Reporting bias: This is a model of reported crashes, not true crashes. Under-reporting is differential by severity (PDO under-reported, fatal generally fully reported) and by region.

Citation

@software{crashrisk_qld_fatal_2026,
  title  = {CrashRisk-QLD fatal binary classifier},
  author = {Mattysmittttt},
  year   = {2026},
  url    = {https://huggingface.co/Mattysmittttt/crashrisk-qld-fatal},
  note   = {Trained on Mattysmittttt/qld-traffic-crashes-clean; source data CC-BY 4.0 © State of Queensland (Department of Transport and Main Roads).}
}

License

Released under CC-BY 4.0.

Downloads last month: -; Downloads are not tracked for this model. How to track

Mattysmittttt
/

crashrisk-qld-fatal