YOLOv11s-cls · driver phone-use binary classifier

Binary classifier that predicts whether a driver is holding / using a phone (talking or texting, left or right hand) from a cropped driver RGB image.

Part of the ktk-studio traffic-violation analytics stack (DeepStream 9.0 + Triton + B200).

Summary


Architecture	YOLOv11s-cls (Ultralytics)
Input	224×224 RGB
Output	logits over 2 classes: `no_phone`, `phone`
Parameters	5.4 M
GFLOPs	12.0
Weights	`best.pt` (PyTorch, 11 MB) / `best.onnx` (21 MB, opset 19)
Val top1	99.91 % at epoch 21 (30 epochs total)

Training data

Source: gymprathap/Driver-Distracted-Dataset (State-Farm-style 10-class).

Original 10 classes were collapsed to binary:

phone = c1 (texting right) ∪ c2 (talking right) ∪ c3 (texting left) ∪ c4 (talking left)
no_phone = c0 (safe) ∪ c5 (radio) ∪ c6 (drinking) ∪ c7 (reaching) ∪ c8 (hair/makeup) ∪ c9 (passenger)

Split	`phone`	`no_phone`	Total
train	7 870	11 196	19 066
val (15 % holdout)	1 386	1 972	3 358

Usage

Ultralytics

from ultralytics import YOLO
model = YOLO("best.pt")
r = model("driver_crop.jpg")
print(r[0].probs.top1, r[0].names[r[0].probs.top1])

ONNX Runtime

import cv2, numpy as np, onnxruntime as ort
sess = ort.InferenceSession("best.onnx", providers=["CUDAExecutionProvider"])
img = cv2.cvtColor(cv2.imread("driver_crop.jpg"), cv2.COLOR_BGR2RGB)
img = cv2.resize(img, (224, 224)).astype(np.float32) / 255.0
x = np.ascontiguousarray(img.transpose(2, 0, 1)[None])
logits = sess.run(None, {"images": x})[0][0]
print(["no_phone", "phone"][int(logits.argmax())], float(logits.max()))

Intended use

Real-time phone-use violation flagging on road-traffic video after car detection + tracking.
Run on the driver region of a detected car bbox (typically top-left half for right-hand-drive / top-right half for left-hand-drive, or the full driver crop from an in-cabin camera).

Out-of-scope / limitations

Trained on a single-source in-cabin dataset; domain gap on windshield-through views is not measured. Fine-tune on target footage for production use.
Binary — does not distinguish "talking" vs "texting"; collapses holding+using into a single phone class.
Confuses with other close-to-ear gestures (adjusting hair, scratching face) rarely — not tested at adversarial scale.

License

AGPL-3.0 (inherits Ultralytics YOLOv11 weight license).

Downloads last month: 55