YOLOv11s-cls · driver phone-use binary classifier
Binary classifier that predicts whether a driver is holding / using a phone (talking or texting, left or right hand) from a cropped driver RGB image.
Part of the ktk-studio traffic-violation analytics stack (DeepStream 9.0 + Triton + B200).
Summary
| Architecture | YOLOv11s-cls (Ultralytics) |
| Input | 224×224 RGB |
| Output | logits over 2 classes: no_phone, phone |
| Parameters | 5.4 M |
| GFLOPs | 12.0 |
| Weights | best.pt (PyTorch, 11 MB) / best.onnx (21 MB, opset 19) |
| Val top1 | 99.91 % at epoch 21 (30 epochs total) |
Training data
Source: gymprathap/Driver-Distracted-Dataset (State-Farm-style 10-class).
Original 10 classes were collapsed to binary:
phone=c1(texting right) ∪c2(talking right) ∪c3(texting left) ∪c4(talking left)no_phone=c0(safe) ∪c5(radio) ∪c6(drinking) ∪c7(reaching) ∪c8(hair/makeup) ∪c9(passenger)
| Split | phone |
no_phone |
Total |
|---|---|---|---|
| train | 7 870 | 11 196 | 19 066 |
| val (15 % holdout) | 1 386 | 1 972 | 3 358 |
Usage
Ultralytics
from ultralytics import YOLO
model = YOLO("best.pt")
r = model("driver_crop.jpg")
print(r[0].probs.top1, r[0].names[r[0].probs.top1])
ONNX Runtime
import cv2, numpy as np, onnxruntime as ort
sess = ort.InferenceSession("best.onnx", providers=["CUDAExecutionProvider"])
img = cv2.cvtColor(cv2.imread("driver_crop.jpg"), cv2.COLOR_BGR2RGB)
img = cv2.resize(img, (224, 224)).astype(np.float32) / 255.0
x = np.ascontiguousarray(img.transpose(2, 0, 1)[None])
logits = sess.run(None, {"images": x})[0][0]
print(["no_phone", "phone"][int(logits.argmax())], float(logits.max()))
Intended use
- Real-time phone-use violation flagging on road-traffic video after car detection + tracking.
- Run on the driver region of a detected car bbox (typically top-left half for right-hand-drive / top-right half for left-hand-drive, or the full driver crop from an in-cabin camera).
Out-of-scope / limitations
- Trained on a single-source in-cabin dataset; domain gap on windshield-through views is not measured. Fine-tune on target footage for production use.
- Binary — does not distinguish "talking" vs "texting"; collapses holding+using
into a single
phoneclass. - Confuses with other close-to-ear gestures (adjusting hair, scratching face) rarely — not tested at adversarial scale.
License
AGPL-3.0 (inherits Ultralytics YOLOv11 weight license).
- Downloads last month
- 55