YOLOv11s-cls · driver phone-use binary classifier

Binary classifier that predicts whether a driver is holding / using a phone (talking or texting, left or right hand) from a cropped driver RGB image.

Part of the ktk-studio traffic-violation analytics stack (DeepStream 9.0 + Triton + B200).

Summary

Architecture YOLOv11s-cls (Ultralytics)
Input 224×224 RGB
Output logits over 2 classes: no_phone, phone
Parameters 5.4 M
GFLOPs 12.0
Weights best.pt (PyTorch, 11 MB) / best.onnx (21 MB, opset 19)
Val top1 99.91 % at epoch 21 (30 epochs total)

Training data

Source: gymprathap/Driver-Distracted-Dataset (State-Farm-style 10-class).

Original 10 classes were collapsed to binary:

  • phone = c1 (texting right) ∪ c2 (talking right) ∪ c3 (texting left) ∪ c4 (talking left)
  • no_phone = c0 (safe) ∪ c5 (radio) ∪ c6 (drinking) ∪ c7 (reaching) ∪ c8 (hair/makeup) ∪ c9 (passenger)
Split phone no_phone Total
train 7 870 11 196 19 066
val (15 % holdout) 1 386 1 972 3 358

Usage

Ultralytics

from ultralytics import YOLO
model = YOLO("best.pt")
r = model("driver_crop.jpg")
print(r[0].probs.top1, r[0].names[r[0].probs.top1])

ONNX Runtime

import cv2, numpy as np, onnxruntime as ort
sess = ort.InferenceSession("best.onnx", providers=["CUDAExecutionProvider"])
img = cv2.cvtColor(cv2.imread("driver_crop.jpg"), cv2.COLOR_BGR2RGB)
img = cv2.resize(img, (224, 224)).astype(np.float32) / 255.0
x = np.ascontiguousarray(img.transpose(2, 0, 1)[None])
logits = sess.run(None, {"images": x})[0][0]
print(["no_phone", "phone"][int(logits.argmax())], float(logits.max()))

Intended use

  • Real-time phone-use violation flagging on road-traffic video after car detection + tracking.
  • Run on the driver region of a detected car bbox (typically top-left half for right-hand-drive / top-right half for left-hand-drive, or the full driver crop from an in-cabin camera).

Out-of-scope / limitations

  • Trained on a single-source in-cabin dataset; domain gap on windshield-through views is not measured. Fine-tune on target footage for production use.
  • Binary — does not distinguish "talking" vs "texting"; collapses holding+using into a single phone class.
  • Confuses with other close-to-ear gestures (adjusting hair, scratching face) rarely — not tested at adversarial scale.

License

AGPL-3.0 (inherits Ultralytics YOLOv11 weight license).

Downloads last month
55
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support