Hierarchical command understanding with state-aware runtime behavior for practical assistant workflows.
7.95M
Parameters
82
Runtime Turns
0
Errors
25.3 ms
Mean Latency
100%
OOD Precision
30.6 MB
Checkpoint
Quickstart (2 minutes)
Install + first prediction
pip install -r requirements.txt
from janegpt_v2_janus.inference import JaneGPTv3NLU
nlu = JaneGPTv3NLU(
model_path="weights/janegpt_v2_janus.pt",
tokenizer_path="weights/tokenizer.json",
)
state = {}
result = nlu.predict("set volume", state=state)
print(result)
if result.get("type") == "command":
state = nlu.update_state(result, state)
Runtime wrapper (recommended for assistant flows)
from runtime.jane_nlu_runtime import JaneNLURuntime
rt = JaneNLURuntime(base_dir=".")
state = {}
out, state = rt.handle_turn("set volume", state)
print(out) # expected: clarify prompt for missing VALUE
out, state = rt.handle_turn("55", state)
print(out) # expected: resolved local command
Run bundled demos
python examples/demo_inference.py
python examples/demo_runtime.py
python examples/demo_runtime_suite.py
What You Get
- Single-pass multitask prediction: domain + action + BIO slots.
- Runtime-safe clarification loops for missing required slots.
- Stateful follow-ups (for example, "that is not enough" after a volume change).
- Local command routing with controlled chat fallback.
- Compact deployment footprint: ~30.62 MB checkpoint.
Benchmark Results
Runtime Reliability
Runtime Stability
82-turn suite with zero runtime errors
Predict Latency (CUDA, batch=1)
Latency Profile
CUDA, batch size = 1 (lower is better)
OOD Rejection Quality
OOD Rejection Quality
Schema-agnostic safety benchmark โ hover bars to see values
Comprehensive Benchmark Summary
Full Benchmark Evidence
All values from real holdout evaluations โ no synthetic or inflated numbers
| Metric | Detail | Jane v2 | Janus |
|---|---|---|---|
| Speed (mean latency) | CUDA, batch=1 | 31.60 ms | 25.31 ms |
| Throughput | CUDA, single GPU | 32 pred/sec | Stable across 82 turns, 0 errors |
| OOD F1 | BANKING77 | 94.31% | 87.80% |
| OOD F1 | CLINC OOS | 89.16% | 79.23% |
| OOD Precision | BANKING77 | 99.35% | 100.00% |
| OOD Precision | CLINC OOS | 99.14% | 100.00% |
| OOD Recall | BANKING77 | 89.75% | 78.25% |
| OOD Recall | CLINC OOS | 81.00% | 65.60% |
| Validation Accuracy | Domain (best epoch) | โ | 99.83% |
| Validation Accuracy | Action (best epoch) | โ | 99.87% |
| Validation Accuracy | Domain+Action pair (best epoch) | โ | 99.83% |
| Slot Extraction F1 | All 15 slot types | โ | 1.000 (100%) |
| Training Loss | Epoch 1 โ 4 | โ | 0.060 โ 0.020 โ 0.002 โ 0.001 |
| Validation Loss | Epoch 1 โ 3 | โ | 0.0153 โ 0.0116 โ 0.0115 (stable) |
| Runtime Reliability | 82-turn conversation test | โ | 0 errors, 0 crashes |
| Domain Confusion | 10 domains | โ | 99%+ per-domain, minimal cross-confusion |
| Action Confusion | 33 actions | โ | Perfect diagonal, no action commonly confused |
Live Output Shapes (click to expand)
Command output
{
"type": "command",
"domain": "apps",
"action": "launch",
"slots": {
"APP_NAME": {
"text": "chrome",
"start": 5,
"end": 11,
"confidence": 0.999
}
},
"confidence": 0.97,
"route": "local"
}
Clarification output
{
"type": "clarify",
"question": "What value should I set it to?",
"debug": {
"domain": "volume",
"action": "set",
"reason": "missing_VALUE"
}
}
Label schema
- Domains (10): volume, brightness, media, apps, browser, productivity, screen, window, system, conversation
- Actions (33): up, down, set, mute, unmute, play, pause, next, previous, launch, close, switch, search, set_reminder, screenshot, read, explain, undo, quit, chat, minimize, maximize, restore, focus, copy, paste, cut, lock, sleep, wifi_on, wifi_off, bluetooth_on, bluetooth_off
- Slot labels (BIO, 15): VALUE, APP_NAME, QUERY, DURATION, TIME, WINDOW_NAME, TEXT
Visual Benchmark Evidence
Confusion Matrix โ Interactive Breakdown
Per-Class True vs Predicted
Single stacked bar per head โ segment width = sample ratio. Hover any segment for details.
volume
brightness
media
apps
browser
productivity
screen
window
system
conversation
up
down
set
mute
unmute
play
pause
next
previous
launch
close
switch
search
set_reminder
screenshot
read
explain
undo
quit
chat
minimize
maximize
restore
focus
copy
paste
cut
lock
sleep
wifi_on
wifi_off
bluetooth_on
bluetooth_off
View original confusion matrix images
Additional diagnostics
Upload-Ready Layout
.
|- README.md
|- .gitattributes
|- LICENSE
|- requirements.txt
|- assets/
| |- jane-janus-glitch.webp
|- janegpt_v2_janus/
| |- __init__.py
| |- architecture.py
| |- dataset.py
| |- inference.py
| |- labels.py
| |- multitask.py
|- runtime/
| |- jane_nlu_runtime.py
|- examples/
| |- demo_inference.py
| |- demo_runtime.py
| |- demo_runtime_suite.py
|- weights/
| |- janegpt_v2_janus.pt
| |- tokenizer.json
|- reports/
| |- fair_benchmarks.json
| |- fair_benchmarks.md
| |- janus_model_report.json
| |- janus_model_report.md
| |- public_benchmarks.json
| |- *.png benchmark visuals
Limitations
- English-focused command language.
- Command NLU model, not an open-domain generative chatbot.
- MASSIVE and SNIPS mapped-intent accuracy is excluded from headline claims because mapping coverage is partial.
License
Apache-2.0 (see LICENSE).
- Downloads last month
- -
Evaluation results
- OOD Precision on BANKING77self-reported1.000
- OOD F1 on BANKING77self-reported0.878
- OOD Recall on BANKING77self-reported0.782
- OOD Precision on CLINC OOSself-reported1.000
- OOD F1 on CLINC OOSself-reported0.792
- OOD Recall on CLINC OOSself-reported0.656
- Validation Domain Accuracyself-reported0.998
- Validation Action Accuracyself-reported0.999
- Slot Extraction F1self-reported1.000