Jane Janus animated hero banner

Hierarchical command understanding with state-aware runtime behavior for practical assistant workflows.

7.95M
Parameters
82
Runtime Turns
0
Errors
25.3 ms
Mean Latency
100%
OOD Precision
30.6 MB
Checkpoint

Quickstart (2 minutes)

Install + first prediction
pip install -r requirements.txt
from janegpt_v2_janus.inference import JaneGPTv3NLU

nlu = JaneGPTv3NLU(
    model_path="weights/janegpt_v2_janus.pt",
    tokenizer_path="weights/tokenizer.json",
)

state = {}
result = nlu.predict("set volume", state=state)
print(result)

if result.get("type") == "command":
    state = nlu.update_state(result, state)
Runtime wrapper (recommended for assistant flows)
from runtime.jane_nlu_runtime import JaneNLURuntime

rt = JaneNLURuntime(base_dir=".")
state = {}

out, state = rt.handle_turn("set volume", state)
print(out)  # expected: clarify prompt for missing VALUE

out, state = rt.handle_turn("55", state)
print(out)  # expected: resolved local command
Run bundled demos
python examples/demo_inference.py
python examples/demo_runtime.py
python examples/demo_runtime_suite.py

What You Get

  • Single-pass multitask prediction: domain + action + BIO slots.
  • Runtime-safe clarification loops for missing required slots.
  • Stateful follow-ups (for example, "that is not enough" after a volume change).
  • Local command routing with controlled chat fallback.
  • Compact deployment footprint: ~30.62 MB checkpoint.

Benchmark Results

Runtime Reliability

Runtime Stability
82-turn suite with zero runtime errors
Total turns tested
82
Local command resolutions
67
Clarification turns
12
Runtime errors
0
Source: reports/fair_benchmarks.json

Predict Latency (CUDA, batch=1)

Latency Profile
CUDA, batch size = 1 (lower is better)
Predict mean
25.31 ms
Predict p95
34.60 ms
Forward mean
35.37 ms
Forward p95
36.71 ms
Source: reports/janus_model_report.json

OOD Rejection Quality

OOD Rejection Quality
Schema-agnostic safety benchmark โ€” hover bars to see values
BANKING77 โ€” OOD F1
87.80%
BANKING77 โ€” OOD Precision
100.00%
BANKING77 โ€” OOD Recall
78.25%
CLINC OOS โ€” OOD F1
79.23%
CLINC OOS โ€” OOD Precision
100.00%
CLINC OOS โ€” OOD Recall
65.60%
Source: reports/fair_benchmarks.json

Comprehensive Benchmark Summary

Full Benchmark Evidence
All values from real holdout evaluations โ€” no synthetic or inflated numbers
Metric Detail Jane v2 Janus
Speed (mean latency) CUDA, batch=1 31.60 ms 25.31 ms
Throughput CUDA, single GPU 32 pred/sec Stable across 82 turns, 0 errors
OOD F1 BANKING77 94.31% 87.80%
OOD F1 CLINC OOS 89.16% 79.23%
OOD Precision BANKING77 99.35% 100.00%
OOD Precision CLINC OOS 99.14% 100.00%
OOD Recall BANKING77 89.75% 78.25%
OOD Recall CLINC OOS 81.00% 65.60%
Validation Accuracy Domain (best epoch) โ€” 99.83%
Validation Accuracy Action (best epoch) โ€” 99.87%
Validation Accuracy Domain+Action pair (best epoch) โ€” 99.83%
Slot Extraction F1 All 15 slot types โ€” 1.000 (100%)
Training Loss Epoch 1 โ†’ 4 โ€” 0.060 โ†’ 0.020 โ†’ 0.002 โ†’ 0.001
Validation Loss Epoch 1 โ†’ 3 โ€” 0.0153 โ†’ 0.0116 โ†’ 0.0115 (stable)
Runtime Reliability 82-turn conversation test โ€” 0 errors, 0 crashes
Domain Confusion 10 domains โ€” 99%+ per-domain, minimal cross-confusion
Action Confusion 33 actions โ€” Perfect diagonal, no action commonly confused

Live Output Shapes (click to expand)

Command output
{
  "type": "command",
  "domain": "apps",
  "action": "launch",
  "slots": {
    "APP_NAME": {
      "text": "chrome",
      "start": 5,
      "end": 11,
      "confidence": 0.999
    }
  },
  "confidence": 0.97,
  "route": "local"
}
Clarification output
{
  "type": "clarify",
  "question": "What value should I set it to?",
  "debug": {
    "domain": "volume",
    "action": "set",
    "reason": "missing_VALUE"
  }
}
Label schema
  • Domains (10): volume, brightness, media, apps, browser, productivity, screen, window, system, conversation
  • Actions (33): up, down, set, mute, unmute, play, pause, next, previous, launch, close, switch, search, set_reminder, screenshot, read, explain, undo, quit, chat, minimize, maximize, restore, focus, copy, paste, cut, lock, sleep, wifi_on, wifi_off, bluetooth_on, bluetooth_off
  • Slot labels (BIO, 15): VALUE, APP_NAME, QUERY, DURATION, TIME, WINDOW_NAME, TEXT

Visual Benchmark Evidence

Train and validation loss

Smoothed train loss

Validation slot F1

Confusion Matrix โ€” Interactive Breakdown

Per-Class True vs Predicted
Single stacked bar per head โ€” segment width = sample ratio. Hover any segment for details.
Domain Sample Distribution โ€” 3,110 total samples โ€” hover each segment
volume
430 samples (13.8%)
Accuracy: 100%
Misclassified: 0
brightness
250 samples (8.0%)
Accuracy: 100%
Misclassified: 0
media
340 samples (10.9%)
Accuracy: 100%
Misclassified: 0
apps
250 samples (8.0%)
Accuracy: 100%
Misclassified: 0
browser
120 samples (3.9%)
Accuracy: 100%
Misclassified: 0
productivity
340 samples (10.9%)
Accuracy: 100%
Misclassified: 0
screen
120 samples (3.9%)
Accuracy: 100%
Misclassified: 0
window
340 samples (10.9%)
Accuracy: 100%
Misclassified: 0
system
580 samples (18.6%)
Accuracy: 100%
Misclassified: 0
conversation
340 samples (10.9%)
Accuracy: 100%
Misclassified: 0
volume
brightness
media
apps
browser
productivity
screen
window
system
conversation
Source: validation set confusion matrix โ€” segment widths proportional to sample count
Action Sample Distribution โ€” 3,205 total samples โ€” hover each segment
up
170 samples (5.3%)
Accuracy: 100%
Misclassified: 0
down
165 samples (5.1%)
Accuracy: 100%
Misclassified: 0
set
170 samples (5.3%)
Accuracy: 100%
Misclassified: 0
mute
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
unmute
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
play
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
pause
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
next
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
previous
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
launch
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
close
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
switch
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
search
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
set_reminder
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
screenshot
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
read
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
explain
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
undo
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
quit
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
chat
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
minimize
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
maximize
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
restore
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
focus
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
copy
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
paste
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
cut
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
lock
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
sleep
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
wifi_on
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
wifi_off
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
bluetooth_on
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
bluetooth_off
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0
up
down
set
mute
unmute
play
pause
next
previous
launch
close
switch
search
set_reminder
screenshot
read
explain
undo
quit
chat
minimize
maximize
restore
focus
copy
paste
cut
lock
sleep
wifi_on
wifi_off
bluetooth_on
bluetooth_off
Source: validation set confusion matrix โ€” segment widths proportional to sample count
View original confusion matrix images

Domain confusion matrix

Action confusion matrix

Additional diagnostics

Learning rate schedule

Epoch time profile

Raw training loss


Upload-Ready Layout

.
|- README.md
|- .gitattributes
|- LICENSE
|- requirements.txt
|- assets/
|  |- jane-janus-glitch.webp
|- janegpt_v2_janus/
|  |- __init__.py
|  |- architecture.py
|  |- dataset.py
|  |- inference.py
|  |- labels.py
|  |- multitask.py
|- runtime/
|  |- jane_nlu_runtime.py
|- examples/
|  |- demo_inference.py
|  |- demo_runtime.py
|  |- demo_runtime_suite.py
|- weights/
|  |- janegpt_v2_janus.pt
|  |- tokenizer.json
|- reports/
|  |- fair_benchmarks.json
|  |- fair_benchmarks.md
|  |- janus_model_report.json
|  |- janus_model_report.md
|  |- public_benchmarks.json
|  |- *.png benchmark visuals

Limitations

  • English-focused command language.
  • Command NLU model, not an open-domain generative chatbot.
  • MASSIVE and SNIPS mapped-intent accuracy is excluded from headline claims because mapping coverage is partial.

License

Apache-2.0 (see LICENSE).

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Evaluation results