RavinduSen/JaneGPT-v2-Janus

Hierarchical command understanding with state-aware runtime behavior for practical assistant workflows.

7.95M

Parameters

Runtime Turns

Errors

25.3 ms

Mean Latency

100%

OOD Precision

30.6 MB

Checkpoint

Quickstart (2 minutes)

Install + first prediction

pip install -r requirements.txt

from janegpt_v2_janus.inference import JaneGPTv3NLU

nlu = JaneGPTv3NLU(
    model_path="weights/janegpt_v2_janus.pt",
    tokenizer_path="weights/tokenizer.json",
)

state = {}
result = nlu.predict("set volume", state=state)
print(result)

if result.get("type") == "command":
    state = nlu.update_state(result, state)

Runtime wrapper (recommended for assistant flows)

from runtime.jane_nlu_runtime import JaneNLURuntime

rt = JaneNLURuntime(base_dir=".")
state = {}

out, state = rt.handle_turn("set volume", state)
print(out)  # expected: clarify prompt for missing VALUE

out, state = rt.handle_turn("55", state)
print(out)  # expected: resolved local command

Run bundled demos

python examples/demo_inference.py
python examples/demo_runtime.py
python examples/demo_runtime_suite.py

What You Get

Single-pass multitask prediction: domain + action + BIO slots.
Runtime-safe clarification loops for missing required slots.
Stateful follow-ups (for example, "that is not enough" after a volume change).
Local command routing with controlled chat fallback.
Compact deployment footprint: ~30.62 MB checkpoint.

Benchmark Results

Runtime Reliability

Runtime Stability

82-turn suite with zero runtime errors

Total turns tested

Local command resolutions

Clarification turns

Runtime errors

Source: reports/fair_benchmarks.json

Predict Latency (CUDA, batch=1)

Latency Profile

CUDA, batch size = 1 (lower is better)

Predict mean

25.31 ms

Predict p95

34.60 ms

Forward mean

35.37 ms

Forward p95

36.71 ms

Source: reports/janus_model_report.json

OOD Rejection Quality

Schema-agnostic safety benchmark — hover bars to see values

Both BANKING77 CLINC OOS

BANKING77 — OOD F1

87.80%

BANKING77 — OOD Precision

100.00%

BANKING77 — OOD Recall

78.25%

CLINC OOS — OOD F1

79.23%

CLINC OOS — OOD Precision

100.00%

CLINC OOS — OOD Recall

65.60%

Source: reports/fair_benchmarks.json

Comprehensive Benchmark Summary

Full Benchmark Evidence

All values from real holdout evaluations — no synthetic or inflated numbers

Metric	Detail	Jane v2	Janus
Speed (mean latency)	CUDA, batch=1	31.60 ms	25.31 ms
Throughput	CUDA, single GPU	32 pred/sec	Stable across 82 turns, 0 errors
OOD F1	BANKING77	94.31%	87.80%
OOD F1	CLINC OOS	89.16%	79.23%
OOD Precision	BANKING77	99.35%	100.00%
OOD Precision	CLINC OOS	99.14%	100.00%
OOD Recall	BANKING77	89.75%	78.25%
OOD Recall	CLINC OOS	81.00%	65.60%
Validation Accuracy	Domain (best epoch)	—	99.83%
Validation Accuracy	Action (best epoch)	—	99.87%
Validation Accuracy	Domain+Action pair (best epoch)	—	99.83%
Slot Extraction F1	All 15 slot types	—	1.000 (100%)
Training Loss	Epoch 1 → 4	—	0.060 → 0.020 → 0.002 → 0.001
Validation Loss	Epoch 1 → 3	—	0.0153 → 0.0116 → 0.0115 (stable)
Runtime Reliability	82-turn conversation test	—	0 errors, 0 crashes
Domain Confusion	10 domains	—	99%+ per-domain, minimal cross-confusion
Action Confusion	33 actions	—	Perfect diagonal, no action commonly confused

Live Output Shapes (click to expand)

Command output

{
  "type": "command",
  "domain": "apps",
  "action": "launch",
  "slots": {
    "APP_NAME": {
      "text": "chrome",
      "start": 5,
      "end": 11,
      "confidence": 0.999
    }
  },
  "confidence": 0.97,
  "route": "local"
}

Clarification output

{
  "type": "clarify",
  "question": "What value should I set it to?",
  "debug": {
    "domain": "volume",
    "action": "set",
    "reason": "missing_VALUE"
  }
}

Label schema

Domains (10): volume, brightness, media, apps, browser, productivity, screen, window, system, conversation
Actions (33): up, down, set, mute, unmute, play, pause, next, previous, launch, close, switch, search, set_reminder, screenshot, read, explain, undo, quit, chat, minimize, maximize, restore, focus, copy, paste, cut, lock, sleep, wifi_on, wifi_off, bluetooth_on, bluetooth_off
Slot labels (BIO, 15): VALUE, APP_NAME, QUERY, DURATION, TIME, WINDOW_NAME, TEXT

Visual Benchmark Evidence

Train and validation loss

Smoothed train loss

Validation slot F1

Confusion Matrix — Interactive Breakdown

Per-Class True vs Predicted

Single stacked bar per head — segment width = sample ratio. Hover any segment for details.

Domains (10) Actions (33)

Domain Sample Distribution — 3,110 total samples — hover each segment

volume
430 samples (13.8%)
Accuracy: 100%
Misclassified: 0

brightness
250 samples (8.0%)
Accuracy: 100%
Misclassified: 0

media
340 samples (10.9%)
Accuracy: 100%
Misclassified: 0

apps
250 samples (8.0%)
Accuracy: 100%
Misclassified: 0

browser
120 samples (3.9%)
Accuracy: 100%
Misclassified: 0

productivity
340 samples (10.9%)
Accuracy: 100%
Misclassified: 0

screen
120 samples (3.9%)
Accuracy: 100%
Misclassified: 0

window
340 samples (10.9%)
Accuracy: 100%
Misclassified: 0

system
580 samples (18.6%)
Accuracy: 100%
Misclassified: 0

conversation
340 samples (10.9%)
Accuracy: 100%
Misclassified: 0

volume

brightness

media

apps

browser

productivity

screen

window

system

conversation

Source: validation set confusion matrix — segment widths proportional to sample count

Action Sample Distribution — 3,205 total samples — hover each segment

up
170 samples (5.3%)
Accuracy: 100%
Misclassified: 0

down
165 samples (5.1%)
Accuracy: 100%
Misclassified: 0

set
170 samples (5.3%)
Accuracy: 100%
Misclassified: 0

mute
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

unmute
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

play
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

pause
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

next
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

previous
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

launch
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

close
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

switch
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

search
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

set_reminder
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

screenshot
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

read
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

explain
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

undo
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

quit
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

chat
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

minimize
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

maximize
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

restore
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

focus
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

copy
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

paste
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

cut
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

lock
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

sleep
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

wifi_on
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

wifi_off
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

bluetooth_on
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

bluetooth_off
90 samples (2.8%)
Accuracy: 100%
Misclassified: 0

down

set

mute

unmute

play

pause

launch

switch

set_reminder

screenshot

read

explain

undo

quit

chat

minimize

maximize

restore

focus

copy

paste

cut

lock

sleep

wifi_on

wifi_off

bluetooth_on

bluetooth_off

Source: validation set confusion matrix — segment widths proportional to sample count

View original confusion matrix images

Domain confusion matrix

Action confusion matrix

Additional diagnostics

Learning rate schedule

Epoch time profile

Raw training loss

Upload-Ready Layout

.
|- README.md
|- .gitattributes
|- LICENSE
|- requirements.txt
|- assets/
|  |- jane-janus-glitch.webp
|- janegpt_v2_janus/
|  |- __init__.py
|  |- architecture.py
|  |- dataset.py
|  |- inference.py
|  |- labels.py
|  |- multitask.py
|- runtime/
|  |- jane_nlu_runtime.py
|- examples/
|  |- demo_inference.py
|  |- demo_runtime.py
|  |- demo_runtime_suite.py
|- weights/
|  |- janegpt_v2_janus.pt
|  |- tokenizer.json
|- reports/
|  |- fair_benchmarks.json
|  |- fair_benchmarks.md
|  |- janus_model_report.json
|  |- janus_model_report.md
|  |- public_benchmarks.json
|  |- *.png benchmark visuals

Limitations

English-focused command language.
Command NLU model, not an open-domain generative chatbot.
MASSIVE and SNIPS mapped-intent accuracy is excluded from headline claims because mapping coverage is partial.

License

Apache-2.0 (see LICENSE).

Downloads last month: -

Evaluation results

OOD Precision on BANKING77
self-reported

1.000
OOD F1 on BANKING77
self-reported

0.878
OOD Recall on BANKING77
self-reported

0.782
OOD Precision on CLINC OOS
self-reported

1.000
OOD F1 on CLINC OOS
self-reported

0.792
OOD Recall on CLINC OOS
self-reported

0.656
Validation Domain Accuracy
self-reported

0.998
Validation Action Accuracy
self-reported

0.999
Slot Extraction F1
self-reported

1.000