Qwen3.5-4B Capability Vector v1 (cross-model contrast, 2026-05-14)

STATUS — null result on aggregate. This was the first attempt at a CAA-style additive "capability direction". Across ~80 docker runs of terminal-bench-2 sweeps, the vector did not produce a statistically significant pass-rate lift. Discriminator AUC=1.0 in residual stream, but the direction encodes output style (parse_fail ↔ no_cmd trade-off), not task-solving capability. See capvec-v2-samemodel and abliterated repos for the follow-up experiments that confirmed this.

What this is

A residual-stream direction tensor for Qwen/Qwen3.5-4B, computed via mean-difference between 5 SFT-successful agent traces and 12 base/cp600/DPO-failing traces on terminal-bench-2 sprint. Adapted from NousResearch/llm-abliteration and failspy's ortho cookbook but inverted: we add the direction at inference instead of orthogonally removing a refusal direction from weights.

Per-layer AUC ranking

Layers 12–22 all reach AUC=1.000 on per-trace projection separation. Layer 22 was picked for the published dir.pt (max margin among 1.0-AUC layers).

layer AUC margin
22 1.000 1.95
19 1.000 1.44
26 0.98 2.65

See vectors/ranking.csv for all 32 layers.

Behavioural results (terminal-bench-2)

sweep configuration pass / N Fisher's p vs base
α=4 sweep, log-summary steered-L22-α4 1/3 0.21
α-grid log-summary α ∈ {2,4,6,8}, L26-α4, N=5 each 0/20
wide-task sweep (5 sprint tasks) steered-L22-α4 1/5 1.0 (same as base)
multi-layer L13/16/19/22 α=0.5, 1.0 0/9
negative α (sub) α=−2, α=−4 1/12 0.62

Net: null lift across all sweeps. The single early α=4 win on log-summary did not replicate.

Quick use (still functional as a research artifact)

import torch
from transformers import AutoTokenizer, AutoModelForImageTextToText
from huggingface_hub import hf_hub_download

tok = AutoTokenizer.from_pretrained('Qwen/Qwen3.5-4B')
model = AutoModelForImageTextToText.from_pretrained(
    'Qwen/Qwen3.5-4B', dtype=torch.bfloat16, device_map={'':0})

vec_path = hf_hub_download('AlexWortega/qwen3.5-4b-capability-vector-20260514', 'vectors/dir.pt')
vec = torch.load(vec_path, weights_only=False)
# Best layer = 22. Apply at inference via residual-stream hook.

See scripts/steer.py for the hook implementation.

Files

  • vectors/dir.pt — 32 unit-norm direction tensors (one per decoder layer)
  • vectors/ranking.csv — per-layer AUC/margin
  • scripts/* — full reproducer (collect/capture/compute/steer/serve/sweep)
  • TASK.md RESEARCH.md PLAN.md RESULTS.md VERIFY.md — full report bundle

Caveats

  • 5 positive traces is small. AUC=1.0 with that n is real but tight.
  • The direction conflates "agent-trace style" with "task-solving" — extracted contrast was between different LoRAs producing different traces, not the same model under different conditions. v2 (same-model contrast) addresses this confound.
  • α≥6 induces "I am done" early-bail behavior. α≤2 has no measurable effect.
  • α=4 sits at the over-steering boundary on log-summary, capable of one-trial flips that don't replicate.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AlexWortega/qwen3.5-4b-capability-vector-20260514

Finetuned
Qwen/Qwen3.5-4B
Finetuned
(290)
this model