File size: 5,455 Bytes
b50017a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 | ---
tags:
- sparse-autoencoder
- crosscoder
- interpretability
- qwen2
- mechanistic-interpretability
- dictionary-learning
license: mit
---
# cc-D16k-k90
A **CrossCoder** sparse crosscoder trained to compare layer-13 activations between:
- **Model A (ToolRL)**: `chengq9/ToolRL-Qwen2.5-3B` — fine-tuned with tool-use reinforcement learning
- **Model B (Base)**: `Qwen/Qwen2.5-3B` — vanilla base model
## What is this?
This model learns a sparse dictionary of features from the internal representations of two language models. By comparing which features activate for which model, we can identify:
- **What the ToolRL fine-tuning changed** (A-exclusive features)
- **What remained the same** (shared features)
- **What the base model does that ToolRL suppressed** (B-exclusive features)
## Model Architecture
Standard CrossCoder — all 16384 features shared, no partition masks
| Parameter | Value |
|-----------|-------|
| Dictionary size | 16384 |
| Top-k active features | 90 |
| Layer | 13 (middle layer of Qwen2.5-3B) |
| Activation dimension | 2048 |
| Partitioning | None — all features shared |
### How it works
1. **Encode**: Takes stacked activations `(batch, 2, 2048)` from both models, applies per-model encoder weights, sums across models, and selects the top-90 features via ReLU + top-k.
2. **Decode**: Reconstructs per-model activations from the sparse feature vector using per-model decoder weights.
3. **Partition masks** (DFC only): Hard binary masks zero out encoder/decoder weights to enforce that exclusive features cannot be used by the wrong model.
## Training
| Parameter | Value |
|-----------|-------|
| Loss function | MSE + L1 sparsity (coef: 1e-3) |
| Training steps | 9000 |
| Learning rate | 1e-4 |
| Batch size | 1024 |
| Sparsity coefficient (shared) | 1e-3 |
| Exclusive sparsity coefficient | 0 |
| Optimizer | Adam (grad clip 1.0) |
| W&B project | `dfc-crosscoder-sweep` |
### Training Data
- **FineWeb**: ~40,000 general web text samples (from `HuggingFaceFW/fineweb` sample-10BT)
- **ToolRL**: ~40,000 tool-use conversation samples (from `emrecanacikgoz/ToolRL`, cycled)
- Activations extracted from layer 13, last token per sample
- Both datasets concatenated and z-score normalized
## Usage
### Quick Start
```python
import torch
from huggingface_hub import hf_hub_download
# Download model files
repo_id = "antebe1/cc-D16k-k90"
for fname in ["model.pt", "config.json", "dfc.py"]:
hf_hub_download(repo_id=repo_id, filename=fname, local_dir="./model")
# Load the crosscoder
import sys; sys.path.insert(0, "./model")
from dfc import DFCCrossCoder
dfc = DFCCrossCoder.load("./model", device="cuda")
print(f"Loaded: dict_size={dfc.dict_size}, k={dfc.k}")
```
### Extract Features from Real Models
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load both models
model_a = AutoModelForCausalLM.from_pretrained("chengq9/ToolRL-Qwen2.5-3B", device_map="cuda:0")
model_b = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-3B", device_map="cuda:1")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B")
# Get activations from layer 13
# NOTE: hidden_states[0] = embeddings, hidden_states[i] = output of layer i-1
# so layer 13 activations are at index 13+1
text = "Use the search tool to find recent papers on RLHF"
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
out_a = model_a(**inputs.to("cuda:0"), output_hidden_states=True)
out_b = model_b(**inputs.to("cuda:1"), output_hidden_states=True)
act_a = out_a.hidden_states[13 + 1][:, -1, :] # last token, layer 13
act_b = out_b.hidden_states[13 + 1][:, -1, :]
# Stack and encode
activations = torch.stack([act_a.cpu(), act_b.cpu()], dim=1) # (1, 2, 2048)
features = dfc.encode(activations.to(dfc.W_enc.device))
print(f"Active features: {(features > 0).sum().item()} / {dfc.dict_size}")
```
### Analyze Partitions (DFC only)
```python
stats = dfc.feature_stats(features)
print(f"L0 total: {stats['l0_total']:.1f}")
print(f"L0 A-excl: {stats['l0_a_excl']:.1f}")
print(f"L0 B-excl: {stats['l0_b_excl']:.1f}")
print(f"L0 shared: {stats['l0_shared']:.1f}")
# Check reconstruction quality
recon, feats = dfc(activations.to(dfc.W_enc.device))
mse = torch.nn.functional.mse_loss(recon.cpu(), activations)
print(f"Reconstruction MSE: {mse.item():.6f}")
```
## Files
| File | Description |
|------|-------------|
| `model.pt` | PyTorch state dict (encoder/decoder weights + partition masks) |
| `config.json` | Architecture config: dict_size, k, partition sizes (n_a, n_b) |
| `hparams.json` | Full training hyperparameters including loss, lr, steps, etc. |
| `dfc.py` | `DFCCrossCoder` class definition — required to load model.pt |
| `demo.py` | Feature extraction demo (works with downloaded model) |
| `requirements.txt` | Python dependencies |
## Part of a Sweep
This model is one of 48 models in a hyperparameter sweep. See the full collection:
| Axis | Values |
|------|--------|
| k (top-k) | 45, 90, 160 |
| dict_size | 8,192 / 16,384 |
| Architecture | DFC (partitioned) / CrossCoder (all shared) |
| Exclusive % (DFC) | 3%, 5%, 10% |
| Exclusive sparsity | 1e-3 (penalized) / 0 (free) |
| CrossCoder L1 | with / without |
## Citation
```bibtex
@misc{cc-D16k-k90,
title={CrossCoder CrossCoder: ToolRL vs Base Qwen2.5-3B},
author={Andre Shportko},
year={2026},
url={https://huggingface.co/antebe1/cc-D16k-k90}
}
```
|