File size: 5,455 Bytes
b50017a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
---
tags:
  - sparse-autoencoder
  - crosscoder
  - interpretability
  - qwen2
  - mechanistic-interpretability
  - dictionary-learning
license: mit
---

# cc-D16k-k90

A **CrossCoder** sparse crosscoder trained to compare layer-13 activations between:
- **Model A (ToolRL)**: `chengq9/ToolRL-Qwen2.5-3B` — fine-tuned with tool-use reinforcement learning
- **Model B (Base)**: `Qwen/Qwen2.5-3B` — vanilla base model

## What is this?

This model learns a sparse dictionary of features from the internal representations of two language models. By comparing which features activate for which model, we can identify:
- **What the ToolRL fine-tuning changed** (A-exclusive features)
- **What remained the same** (shared features)
- **What the base model does that ToolRL suppressed** (B-exclusive features)

## Model Architecture

Standard CrossCoder — all 16384 features shared, no partition masks

| Parameter | Value |
|-----------|-------|
| Dictionary size | 16384 |
| Top-k active features | 90 |
| Layer | 13 (middle layer of Qwen2.5-3B) |
| Activation dimension | 2048 |
| Partitioning | None — all features shared |

### How it works

1. **Encode**: Takes stacked activations `(batch, 2, 2048)` from both models, applies per-model encoder weights, sums across models, and selects the top-90 features via ReLU + top-k.
2. **Decode**: Reconstructs per-model activations from the sparse feature vector using per-model decoder weights.
3. **Partition masks** (DFC only): Hard binary masks zero out encoder/decoder weights to enforce that exclusive features cannot be used by the wrong model.

## Training

| Parameter | Value |
|-----------|-------|
| Loss function | MSE + L1 sparsity (coef: 1e-3) |
| Training steps | 9000 |
| Learning rate | 1e-4 |
| Batch size | 1024 |
| Sparsity coefficient (shared) | 1e-3 |
| Exclusive sparsity coefficient | 0 |
| Optimizer | Adam (grad clip 1.0) |
| W&B project | `dfc-crosscoder-sweep` |

### Training Data

- **FineWeb**: ~40,000 general web text samples (from `HuggingFaceFW/fineweb` sample-10BT)
- **ToolRL**: ~40,000 tool-use conversation samples (from `emrecanacikgoz/ToolRL`, cycled)
- Activations extracted from layer 13, last token per sample
- Both datasets concatenated and z-score normalized

## Usage

### Quick Start

```python
import torch
from huggingface_hub import hf_hub_download

# Download model files
repo_id = "antebe1/cc-D16k-k90"
for fname in ["model.pt", "config.json", "dfc.py"]:
    hf_hub_download(repo_id=repo_id, filename=fname, local_dir="./model")

# Load the crosscoder
import sys; sys.path.insert(0, "./model")
from dfc import DFCCrossCoder

dfc = DFCCrossCoder.load("./model", device="cuda")
print(f"Loaded: dict_size={dfc.dict_size}, k={dfc.k}")
```

### Extract Features from Real Models

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load both models
model_a = AutoModelForCausalLM.from_pretrained("chengq9/ToolRL-Qwen2.5-3B", device_map="cuda:0")
model_b = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-3B", device_map="cuda:1")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B")

# Get activations from layer 13
# NOTE: hidden_states[0] = embeddings, hidden_states[i] = output of layer i-1
#       so layer 13 activations are at index 13+1
text = "Use the search tool to find recent papers on RLHF"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    out_a = model_a(**inputs.to("cuda:0"), output_hidden_states=True)
    out_b = model_b(**inputs.to("cuda:1"), output_hidden_states=True)
    act_a = out_a.hidden_states[13 + 1][:, -1, :]  # last token, layer 13
    act_b = out_b.hidden_states[13 + 1][:, -1, :]

# Stack and encode
activations = torch.stack([act_a.cpu(), act_b.cpu()], dim=1)  # (1, 2, 2048)
features = dfc.encode(activations.to(dfc.W_enc.device))

print(f"Active features: {(features > 0).sum().item()} / {dfc.dict_size}")
```

### Analyze Partitions (DFC only)

```python
stats = dfc.feature_stats(features)
print(f"L0 total:    {stats['l0_total']:.1f}")
print(f"L0 A-excl:   {stats['l0_a_excl']:.1f}")
print(f"L0 B-excl:   {stats['l0_b_excl']:.1f}")
print(f"L0 shared:   {stats['l0_shared']:.1f}")

# Check reconstruction quality
recon, feats = dfc(activations.to(dfc.W_enc.device))
mse = torch.nn.functional.mse_loss(recon.cpu(), activations)
print(f"Reconstruction MSE: {mse.item():.6f}")
```

## Files

| File | Description |
|------|-------------|
| `model.pt` | PyTorch state dict (encoder/decoder weights + partition masks) |
| `config.json` | Architecture config: dict_size, k, partition sizes (n_a, n_b) |
| `hparams.json` | Full training hyperparameters including loss, lr, steps, etc. |
| `dfc.py` | `DFCCrossCoder` class definition — required to load model.pt |
| `demo.py` | Feature extraction demo (works with downloaded model) |
| `requirements.txt` | Python dependencies |

## Part of a Sweep

This model is one of 48 models in a hyperparameter sweep. See the full collection:

| Axis | Values |
|------|--------|
| k (top-k) | 45, 90, 160 |
| dict_size | 8,192 / 16,384 |
| Architecture | DFC (partitioned) / CrossCoder (all shared) |
| Exclusive % (DFC) | 3%, 5%, 10% |
| Exclusive sparsity | 1e-3 (penalized) / 0 (free) |
| CrossCoder L1 | with / without |

## Citation

```bibtex
@misc{cc-D16k-k90,
  title={CrossCoder CrossCoder: ToolRL vs Base Qwen2.5-3B},
  author={Andre Shportko},
  year={2026},
  url={https://huggingface.co/antebe1/cc-D16k-k90}
}
```