--- tags: - sparse-autoencoder - crosscoder - interpretability - qwen2 - mechanistic-interpretability - dictionary-learning license: mit --- # cc-D16k-k90 A **CrossCoder** sparse crosscoder trained to compare layer-13 activations between: - **Model A (ToolRL)**: `chengq9/ToolRL-Qwen2.5-3B` — fine-tuned with tool-use reinforcement learning - **Model B (Base)**: `Qwen/Qwen2.5-3B` — vanilla base model ## What is this? This model learns a sparse dictionary of features from the internal representations of two language models. By comparing which features activate for which model, we can identify: - **What the ToolRL fine-tuning changed** (A-exclusive features) - **What remained the same** (shared features) - **What the base model does that ToolRL suppressed** (B-exclusive features) ## Model Architecture Standard CrossCoder — all 16384 features shared, no partition masks | Parameter | Value | |-----------|-------| | Dictionary size | 16384 | | Top-k active features | 90 | | Layer | 13 (middle layer of Qwen2.5-3B) | | Activation dimension | 2048 | | Partitioning | None — all features shared | ### How it works 1. **Encode**: Takes stacked activations `(batch, 2, 2048)` from both models, applies per-model encoder weights, sums across models, and selects the top-90 features via ReLU + top-k. 2. **Decode**: Reconstructs per-model activations from the sparse feature vector using per-model decoder weights. 3. **Partition masks** (DFC only): Hard binary masks zero out encoder/decoder weights to enforce that exclusive features cannot be used by the wrong model. ## Training | Parameter | Value | |-----------|-------| | Loss function | MSE + L1 sparsity (coef: 1e-3) | | Training steps | 9000 | | Learning rate | 1e-4 | | Batch size | 1024 | | Sparsity coefficient (shared) | 1e-3 | | Exclusive sparsity coefficient | 0 | | Optimizer | Adam (grad clip 1.0) | | W&B project | `dfc-crosscoder-sweep` | ### Training Data - **FineWeb**: ~40,000 general web text samples (from `HuggingFaceFW/fineweb` sample-10BT) - **ToolRL**: ~40,000 tool-use conversation samples (from `emrecanacikgoz/ToolRL`, cycled) - Activations extracted from layer 13, last token per sample - Both datasets concatenated and z-score normalized ## Usage ### Quick Start ```python import torch from huggingface_hub import hf_hub_download # Download model files repo_id = "antebe1/cc-D16k-k90" for fname in ["model.pt", "config.json", "dfc.py"]: hf_hub_download(repo_id=repo_id, filename=fname, local_dir="./model") # Load the crosscoder import sys; sys.path.insert(0, "./model") from dfc import DFCCrossCoder dfc = DFCCrossCoder.load("./model", device="cuda") print(f"Loaded: dict_size={dfc.dict_size}, k={dfc.k}") ``` ### Extract Features from Real Models ```python from transformers import AutoModelForCausalLM, AutoTokenizer # Load both models model_a = AutoModelForCausalLM.from_pretrained("chengq9/ToolRL-Qwen2.5-3B", device_map="cuda:0") model_b = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-3B", device_map="cuda:1") tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B") # Get activations from layer 13 # NOTE: hidden_states[0] = embeddings, hidden_states[i] = output of layer i-1 # so layer 13 activations are at index 13+1 text = "Use the search tool to find recent papers on RLHF" inputs = tokenizer(text, return_tensors="pt") with torch.no_grad(): out_a = model_a(**inputs.to("cuda:0"), output_hidden_states=True) out_b = model_b(**inputs.to("cuda:1"), output_hidden_states=True) act_a = out_a.hidden_states[13 + 1][:, -1, :] # last token, layer 13 act_b = out_b.hidden_states[13 + 1][:, -1, :] # Stack and encode activations = torch.stack([act_a.cpu(), act_b.cpu()], dim=1) # (1, 2, 2048) features = dfc.encode(activations.to(dfc.W_enc.device)) print(f"Active features: {(features > 0).sum().item()} / {dfc.dict_size}") ``` ### Analyze Partitions (DFC only) ```python stats = dfc.feature_stats(features) print(f"L0 total: {stats['l0_total']:.1f}") print(f"L0 A-excl: {stats['l0_a_excl']:.1f}") print(f"L0 B-excl: {stats['l0_b_excl']:.1f}") print(f"L0 shared: {stats['l0_shared']:.1f}") # Check reconstruction quality recon, feats = dfc(activations.to(dfc.W_enc.device)) mse = torch.nn.functional.mse_loss(recon.cpu(), activations) print(f"Reconstruction MSE: {mse.item():.6f}") ``` ## Files | File | Description | |------|-------------| | `model.pt` | PyTorch state dict (encoder/decoder weights + partition masks) | | `config.json` | Architecture config: dict_size, k, partition sizes (n_a, n_b) | | `hparams.json` | Full training hyperparameters including loss, lr, steps, etc. | | `dfc.py` | `DFCCrossCoder` class definition — required to load model.pt | | `demo.py` | Feature extraction demo (works with downloaded model) | | `requirements.txt` | Python dependencies | ## Part of a Sweep This model is one of 48 models in a hyperparameter sweep. See the full collection: | Axis | Values | |------|--------| | k (top-k) | 45, 90, 160 | | dict_size | 8,192 / 16,384 | | Architecture | DFC (partitioned) / CrossCoder (all shared) | | Exclusive % (DFC) | 3%, 5%, 10% | | Exclusive sparsity | 1e-3 (penalized) / 0 (free) | | CrossCoder L1 | with / without | ## Citation ```bibtex @misc{cc-D16k-k90, title={CrossCoder CrossCoder: ToolRL vs Base Qwen2.5-3B}, author={Andre Shportko}, year={2026}, url={https://huggingface.co/antebe1/cc-D16k-k90} } ```