---
license: mit
tags:
  - sparse-autoencoder
  - biology
  - computer-vision
  - dinov2
  - morphological-traits
  - insects
  - biodiversity
  - feature-extraction
datasets:
  - osunlp/bioscan-traits
pipeline_tag: feature-extraction
---

# SAE Trait Annotation for Organismal Images

Sparse Autoencoder (SAE) checkpoint from the ICLR 2026 paper:
[**Automatic Image-Level Morphological Trait Annotation for Organismal Images**](https://arxiv.org/pdf/2604.01619)

**Authors:** Vardaan Pahuja, Samuel Stevens, Alyson East, Sydne Record, Yu Su

- Website: [osu-nlp-group.github.io/sae-trait-annotation](https://osu-nlp-group.github.io/sae-trait-annotation/)
- Code: [OSU-NLP-Group/sae-trait-annotation](https://github.com/OSU-NLP-Group/sae-trait-annotation)
- Dataset: [osunlp/bioscan-traits](https://huggingface.co/datasets/osunlp/bioscan-traits)

## Model Description

This SAE is trained on penultimate-layer activations of a DINOv2 ViT-B/14 model applied to insect images from [BIOSCAN-5M](https://github.com/zahrag/BIOSCAN-5M). Its latents capture interpretable visual features that correspond to species-level morphological traits (e.g., wing venation, body coloration, antennal structure). These latents are used to steer a multimodal LLM (Qwen2.5-VL-72B) into generating natural-language trait annotations.

**Architecture:**
- Base encoder: DINOv2 ViT-B/14 (frozen), activations from layer `-2`
- SAE input dimension (`d-vit`): 768
- Expansion factor: 32 → **24,576 latent dimensions**
- Training data: patch-level activations from BIOSCAN-5M

## Usage

Clone the [code repository](https://github.com/OSU-NLP-Group/sae-trait-annotation) (which vendors the `saev` library), then load and run the SAE as follows:

```python
import torch
import saev.nn
import saev.activations
from torchvision import datasets
from torch.utils.data import DataLoader

device = "cuda" if torch.cuda.is_available() else "cpu"

# Build the image transform and DINOv2 ViT-B/14 backbone
img_transform = saev.activations.make_img_transform("dinov2", "sae.pt")
vit = saev.activations.make_vit("dinov2", "dinov2_vitb14")

# Wrap the ViT to record activations from layer 10 (penultimate), 256 patches
recorded_vit = saev.activations.RecordedVisionTransformer(
    vit, n_patches=256, cls_token=True, layers=[10]
).to(device)

# Load the SAE checkpoint
sae = saev.nn.load("sae.pt").to(device)
sae.eval()

# --- Encode a batch of images ---
# dataset: torchvision ImageFolder with images at 224x224
dataset = datasets.ImageFolder(root="/path/to/images/train")

def collate_fn(batch):
    images, labels = zip(*batch)
    return list(images), torch.tensor(labels)

loader = DataLoader(dataset, batch_size=32, shuffle=False, collate_fn=collate_fn)

with torch.no_grad():
    for images, labels in loader:
        images_t = torch.stack(img_transform(images)).to(device)

        # vit_acts: (batch, n_layers, n_patches+1, d_vit)
        _, vit_acts = recorded_vit(images_t)

        # Select layer 0 of the recorded layers, drop the CLS token
        vit_acts = vit_acts[:, 0, 1:, :]   # (batch, 256, 768)

        # SAE forward: returns (reconstruction, features, aux)
        _, f_x, _ = sae(vit_acts)           # f_x: (batch, 256, 24576)

        # Threshold activations to find active latents (default thresh=0.9)
        active = (f_x > 0.9)               # (batch, 256, 24576) bool
```

The active latent indices per patch identify which SAE dimensions fire on each image region. These are used downstream to find species-prominent latents and generate trait annotations via an MLLM. See [`create_trait_dataset_mllm_sae.py`](https://github.com/OSU-NLP-Group/sae-trait-annotation/blob/main/create_trait_dataset_mllm_sae.py) for the full pipeline.

## Training Details

- **Training data:** BIOSCAN-5M insect images preprocessed into `ImageFolder` layout
- **Learning rate:** 1e-3
- **Sparsity coefficient (alpha):**: 4e-4
- **Data patches:** patch-level (256 patches/image), unscaled mean and norm

## Intended Use

- Generating morphological trait annotations for organismal (insect) images
- Interpretability research on vision foundation models via SAE latent analysis
- Downstream fine-tuning of classifiers using trait-annotated data (e.g., with BioCLIP)

## Citation

```bibtex
@inproceedings{
  pahuja2026automatic,
  title={Automatic Image-Level Morphological Trait Annotation for Organismal Images},
  author={Vardaan Pahuja and Samuel Stevens and Alyson East and Sydne Record and Yu Su},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026},
  url={https://openreview.net/forum?id=oFRbiaib5Q}
}
```

## Acknowledgments

**Code**

- [SAEV](https://github.com/OSU-NLP-Group/saev) for sparse autoencoder training infrastructure.
- [BioCLIP](https://github.com/Imageomics/bioclip) for downstream training/evaluation tooling.

**Funding**

This research was supported in part by NSF CAREER \#2443149, NSF OAC 2118240, and an Alfred P. Sloan Foundation Fellowship. Computational resources were provided by the Ohio Supercomputer Center.

S. Record and A. East were additionally supported by NSF Award No. 242918 (EPSCOR Research Fellows: Advancing NEON-Enabled Science and Workforce Development at the University of Maine with AI) and Hatch project Award \#MEO-022425 from the USDA National Institute of Food and Agriculture.

**People**

We thank colleagues in the OSU NLP group for valuable feedback. This work was in part conceived at [Funcapalooza](https://github.com/Imageomics/FuncaPalooza-2025/wiki/).