VIBE: Multimodal Brain Encoding from Video, Audio, and Text

VIBE (Video-Input Brain Encoder) is a pretrained multimodal fMRI encoding model for predicting whole-brain fMRI responses from aligned movie transcripts, audio, and video. The model is integrated with the BERG (Brain Encoding Response Generator) library and was trained on the CNeuroMod dataset used for Algonauts 2025 challenge preparation.

This model card corresponds to the VIBE-Gigantic variant. Additional VIBE variants are available separately through the Hugging Face collection.

For full model documentation, BERG integration details, metadata structure, and API usage, see the BERG model page:

https://brain-encoding-response-generator.readthedocs.io/en/latest/models/model_cards/fmri-cneuromod_algo2025-vibe.html

Model summary

VIBE predicts parcel-wise fMRI activity from multimodal movie stimuli. It combines transcript, audio, and video features aligned to fMRI TRs and produces predicted brain responses in Schaefer parcel space.

  • Modality: fMRI
  • Species: Human
  • Stimuli: Video + Audio + Text
  • Atlas: Schaefer 2018, 1000 parcels, 7-network parcellation
  • Training data: CNeuroMod (Algonauts 2025 challenge preparation)
  • Subjects: 4 subjects (Algonauts-style IDs: 1, 2, 3, 5)

Model architecture

VIBE uses a two-stage Transformer architecture for multimodal brain encoding.

  • In the first stage, text, audio, and video features are linearly projected into a shared 256-dimensional space together with a learned subject embedding.
  • A modality-fusion Transformer performs cross-attention across modalities independently at each TR.
  • The fused per-TR representations are then passed to a prediction Transformer with 2 layers to model temporal dependencies across TRs using Rotary Positional Embeddings (RoPE).
  • A final feed-forward layer maps the resulting representations to the 1000-parcel Schaefer output space.

The model is trained using a combined Pearson-correlation + MSE loss and was ensembled across multiple random seeds in the original work.

These BERG-integrated VIBE models are modified from the original release to use fewer feature extractors for faster inference and lower memory usage.

For full details, see:

Schad, Dixit, Keck et al. (2025), arXiv:2507.17958

Temporal resolution

The model was trained with a TR of 1.49 s, which is also the prediction resolution.

The transcript input must contain exactly one string per TR, and the number of transcript strings must match the number of TRs derived from the video duration:

floor(video_duration / 1.49)

A mismatch between transcript length and derived video TRs will raise an error.

Input and output

Input

Two inputs are required:

  1. stimulus: a list[str] containing one transcript string per fMRI TR
  2. video_path: a str pointing to the source video file used for audio/video feature extraction

Example:

stimulus = ["Hello, are you", "awake? Yes,"]
video_path = "/path/to/movie.mp4"

Output

A torch.Tensor of shape:

[num_timepoints, num_parcels]

where:

  • num_timepoints is the number of predicted TRs
  • num_parcels is the number of Schaefer parcels (1000 by default, or fewer if output selection is used)

Usage with BERG

from berg import BERG

berg = BERG(berg_dir="path/to/brain-encoding-response-generator")

# Inspect available pretrained variants
variants = berg.get_model_variants("fmri-cneuromod_algo2025-vibe")

# Load this model variant
model = berg.get_encoding_model(
    "fmri-cneuromod_algo2025-vibe",
    subject=1,
    device="auto",
    model_variant="ShreyDixit/VIBE-Gigantic",
    low_mem_use=True
)

stimulus = ["Hello, are you", "awake? Yes,"]
video_path = "/path/to/movie.mp4"

responses = berg.encode(
    model,
    stimulus,
    video_path=video_path
)

print(responses.shape)

Optional output selection

VIBE supports optional output filtering through the selection argument in get_encoding_model().

You can select:

  • specific Schaefer network labels via roi
  • specific parcel indices via parcel_index

Valid ROI labels are:

  • "Vis"
  • "SomMot"
  • "DorsAttn"
  • "SalVentAttn"
  • "Limbic"
  • "Cont"
  • "Default"

Example:

model = berg.get_encoding_model(
    "fmri-cneuromod_algo2025-vibe",
    subject=1,
    model_variant="ShreyDixit/VIBE-Gigantic",
    selection={"roi": ["Vis"]}
)

Evaluation

  • In-distribution (Friends S07): 0.3129 Glass brain evaluation figure on Friend S07

  • Out-of-distribution (6 films): 0.2028 Glass brain evaluation figure on Friend S07

Metric:

  • Mean parcel-wise Pearson correlation

This repository contains the VIBE-Gigantic variant released for BERG-compatible inference.

Note, that this model is not directly comparable to the winning models of the Algonauts 2025 Challenge because all the winning teams (including us) used ensembles, while this is a single model. However, despite being a single model, it does provide competitive scores and is easily accessable to the community.

Metadata

The model exposes ROI mask metadata for the 7 Schaefer networks:

  • Vis
  • SomMot
  • DorsAttn
  • SalVentAttn
  • Limbic
  • Cont
  • Default

Atlas files for glass brain visualization (Schaefer 1000-parcel MNI coordinates) are provided separately in the BERG directory and are not part of the per-subject metadata files.

References

If you use this model, please cite:

@article{schad2025vibe,
  author = {Schad, Daniel Carlström and Dixit, Shrey and Keck, Janis and Studenyak, Viktor and Shpilevoi, Aleksandr and Bicanski, Andrej},
  title = {VIBE: Video-Input Brain Encoder for fMRI Response Modeling},
  journal = {arXiv preprint arXiv:2507.17958},
  year = {2025}
}

Related resources

Downloads last month
12
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including ShreyDixit/VIBE-Gigantic

Paper for ShreyDixit/VIBE-Gigantic