{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# SF-Cluster \u2014 frustration-guided MSA subset builder\n", "\n", "**What this notebook does.** Installs the open-source `sf_cluster` package, downloads a small KaiB demo bundle (a 364-sequence MSA + a per-residue Frustration Index matrix from FrustrAI-Seq), and builds two flavours of stratified MSA subsets (`mosaic` and `gradient`) using the contrast-HV/LV score. Everything runs on CPU in roughly two minutes.\n", "\n", "**Who it is for.** Biologists who want reproducible, frustration-stratified MSA slices to feed into an AF-Cluster-style multi-conformer prediction loop.\n", "\n", "**What you do next.** Take the 12 mosaic or 12 gradient A3M subsets emitted at the end of this notebook, run each through ColabFold AF2, and aggregate per the SF-Cluster \u00a79.1 hit criterion.\n", "\n", "---\n", "\n", "> ## LIMITATIONS \u2014 please read\n", "> A controlled comparison on the Main-21 cases shows that **uniform random subsampling performs equivalently on most cases**. The frustration signal is **not** the active ingredient here \u2014 depth reduction is. See the OSS README for the full ablation.\n", ">\n", "> Use this tool when you want **stratified, reproducible MSA subsets** with a clear provenance story \u2014 not as a guaranteed conformational diversity engine. It is a research baseline, not a turnkey accuracy improvement." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Install the package\n", "\n", "Pulls the OSS release from Hugging Face. Pure-Python; only depends on `numpy` and `scipy`." ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": [ "!pip install -q git+https://huggingface.co/ChatterjeeLab/SF-Cluster" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Download the KaiB demo bundle\n", "\n", "Three files, ~200 KB total: a filtered MSA, a per-residue FI matrix from FrustrAI-Seq, and the parallel sequence-ID list." ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": [ "from huggingface_hub import hf_hub_download\n", "from pathlib import Path\n", "import os\n", "\n", "REPO = 'ChatterjeeLab/SF-Cluster'\n", "FILES = ['examples/data/KaiB_filtered.a3m',\n", " 'examples/data/KaiB_fi_matrix.npy',\n", " 'examples/data/KaiB_seq_ids.txt']\n", "\n", "local = {}\n", "for fname in FILES:\n", " p = hf_hub_download(repo_id=REPO, filename=fname, repo_type='model')\n", " local[fname] = p\n", " print(f'{fname:50s} {os.path.getsize(p)/1024:7.1f} KB -> {p}')\n", "\n", "A3M = local['examples/data/KaiB_filtered.a3m']\n", "FI = local['examples/data/KaiB_fi_matrix.npy']\n", "IDS = local['examples/data/KaiB_seq_ids.txt']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Build the pool and stratified subsets\n", "\n", "The `pool_msa` call ties the MSA records to their per-residue FI vectors. `contrast_hvlv` computes the per-sequence high-variance / low-variance FI contrast (see README for the formula). `method_mosaic` and `method_gradient` then deterministically draw 12 subsets of 32 sequences each." ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": [ "import numpy as np\n", "from sf_cluster import pool_msa, contrast_hvlv, method_mosaic, method_gradient\n", "\n", "pool = pool_msa(A3M, FI)\n", "print(f'pool: N_seq={pool.n_seq}, L={pool.n_cols}, query={pool.headers[0]!r}')\n", "\n", "score = contrast_hvlv(pool.fi_matrix)\n", "print(f'contrast_hvlv: shape={score.shape}, '\n", " f'min={score.min():+.3f}, median={np.median(score):+.3f}, max={score.max():+.3f}')\n", "\n", "mosaic_subsets = method_mosaic(score)\n", "gradient_subsets = method_gradient(score)\n", "\n", "def summarize(name, subsets):\n", " print(f'\\n[{name}] {len(subsets)} subsets')\n", " print(f'{\"subset_id\":>10} {\"n_seqs\":>7} {\"mean_contrast\":>14}')\n", " for i, sub in enumerate(subsets):\n", " m = float(np.mean(score[sub]))\n", " print(f'{i:>10d} {len(sub):>7d} {m:>+14.4f}')\n", "\n", "summarize('mosaic', mosaic_subsets)\n", "summarize('gradient', gradient_subsets)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Visualise\n", "\n", "Three plots: the contrast score distribution with tercile / quartile boundaries marked, the per-subset mean contrast score for both methods, and the pairwise sequence-overlap heatmap between mosaic and gradient subsets." ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "\n", "fig, axes = plt.subplots(1, 3, figsize=(15, 4))\n", "\n", "# (a) score histogram with tercile + quartile lines\n", "ax = axes[0]\n", "ax.hist(score, bins=40, color='#4477AA', edgecolor='white', alpha=0.85)\n", "sorted_s = np.sort(score)\n", "N = len(sorted_s)\n", "terciles = [sorted_s[N//3], sorted_s[2*N//3]]\n", "quartiles = [sorted_s[N//4], sorted_s[N//2], sorted_s[3*N//4]]\n", "for t in terciles:\n", " ax.axvline(t, color='#CC6677', linestyle='--', label='tercile (mosaic)' if t==terciles[0] else None)\n", "for q in quartiles:\n", " ax.axvline(q, color='#117733', linestyle=':', label='quartile (gradient)' if q==quartiles[0] else None)\n", "ax.set_xlabel('contrast_hvlv')\n", "ax.set_ylabel('count')\n", "ax.set_title('(a) per-sequence contrast score')\n", "ax.legend(fontsize=8)\n", "\n", "# (b) per-subset mean contrast\n", "ax = axes[1]\n", "x = np.arange(12)\n", "m_means = np.array([score[s].mean() for s in mosaic_subsets])\n", "g_means = np.array([score[s].mean() for s in gradient_subsets])\n", "w = 0.4\n", "ax.bar(x - w/2, m_means, width=w, label='mosaic', color='#4477AA')\n", "ax.bar(x + w/2, g_means, width=w, label='gradient', color='#CC6677')\n", "ax.axhline(0, color='black', lw=0.5)\n", "ax.set_xlabel('subset id')\n", "ax.set_ylabel('mean contrast_hvlv')\n", "ax.set_title('(b) per-subset mean score')\n", "ax.legend(fontsize=8)\n", "\n", "# (c) pairwise overlap heatmap (mosaic x gradient)\n", "ax = axes[2]\n", "M = np.zeros((12, 12), dtype=int)\n", "for i, si in enumerate(mosaic_subsets):\n", " set_i = set(si)\n", " for j, sj in enumerate(gradient_subsets):\n", " M[i, j] = len(set_i & set(sj))\n", "im = ax.imshow(M, cmap='magma', aspect='auto')\n", "ax.set_xlabel('gradient subset')\n", "ax.set_ylabel('mosaic subset')\n", "ax.set_title('(c) sequence overlap (count)')\n", "plt.colorbar(im, ax=ax, fraction=0.046, pad=0.04)\n", "\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Write subsets to A3M files\n", "\n", "Each subset is written as a ColabFold-compatible A3M with the query as the first record. Downstream you would feed one A3M per AF2 run." ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": [ "from pathlib import Path\n", "from sf_cluster import build_subsets\n", "\n", "out_mosaic = Path('./subsets_mosaic')\n", "out_gradient = Path('./subsets_gradient')\n", "\n", "_, _, _, mosaic_paths = build_subsets(A3M, FI, method='mosaic', out_dir=out_mosaic)\n", "_, _, _, gradient_paths = build_subsets(A3M, FI, method='gradient', out_dir=out_gradient)\n", "\n", "print(f'mosaic -> {len(mosaic_paths):2d} files in {out_mosaic}/')\n", "print(f'gradient -> {len(gradient_paths):2d} files in {out_gradient}/')\n", "\n", "sample = mosaic_paths[0]\n", "print(f'\\nFirst 3 records of {sample.name}:')\n", "with open(sample) as f:\n", " lines = f.read().splitlines()\n", "shown = 0\n", "i = 0\n", "while i < len(lines) and shown < 3:\n", " if lines[i].startswith('>'):\n", " print(' ', lines[i])\n", " if i+1 < len(lines):\n", " seq = lines[i+1]\n", " print(' ', seq[:80] + ('...' if len(seq) > 80 else ''))\n", " shown += 1\n", " i += 2\n", " else:\n", " i += 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Bring your own protein\n", "\n", "The demo bundle is tiny and CPU-friendly. For your own target:\n", "\n", "1. **Build an MSA.** Use the official [ColabFold notebook](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb) (`mmseqs2_uniref_env` mode) to generate a deep `.a3m`, then filter it (e.g. 25%-gap filter) to obtain `your_msa.a3m`.\n", "2. **Compute the FI matrix.** Run [FrustrAI-Seq](https://huggingface.co/leuschj/FrustrAI-Seq) on `your_msa.a3m` to obtain a per-residue Frustration Index matrix `your_fi.npy` of shape `(N_seq, L)`. **A GPU is required for this step.** See the FrustrAI-Seq model card for inference details.\n", "3. **Re-run the cells above.** Just point `A3M` and `FI` at your files and re-execute from \u00a73 onward. The package will raise a `ValueError` if `N_seq` disagrees between the two." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7. Next: run AF2 on each subset\n", "\n", "Feed each subset A3M into the official [ColabFold AlphaFold2 notebook](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb) \u2014 one subset per AF2 run. Aggregate per the SF-Cluster \u00a79.1 hit criterion:\n", "\n", "- C\u03b1 RMSD \u2264 3.0 \u00c5 on the `common_core` residues vs. each reference state,\n", "- mean pLDDT \u2265 70 overall,\n", "- mean pLDDT \u2265 70 inside the `switch_region`.\n", "\n", "**Compute budget disclosure (per `docs/protocol_lock.md`).** The SF-Cluster paper locks AF2 at 3 recycles \u00d7 4 seeds \u00d7 5 models for KaiB / Mpt53, and 0 recycles \u00d7 8 seeds \u00d7 5 models for the GA/GB cases. The GA/GB row was further trimmed to **4 subsets per case** during refinement to stay within the compute envelope. Global seed: `20260422`. Per-case seed = `hash(case_name) mod 2^31`; per-subset seed = `base_seed + subset_index`. All inference uses `templates=OFF`, `relax=OFF`, `dropout=OFF`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 8. Citation, license, companion repo\n", "\n", "```bibtex\n", "@misc{sf_cluster_2026,\n", " title = {SF-Cluster: frustration-guided MSA subset builders for AF2 multi-conformer prediction},\n", " author = {Cao, Hanqun and {Chatterjee Lab}},\n", " year = {2026},\n", " note = {Workshop release. Companion code: https://huggingface.co/ChatterjeeLab/SF-Cluster},\n", " url = {https://huggingface.co/ChatterjeeLab/SF-Cluster}\n", "}\n", "```\n", "\n", "**License:** MIT. See `LICENSE` in the OSS repo.\n", "\n", "**Companion private dev repo.** Full Phase II benchmark code (DBSCAN baselines, all four arms, evaluation harness, region partition ablation) lives in the SF-Cluster private dev repository. The OSS release here is a slim, dependency-light subset \u2014 only the `mosaic` and `gradient` arms and their scoring function \u2014 intended for reuse, not full reproduction of the benchmark." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.10" } }, "nbformat": 4, "nbformat_minor": 5 }