{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SF-Cluster \u2014 frustration-guided MSA subset builder\n",
    "\n",
    "**What this notebook does.** Installs the open-source `sf_cluster` package, downloads a small KaiB demo bundle (a 364-sequence MSA + a per-residue Frustration Index matrix from FrustrAI-Seq), and builds two flavours of stratified MSA subsets (`mosaic` and `gradient`) using the contrast-HV/LV score. Everything runs on CPU in roughly two minutes.\n",
    "\n",
    "**Who it is for.** Biologists who want reproducible, frustration-stratified MSA slices to feed into an AF-Cluster-style multi-conformer prediction loop.\n",
    "\n",
    "**What you do next.** Take the 12 mosaic or 12 gradient A3M subsets emitted at the end of this notebook, run each through ColabFold AF2, and aggregate per the SF-Cluster \u00a79.1 hit criterion.\n",
    "\n",
    "---\n",
    "\n",
    "> ## LIMITATIONS \u2014 please read\n",
    "> A controlled comparison on the Main-21 cases shows that **uniform random subsampling performs equivalently on most cases**. The frustration signal is **not** the active ingredient here \u2014 depth reduction is. See the OSS README for the full ablation.\n",
    ">\n",
    "> Use this tool when you want **stratified, reproducible MSA subsets** with a clear provenance story \u2014 not as a guaranteed conformational diversity engine. It is a research baseline, not a turnkey accuracy improvement."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Install the package\n",
    "\n",
    "Pulls the OSS release from Hugging Face. Pure-Python; only depends on `numpy` and `scipy`."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "!pip install -q git+https://huggingface.co/ChatterjeeLab/SF-Cluster"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Download the KaiB demo bundle\n",
    "\n",
    "Three files, ~200 KB total: a filtered MSA, a per-residue FI matrix from FrustrAI-Seq, and the parallel sequence-ID list."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "from huggingface_hub import hf_hub_download\n",
    "from pathlib import Path\n",
    "import os\n",
    "\n",
    "REPO = 'ChatterjeeLab/SF-Cluster'\n",
    "FILES = ['examples/data/KaiB_filtered.a3m',\n",
    "         'examples/data/KaiB_fi_matrix.npy',\n",
    "         'examples/data/KaiB_seq_ids.txt']\n",
    "\n",
    "local = {}\n",
    "for fname in FILES:\n",
    "    p = hf_hub_download(repo_id=REPO, filename=fname, repo_type='model')\n",
    "    local[fname] = p\n",
    "    print(f'{fname:50s}  {os.path.getsize(p)/1024:7.1f} KB  ->  {p}')\n",
    "\n",
    "A3M = local['examples/data/KaiB_filtered.a3m']\n",
    "FI  = local['examples/data/KaiB_fi_matrix.npy']\n",
    "IDS = local['examples/data/KaiB_seq_ids.txt']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Build the pool and stratified subsets\n",
    "\n",
    "The `pool_msa` call ties the MSA records to their per-residue FI vectors. `contrast_hvlv` computes the per-sequence high-variance / low-variance FI contrast (see README for the formula). `method_mosaic` and `method_gradient` then deterministically draw 12 subsets of 32 sequences each."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from sf_cluster import pool_msa, contrast_hvlv, method_mosaic, method_gradient\n",
    "\n",
    "pool = pool_msa(A3M, FI)\n",
    "print(f'pool: N_seq={pool.n_seq}, L={pool.n_cols}, query={pool.headers[0]!r}')\n",
    "\n",
    "score = contrast_hvlv(pool.fi_matrix)\n",
    "print(f'contrast_hvlv: shape={score.shape}, '\n",
    "      f'min={score.min():+.3f}, median={np.median(score):+.3f}, max={score.max():+.3f}')\n",
    "\n",
    "mosaic_subsets   = method_mosaic(score)\n",
    "gradient_subsets = method_gradient(score)\n",
    "\n",
    "def summarize(name, subsets):\n",
    "    print(f'\\n[{name}]  {len(subsets)} subsets')\n",
    "    print(f'{\"subset_id\":>10}  {\"n_seqs\":>7}  {\"mean_contrast\":>14}')\n",
    "    for i, sub in enumerate(subsets):\n",
    "        m = float(np.mean(score[sub]))\n",
    "        print(f'{i:>10d}  {len(sub):>7d}  {m:>+14.4f}')\n",
    "\n",
    "summarize('mosaic',   mosaic_subsets)\n",
    "summarize('gradient', gradient_subsets)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Visualise\n",
    "\n",
    "Three plots: the contrast score distribution with tercile / quartile boundaries marked, the per-subset mean contrast score for both methods, and the pairwise sequence-overlap heatmap between mosaic and gradient subsets."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "\n",
    "fig, axes = plt.subplots(1, 3, figsize=(15, 4))\n",
    "\n",
    "# (a) score histogram with tercile + quartile lines\n",
    "ax = axes[0]\n",
    "ax.hist(score, bins=40, color='#4477AA', edgecolor='white', alpha=0.85)\n",
    "sorted_s = np.sort(score)\n",
    "N = len(sorted_s)\n",
    "terciles = [sorted_s[N//3], sorted_s[2*N//3]]\n",
    "quartiles = [sorted_s[N//4], sorted_s[N//2], sorted_s[3*N//4]]\n",
    "for t in terciles:\n",
    "    ax.axvline(t, color='#CC6677', linestyle='--', label='tercile (mosaic)' if t==terciles[0] else None)\n",
    "for q in quartiles:\n",
    "    ax.axvline(q, color='#117733', linestyle=':',  label='quartile (gradient)' if q==quartiles[0] else None)\n",
    "ax.set_xlabel('contrast_hvlv')\n",
    "ax.set_ylabel('count')\n",
    "ax.set_title('(a) per-sequence contrast score')\n",
    "ax.legend(fontsize=8)\n",
    "\n",
    "# (b) per-subset mean contrast\n",
    "ax = axes[1]\n",
    "x = np.arange(12)\n",
    "m_means = np.array([score[s].mean() for s in mosaic_subsets])\n",
    "g_means = np.array([score[s].mean() for s in gradient_subsets])\n",
    "w = 0.4\n",
    "ax.bar(x - w/2, m_means, width=w, label='mosaic',   color='#4477AA')\n",
    "ax.bar(x + w/2, g_means, width=w, label='gradient', color='#CC6677')\n",
    "ax.axhline(0, color='black', lw=0.5)\n",
    "ax.set_xlabel('subset id')\n",
    "ax.set_ylabel('mean contrast_hvlv')\n",
    "ax.set_title('(b) per-subset mean score')\n",
    "ax.legend(fontsize=8)\n",
    "\n",
    "# (c) pairwise overlap heatmap (mosaic x gradient)\n",
    "ax = axes[2]\n",
    "M = np.zeros((12, 12), dtype=int)\n",
    "for i, si in enumerate(mosaic_subsets):\n",
    "    set_i = set(si)\n",
    "    for j, sj in enumerate(gradient_subsets):\n",
    "        M[i, j] = len(set_i & set(sj))\n",
    "im = ax.imshow(M, cmap='magma', aspect='auto')\n",
    "ax.set_xlabel('gradient subset')\n",
    "ax.set_ylabel('mosaic subset')\n",
    "ax.set_title('(c) sequence overlap (count)')\n",
    "plt.colorbar(im, ax=ax, fraction=0.046, pad=0.04)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Write subsets to A3M files\n",
    "\n",
    "Each subset is written as a ColabFold-compatible A3M with the query as the first record. Downstream you would feed one A3M per AF2 run."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "from sf_cluster import build_subsets\n",
    "\n",
    "out_mosaic   = Path('./subsets_mosaic')\n",
    "out_gradient = Path('./subsets_gradient')\n",
    "\n",
    "_, _, _, mosaic_paths   = build_subsets(A3M, FI, method='mosaic',   out_dir=out_mosaic)\n",
    "_, _, _, gradient_paths = build_subsets(A3M, FI, method='gradient', out_dir=out_gradient)\n",
    "\n",
    "print(f'mosaic   -> {len(mosaic_paths):2d} files in {out_mosaic}/')\n",
    "print(f'gradient -> {len(gradient_paths):2d} files in {out_gradient}/')\n",
    "\n",
    "sample = mosaic_paths[0]\n",
    "print(f'\\nFirst 3 records of {sample.name}:')\n",
    "with open(sample) as f:\n",
    "    lines = f.read().splitlines()\n",
    "shown = 0\n",
    "i = 0\n",
    "while i < len(lines) and shown < 3:\n",
    "    if lines[i].startswith('>'):\n",
    "        print(' ', lines[i])\n",
    "        if i+1 < len(lines):\n",
    "            seq = lines[i+1]\n",
    "            print(' ', seq[:80] + ('...' if len(seq) > 80 else ''))\n",
    "        shown += 1\n",
    "        i += 2\n",
    "    else:\n",
    "        i += 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Bring your own protein\n",
    "\n",
    "The demo bundle is tiny and CPU-friendly. For your own target:\n",
    "\n",
    "1. **Build an MSA.** Use the official [ColabFold notebook](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb) (`mmseqs2_uniref_env` mode) to generate a deep `.a3m`, then filter it (e.g. 25%-gap filter) to obtain `your_msa.a3m`.\n",
    "2. **Compute the FI matrix.** Run [FrustrAI-Seq](https://huggingface.co/leuschj/FrustrAI-Seq) on `your_msa.a3m` to obtain a per-residue Frustration Index matrix `your_fi.npy` of shape `(N_seq, L)`. **A GPU is required for this step.** See the FrustrAI-Seq model card for inference details.\n",
    "3. **Re-run the cells above.** Just point `A3M` and `FI` at your files and re-execute from \u00a73 onward. The package will raise a `ValueError` if `N_seq` disagrees between the two."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Next: run AF2 on each subset\n",
    "\n",
    "Feed each subset A3M into the official [ColabFold AlphaFold2 notebook](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb) \u2014 one subset per AF2 run. Aggregate per the SF-Cluster \u00a79.1 hit criterion:\n",
    "\n",
    "- C\u03b1 RMSD \u2264 3.0 \u00c5 on the `common_core` residues vs. each reference state,\n",
    "- mean pLDDT \u2265 70 overall,\n",
    "- mean pLDDT \u2265 70 inside the `switch_region`.\n",
    "\n",
    "**Compute budget disclosure (per `docs/protocol_lock.md`).** The SF-Cluster paper locks AF2 at 3 recycles \u00d7 4 seeds \u00d7 5 models for KaiB / Mpt53, and 0 recycles \u00d7 8 seeds \u00d7 5 models for the GA/GB cases. The GA/GB row was further trimmed to **4 subsets per case** during refinement to stay within the compute envelope. Global seed: `20260422`. Per-case seed = `hash(case_name) mod 2^31`; per-subset seed = `base_seed + subset_index`. All inference uses `templates=OFF`, `relax=OFF`, `dropout=OFF`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8. Citation, license, companion repo\n",
    "\n",
    "```bibtex\n",
    "@misc{sf_cluster_2026,\n",
    "  title  = {SF-Cluster: frustration-guided MSA subset builders for AF2 multi-conformer prediction},\n",
    "  author = {Cao, Hanqun and {Chatterjee Lab}},\n",
    "  year   = {2026},\n",
    "  note   = {Workshop release. Companion code: https://huggingface.co/ChatterjeeLab/SF-Cluster},\n",
    "  url    = {https://huggingface.co/ChatterjeeLab/SF-Cluster}\n",
    "}\n",
    "```\n",
    "\n",
    "**License:** MIT. See `LICENSE` in the OSS repo.\n",
    "\n",
    "**Companion private dev repo.** Full Phase II benchmark code (DBSCAN baselines, all four arms, evaluation harness, region partition ablation) lives in the SF-Cluster private dev repository. The OSS release here is a slim, dependency-light subset \u2014 only the `mosaic` and `gradient` arms and their scoring function \u2014 intended for reuse, not full reproduction of the benchmark."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}