Initial OSS release: mosaic + gradient subset builders (verified KaiB 95.0%, GA98 92.5%, GB98 50.0% on Phase XII pilot)

Browse files

Files changed (11) hide show

.gitignore +14 -0
LICENSE +21 -0
README.md +136 -0
examples/run_demo.sh +65 -0
pyproject.toml +43 -0
src/sf_cluster/__init__.py +27 -0
src/sf_cluster/cli.py +119 -0
src/sf_cluster/methods.py +217 -0
src/sf_cluster/pool.py +176 -0
src/sf_cluster/score.py +66 -0
tests/test_methods.py +211 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,14 @@

+__pycache__/
+*.pyc
+*.pyo
+*.egg-info/
+build/
+dist/
+.eggs/
+.pytest_cache/
+.mypy_cache/
+.ruff_cache/
+.coverage
+htmlcov/
+examples/demo_out/
+.venv/

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2026 Hanqun Cao
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md ADDED Viewed

	@@ -0,0 +1,136 @@

+# SF-Cluster (workshop OSS release)
+Frustration-guided MSA subset builders for AlphaFold2 multi-conformer
+prediction. This is the open-source workshop distribution of two subset
+methods from the SF-Cluster benchmark:
+- **mosaic** — each subset mixes high / mid / low contrast-FI sequences.
+- **gradient** — each subset is homogeneous within a contrast-FI quartile.
+The contrast score is computed from a per-residue Frustration Index (FI)
+matrix produced by [FrustrAI-Seq](https://github.com/leuschj/FrustrAI-Seq)
+(HF model: `leuschj/FrustrAI-Seq`).
+This package is dependency-light (`numpy`, `scipy`), provides a CLI, and is
+designed to be a drop-in replacement for random / uniform MSA subsampling in
+[AF-Cluster](https://github.com/HWaymentSteele/AF_Cluster)-style pipelines.
+## Algorithm
+Given a filtered MSA `A` of `N` sequences over `L` match-state columns, and a
+per-residue FI matrix `F ∈ ℝ^{N×L}`:
+1. **Column variance**: `v_l = Var_i(F_{i,l})` over sequences.
+2. **High-variance mask**: `HV = {l : v_l ≥ percentile(v, 80)}`,
+   `LV = ¬HV`.
+3. **Contrast score** per sequence:
+   ```
+   contrast_hvlv(i) = mean_{l ∈ HV} F_{i,l} − mean_{l ∈ LV} F_{i,l}
+   ```
+4. **Mosaic** (N_SUBSETS = 12, TARGET_SIZE = 32):
+   sort pool by `contrast_hvlv`, tri-stratify into low/mid/high terciles;
+   for each subset `s ∈ {0..11}`, draw `11 high + 11 low + 10 mid` with
+   `np.random.default_rng(seed=s)`.
+5. **Gradient** (N_SUBSETS = 12, TARGET_SIZE = 32):
+   split sorted pool into 4 quartiles; for each bin `b ∈ {0..3}` and
+   `s ∈ {0..2}` draw 32 sequences from that bin only with
+   `np.random.default_rng(seed=10*b + s)`.
+## Install
+```bash
+pip install -e .
+```
+Python ≥ 3.10. Dependencies: `numpy`, `scipy`.
+## Inputs
+You need two files per case:
+1. A filtered A3M file (ColabFold-style). Lowercase insertion-state letters
+   are preserved verbatim in output subsets; only match-state (uppercase)
+   columns are scored.
+2. A per-residue FI matrix `.npy` of shape `(N_seq, L)`, where `N_seq` is
+   the number of sequences in the A3M and `L` is the number of match-state
+   columns.
+The FI matrix is produced by FrustrAI-Seq. We do not bundle weights — see
+`https://github.com/leuschj/FrustrAI-Seq` (model card:
+`https://huggingface.co/leuschj/FrustrAI-Seq`) for inference instructions.
+A reference usage pattern is documented in `examples/run_demo.sh`.
+## CLI
+```bash
+sf-cluster build \
+    --a3m   path/to/filtered.a3m \
+    --fi    path/to/fi_matrix.npy \
+    --method mosaic \
+    --n-subsets 12 \
+    --subset-size 32 \
+    --seed 20260422 \
+    --out   subsets/kaib_mosaic/
+```
+Outputs:
+```
+subsets/kaib_mosaic/
+├── mosaic_subset_000.a3m
+├── mosaic_subset_001.a3m
+├── ...
+├── mosaic_subset_011.a3m
+├── mosaic_subset_index.tsv   # subset_id, pool_index, header, score
+└── mosaic_meta.json          # provenance + score stats
+```
+## Library
+```python
+from sf_cluster import pool_msa, contrast_hvlv, method_mosaic, method_gradient
+pool = pool_msa("filtered.a3m", "fi_matrix.npy")
+score = contrast_hvlv(pool.fi_matrix)         # (N,) per-sequence
+subsets = method_mosaic(score)                # list[list[int]] of 12 × 32
+# or
+subsets = method_gradient(score)
+```
+Each subset is a list of indices into `pool.headers` / `pool.sequences`.
+## Reproducibility
+All RNG draws use `np.random.default_rng(seed=...)` with method-specific
+deterministic seeds (see Algorithm §4–§5). Re-running the same A3M + FI
+matrix yields byte-identical subset assignments. The CLI also records a
+provenance JSON (`{method}_meta.json`) capturing inputs, sizes, and the
+package version.
+## LIMITATIONS
+- **No frustration model included.** You must run FrustrAI-Seq separately to
+  obtain the `(N_seq, L)` FI matrix. This package only handles the
+  scoring + subset-construction stage.
+- **No AF2 runner included.** The package emits A3M files; downstream
+  inference (AF2 / ColabFold) is the user's responsibility.
+- **Only `mosaic` and `gradient` arms are open-sourced here.** The other
+  SF-Cluster arms (`region_cluster`, `contrast_nc`) require additional
+  feature pipelines and are intentionally excluded from this workshop
+  release.
+- **No re-sampling guarantee across subsets.** A sequence can appear in
+  multiple subsets (gradient draws from a single quartile with replacement
+  if the quartile is smaller than `subset_size`).
+- **Empirical caveat (read this).** Controlled comparison shows uniform
+  subsampling performs equivalently on most Main-21 cases — see paper for
+  boundary conditions under which contrast-FI stratification yields a
+  measurable lift over random subsampling. Treat this package as a research
+  baseline, not a turnkey accuracy improvement.
+## Citation
+If you use this code, please cite the SF-Cluster paper (forthcoming) and
+[FrustrAI-Seq](https://github.com/leuschj/FrustrAI-Seq).
+## License
+MIT. See `LICENSE`.

examples/run_demo.sh ADDED Viewed

	@@ -0,0 +1,65 @@

+#!/usr/bin/env bash
+# Minimal end-to-end demo on synthetic A3M + FI matrix.
+# Produces:
+#   demo_out/mosaic/   -- 12 mosaic subsets
+#   demo_out/gradient/ -- 12 gradient subsets
+set -euo pipefail
+HERE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+OUT="${HERE}/demo_out"
+mkdir -p "${OUT}"
+# 1. Generate synthetic inputs (200 random "sequences", L=60, random FI).
+python - <<'PY'
+import os
+import numpy as np
+from pathlib import Path
+OUT = Path(os.environ.get("DEMO_OUT", "examples/demo_out"))
+OUT.mkdir(parents=True, exist_ok=True)
+rng = np.random.default_rng(0)
+N, L = 200, 60
+alphabet = np.array(list("ACDEFGHIKLMNPQRSTVWY-"))
+seqs = rng.choice(alphabet, size=(N, L))
+a3m_path = OUT / "synthetic.a3m"
+with open(a3m_path, "w") as f:
+    f.write(f"#{L}\t1\n")
+    for i, row in enumerate(seqs):
+        tag = "query" if i == 0 else f"seq{i:04d}"
+        f.write(f">{tag}\n{''.join(row)}\n")
+# Synthetic FI matrix: random but with a few high-variance columns.
+fi = rng.normal(loc=0.0, scale=0.3, size=(N, L)).astype(np.float64)
+hv_cols = rng.choice(L, size=L // 5, replace=False)
+fi[:, hv_cols] += rng.normal(loc=0.0, scale=1.2, size=(N, len(hv_cols)))
+np.save(OUT / "synthetic_fi.npy", fi)
+print(f"wrote {a3m_path}")
+print(f"wrote {OUT/'synthetic_fi.npy'}  shape={fi.shape}")
+PY
+export DEMO_OUT="${OUT}"
+# 2. Build mosaic subsets.
+sf-cluster build \
+    --a3m   "${OUT}/synthetic.a3m" \
+    --fi    "${OUT}/synthetic_fi.npy" \
+    --method mosaic \
+    --n-subsets 12 \
+    --subset-size 32 \
+    --seed 20260422 \
+    --out   "${OUT}/mosaic"
+# 3. Build gradient subsets.
+sf-cluster build \
+    --a3m   "${OUT}/synthetic.a3m" \
+    --fi    "${OUT}/synthetic_fi.npy" \
+    --method gradient \
+    --n-subsets 12 \
+    --subset-size 32 \
+    --seed 20260422 \
+    --out   "${OUT}/gradient"
+echo
+echo "Done. Inspect ${OUT}/mosaic and ${OUT}/gradient."

pyproject.toml ADDED Viewed

	@@ -0,0 +1,43 @@

+[build-system]
+requires = ["setuptools>=61.0", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "sf_cluster"
+version = "0.1.0"
+description = "Frustration-guided MSA subset builders for AlphaFold2 multi-conformer prediction (mosaic + gradient arms)."
+readme = "README.md"
+license = {file = "LICENSE"}
+requires-python = ">=3.10"
+authors = [
+    {name = "Hanqun Cao", email = "hanquncao@gmail.com"},
+]
+keywords = ["alphafold", "msa", "frustration", "protein", "fold-switch", "subsampling"]
+classifiers = [
+    "License :: OSI Approved :: MIT License",
+    "Programming Language :: Python :: 3",
+    "Programming Language :: Python :: 3.10",
+    "Programming Language :: Python :: 3.11",
+    "Programming Language :: Python :: 3.12",
+    "Topic :: Scientific/Engineering :: Bio-Informatics",
+]
+dependencies = [
+    "numpy>=1.23",
+    "scipy>=1.10",
+]
+[project.optional-dependencies]
+dev = ["pytest>=7.0"]
+[project.scripts]
+sf-cluster = "sf_cluster.cli:main"
+[project.urls]
+Homepage = "https://github.com/hanqun-cao/sf-cluster"
+Issues = "https://github.com/hanqun-cao/sf-cluster/issues"
+[tool.setuptools.packages.find]
+where = ["src"]
+[tool.setuptools.package-dir]
+"" = "src"

src/sf_cluster/__init__.py ADDED Viewed

	@@ -0,0 +1,27 @@

+"""SF-Cluster: frustration-guided MSA subset builders.
+Public API:
+    pool_msa(a3m_path, fi_npy_path) -> Pool
+    contrast_hvlv(fi_matrix) -> np.ndarray
+    method_mosaic(pool, score, n_subsets=12, subset_size=32) -> list[list[int]]
+    method_gradient(pool, score, n_subsets=12, subset_size=32) -> list[list[int]]
+    build_subsets(a3m_path, fi_npy_path, method, ...) -> list[list[int]]
+"""
+from .pool import pool_msa, Pool, read_a3m, write_a3m
+from .score import contrast_hvlv, high_variance_mask
+from .methods import method_mosaic, method_gradient, build_subsets
+__version__ = "0.1.0"
+__all__ = [
+    "pool_msa",
+    "Pool",
+    "read_a3m",
+    "write_a3m",
+    "contrast_hvlv",
+    "high_variance_mask",
+    "method_mosaic",
+    "method_gradient",
+    "build_subsets",
+    "__version__",
+]

src/sf_cluster/cli.py ADDED Viewed

	@@ -0,0 +1,119 @@

+"""Command-line interface: `sf-cluster build ...`."""
+from __future__ import annotations
+import argparse
+import json
+import sys
+from pathlib import Path
+import numpy as np
+from . import __version__
+from .methods import N_SUBSETS, TARGET_SIZE, build_subsets
+def _add_build_parser(sub: argparse._SubParsersAction) -> None:
+    p = sub.add_parser(
+        "build",
+        help="Build N MSA subsets from a filtered A3M + per-residue FI matrix.",
+    )
+    p.add_argument("--a3m", required=True, type=Path,
+                   help="path to filtered A3M file")
+    p.add_argument("--fi", required=True, type=Path,
+                   help="path to per-residue FI matrix .npy (N_seq, L)")
+    p.add_argument("--method", required=True, choices=["mosaic", "gradient"],
+                   help="subset construction method")
+    p.add_argument("--n-subsets", type=int, default=N_SUBSETS,
+                   help=f"number of subsets (default {N_SUBSETS})")
+    p.add_argument("--subset-size", type=int, default=TARGET_SIZE,
+                   help=f"sequences per subset (default {TARGET_SIZE})")
+    p.add_argument("--hv-percentile", type=float, default=80.0,
+                   help="column-variance percentile for HV mask (default 80)")
+    p.add_argument("--seed", type=int, default=20260422,
+                   help="global RNG seed tag (recorded in sidecar; "
+                        "per-subset seeds are method-deterministic)")
+    p.add_argument("--query-index", type=int, default=0,
+                   help="index of query in the A3M pool (default 0)")
+    p.add_argument("--out", required=True, type=Path,
+                   help="output directory for subset A3Ms")
+    p.set_defaults(func=_cmd_build)
+def _cmd_build(args: argparse.Namespace) -> int:
+    if not args.a3m.exists():
+        print(f"error: A3M not found: {args.a3m}", file=sys.stderr)
+        return 2
+    if not args.fi.exists():
+        print(f"error: FI matrix not found: {args.fi}", file=sys.stderr)
+        return 2
+    args.out.mkdir(parents=True, exist_ok=True)
+    pool, score, subsets, paths = build_subsets(
+        a3m_path=args.a3m,
+        fi_npy_path=args.fi,
+        method=args.method,
+        n_subsets=args.n_subsets,
+        subset_size=args.subset_size,
+        hv_percentile=args.hv_percentile,
+        out_dir=args.out,
+        query_index=args.query_index,
+    )
+    # Sidecar: subset index TSV
+    idx_tsv = args.out / f"{args.method}_subset_index.tsv"
+    with open(idx_tsv, "w") as fh:
+        fh.write("subset_id\tseq_index\tpool_index\theader\tcontrast_hvlv\n")
+        for s_i, idx_list in enumerate(subsets):
+            for j, p_i in enumerate(idx_list):
+                fh.write(f"{s_i:03d}\t{j}\t{p_i}\t{pool.headers[p_i]}\t"
+                         f"{score[p_i]:.6f}\n")
+    # Sidecar: provenance JSON
+    meta = {
+        "sf_cluster_version": __version__,
+        "method": args.method,
+        "a3m": str(args.a3m.resolve()),
+        "fi_matrix": str(args.fi.resolve()),
+        "n_subsets": args.n_subsets,
+        "subset_size": args.subset_size,
+        "hv_percentile": args.hv_percentile,
+        "pool_size": pool.n_seq,
+        "n_cols": pool.n_cols,
+        "seed_tag": args.seed,
+        "query_header": pool.headers[args.query_index],
+        "score_stats": {
+            "min": float(np.min(score)),
+            "max": float(np.max(score)),
+            "mean": float(np.mean(score)),
+            "std": float(np.std(score)),
+        },
+    }
+    (args.out / f"{args.method}_meta.json").write_text(json.dumps(meta, indent=2))
+    print(f"[sf-cluster] method={args.method}  pool={pool.n_seq}  "
+          f"wrote {len(paths)} A3Ms to {args.out}")
+    return 0
+def build_parser() -> argparse.ArgumentParser:
+    p = argparse.ArgumentParser(
+        prog="sf-cluster",
+        description="Frustration-guided MSA subset builders "
+                    "(mosaic + gradient).",
+    )
+    p.add_argument("--version", action="version",
+                   version=f"sf-cluster {__version__}")
+    sub = p.add_subparsers(dest="command", required=True)
+    _add_build_parser(sub)
+    return p
+def main(argv=None) -> int:
+    parser = build_parser()
+    args = parser.parse_args(argv)
+    return args.func(args)
+if __name__ == "__main__":
+    sys.exit(main())

src/sf_cluster/methods.py ADDED Viewed

	@@ -0,0 +1,217 @@

+"""Subset-construction methods: mosaic and gradient.
+Both methods take a per-sequence score (typically `contrast_hvlv`) and
+produce N_SUBSETS lists of pool indices of length TARGET_SIZE.
+Defaults match the published SF-Cluster Phase XII protocol:
+    N_SUBSETS = 12
+    TARGET_SIZE = 32
+    mosaic seeds:   s = 0, 1, ..., N_SUBSETS-1
+    gradient seeds: bin_i * 10 + s   for s in {0, 1, 2}, bin_i in {0..3}
+"""
+from __future__ import annotations
+from pathlib import Path
+from typing import List, Optional, Sequence
+import numpy as np
+from .pool import Pool, pool_msa, write_a3m
+from .score import contrast_hvlv
+N_SUBSETS = 12
+TARGET_SIZE = 32
+def _subsample(indices: Sequence[int], size: int, rng: np.random.Generator) -> List[int]:
+    """Sample `size` items from `indices` without replacement if possible,
+    with replacement otherwise. Empty input returns []."""
+    idx = list(indices)
+    if len(idx) == 0:
+        return []
+    if len(idx) >= size:
+        return list(rng.choice(idx, size=size, replace=False))
+    return list(rng.choice(idx, size=size, replace=True))
+# ---------------------------------------------------------------------------
+# Method: mosaic
+# ---------------------------------------------------------------------------
+def method_mosaic(score: np.ndarray,
+                  n_subsets: int = N_SUBSETS,
+                  subset_size: int = TARGET_SIZE,
+                  *,
+                  high_n: int = 11,
+                  low_n: int = 11,
+                  mid_n: int = 10) -> List[List[int]]:
+    """Tri-stratified mosaic: each subset mixes high/low/mid score tiers.
+    Pool is tri-stratified on `score` (low / mid / high terciles), and each of
+    `n_subsets` subsets samples (high_n + low_n + mid_n) = subset_size items.
+    Seeds: subset s uses np.random.default_rng(seed=s).
+    Args:
+        score:       (N,) per-pool-sequence score (e.g., contrast_hvlv).
+        n_subsets:   number of subsets to build (default 12).
+        subset_size: total seqs per subset; must equal high_n+low_n+mid_n.
+        high_n, low_n, mid_n: per-tier sample counts (defaults 11/11/10).
+    Returns:
+        list of n_subsets lists of pool indices, length == subset_size each.
+    """
+    if high_n + low_n + mid_n != subset_size:
+        raise ValueError(
+            f"high_n+low_n+mid_n ({high_n+low_n+mid_n}) != subset_size ({subset_size})"
+        )
+    score = np.asarray(score)
+    if score.ndim != 1:
+        raise ValueError("score must be 1-D")
+    N = score.shape[0]
+    if N == 0:
+        raise ValueError("empty score array")
+    sorted_idx = np.argsort(score)
+    low_group  = list(sorted_idx[: N // 3])
+    high_group = list(sorted_idx[2 * N // 3 :])
+    mid_group  = list(sorted_idx[N // 3 : 2 * N // 3])
+    subsets: List[List[int]] = []
+    for s in range(n_subsets):
+        rng = np.random.default_rng(seed=s)
+        hi  = _subsample(high_group, high_n, rng)
+        lo  = _subsample(low_group,  low_n,  rng)
+        mid = _subsample(mid_group,  mid_n,  rng)
+        subsets.append([int(x) for x in (hi + lo + mid)])
+    return subsets
+# ---------------------------------------------------------------------------
+# Method: gradient
+# ---------------------------------------------------------------------------
+def method_gradient(score: np.ndarray,
+                    n_subsets: int = N_SUBSETS,
+                    subset_size: int = TARGET_SIZE,
+                    *,
+                    n_bins: int = 4,
+                    subsets_per_bin: int = 3) -> List[List[int]]:
+    """Homogeneous per-quartile subsets along the `score` gradient.
+    Pool is split into `n_bins` equal-size bins on sorted score, then for each
+    bin `subsets_per_bin` subsets are drawn entirely from within that bin.
+    Default 4 bins × 3 subsets-per-bin = 12 subsets.
+    Seeds: bin_i in [0..n_bins-1], s in [0..subsets_per_bin-1] use
+    np.random.default_rng(seed=bin_i*10 + s).
+    Args:
+        score:           (N,) per-pool-sequence score.
+        n_subsets:       expected total (must == n_bins * subsets_per_bin).
+        subset_size:     seqs per subset.
+        n_bins:          number of score quantile bins (default 4).
+        subsets_per_bin: subsets drawn per bin (default 3).
+    Returns:
+        list of n_subsets lists of pool indices.
+    """
+    if n_bins * subsets_per_bin != n_subsets:
+        raise ValueError(
+            f"n_bins*subsets_per_bin ({n_bins*subsets_per_bin}) != n_subsets ({n_subsets})"
+        )
+    score = np.asarray(score)
+    if score.ndim != 1:
+        raise ValueError("score must be 1-D")
+    N = score.shape[0]
+    if N == 0:
+        raise ValueError("empty score array")
+    sorted_idx = np.argsort(score)
+    # Equal-quantile bins by integer split (matches reference impl for n_bins=4).
+    bins: List[List[int]] = []
+    for b in range(n_bins):
+        start = (b * N) // n_bins
+        end = ((b + 1) * N) // n_bins
+        bins.append(list(sorted_idx[start:end]))
+    subsets: List[List[int]] = []
+    for bin_i, bin_idx in enumerate(bins):
+        for s in range(subsets_per_bin):
+            rng = np.random.default_rng(seed=bin_i * 10 + s)
+            chosen = _subsample(bin_idx, subset_size, rng)
+            subsets.append([int(x) for x in chosen])
+    return subsets
+# ---------------------------------------------------------------------------
+# High-level convenience: build_subsets
+# ---------------------------------------------------------------------------
+def _write_subset_a3ms(pool: Pool,
+                       subsets: List[List[int]],
+                       out_dir: Path,
+                       method: str,
+                       query_index: int = 0) -> List[Path]:
+    """Write one A3M per subset; query (pool[query_index]) is always first."""
+    out_dir = Path(out_dir)
+    out_dir.mkdir(parents=True, exist_ok=True)
+    q_header = pool.headers[query_index]
+    q_seq = pool.sequences[query_index]
+    paths: List[Path] = []
+    for s_i, idx_list in enumerate(subsets):
+        seen = {q_header}
+        seqs_for_file = [(q_header, q_seq)]
+        for i in idx_list:
+            h = pool.headers[i]
+            if h in seen:
+                continue
+            seen.add(h)
+            seqs_for_file.append((h, pool.sequences[i]))
+        fname = out_dir / f"{method}_subset_{s_i:03d}.a3m"
+        write_a3m(fname, pool.header_line, seqs_for_file)
+        paths.append(fname)
+    return paths
+def build_subsets(a3m_path: str | Path,
+                  fi_npy_path: str | Path,
+                  method: str = "mosaic",
+                  *,
+                  n_subsets: int = N_SUBSETS,
+                  subset_size: int = TARGET_SIZE,
+                  hv_percentile: float = 80.0,
+                  out_dir: Optional[str | Path] = None,
+                  query_index: int = 0):
+    """End-to-end: pool -> score -> subset indices [-> A3M files].
+    Args:
+        a3m_path:     input filtered A3M.
+        fi_npy_path:  per-residue FI matrix (N_seq, L) .npy.
+        method:       "mosaic" or "gradient".
+        n_subsets:    default 12.
+        subset_size:  default 32.
+        hv_percentile: HV-column variance percentile for contrast_hvlv.
+        out_dir:      if given, write one A3M per subset there.
+        query_index:  which pool row is the query seq (placed first).
+    Returns:
+        (pool, score, subsets) or (pool, score, subsets, paths) if out_dir.
+    """
+    pool = pool_msa(a3m_path, fi_npy_path)
+    score = contrast_hvlv(pool.fi_matrix, percentile=hv_percentile)
+    if method == "mosaic":
+        subsets = method_mosaic(score, n_subsets=n_subsets, subset_size=subset_size)
+    elif method == "gradient":
+        subsets = method_gradient(score, n_subsets=n_subsets, subset_size=subset_size)
+    else:
+        raise ValueError(f"unknown method: {method!r} (expected 'mosaic' or 'gradient')")
+    if out_dir is None:
+        return pool, score, subsets
+    paths = _write_subset_a3ms(pool, subsets, Path(out_dir), method,
+                               query_index=query_index)
+    return pool, score, subsets, paths

src/sf_cluster/pool.py ADDED Viewed

	@@ -0,0 +1,176 @@

+"""A3M parsing and pool construction.
+The pool ties together aligned sequences from a ColabFold-style A3M and a
+per-residue Frustration Index (FI) matrix produced by FrustrAI-Seq.
+A3M conventions (ColabFold):
+    Line 1: optional header line beginning with '#', e.g. "#91\\t1"
+    Then alternating ">header" and sequence lines.
+    Sequence lines may contain UPPERCASE match-state letters, '-' gaps, and
+    lowercase letters denoting insertion states (not part of the alignment).
+"""
+from __future__ import annotations
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import List, Optional, Tuple
+import numpy as np
+@dataclass
+class Pool:
+    """Container for sequences + per-residue FI vectors.
+    Attributes:
+        headers:    list[str] short header (first whitespace-separated token)
+        sequences:  list[str] aligned sequences (lowercase insertion states preserved)
+        fi_matrix:  np.ndarray (N, L) per-residue FI; columns correspond to
+                    match-state (uppercase) positions in the aligned sequences
+        header_line: Optional[str] original '#' header line, if present
+    """
+    headers: List[str]
+    sequences: List[str]
+    fi_matrix: np.ndarray
+    header_line: Optional[str] = None
+    full_headers: List[str] = field(default_factory=list)
+    def __len__(self) -> int:
+        return len(self.headers)
+    @property
+    def n_seq(self) -> int:
+        return len(self.headers)
+    @property
+    def n_cols(self) -> int:
+        return int(self.fi_matrix.shape[1]) if self.fi_matrix.size else 0
+# ---------------------------------------------------------------------------
+# A3M I/O
+# ---------------------------------------------------------------------------
+def read_a3m(path: str | Path) -> Tuple[Optional[str], List[Tuple[str, str]]]:
+    """Read an A3M file.
+    Returns:
+        (header_line, [(header, seq), ...])
+        header_line is the leading '#...' line if present, else None.
+        header is the full header text without the leading '>'.
+        seq is the raw sequence line (lowercase insertion states retained).
+    """
+    path = Path(path)
+    with open(path) as f:
+        lines = [ln.rstrip("\n") for ln in f.readlines()]
+    if not lines:
+        return None, []
+    i = 0
+    header_line = None
+    if lines[0].startswith("#"):
+        header_line = lines[0]
+        i = 1
+    seqs: List[Tuple[str, str]] = []
+    while i < len(lines):
+        ln = lines[i]
+        if ln.startswith(">"):
+            h = ln[1:]
+            s = lines[i + 1] if i + 1 < len(lines) else ""
+            seqs.append((h, s))
+            i += 2
+        else:
+            i += 1
+    return header_line, seqs
+def write_a3m(path: str | Path,
+              header_line: Optional[str],
+              seqs: List[Tuple[str, str]]) -> None:
+    """Write an A3M file.  seqs = [(header, seq), ...]."""
+    path = Path(path)
+    path.parent.mkdir(parents=True, exist_ok=True)
+    with open(path, "w") as f:
+        if header_line is not None:
+            f.write(header_line + "\n")
+        for h, s in seqs:
+            f.write(f">{h}\n{s}\n")
+# ---------------------------------------------------------------------------
+# Pool construction
+# ---------------------------------------------------------------------------
+def _dedup_a3m(seqs: List[Tuple[str, str]]) -> Tuple[List[int], List[Tuple[str, str]]]:
+    """Deduplicate by short header (first whitespace token).
+    Returns (kept_indices_into_input, [(short_header, seq), ...]).
+    """
+    seen = set()
+    keep_idx: List[int] = []
+    out: List[Tuple[str, str]] = []
+    for i, (h, s) in enumerate(seqs):
+        short = h.split()[0]
+        if short in seen:
+            continue
+        seen.add(short)
+        keep_idx.append(i)
+        out.append((short, s))
+    return keep_idx, out
+def pool_msa(a3m_path: str | Path,
+             fi_npy_path: str | Path,
+             *,
+             dedup: bool = True) -> Pool:
+    """Build a Pool from an A3M file and a per-residue FI matrix.
+    Args:
+        a3m_path:    path to filtered.a3m (ColabFold style).
+        fi_npy_path: path to FI matrix .npy of shape (N_seq, L) where
+                     N_seq matches the number of sequences in the A3M and
+                     L is the number of match-state alignment columns.
+                     Typically produced by FrustrAI-Seq
+                     (https://github.com/leuschj/FrustrAI-Seq,
+                     HF model: leuschj/FrustrAI-Seq).
+        dedup:       drop duplicates by short header (default True).
+    Returns:
+        Pool object.
+    Raises:
+        ValueError if N_seq disagree between the A3M and the FI matrix.
+    """
+    header_line, raw_seqs = read_a3m(a3m_path)
+    fi = np.load(str(fi_npy_path))
+    if fi.ndim != 2:
+        raise ValueError(
+            f"FI matrix must be 2-D (N_seq, L); got shape {fi.shape}"
+        )
+    if fi.shape[0] != len(raw_seqs):
+        raise ValueError(
+            f"FI rows ({fi.shape[0]}) != A3M sequences ({len(raw_seqs)}) "
+            f"for {a3m_path}"
+        )
+    if dedup:
+        keep_idx, kept = _dedup_a3m(raw_seqs)
+        fi = fi[keep_idx]
+        full_headers = [raw_seqs[i][0] for i in keep_idx]
+        short_headers = [h for h, _ in kept]
+        seqs = [s for _, s in kept]
+    else:
+        full_headers = [h for h, _ in raw_seqs]
+        short_headers = [h.split()[0] for h, _ in raw_seqs]
+        seqs = [s for _, s in raw_seqs]
+    return Pool(
+        headers=short_headers,
+        sequences=seqs,
+        fi_matrix=np.asarray(fi, dtype=np.float64),
+        header_line=header_line,
+        full_headers=full_headers,
+    )

src/sf_cluster/score.py ADDED Viewed

	@@ -0,0 +1,66 @@

+"""Sequence-level frustration contrast scores.
+contrast_hvlv(seq) = mean_FI(high-variance positions) - mean_FI(low-variance positions)
+High-variance positions are MSA columns whose across-sequence FI variance is
+at or above the (default 80th) percentile.
+"""
+from __future__ import annotations
+import numpy as np
+def high_variance_mask(fi_matrix: np.ndarray,
+                       percentile: float = 80.0) -> np.ndarray:
+    """Boolean (L,) mask of high-variance MSA columns.
+    Args:
+        fi_matrix:  (N, L) per-residue FI; may contain NaN.
+        percentile: column-variance percentile threshold (default 80).
+    Returns:
+        boolean array of length L (True = high-variance).
+    """
+    if fi_matrix.ndim != 2:
+        raise ValueError("fi_matrix must be 2-D (N, L)")
+    col_var = np.nanvar(fi_matrix, axis=0)
+    if np.all(np.isnan(col_var)):
+        return np.zeros(fi_matrix.shape[1], dtype=bool)
+    thresh = np.nanpercentile(col_var, percentile)
+    return col_var >= thresh
+def contrast_hvlv(fi_matrix: np.ndarray,
+                  percentile: float = 80.0) -> np.ndarray:
+    """Per-sequence high-variance / low-variance FI contrast.
+    score[i] = mean_FI_over_HV_cols(seq_i) - mean_FI_over_LV_cols(seq_i)
+    NaN-safe: sequences with all-NaN in a group contribute 0 there.
+    Args:
+        fi_matrix:  (N, L) per-residue FI matrix.
+        percentile: column-variance percentile defining HV (default 80).
+    Returns:
+        np.ndarray (N,) float64 contrast score per sequence.
+    """
+    if fi_matrix.ndim != 2:
+        raise ValueError("fi_matrix must be 2-D (N, L)")
+    N = fi_matrix.shape[0]
+    hv = high_variance_mask(fi_matrix, percentile=percentile)
+    lv = ~hv
+    if hv.any():
+        mean_hv = np.nanmean(fi_matrix[:, hv], axis=1)
+    else:
+        mean_hv = np.zeros(N, dtype=np.float64)
+    if lv.any():
+        mean_lv = np.nanmean(fi_matrix[:, lv], axis=1)
+    else:
+        mean_lv = np.zeros(N, dtype=np.float64)
+    mean_hv = np.nan_to_num(mean_hv, nan=0.0)
+    mean_lv = np.nan_to_num(mean_lv, nan=0.0)
+    return (mean_hv - mean_lv).astype(np.float64, copy=False)

tests/test_methods.py ADDED Viewed

	@@ -0,0 +1,211 @@

+"""Tests for sf_cluster: shapes, determinism, in-pool guarantee."""
+from __future__ import annotations
+import os
+import sys
+from pathlib import Path
+import numpy as np
+import pytest
+# Allow `python -m pytest tests/` from the repo root before installing.
+sys.path.insert(0, str(Path(__file__).resolve().parents[1] / "src"))
+from sf_cluster import (  # noqa: E402
+    contrast_hvlv,
+    high_variance_mask,
+    method_gradient,
+    method_mosaic,
+    pool_msa,
+    read_a3m,
+    write_a3m,
+)
+from sf_cluster.methods import N_SUBSETS, TARGET_SIZE  # noqa: E402
+# ---------------------------------------------------------------------------
+# fixtures
+# ---------------------------------------------------------------------------
+@pytest.fixture
+def synthetic_pool(tmp_path):
+    """Synthetic A3M + FI matrix written to disk; returns paths."""
+    rng = np.random.default_rng(0)
+    N, L = 200, 50
+    alphabet = np.array(list("ACDEFGHIKLMNPQRSTVWY-"))
+    seqs = rng.choice(alphabet, size=(N, L))
+    a3m_path = tmp_path / "syn.a3m"
+    with open(a3m_path, "w") as f:
+        f.write(f"#{L}\t1\n")
+        for i, row in enumerate(seqs):
+            tag = "query" if i == 0 else f"seq{i:04d}"
+            f.write(f">{tag}\n{''.join(row)}\n")
+    fi = rng.normal(0, 0.3, size=(N, L)).astype(np.float64)
+    hv_cols = rng.choice(L, size=L // 5, replace=False)
+    fi[:, hv_cols] += rng.normal(0, 1.5, size=(N, len(hv_cols)))
+    fi_path = tmp_path / "syn_fi.npy"
+    np.save(fi_path, fi)
+    return a3m_path, fi_path, N, L
+# ---------------------------------------------------------------------------
+# pool / a3m
+# ---------------------------------------------------------------------------
+def test_a3m_roundtrip(tmp_path):
+    p = tmp_path / "rt.a3m"
+    write_a3m(p, "#5\t1", [("query", "ACDEF"), ("h2 desc", "ACDef")])
+    hl, seqs = read_a3m(p)
+    assert hl == "#5\t1"
+    assert seqs == [("query", "ACDEF"), ("h2 desc", "ACDef")]
+def test_pool_shapes(synthetic_pool):
+    a3m, fi, N, L = synthetic_pool
+    pool = pool_msa(a3m, fi)
+    assert pool.n_seq == N
+    assert pool.n_cols == L
+    assert pool.fi_matrix.shape == (N, L)
+    assert len(pool.sequences) == N
+    assert pool.headers[0] == "query"
+def test_pool_rejects_shape_mismatch(tmp_path, synthetic_pool):
+    a3m, fi, N, L = synthetic_pool
+    bad = tmp_path / "bad_fi.npy"
+    np.save(bad, np.zeros((N + 1, L)))
+    with pytest.raises(ValueError, match="FI rows"):
+        pool_msa(a3m, bad)
+# ---------------------------------------------------------------------------
+# score
+# ---------------------------------------------------------------------------
+def test_hv_mask_fraction():
+    rng = np.random.default_rng(1)
+    F = rng.normal(size=(100, 50))
+    hv = high_variance_mask(F, percentile=80)
+    # At p=80 we expect ~20% True (allow some slack since percentile is a
+    # threshold, not an exact split).
+    frac = hv.mean()
+    assert 0.1 <= frac <= 0.4
+def test_contrast_hvlv_shape_and_finite(synthetic_pool):
+    a3m, fi, N, L = synthetic_pool
+    pool = pool_msa(a3m, fi)
+    score = contrast_hvlv(pool.fi_matrix)
+    assert score.shape == (N,)
+    assert np.all(np.isfinite(score))
+# ---------------------------------------------------------------------------
+# methods: mosaic
+# ---------------------------------------------------------------------------
+def test_mosaic_shapes(synthetic_pool):
+    a3m, fi, N, _ = synthetic_pool
+    pool = pool_msa(a3m, fi)
+    score = contrast_hvlv(pool.fi_matrix)
+    subs = method_mosaic(score)
+    assert len(subs) == N_SUBSETS
+    for s in subs:
+        assert len(s) == TARGET_SIZE
+def test_mosaic_determinism(synthetic_pool):
+    a3m, fi, _, _ = synthetic_pool
+    pool = pool_msa(a3m, fi)
+    score = contrast_hvlv(pool.fi_matrix)
+    a = method_mosaic(score)
+    b = method_mosaic(score)
+    assert a == b
+def test_mosaic_in_pool(synthetic_pool):
+    a3m, fi, N, _ = synthetic_pool
+    pool = pool_msa(a3m, fi)
+    score = contrast_hvlv(pool.fi_matrix)
+    subs = method_mosaic(score)
+    for s in subs:
+        assert all(0 <= i < N for i in s), "out-of-pool index in mosaic subset"
+def test_mosaic_tier_composition(synthetic_pool):
+    """High tier draws should come from upper third of sorted score."""
+    a3m, fi, N, _ = synthetic_pool
+    pool = pool_msa(a3m, fi)
+    score = contrast_hvlv(pool.fi_matrix)
+    sorted_idx = np.argsort(score)
+    high_set = set(sorted_idx[2 * N // 3:].tolist())
+    low_set  = set(sorted_idx[: N // 3].tolist())
+    mid_set  = set(sorted_idx[N // 3: 2 * N // 3].tolist())
+    subs = method_mosaic(score)
+    # First 11 = high, next 11 = low, last 10 = mid.
+    for s in subs:
+        assert all(i in high_set for i in s[:11])
+        assert all(i in low_set for i in s[11:22])
+        assert all(i in mid_set for i in s[22:32])
+# ---------------------------------------------------------------------------
+# methods: gradient
+# ---------------------------------------------------------------------------
+def test_gradient_shapes(synthetic_pool):
+    a3m, fi, _, _ = synthetic_pool
+    pool = pool_msa(a3m, fi)
+    score = contrast_hvlv(pool.fi_matrix)
+    subs = method_gradient(score)
+    assert len(subs) == N_SUBSETS
+    for s in subs:
+        assert len(s) == TARGET_SIZE
+def test_gradient_determinism(synthetic_pool):
+    a3m, fi, _, _ = synthetic_pool
+    pool = pool_msa(a3m, fi)
+    score = contrast_hvlv(pool.fi_matrix)
+    a = method_gradient(score)
+    b = method_gradient(score)
+    assert a == b
+def test_gradient_in_pool_and_homogeneous(synthetic_pool):
+    a3m, fi, N, _ = synthetic_pool
+    pool = pool_msa(a3m, fi)
+    score = contrast_hvlv(pool.fi_matrix)
+    sorted_idx = np.argsort(score)
+    bins = []
+    for b in range(4):
+        bins.append(set(sorted_idx[(b * N) // 4: ((b + 1) * N) // 4].tolist()))
+    subs = method_gradient(score)
+    for grp_i in range(4):
+        for s_i in range(3):
+            sub = subs[grp_i * 3 + s_i]
+            assert all(0 <= i < N for i in sub), "out-of-pool index"
+            assert all(i in bins[grp_i] for i in sub), \
+                f"gradient subset {grp_i*3+s_i} leaked outside quartile {grp_i}"
+# ---------------------------------------------------------------------------
+# CLI smoke
+# ---------------------------------------------------------------------------
+def test_cli_build_smoke(tmp_path, synthetic_pool):
+    from sf_cluster.cli import main as cli_main
+    a3m, fi, _, _ = synthetic_pool
+    out = tmp_path / "subs_mosaic"
+    rc = cli_main([
+        "build",
+        "--a3m", str(a3m),
+        "--fi", str(fi),
+        "--method", "mosaic",
+        "--out", str(out),
+    ])
+    assert rc == 0
+    files = sorted(out.glob("mosaic_subset_*.a3m"))
+    assert len(files) == N_SUBSETS
+    assert (out / "mosaic_subset_index.tsv").exists()
+    assert (out / "mosaic_meta.json").exists()