CASSANDRA β€” ASL configuration on TRAM2

Fine-tuned CTI-BERT models for extracting MITRE ATT&CK techniques from cyber threat intelligence (CTI) reports. This repository contains the ASL configuration of the CASSANDRA recipe trained on TRAM2 (50 ATT&CK sub-techniques), comprising 6 ensemble members trained with seeds {42, 123, 456, 789, 2024, 3141}.

This is the highest-F1 configuration in the CASSANDRA paper on TRAM2. The asymmetric loss + 6-seed ensemble adds +3.30 F1 over the BCE 3-seed configuration (cassandra-bce-tram2).

Anonymous artifact for ACM CCS 2026 review. Final author identification will be added after review.

Headline result

On the TRAM2 test set (30 scored documents):

  • 6-seed ensemble per-document F1 at dev-tuned Ο„=0.59: 77.17%
  • 6-seed ensemble per-document F1 at uniform Ο„=0.5: 76.87%
  • Bootstrap 95% CI (2000 resamples over the 30 test documents): [70.72%, 82.42%]

Comparison numbers from Buchel et al. (2025) on the same protocol:

  • Llama 3.1 8B (generative): 72.50% β€” this configuration: 77.17% (+4.67 F1, 73Γ— smaller)
  • CySecBERT 110M: 69.74%
  • CTI-BERT 110M (SoK baseline): 69.07%

The per-seed table below shows the live artifact's individual seed F1s and ensemble F1; small variance from the headline (≀0.5 F1) reflects inference-time floating-point ordering on different hardware. Full per-seed and ensemble metrics are in results.json.

Architecture

LabelAttentionClassifier with asymmetric loss training:

  • Encoder: ibm-research/CTI-BERT (110M params, 768 hidden)
  • Head: 50 learned 768-dim label queries that attend over the encoder's last_hidden_state, followed by a shared 1-output linear layer applied per-label
  • Loss: Asymmetric Loss (Ridnik et al. 2021) with Ξ³_neg=4, Ξ³_pos=0, clip=0.05 β€” designed for long-tail multi-label classification, suppresses easy negatives so the model can attend to rare-class signal
  • Regularization / training tricks: layer-wise learning rate decay (Ξ±=0.85), exponential moving average (Ξ²=0.999), stochastic weight averaging (last 25% of epochs), per-seed best-of-{base, EMA, SWA} selection on validation macro-F1, multi-seed probability averaging at inference

The architecture is custom (not derived from transformers.PreTrainedModel), so loading requires the modeling.py file shipped with this repo.

Training data

  • TRAM2 (Threat Report ATT&CK Mapping v2): 151 reports, 19,178 sentences, 50 ATT&CK sub-techniques. Mean of ~82 positive examples per class.
  • Splits: report-level train/test split from Buchel et al. (2025) (120 train reports, 31 test reports β€” one test report excluded from per-document F1 due to empty in-vocabulary ground truth).
  • Validation: 80:20 sentence-level random split within the training reports for early stopping and threshold selection.

Intended use

Map free-text CTI sentences to ATT&CK techniques. The model takes a single sentence and outputs a probability for each of 50 techniques.

Aggregation to document level: per-sentence inference, take the per-class max across a document's sentences, threshold, report the union of predicted techniques per document.

Threshold note: the headline 77.17% uses a dev-tuned global threshold Ο„=0.59 selected on the held-out validation split (NOT on test). Using Ο„=0.5 uniform yields 76.87% on the same test set. See results.json for the full threshold-sweep curve.

Limitations:

  • Trained on English-language CTI; behavior on other languages is not characterized.
  • 50 TRAM2 sub-techniques only; sentences describing techniques outside this set produce all-zero predictions.
  • The asymmetric loss is tuned for label-dense problems (mean ~82 samples/technique). For sparser benchmarks like AnnoCTR (mean 15.5/technique), the BCE configuration is preferable β€” see cassandra-bce-annoctr and the paper Β§3.2 (RQ4 results) for the label-density transfer analysis.

How to load and run

from modeling import load_ensemble, predict_ensemble
import os, glob

seed_dirs = sorted(glob.glob(os.path.join(os.path.dirname(__file__), "seeds", "seed-*")))
seeds = load_ensemble(seed_dirs, device="cuda")

sentences = [
    "The malware uses Windows Command Shell to execute encoded scripts.",
    "After initial access, persistence was established via Registry Run Keys.",
]
# Ο„=0.5 is the simplest threshold; for the headline 77.17% use Ο„=0.59
results = predict_ensemble(seeds, sentences, threshold=0.5)
for sentence, techniques in results:
    print(sentence, "->", techniques)

A complete CLI example is in inference_example.py:

pip install -r requirements.txt
python inference_example.py --threshold 0.59

Per-seed members

Seed Tag Per-document F1 (Ο„=0.5) Selected weights
42 orig 72.33% SWA
123 orig 74.72% EMA
456 orig 73.51% base
789 repl 71.31% SWA
2024 repl 70.35% SWA
3141 repl 69.81% EMA
6-seed ensemble (Ο„=0.5) β€” 76.87% β€”
6-seed ensemble (dev-Ο„=0.59) β€” 77.17% β€”

The "orig" / "repl" tag distinguishes the two 3-seed runs reported in the paper (originals and replication seeds for the per-seed variance study). For the headline 6-seed ensemble F1 they are pooled by averaging sigmoid probabilities.

For verification without re-running the model, each seed directory contains a seed_probs.npz file with the model's per-sentence sigmoid probabilities β€” sufficient to recompute every F1 number in the model card and reproduce the bootstrap CI.

Citation

@inproceedings{cassandra2026,
  title  = {CASSANDRA: How Many Parameters Suffice to Automate TTP Extractions from CTI Reports---Pushing Towards the Lower Bound},
  author = {Anonymous},
  booktitle = {Proceedings of the 2026 ACM SIGSAC Conference on Computer and Communications Security (CCS)},
  year   = {2026},
  note   = {Under review β€” anonymous submission}
}

Please also cite TRAM2, the CTI-BERT encoder, and the asymmetric-loss work (Ridnik et al. 2021).

License

MIT β€” see LICENSE.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cassandra-anon/cassandra-asl-tram2

Finetuned
(4)
this model