| --- |
| license: mit |
| library_name: pytorch |
| tags: |
| - cti |
| - attack-classification |
| - mitre-attack |
| - cybersecurity |
| - text-classification |
| - multi-label-classification |
| language: |
| - en |
| base_model: ibm-research/CTI-BERT |
| --- |
| |
| # CASSANDRA — BCE configuration on TRAM2 |
|
|
| Fine-tuned CTI-BERT models for extracting MITRE ATT&CK techniques from cyber threat intelligence (CTI) reports. This repository contains the **BCE configuration** of the CASSANDRA recipe trained on **TRAM2** (50 ATT&CK sub-techniques), comprising **3 ensemble members** trained with seeds {42, 123, 456}. |
|
|
| > Anonymous artifact for ACM CCS 2026 review. Final author identification will be added after review. |
|
|
| ## Headline result |
|
|
| On the **TRAM2** test set (30 scored documents): |
|
|
| - **3-seed ensemble per-document F1 (Ï„=0.5): 73.87%** |
| - Exceeds Llama 3.1 8B (72.50%, Buchel et al. 2025) at 73× fewer parameters. |
|
|
| The per-seed table below shows the live artifact's individual seed F1s and ensemble F1; small variance from the headline (≤0.3 F1) reflects inference-time floating-point ordering on different hardware. Full per-seed and ensemble metrics are in [`results.json`](./results.json). |
|
|
| ## Architecture |
|
|
| `LabelAttentionClassifier`: a 110M-parameter CTI-BERT encoder followed by a per-label attention head. |
|
|
| - Encoder: [`ibm-research/CTI-BERT`](https://huggingface.co/ibm-research/CTI-BERT) (110M params, 768 hidden) |
| - Head: 50 learned 768-dim label queries that attend over the encoder's `last_hidden_state`, followed by a shared 1-output linear layer applied per-label |
| - Loss: BCE with `pos_weight=5.0` |
| - Regularization / training tricks: layer-wise learning rate decay (α=0.85), exponential moving average (β=0.999), multi-seed probability averaging at inference |
|
|
| The architecture is custom (not derived from `transformers.PreTrainedModel`), so loading requires the [`modeling.py`](./modeling.py) file shipped with this repo. |
|
|
| ## Training data |
|
|
| - **TRAM2** (Threat Report ATT&CK Mapping v2): 151 reports, 19,178 sentences, 50 ATT&CK sub-techniques. Mean of ~82 positive examples per class. |
| - **Splits**: report-level train/test split from Buchel et al. (2025) "SoK: A Survey of Approaches for ATT&CK Classifier Construction" (120 train reports, 31 test reports — one test report excluded from per-document F1 due to empty in-vocabulary ground truth). |
| - **Validation**: 80:20 sentence-level random split within the training reports for early stopping and threshold selection. |
|
|
| ## Intended use |
|
|
| Map free-text CTI sentences (analyst reports, incident write-ups, vendor advisories) to ATT&CK techniques. The model takes a single sentence and outputs a probability for each of 50 techniques. |
|
|
| **Aggregation to document level (paper convention):** apply per-sentence inference, take the per-class max across sentences in a document, threshold that, report the union of predicted techniques per document. F1 is computed against the document-level technique set. |
|
|
| **Limitations:** |
| - Trained on English-language CTI; behavior on other languages is not characterized. |
| - The label vocabulary is fixed at the 50 TRAM2 sub-techniques. |
| - Within TRAM2, the rarest techniques have ~7 positive examples; predictions for these classes are noisier than for densely-populated techniques. |
|
|
| ## How to load and run |
|
|
| ```python |
| from modeling import load_ensemble, predict_ensemble |
| import os, glob |
| |
| seed_dirs = sorted(glob.glob(os.path.join(os.path.dirname(__file__), "seeds", "seed-*"))) |
| seeds = load_ensemble(seed_dirs, device="cuda") |
| |
| sentences = [ |
| "The malware uses Windows Command Shell to execute encoded scripts.", |
| "After initial access, persistence was established via Registry Run Keys.", |
| ] |
| results = predict_ensemble(seeds, sentences, threshold=0.5) |
| for sentence, techniques in results: |
| print(sentence, "->", techniques) |
| ``` |
|
|
| A complete CLI example is in [`inference_example.py`](./inference_example.py): |
|
|
| ```bash |
| pip install -r requirements.txt |
| python inference_example.py |
| ``` |
|
|
| ## Per-seed members |
|
|
| | Seed | Per-document F1 (Ï„=0.5) | Selected weights | |
| |---|---|---| |
| | 42 | 73.78% | EMA | |
| | 123 | 71.97% | EMA | |
| | 456 | 75.59% | EMA | |
| | **3-seed ensemble** | **73.87%** | — | |
|
|
| For verification without re-running the model, each seed directory contains a `seed_probs.npz` file with the model's per-sentence sigmoid probabilities on the test and dev splits — sufficient to recompute every F1 number in the model card. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @inproceedings{cassandra2026, |
| title = {CASSANDRA: How Many Parameters Suffice to Automate TTP Extractions from CTI Reports---Pushing Towards the Lower Bound}, |
| author = {Anonymous}, |
| booktitle = {Proceedings of the 2026 ACM SIGSAC Conference on Computer and Communications Security (CCS)}, |
| year = {2026}, |
| note = {Under review — anonymous submission} |
| } |
| ``` |
|
|
| Please also cite the TRAM2 dataset and the CTI-BERT encoder. |
|
|
| ## License |
|
|
| MIT — see [`LICENSE`](./LICENSE). |
|
|