Align README with paper: numbers, title, section refs

5698e6e verified about 9 hours ago

4.91 kB

	---
	license: mit
	library_name: pytorch
	tags:
	- cti
	- attack-classification
	- mitre-attack
	- cybersecurity
	- text-classification
	- multi-label-classification
	language:
	- en
	base_model: ibm-research/CTI-BERT
	---

	# CASSANDRA — BCE configuration on TRAM2

	Fine-tuned CTI-BERT models for extracting MITRE ATT&CK techniques from cyber threat intelligence (CTI) reports. This repository contains the BCE configuration of the CASSANDRA recipe trained on TRAM2 (50 ATT&CK sub-techniques), comprising 3 ensemble members trained with seeds {42, 123, 456}.

	> Anonymous artifact for ACM CCS 2026 review. Final author identification will be added after review.

	## Headline result

	On the TRAM2 test set (30 scored documents):

	- 3-seed ensemble per-document F1 (τ=0.5): 73.87%
	- Exceeds Llama 3.1 8B (72.50%, Buchel et al. 2025) at 73× fewer parameters.

	The per-seed table below shows the live artifact's individual seed F1s and ensemble F1; small variance from the headline (≤0.3 F1) reflects inference-time floating-point ordering on different hardware. Full per-seed and ensemble metrics are in [`results.json`](./results.json).

	## Architecture

	`LabelAttentionClassifier`: a 110M-parameter CTI-BERT encoder followed by a per-label attention head.

	- Encoder: [`ibm-research/CTI-BERT`](https://huggingface.co/ibm-research/CTI-BERT) (110M params, 768 hidden)
	- Head: 50 learned 768-dim label queries that attend over the encoder's `last_hidden_state`, followed by a shared 1-output linear layer applied per-label
	- Loss: BCE with `pos_weight=5.0`
	- Regularization / training tricks: layer-wise learning rate decay (α=0.85), exponential moving average (β=0.999), multi-seed probability averaging at inference

	The architecture is custom (not derived from `transformers.PreTrainedModel`), so loading requires the [`modeling.py`](./modeling.py) file shipped with this repo.

	## Training data

	- TRAM2 (Threat Report ATT&CK Mapping v2): 151 reports, 19,178 sentences, 50 ATT&CK sub-techniques. Mean of ~82 positive examples per class.
	- Splits: report-level train/test split from Buchel et al. (2025) "SoK: A Survey of Approaches for ATT&CK Classifier Construction" (120 train reports, 31 test reports — one test report excluded from per-document F1 due to empty in-vocabulary ground truth).
	- Validation: 80:20 sentence-level random split within the training reports for early stopping and threshold selection.

	## Intended use

	Map free-text CTI sentences (analyst reports, incident write-ups, vendor advisories) to ATT&CK techniques. The model takes a single sentence and outputs a probability for each of 50 techniques.

	Aggregation to document level (paper convention): apply per-sentence inference, take the per-class max across sentences in a document, threshold that, report the union of predicted techniques per document. F1 is computed against the document-level technique set.

	Limitations:
	- Trained on English-language CTI; behavior on other languages is not characterized.
	- The label vocabulary is fixed at the 50 TRAM2 sub-techniques.
	- Within TRAM2, the rarest techniques have ~7 positive examples; predictions for these classes are noisier than for densely-populated techniques.

	## How to load and run

	```python
	from modeling import load_ensemble, predict_ensemble
	import os, glob

	seed_dirs = sorted(glob.glob(os.path.join(os.path.dirname(__file__), "seeds", "seed-*")))
	seeds = load_ensemble(seed_dirs, device="cuda")

	sentences = [
	"The malware uses Windows Command Shell to execute encoded scripts.",
	"After initial access, persistence was established via Registry Run Keys.",
	]
	results = predict_ensemble(seeds, sentences, threshold=0.5)
	for sentence, techniques in results:
	print(sentence, "->", techniques)
	```

	A complete CLI example is in [`inference_example.py`](./inference_example.py):

	```bash
	pip install -r requirements.txt
	python inference_example.py
	```

	## Per-seed members

	\| Seed \| Per-document F1 (τ=0.5) \| Selected weights \|
	\|---\|---\|---\|
	\| 42 \| 73.78% \| EMA \|
	\| 123 \| 71.97% \| EMA \|
	\| 456 \| 75.59% \| EMA \|
	\| 3-seed ensemble \| 73.87% \| — \|

	For verification without re-running the model, each seed directory contains a `seed_probs.npz` file with the model's per-sentence sigmoid probabilities on the test and dev splits — sufficient to recompute every F1 number in the model card.

	## Citation

	```bibtex
	@inproceedings{cassandra2026,
	title = {CASSANDRA: How Many Parameters Suffice to Automate TTP Extractions from CTI Reports---Pushing Towards the Lower Bound},
	author = {Anonymous},
	booktitle = {Proceedings of the 2026 ACM SIGSAC Conference on Computer and Communications Security (CCS)},
	year = {2026},
	note = {Under review — anonymous submission}
	}
	```

	Please also cite the TRAM2 dataset and the CTI-BERT encoder.

	## License

	MIT — see [`LICENSE`](./LICENSE).