--- language: en license: apache-2.0 library_name: setfit tags: - text-classification - transaction-classification - banking - finance - setfit - sentence-transformers - few-shot-learning - contrastive-learning datasets: - mitulshah/transaction-categorization base_model: sentence-transformers/all-MiniLM-L6-v2 pipeline_tag: text-classification model-index: - name: transaction-classifier-setfit results: - task: type: text-classification name: Transaction Classification metrics: - name: Real-World Accuracy (Weighted) type: accuracy value: 0.805 - name: SetFit-Only Accuracy type: accuracy value: 0.667 - name: Validation Accuracy type: accuracy value: 0.98 --- # Transaction Classifier — SetFit (v3) A [SetFit](https://github.com/huggingface/setfit) model built on `sentence-transformers/all-MiniLM-L6-v2` that classifies bank transaction strings into 10 budget categories using contrastive few-shot learning. This is **version 3** in a progressive model development series. It demonstrated that pre-trained semantic embeddings dramatically outperform traditional NLP approaches for transaction classification, jumping from 55.7% to 80.5% real-world accuracy. ## Model Details | Property | Value | |---|---| | Base model | `sentence-transformers/all-MiniLM-L6-v2` (22M params) | | Framework | SetFit (contrastive learning + logistic head) | | Task | Multi-class text classification (10 categories) | | Training samples | 8,000 | | Contrastive iterations | 20 | | Epochs | 1 | | Batch size | 32 | | Format | SafeTensors + model_head.pkl | | Trained | 2026-03-28 | ## Categories | ID | Category | |---|---| | 0 | Food & Dining | | 1 | Transportation | | 2 | Shopping & Retail | | 3 | Entertainment & Recreation | | 4 | Healthcare & Medical | | 5 | Utilities & Services | | 6 | Financial Services | | 7 | Income | | 8 | Government & Legal | | 9 | Charity & Donations | ## Performance Evaluated on 505 unique real-world RBC transactions (3,113 weighted, 2019-2026). ### Overall | Metric | Score | |---|---| | Real-world accuracy (weighted) | **80.5%** | | SetFit-only accuracy | **66.7%** | | Validation accuracy | 98.0% | ### Per-Category Accuracy | Category | Accuracy | |---|---| | Income | 97.8% | | Healthcare & Medical | 100.0% | | Financial Services | 91.2% | | Entertainment & Recreation | 88.6% | | Food & Dining | 83.9% | | Transportation | 83.3% | | Shopping & Retail | 74.6% | | Government & Legal | 54.5% | | Utilities & Services | 34.2% | | Charity & Donations | 0.0% | ## Usage ```python from setfit import SetFitModel model = SetFitModel.from_pretrained("maaz-zaidi/transaction-classifier-setfit") predictions = model.predict([ "STARBUCKS STORE 12345", "SHELL GAS STATION", "NETFLIX.COM" ]) print(predictions) ``` ## Training Data - **Primary**: [mitulshah/transaction-categorization](https://huggingface.co/datasets/mitulshah/transaction-categorization) - 8K samples from 3.6M records (gated dataset) - **Evaluation**: 505 real-world RBC bank transactions (2019-2026) ## Key Breakthrough SetFit's contrastive learning approach was the breakthrough moment in this project: - **v2 (FastText) -> v3 (SetFit)**: 55.7% -> 80.5% overall accuracy - FastText's ML-only accuracy was 14.8% (severe Income category bias). SetFit's ML-only accuracy was 66.7%. - Pre-trained sentence embeddings understand real-world merchant concepts that character n-grams cannot capture. ## Part of a Series See the [Transaction Classifier collection](https://huggingface.co/collections/maaz-zaidi/transaction-classifier) for all 7 model versions. ## Limitations - Contrastive learning with a logistic regression head is outperformed by standard cross-entropy fine-tuning at this data scale (see v4) - Utilities & Services at only 34.2% accuracy - Domain-specific to Canadian banking transaction formats ## Citation ```bibtex @misc{zaidi2026txnclassifier, title={Transaction Classifier: Multi-Stage Bank Transaction Categorization}, author={Maaz Zaidi}, year={2026}, url={https://huggingface.co/maaz-zaidi/transaction-classifier-setfit} } ```