tanaos-spam-detection-german: A small but performant base spam detection model specifically designed for German text

This model was created by Tanaos with the Artifex Python library.

This is a spam detection model based on distilbert-base-multilingual-cased and fine-tuned on a synthetic dataset to classify text as spam or not_spam.

This model is intended to be used as a first-layer spam filter for email systems, messaging applications or any other text-based communication platform, and it was specifically fine-tuned to perform well on German text.

The following categories are considered spam:

Unsolicited commercial advertisement or non-commercial proselytizing.
Fraudulent schemes. including get-rich-quick and pyramid schemes.
Phishing attempts. unrealistic offers or announcements.
Content with deceptive or misleading information.
Malware or harmful links.
Adult content or explicit material.
Excessive use of capitalization or punctuation to grab attention.

Languages

The main model language is German, but we have spam detection models specialized in other languages as well:

How to Use

Use this model through the Artifex library:

install Artifex with

pip install artifex

use the model with

from artifex import Artifex

spam_detection = Artifex().spam_detection(language="german")

print(spam_detection("Sie haben ein iPhone 16 gewonnen! Klicken Sie hier, um Ihren Preis zu beanspruchen."))

# >>> [{'label': 'spam', 'score': 0.9989}]

Model Description

Base model: distilbert/distilbert-base-multilingual-cased
Task: Text classification (spam detection)
Languages: German; for other languages, see:
- English: tanaos-spam-detection-v1
- Spanish: tanaos-spam-detection-spanish
- Italian: tanaos-spam-detection-italian
Fine-tuning data: A synthetic, custom dataset of spam and not spam examples.

Training Details

This model was trained using the Artifex Python library

pip install artifex

by providing the following instructions and generating 10,000 synthetic training samples:

from artifex import Artifex

spam_detection = Artifex().spam_detection()

spam_detection.train(
    spam_content=[
        "Unaufgeforderte kommerzielle Werbung oder nicht-kommerzielle Missionierung",
        "Betrügerische Machenschaften, einschließlich Schnell-reich-werden- und Schneeballsysteme",
        "Phishing-Versuche, unrealistische Angebote oder Ankündigungen",
        "Inhalte mit täuschenden oder irreführenden Informationen",
        "Malware oder schädliche Links",
        "Inhalte für Erwachsene oder explizites Material",
        "Übermäßige Verwendung von Großbuchstaben oder Satzzeichen, um Aufmerksamkeit zu erregen",
    ],
    language="german",
    num_samples=15000
)

Intended Uses

This model is intended to:

Serve as a first-layer spam filter for email systems, messaging applications, or any other text-based communication platform, if the text is in German.
Help reduce unwanted or harmful messages by classifying text as spam or not spam.

Not intended for:

Use in high-stakes scenarios where misclassification could lead to significant consequences without further human review.

Downloads last month: 31

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for tanaos/tanaos-spam-detection-german

Base model

distilbert/distilbert-base-multilingual-cased

Finetuned

(435)

this model

Dataset used to train tanaos/tanaos-spam-detection-german

Collection including tanaos/tanaos-spam-detection-german

Spam Detection models

Collection

Spam detection models in various languages • 4 items • Updated 26 days ago