Architecture

Input (integer token sequences, max_length=100)
  → Embedding(vocab_size=10000, embedding_dim=128, input_length=100)
  → SpatialDropout1D(rate=0.2)
  → LSTM(units=64, dropout=0.2, recurrent_dropout=0.2)
  → Dense(units=32, activation='relu')
  → Dropout(rate=0.5)
  → Dense(units=1, activation='sigmoid')

Total parameters: ~1.3M (dominated by the embedding matrix: 10000 × 128 = 1.28M)

Compiled with:

Optimizer: Adam (learning_rate=0.001)
Loss: Binary cross-entropy
Callbacks: EarlyStopping (patience=3, restore_best_weights=True), ReduceLROnPlateau (factor=0.5, patience=2)

Training Data

Dataset: Sentiment140 — Go, A., Bhayani, R., & Huang, L. (2009)
Training subset used: ~1,600 samples drawn via df.sample(0.30) from a Spark DataFrame on Google Colab
Label encoding: 0 = negative (original polarity 0), 1 = positive (original polarity 4)
Text preprocessing: Raw tweet text, tokenised with tf.keras.preprocessing.text.Tokenizer, padded to max_length=100

Note: Training was performed on a small subset (~1,600 samples). The full Sentiment140 dataset contains 1.6M tweets. The preprocessing pipeline in this repository (src/spark/preprocessing.py) operates on the full dataset using PySpark, but the Keras model was trained on a Colab subset for prototyping purposes.

Training Procedure

model.fit(
    X_train, y_train,
    validation_split=0.2,   # ~320 validation samples
    epochs=5,               # training ran all 5 epochs
    batch_size=64,
    callbacks=[early_stopping, reduce_lr]
)

Training log (Google Colab run)

Epoch	Train Accuracy	Val Accuracy	Val Loss	Learning Rate
1	68.5%	100.0%	0.173	0.001
2	99.7%	100.0%	0.093	0.001
3	99.9%	100.0%	0.071	0.001
4	100.0%	100.0%	0.043	0.001
5	100.0%	100.0%	0.023	0.001

Best checkpoint saved at epoch 5 (lowest val_loss = 0.023).

Evaluation

Important caveats

These results should not be interpreted as production accuracy:

Small training set: ~1,280 training samples and ~320 validation samples. The 100% validation accuracy is consistent with a model that has memorised a small, potentially non-representative validation split.
AUC metric not reliable: The AUC metric reported 0.0 throughout all training epochs. This is a known incompatibility between tf.keras.metrics.AUC() and certain TensorFlow/Keras version combinations when used with a sigmoid output and binary cross-entropy loss without explicit threshold configuration. AUC values are therefore excluded from this card.
No held-out test set evaluation: Evaluation was not performed on a separate, never-seen test set after training.

For reliable performance estimates on binary Twitter sentiment, refer to the Spark MLlib classifiers evaluated on the full dataset (see README.md):

Model	Accuracy	F1 (weighted)	Dataset size
Random Forest	70.3%	70.3%	~233K rows
Gradient Boosted Trees	69.9%	69.8%	~233K rows
Logistic Regression	68.5%	68.5%	~233K rows
Naive Bayes	67.2%	67.3%	~233K rows

Files in this Repository

File	Description	Size
`best_LSTM_pipeline_model.h5`	Best checkpoint by validation loss	~17 MB
`pipeline_lstm_model.h5`	Final epoch weights	~17 MB

Both files are in HDF5 format (legacy Keras format). The native Keras .keras format is recommended for new training runs but these weights are fully loadable with tf.keras.models.load_model.

How to Use

Load and run inference

import tensorflow as tf
import numpy as np

# Load model
model = tf.keras.models.load_model("best_LSTM_pipeline_model.h5")

# Reproduce the tokeniser (must match training)
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words=10000)
# tokenizer must be fitted on the same training texts used during training
# (the tokenizer state is not saved alongside the .h5 file)

# Preprocess new text
texts = ["I love this product!", "Terrible experience, would not recommend."]
sequences = tokenizer.texts_to_sequences(texts)
padded = pad_sequences(sequences, maxlen=100)

# Predict
predictions = model.predict(padded)
labels = ["positive" if p > 0.5 else "negative" for p in predictions.flatten()]
print(labels)

Important: The Tokenizer vocabulary is not saved in the .h5 file. To reproduce predictions, you must refit the tokenizer on the same training texts. This is a known limitation of this prototype — a future improvement would be to save the tokenizer alongside the model weights (e.g., as tokenizer.json).

Download via huggingface_hub

from huggingface_hub import hf_hub_download

path = hf_hub_download(
    repo_id="alivafaei/sentiment140-lstm",
    filename="best_LSTM_pipeline_model.h5"
)
model = tf.keras.models.load_model(path)

Limitations

Trained on ~1,600 samples — far below what is needed for a reliable general-purpose sentiment classifier
No tokenizer state saved — inference requires refitting the tokenizer on original training data
HDF5 format is legacy; convert to .keras format for use with Keras 3+
No evaluation on a held-out test set
Not suitable for production use without retraining on the full dataset with proper evaluation

Intended Use

This model was created as a learning artifact for an NLP/ML research prototype. It is appropriate for:

Understanding LSTM architecture for text classification
Demonstrating end-to-end ML pipeline design (PySpark preprocessing → Keras training)
Academic / portfolio reference

It is not appropriate for production sentiment analysis, business decision-making, or any application requiring reliable accuracy.

Citation

If you use this model or the MICAP pipeline in your work:

@misc{vafaei2024micap,
  author = {Vafaei, Ali},
  title  = {MICAP: Market Intelligence \& Competitor Analysis Platform},
  year   = {2024},
  url    = {https://github.com/itsalivafaei/micap}
}

Dataset citation:

@inproceedings{go2009twitter,
  title     = {Twitter Sentiment Classification using Distant Supervision},
  author    = {Go, Alec and Bhayani, Richa and Huang, Lei},
  booktitle = {CS224N Project Report, Stanford},
  year      = {2009}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support