Architecture
Input (integer token sequences, max_length=100)
β Embedding(vocab_size=10000, embedding_dim=128, input_length=100)
β SpatialDropout1D(rate=0.2)
β LSTM(units=64, dropout=0.2, recurrent_dropout=0.2)
β Dense(units=32, activation='relu')
β Dropout(rate=0.5)
β Dense(units=1, activation='sigmoid')
Total parameters: ~1.3M (dominated by the embedding matrix: 10000 Γ 128 = 1.28M)
Compiled with:
- Optimizer: Adam (learning_rate=0.001)
- Loss: Binary cross-entropy
- Callbacks: EarlyStopping (patience=3, restore_best_weights=True), ReduceLROnPlateau (factor=0.5, patience=2)
Training Data
- Dataset: Sentiment140 β Go, A., Bhayani, R., & Huang, L. (2009)
- Training subset used: ~1,600 samples drawn via
df.sample(0.30)from a Spark DataFrame on Google Colab - Label encoding: 0 = negative (original polarity 0), 1 = positive (original polarity 4)
- Text preprocessing: Raw tweet text, tokenised with
tf.keras.preprocessing.text.Tokenizer, padded tomax_length=100
Note: Training was performed on a small subset (~1,600 samples). The full Sentiment140 dataset contains 1.6M tweets. The preprocessing pipeline in this repository (
src/spark/preprocessing.py) operates on the full dataset using PySpark, but the Keras model was trained on a Colab subset for prototyping purposes.
Training Procedure
model.fit(
X_train, y_train,
validation_split=0.2, # ~320 validation samples
epochs=5, # training ran all 5 epochs
batch_size=64,
callbacks=[early_stopping, reduce_lr]
)
Training log (Google Colab run)
| Epoch | Train Accuracy | Val Accuracy | Val Loss | Learning Rate |
|---|---|---|---|---|
| 1 | 68.5% | 100.0% | 0.173 | 0.001 |
| 2 | 99.7% | 100.0% | 0.093 | 0.001 |
| 3 | 99.9% | 100.0% | 0.071 | 0.001 |
| 4 | 100.0% | 100.0% | 0.043 | 0.001 |
| 5 | 100.0% | 100.0% | 0.023 | 0.001 |
Best checkpoint saved at epoch 5 (lowest val_loss = 0.023).
Evaluation
Important caveats
These results should not be interpreted as production accuracy:
Small training set: ~1,280 training samples and ~320 validation samples. The 100% validation accuracy is consistent with a model that has memorised a small, potentially non-representative validation split.
AUC metric not reliable: The AUC metric reported
0.0throughout all training epochs. This is a known incompatibility betweentf.keras.metrics.AUC()and certain TensorFlow/Keras version combinations when used with a sigmoid output and binary cross-entropy loss without explicit threshold configuration. AUC values are therefore excluded from this card.No held-out test set evaluation: Evaluation was not performed on a separate, never-seen test set after training.
For reliable performance estimates on binary Twitter sentiment, refer to the Spark MLlib classifiers evaluated on the full dataset (see README.md):
| Model | Accuracy | F1 (weighted) | Dataset size |
|---|---|---|---|
| Random Forest | 70.3% | 70.3% | ~233K rows |
| Gradient Boosted Trees | 69.9% | 69.8% | ~233K rows |
| Logistic Regression | 68.5% | 68.5% | ~233K rows |
| Naive Bayes | 67.2% | 67.3% | ~233K rows |
Files in this Repository
| File | Description | Size |
|---|---|---|
best_LSTM_pipeline_model.h5 |
Best checkpoint by validation loss | ~17 MB |
pipeline_lstm_model.h5 |
Final epoch weights | ~17 MB |
Both files are in HDF5 format (legacy Keras format). The native Keras .keras format is recommended for new training runs but these weights are fully loadable with tf.keras.models.load_model.
How to Use
Load and run inference
import tensorflow as tf
import numpy as np
# Load model
model = tf.keras.models.load_model("best_LSTM_pipeline_model.h5")
# Reproduce the tokeniser (must match training)
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer(num_words=10000)
# tokenizer must be fitted on the same training texts used during training
# (the tokenizer state is not saved alongside the .h5 file)
# Preprocess new text
texts = ["I love this product!", "Terrible experience, would not recommend."]
sequences = tokenizer.texts_to_sequences(texts)
padded = pad_sequences(sequences, maxlen=100)
# Predict
predictions = model.predict(padded)
labels = ["positive" if p > 0.5 else "negative" for p in predictions.flatten()]
print(labels)
Important: The
Tokenizervocabulary is not saved in the.h5file. To reproduce predictions, you must refit the tokenizer on the same training texts. This is a known limitation of this prototype β a future improvement would be to save the tokenizer alongside the model weights (e.g., astokenizer.json).
Download via huggingface_hub
from huggingface_hub import hf_hub_download
path = hf_hub_download(
repo_id="alivafaei/sentiment140-lstm",
filename="best_LSTM_pipeline_model.h5"
)
model = tf.keras.models.load_model(path)
Limitations
- Trained on ~1,600 samples β far below what is needed for a reliable general-purpose sentiment classifier
- No tokenizer state saved β inference requires refitting the tokenizer on original training data
- HDF5 format is legacy; convert to
.kerasformat for use with Keras 3+ - No evaluation on a held-out test set
- Not suitable for production use without retraining on the full dataset with proper evaluation
Intended Use
This model was created as a learning artifact for an NLP/ML research prototype. It is appropriate for:
- Understanding LSTM architecture for text classification
- Demonstrating end-to-end ML pipeline design (PySpark preprocessing β Keras training)
- Academic / portfolio reference
It is not appropriate for production sentiment analysis, business decision-making, or any application requiring reliable accuracy.
Citation
If you use this model or the MICAP pipeline in your work:
@misc{vafaei2024micap,
author = {Vafaei, Ali},
title = {MICAP: Market Intelligence \& Competitor Analysis Platform},
year = {2024},
url = {https://github.com/itsalivafaei/micap}
}
Dataset citation:
@inproceedings{go2009twitter,
title = {Twitter Sentiment Classification using Distant Supervision},
author = {Go, Alec and Bhayani, Richa and Huang, Lei},
booktitle = {CS224N Project Report, Stanford},
year = {2009}
}