Architecture

Input (integer token sequences, max_length=100)
  β†’ Embedding(vocab_size=10000, embedding_dim=128, input_length=100)
  β†’ SpatialDropout1D(rate=0.2)
  β†’ LSTM(units=64, dropout=0.2, recurrent_dropout=0.2)
  β†’ Dense(units=32, activation='relu')
  β†’ Dropout(rate=0.5)
  β†’ Dense(units=1, activation='sigmoid')

Total parameters: ~1.3M (dominated by the embedding matrix: 10000 Γ— 128 = 1.28M)

Compiled with:

  • Optimizer: Adam (learning_rate=0.001)
  • Loss: Binary cross-entropy
  • Callbacks: EarlyStopping (patience=3, restore_best_weights=True), ReduceLROnPlateau (factor=0.5, patience=2)

Training Data

  • Dataset: Sentiment140 β€” Go, A., Bhayani, R., & Huang, L. (2009)
  • Training subset used: ~1,600 samples drawn via df.sample(0.30) from a Spark DataFrame on Google Colab
  • Label encoding: 0 = negative (original polarity 0), 1 = positive (original polarity 4)
  • Text preprocessing: Raw tweet text, tokenised with tf.keras.preprocessing.text.Tokenizer, padded to max_length=100

Note: Training was performed on a small subset (~1,600 samples). The full Sentiment140 dataset contains 1.6M tweets. The preprocessing pipeline in this repository (src/spark/preprocessing.py) operates on the full dataset using PySpark, but the Keras model was trained on a Colab subset for prototyping purposes.


Training Procedure

model.fit(
    X_train, y_train,
    validation_split=0.2,   # ~320 validation samples
    epochs=5,               # training ran all 5 epochs
    batch_size=64,
    callbacks=[early_stopping, reduce_lr]
)

Training log (Google Colab run)

Epoch Train Accuracy Val Accuracy Val Loss Learning Rate
1 68.5% 100.0% 0.173 0.001
2 99.7% 100.0% 0.093 0.001
3 99.9% 100.0% 0.071 0.001
4 100.0% 100.0% 0.043 0.001
5 100.0% 100.0% 0.023 0.001

Best checkpoint saved at epoch 5 (lowest val_loss = 0.023).


Evaluation

Important caveats

These results should not be interpreted as production accuracy:

  1. Small training set: ~1,280 training samples and ~320 validation samples. The 100% validation accuracy is consistent with a model that has memorised a small, potentially non-representative validation split.

  2. AUC metric not reliable: The AUC metric reported 0.0 throughout all training epochs. This is a known incompatibility between tf.keras.metrics.AUC() and certain TensorFlow/Keras version combinations when used with a sigmoid output and binary cross-entropy loss without explicit threshold configuration. AUC values are therefore excluded from this card.

  3. No held-out test set evaluation: Evaluation was not performed on a separate, never-seen test set after training.

For reliable performance estimates on binary Twitter sentiment, refer to the Spark MLlib classifiers evaluated on the full dataset (see README.md):

Model Accuracy F1 (weighted) Dataset size
Random Forest 70.3% 70.3% ~233K rows
Gradient Boosted Trees 69.9% 69.8% ~233K rows
Logistic Regression 68.5% 68.5% ~233K rows
Naive Bayes 67.2% 67.3% ~233K rows

Files in this Repository

File Description Size
best_LSTM_pipeline_model.h5 Best checkpoint by validation loss ~17 MB
pipeline_lstm_model.h5 Final epoch weights ~17 MB

Both files are in HDF5 format (legacy Keras format). The native Keras .keras format is recommended for new training runs but these weights are fully loadable with tf.keras.models.load_model.


How to Use

Load and run inference

import tensorflow as tf
import numpy as np

# Load model
model = tf.keras.models.load_model("best_LSTM_pipeline_model.h5")

# Reproduce the tokeniser (must match training)
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words=10000)
# tokenizer must be fitted on the same training texts used during training
# (the tokenizer state is not saved alongside the .h5 file)

# Preprocess new text
texts = ["I love this product!", "Terrible experience, would not recommend."]
sequences = tokenizer.texts_to_sequences(texts)
padded = pad_sequences(sequences, maxlen=100)

# Predict
predictions = model.predict(padded)
labels = ["positive" if p > 0.5 else "negative" for p in predictions.flatten()]
print(labels)

Important: The Tokenizer vocabulary is not saved in the .h5 file. To reproduce predictions, you must refit the tokenizer on the same training texts. This is a known limitation of this prototype β€” a future improvement would be to save the tokenizer alongside the model weights (e.g., as tokenizer.json).

Download via huggingface_hub

from huggingface_hub import hf_hub_download

path = hf_hub_download(
    repo_id="alivafaei/sentiment140-lstm",
    filename="best_LSTM_pipeline_model.h5"
)
model = tf.keras.models.load_model(path)

Limitations

  • Trained on ~1,600 samples β€” far below what is needed for a reliable general-purpose sentiment classifier
  • No tokenizer state saved β€” inference requires refitting the tokenizer on original training data
  • HDF5 format is legacy; convert to .keras format for use with Keras 3+
  • No evaluation on a held-out test set
  • Not suitable for production use without retraining on the full dataset with proper evaluation

Intended Use

This model was created as a learning artifact for an NLP/ML research prototype. It is appropriate for:

  • Understanding LSTM architecture for text classification
  • Demonstrating end-to-end ML pipeline design (PySpark preprocessing β†’ Keras training)
  • Academic / portfolio reference

It is not appropriate for production sentiment analysis, business decision-making, or any application requiring reliable accuracy.


Citation

If you use this model or the MICAP pipeline in your work:

@misc{vafaei2024micap,
  author = {Vafaei, Ali},
  title  = {MICAP: Market Intelligence \& Competitor Analysis Platform},
  year   = {2024},
  url    = {https://github.com/itsalivafaei/micap}
}

Dataset citation:

@inproceedings{go2009twitter,
  title     = {Twitter Sentiment Classification using Distant Supervision},
  author    = {Go, Alec and Bhayani, Richa and Huang, Lei},
  booktitle = {CS224N Project Report, Stanford},
  year      = {2009}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support