ImagineClassification (fine-tuned ViT)

Fine-tuned Vision Transformer (ViT-B/16, patch 16, 224×224) for coarse fashion product classification into four masterCategory labels from the Fashion Product Images (small) dataset.

Model summary

Item	Detail
Base checkpoint	`google/vit-base-patch16-224-in21k`
Task	Multi-class image classification (4 classes)
Labels	`Apparel`, `Accessories`, `Footwear`, `Personal Care`
Input	RGB images, 224×224 (use the bundled `ViTImageProcessor` / `AutoImageProcessor`)
Framework	PyTorch + Transformers

This repository was produced by comparing three candidate image classifiers (same data and training recipe), then packaging the best checkpoint by test accuracy.

Training procedure (from experiment notebook)

Training and evaluation follow the pipeline described in Pipeline_1_fine_tuning_models.ipynb:

Data: ashraq/fashion-product-images-small (train split), rows with masterCategory in the four classes above.
Balanced sampling: For each class, 2,100 images sampled with random_state=5 (SEED = 5):
- 100 images per class held out as a stratified out-of-sample test set (400 images total).
- Remaining 2,000 per class form the train/val pool; stratified train/validation split (same seed).
Optimization: AdamW, learning rate 2e-5, batch size 8, 1 fine-tuning epoch, cross-entropy loss.
Candidates fine-tuned (same recipe):
google/vit-base-patch16-224-in21k, facebook/deit-tiny-patch16-224, google/mobilenet_v2_1.0_224.
Selection rule: Highest test accuracy on the held-out 400-sample test set; ties broken by the loop order in the notebook.

Reported results (one Colab run, post fine-tuning)

On the 400-image balanced test set; runtime = mean seconds per image for evaluation in that run (device-dependent).

Model	Test accuracy	Runtime (s / image)
`google/vit-base-patch16-224-in21k`	0.9975	0.000953
`facebook/deit-tiny-patch16-224`	0.9950	0.000886
`google/mobilenet_v2_1.0_224`	0.9375	0.001365

Selected model: google/vit-base-patch16-224-in21k (accuracy 1.0000 on this test split).

Note: Perfect accuracy on 400 samples does not guarantee generalization to all real-world product photos. Performance depends on image quality, viewpoint, and domain shift relative to the dataset.

Intended use

Primary: Fast coarse category tagging for fashion e-commerce assets (four-way classification).
Out of scope: Fine-grained SKU/subcategory prediction, non-fashion images, or classes outside the four labels above.

Limitations and bias

Trained only on four frequent masterCategory values for a class-balanced setup; other categories from the original catalog are not represented.
The source dataset may reflect commercial catalog biases (presentation, demographics, geography).
Do not use for high-stakes decisions (e.g., safety, compliance, or financial outcomes) without further validation.

How to use

from transformers import AutoImageProcessor, AutoModelForImageClassification
from PIL import Image
import requests

model_id = "Leoinhouse/ImagineClassification-finetuned-model"  # or local path
processor = AutoImageProcessor.from_pretrained(model_id)
model = AutoModelForImageClassification.from_pretrained(model_id)

image = Image.open(requests.get("https://example.com/product.jpg", stream=True).raw).convert("RGB")
inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
predicted_id = outputs.logits.argmax(-1).item()
label = model.config.id2label[predicted_id]
print(label)

Citation

If you use this model, cite the Transformers library and the dataset you rely on, for example:

@inproceedings{wolf-etal-2020-transformers,
  title={Transformers: State-of-the-Art Natural Language Processing},
  author={Wolf, Thomas and others},
  booktitle={EMNLP 2020: System Demonstrations},
  year={2020}
}

Model card contact

Maintained for coursework / project use under Hugging Face user Leoinhouse. For the upstream ViT architecture and weights, see the base model card: google/vit-base-patch16-224-in21k.

Downloads last month: 36

Safetensors

Model size

85.8M params

Tensor type

F32

Model tree for Leoinhouse/ImagineClassification-finetuned-model

Base model

google/vit-base-patch16-224-in21k

Finetuned

(2512)

this model

Leoinhouse
/

ImagineClassification-finetuned-model