gemma3-4b-kinetic3K_FT
Gemma 3 4B-IT fine-tuned on 3,115 Kinetics video action-recognition samples (Stage 1 — vision alignment).
Model Description
This is a Stage 1 fine-tune of google/gemma-3-4b-it. In Stage 1, the LLM backbone is fully frozen; only the vision encoder and image/video projector are trained. The goal is to align the visual representations with action-recognition vocabulary before full instruction tuning.
| Item | Value |
|---|---|
| Base model | google/gemma-3-4b-it |
| Architecture | Gemma3ForConditionalGeneration |
| Training stage | Stage 1 (vision alignment) |
| Trainable components | Vision tower + image projector (embed_vision) |
| Frozen components | LLM backbone (language_model) |
Training Details
Dataset
| Property | Value |
|---|---|
| Dataset | Kinetics-400/600/700 (curated subset) |
| Samples | 3,115 video–text pairs |
| Task | Video action recognition |
| Format | Video + "What is the main action in this video?" → short action phrase |
Hyperparameters
| Parameter | Value |
|---|---|
| Epochs | 1 |
| Total steps | 390 |
| Per-device batch size | 1 |
| Gradient accumulation | 8 (effective batch size = 8) |
| LLM learning rate | 1e-5 |
| Projector learning rate | 2e-5 |
| Vision encoder learning rate | 0.0 (frozen) |
| LR scheduler | Cosine |
| Warmup ratio | 0.03 |
| Optimizer | paged_adamw_8bit |
| Precision | bfloat16 |
| Max sequence length | 4096 |
| Gradient checkpointing | Yes |
Infrastructure
| Property | Value |
|---|---|
| Parallelism | DeepSpeed ZeRO Stage 2 |
| Hardware | 1 × GPU |
| Training time | ~2.4 hours (8,589 s) |
| Framework | Transformers 5.5.0 + DeepSpeed |
Training Curve
| Step | Loss |
|---|---|
| 10 | 6.33 |
| 50 | 5.80 |
| 100 | 4.52 |
| 200 | 3.30 |
| 300 | 3.04 |
| 390 (final) | 3.04 |
Final train loss: 3.04 (initial: 6.33)
Usage
from transformers import AutoProcessor, Gemma3ForConditionalGeneration
import torch
model = Gemma3ForConditionalGeneration.from_pretrained(
"bear7011/gemma3-4b-kinetic3K_FT",
torch_dtype=torch.bfloat16,
device_map="auto",
)
processor = AutoProcessor.from_pretrained("bear7011/gemma3-4b-kinetic3K_FT")
Limitations
- Stage 1 only — LLM reasoning capability is unchanged; this checkpoint is intended as a base for Stage 2 instruction tuning.
- Trained on a small 3K subset of Kinetics; generalisation to out-of-distribution actions may be limited.
Citation
If you use this model, please cite the original Kinetics dataset and Gemma 3 model card.
- Downloads last month
- 41
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support