gemma3-4b-kinetic3K_FT

Gemma 3 4B-IT fine-tuned on 3,115 Kinetics video action-recognition samples (Stage 1 — vision alignment).

Model Description

This is a Stage 1 fine-tune of google/gemma-3-4b-it. In Stage 1, the LLM backbone is fully frozen; only the vision encoder and image/video projector are trained. The goal is to align the visual representations with action-recognition vocabulary before full instruction tuning.

Item Value
Base model google/gemma-3-4b-it
Architecture Gemma3ForConditionalGeneration
Training stage Stage 1 (vision alignment)
Trainable components Vision tower + image projector (embed_vision)
Frozen components LLM backbone (language_model)

Training Details

Dataset

Property Value
Dataset Kinetics-400/600/700 (curated subset)
Samples 3,115 video–text pairs
Task Video action recognition
Format Video + "What is the main action in this video?" → short action phrase

Hyperparameters

Parameter Value
Epochs 1
Total steps 390
Per-device batch size 1
Gradient accumulation 8 (effective batch size = 8)
LLM learning rate 1e-5
Projector learning rate 2e-5
Vision encoder learning rate 0.0 (frozen)
LR scheduler Cosine
Warmup ratio 0.03
Optimizer paged_adamw_8bit
Precision bfloat16
Max sequence length 4096
Gradient checkpointing Yes

Infrastructure

Property Value
Parallelism DeepSpeed ZeRO Stage 2
Hardware 1 × GPU
Training time ~2.4 hours (8,589 s)
Framework Transformers 5.5.0 + DeepSpeed

Training Curve

Step Loss
10 6.33
50 5.80
100 4.52
200 3.30
300 3.04
390 (final) 3.04

Final train loss: 3.04 (initial: 6.33)

Usage

from transformers import AutoProcessor, Gemma3ForConditionalGeneration
import torch

model = Gemma3ForConditionalGeneration.from_pretrained(
    "bear7011/gemma3-4b-kinetic3K_FT",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("bear7011/gemma3-4b-kinetic3K_FT")

Limitations

  • Stage 1 only — LLM reasoning capability is unchanged; this checkpoint is intended as a base for Stage 2 instruction tuning.
  • Trained on a small 3K subset of Kinetics; generalisation to out-of-distribution actions may be limited.

Citation

If you use this model, please cite the original Kinetics dataset and Gemma 3 model card.

Downloads last month
41
Safetensors
Model size
5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bear7011/gemma3-4b-kinetic3K_FT

Finetuned
(657)
this model