gemma3-4b-kinetic3K_FT

Gemma 3 4B-IT fine-tuned on 3,115 Kinetics video action-recognition samples (Stage 1 — vision alignment).

Model Description

This is a Stage 1 fine-tune of google/gemma-3-4b-it. In Stage 1, the LLM backbone is fully frozen; only the vision encoder and image/video projector are trained. The goal is to align the visual representations with action-recognition vocabulary before full instruction tuning.

Item	Value
Base model	`google/gemma-3-4b-it`
Architecture	`Gemma3ForConditionalGeneration`
Training stage	Stage 1 (vision alignment)
Trainable components	Vision tower + image projector (`embed_vision`)
Frozen components	LLM backbone (`language_model`)

Training Details

Dataset

Property	Value
Dataset	Kinetics-400/600/700 (curated subset)
Samples	3,115 video–text pairs
Task	Video action recognition
Format	Video + "What is the main action in this video?" → short action phrase

Hyperparameters

Parameter	Value
Epochs	1
Total steps	390
Per-device batch size	1
Gradient accumulation	8 (effective batch size = 8)
LLM learning rate	1e-5
Projector learning rate	2e-5
Vision encoder learning rate	0.0 (frozen)
LR scheduler	Cosine
Warmup ratio	0.03
Optimizer	`paged_adamw_8bit`
Precision	bfloat16
Max sequence length	4096
Gradient checkpointing	Yes

Infrastructure

Property	Value
Parallelism	DeepSpeed ZeRO Stage 2
Hardware	1 × GPU
Training time	~2.4 hours (8,589 s)
Framework	Transformers 5.5.0 + DeepSpeed

Training Curve

Step	Loss
10	6.33
50	5.80
100	4.52
200	3.30
300	3.04
390 (final)	3.04

Final train loss: 3.04 (initial: 6.33)

Usage

from transformers import AutoProcessor, Gemma3ForConditionalGeneration
import torch

model = Gemma3ForConditionalGeneration.from_pretrained(
    "bear7011/gemma3-4b-kinetic3K_FT",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("bear7011/gemma3-4b-kinetic3K_FT")

Limitations

Stage 1 only — LLM reasoning capability is unchanged; this checkpoint is intended as a base for Stage 2 instruction tuning.
Trained on a small 3K subset of Kinetics; generalisation to out-of-distribution actions may be limited.

Citation

If you use this model, please cite the original Kinetics dataset and Gemma 3 model card.

Downloads last month: 41

Safetensors

Model size

5B params

Tensor type

BF16

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bear7011/gemma3-4b-kinetic3K_FT

Base model

google/gemma-3-4b-pt

Finetuned

google/gemma-3-4b-it

Finetuned

(657)

this model