minimind-63M-full-sft-Junhan
This repository contains a 63.9M-parameter dense MiniMind chat model converted to a Transformers-compatible checkpoint for easy loading with transformers.
Model Summary
- Architecture: dense decoder-only causal LM
- Exported architecture name:
Qwen3ForCausalLM - Original training codebase: MiniMind
- Parameters: 63.9M
- Hidden size: 768
- Layers: 8
- Attention heads: 8
- KV heads: 4
- Vocab size: 6400
- Max position embeddings: 32768
- RoPE theta: 1e6
- MoE: no
- Checkpoint type: full-parameter SFT
This model was trained from a MiniMind pretraining checkpoint and then fully fine-tuned on the MiniMind SFT pipeline. The exported folder was produced from the local full_sft_768.pth checkpoint using scripts/convert_model.py.
Training Notes
- Base training pipeline: MiniMind
- SFT training script:
trainer/train_full_sft.py - SFT data used locally:
sft_t2t_mini.jsonl - Typical SFT sequence length in this setup:
max_seq_len=768
The upstream MiniMind SFT data mixes general instruction-following samples with some tool-calling and reasoning-style samples. As a result, this checkpoint is mainly a lightweight chat model, not a specialized tool-use or reasoning model.
Usage
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
repo_id = "YOUR_USERNAME/minimind-63M-full-sft-Junhan"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(
repo_id,
torch_dtype="auto",
device_map="auto",
)
messages = [
{"role": "user", "content": "你好,介绍一下你自己。"}
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(
inputs,
max_new_tokens=256,
do_sample=True,
temperature=0.7,
top_p=0.9,
)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))
Intended Use
- Lightweight chat experiments
- Small-model SFT baselines
- Educational and debugging purposes
- Simple local inference and deployment tests
Limitations
- This is a very small model, so factuality, planning, and reasoning ability are limited.
- Tool-use style may appear in some responses, but robustness is limited.
- The model is not suitable for high-stakes medical, legal, financial, or safety-critical use.
- The training mixture includes distilled or synthetic components, so behavior may inherit teacher-model style artifacts.
Source
- Upstream codebase: https://github.com/jingyaogong/minimind
License
This model card uses cc-by-nc-4.0 conservatively because the upstream MiniMind dataset documentation mentions mixed source licenses, including non-commercial terms in parts of the training pipeline. Review your exact data provenance before using or relicensing this model for commercial scenarios.
- Downloads last month
- 323