Instructions to use u-10bei/qwen3-14b-sft-merged with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use u-10bei/qwen3-14b-sft-merged with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="u-10bei/qwen3-14b-sft-merged") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("u-10bei/qwen3-14b-sft-merged") model = AutoModelForMultimodalLM.from_pretrained("u-10bei/qwen3-14b-sft-merged") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use u-10bei/qwen3-14b-sft-merged with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "u-10bei/qwen3-14b-sft-merged" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "u-10bei/qwen3-14b-sft-merged", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/u-10bei/qwen3-14b-sft-merged
- SGLang
How to use u-10bei/qwen3-14b-sft-merged with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "u-10bei/qwen3-14b-sft-merged" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "u-10bei/qwen3-14b-sft-merged", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "u-10bei/qwen3-14b-sft-merged" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "u-10bei/qwen3-14b-sft-merged", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use u-10bei/qwen3-14b-sft-merged with Docker Model Runner:
docker model run hf.co/u-10bei/qwen3-14b-sft-merged
Qwen3-14B SFT Model
Model Description
This is a fine-tuned version of Qwen3-14B using Supervised Fine-Tuning (SFT) with FSDP (Fully Sharded Data Parallel) + QLoRA (Quantized Low-Rank Adaptation) techniques.
Training Details
Base Model
- Model: Qwen/Qwen3-14B
- Architecture: Transformer-based causal language model
- Parameters: 14 billion
Training Configuration
- Method: FSDP + QLoRA
- Quantization: 4-bit QLoRA
- LoRA Parameters:
- r: 64
- alpha: 16
- dropout: 0.1
- target: linear layers
- Hardware: 8x H100 80GB HBM3
- Precision: bfloat16
- Flash Attention: Enabled
Training Hyperparameters
- Epochs: 1
- Micro Batch Size: 1
- Gradient Accumulation Steps: 16
- Learning Rate: 1e-4
- Scheduler: Cosine with warmup
- Warmup Ratio: 0.03
- Optimizer: AdamW
- Sequence Length: 1024
Dataset
- Custom SFT dataset (SFT_004_origin_4.parquet)
- Validation split: 10%
- Sample packing enabled for training efficiency
Model Performance
The model has been trained for efficient instruction following and maintains the original Qwen3 capabilities while being optimized for custom tasks.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
"u-10bei/qwen3-14b-sft-merged",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
"u-10bei/qwen3-14b-sft-merged",
trust_remote_code=True
)
# Chat format
messages = [
{"role": "user", "content": "Hello! How can I help you today?"}
]
# Format conversation
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
# Tokenize
inputs = tokenizer(text, return_tensors="pt")
# Generate
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=256,
do_sample=True,
temperature=0.7,
top_p=0.9,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id
)
# Decode response
response = tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True)
print(response)
Direct Chat Format
# Manual chat formatting
prompt = "<|im_start|>user\nHello! How are you?<|im_end|>\n<|im_start|>assistant\n"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=100,
do_sample=True,
temperature=0.7,
eos_token_id=tokenizer.convert_tokens_to_ids("<|im_end|>")
)
response = tokenizer.decode(outputs[0], skip_special_tokens=False)
print(response)
Special Tokens
- BOS Token:
<|im_start|> - EOS Token:
<|im_end|> - UNK Token:
<|endoftext|> - PAD Token:
<|endoftext|>
Technical Specifications
Model Architecture
- Attention: Flash Attention 2 (training and inference)
- Precision: bfloat16 native support
- Context Length: 1024 tokens (training), extensible for inference
- Vocabulary Size: 151,669 tokens
Optimization Features
- Memory Efficient: FSDP sharding reduces memory footprint
- Quantization Ready: QLoRA-compatible for efficient fine-tuning
- Multi-GPU: Optimized for distributed inference
Training Infrastructure
- Distributed Training: FSDP (Fully Sharded Data Parallel)
- Communication: NCCL with Ethernet backend
- Memory Management: Expandable segments, optimized allocation
- Monitoring: Weights & Biases integration
Limitations
- This model is optimized for the specific training dataset and may not generalize to all use cases
- Context length is limited to 1024 tokens during training
- Performance may vary depending on the specific task and input format
Ethical Considerations
This model inherits the capabilities and limitations of the base Qwen3-14B model. Users should be aware of potential biases and use the model responsibly.
Citation
If you use this model, please cite:
@model{qwen3-14b-sft-merged,
title={Qwen3-14B SFT Model with FSDP+QLoRA},
author={u-10bei},
year={2025},
url={https://huggingface.co/u-10bei/qwen3-14b-sft-merged}
}
Model Card Authors
- u-10bei
Training Date
August 2025
This model was trained using advanced distributed training techniques (FSDP + QLoRA) on high-performance H100 hardware for optimal efficiency and scalability.
- Downloads last month
- 2