Instructions to use u-10bei/qwen3-14b-sft-merged with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use u-10bei/qwen3-14b-sft-merged with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="u-10bei/qwen3-14b-sft-merged")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("u-10bei/qwen3-14b-sft-merged")
model = AutoModelForMultimodalLM.from_pretrained("u-10bei/qwen3-14b-sft-merged")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use u-10bei/qwen3-14b-sft-merged with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "u-10bei/qwen3-14b-sft-merged"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "u-10bei/qwen3-14b-sft-merged",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/u-10bei/qwen3-14b-sft-merged

SGLang

How to use u-10bei/qwen3-14b-sft-merged with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "u-10bei/qwen3-14b-sft-merged" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "u-10bei/qwen3-14b-sft-merged",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "u-10bei/qwen3-14b-sft-merged" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "u-10bei/qwen3-14b-sft-merged",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use u-10bei/qwen3-14b-sft-merged with Docker Model Runner:
```
docker model run hf.co/u-10bei/qwen3-14b-sft-merged
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Qwen3-14B SFT Model

Model Description

This is a fine-tuned version of Qwen3-14B using Supervised Fine-Tuning (SFT) with FSDP (Fully Sharded Data Parallel) + QLoRA (Quantized Low-Rank Adaptation) techniques.

Training Details

Base Model

Model: Qwen/Qwen3-14B
Architecture: Transformer-based causal language model
Parameters: 14 billion

Training Configuration

Method: FSDP + QLoRA
Quantization: 4-bit QLoRA
LoRA Parameters:
- r: 64
- alpha: 16
- dropout: 0.1
- target: linear layers
Hardware: 8x H100 80GB HBM3
Precision: bfloat16
Flash Attention: Enabled

Training Hyperparameters

Epochs: 1
Micro Batch Size: 1
Gradient Accumulation Steps: 16
Learning Rate: 1e-4
Scheduler: Cosine with warmup
Warmup Ratio: 0.03
Optimizer: AdamW
Sequence Length: 1024

Dataset

Custom SFT dataset (SFT_004_origin_4.parquet)
Validation split: 10%
Sample packing enabled for training efficiency

Model Performance

The model has been trained for efficient instruction following and maintains the original Qwen3 capabilities while being optimized for custom tasks.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "u-10bei/qwen3-14b-sft-merged",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "u-10bei/qwen3-14b-sft-merged",
    trust_remote_code=True
)

# Chat format
messages = [
    {"role": "user", "content": "Hello! How can I help you today?"}
]

# Format conversation
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# Tokenize
inputs = tokenizer(text, return_tensors="pt")

# Generate
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id
    )

# Decode response
response = tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True)
print(response)

Direct Chat Format

# Manual chat formatting
prompt = "<|im_start|>user\nHello! How are you?<|im_end|>\n<|im_start|>assistant\n"
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.7,
    eos_token_id=tokenizer.convert_tokens_to_ids("<|im_end|>")
)

response = tokenizer.decode(outputs[0], skip_special_tokens=False)
print(response)

Special Tokens

BOS Token: <|im_start|>
EOS Token: <|im_end|>
UNK Token: <|endoftext|>
PAD Token: <|endoftext|>

Technical Specifications

Model Architecture

Attention: Flash Attention 2 (training and inference)
Precision: bfloat16 native support
Context Length: 1024 tokens (training), extensible for inference
Vocabulary Size: 151,669 tokens

Optimization Features

Memory Efficient: FSDP sharding reduces memory footprint
Quantization Ready: QLoRA-compatible for efficient fine-tuning
Multi-GPU: Optimized for distributed inference

Training Infrastructure

Distributed Training: FSDP (Fully Sharded Data Parallel)
Communication: NCCL with Ethernet backend
Memory Management: Expandable segments, optimized allocation
Monitoring: Weights & Biases integration

Limitations

This model is optimized for the specific training dataset and may not generalize to all use cases
Context length is limited to 1024 tokens during training
Performance may vary depending on the specific task and input format

Ethical Considerations

This model inherits the capabilities and limitations of the base Qwen3-14B model. Users should be aware of potential biases and use the model responsibly.

Citation

If you use this model, please cite:

@model{qwen3-14b-sft-merged,
  title={Qwen3-14B SFT Model with FSDP+QLoRA},
  author={u-10bei},
  year={2025},
  url={https://huggingface.co/u-10bei/qwen3-14b-sft-merged}
}

Model Card Authors

u-10bei

Training Date

August 2025

This model was trained using advanced distributed training techniques (FSDP + QLoRA) on high-performance H100 hardware for optimal efficiency and scalability.

Downloads last month: 2

Model tree for u-10bei/qwen3-14b-sft-merged

Base model

Qwen/Qwen3-14B-Base

Finetuned

Qwen/Qwen3-14B

Finetuned

(267)

this model

Quantizations

1 model