Qianfan VL Demo
Domain-Enhanced Universal Vision-Language Models
Github: https://github.com/baidubce/Qianfan-VL (Welcome to star/fork/watch!)
Huggingface: https://huggingface.co/collections/baidu/qianfan-vl-68d0b9b0be8575c17267c85c
Tech Report: https://arxiv.org/abs/2509.18189
Online Demo: https://huggingface.co/spaces/baidu/Qianfan-VL
Current Vision-Language Models (VLMs) face a dilemma: either pursue generality but lack professional depth, or specialize in specific domains but lose general capabilities. Enterprise applications need both - broad multimodal understanding abilities AND excellent performance in critical areas like document processing, OCR recognition, and mathematical reasoning. The Qianfan-VL team proposes an elegant solution: a four-stage progressive training pipeline that successfully achieves "having your cake and eating it too" through carefully designed training strategies and data ratios. Technical Innovation: The Art of Four-Stage Progressive Training
Qianfan-VL adopts a classic and effective three-component architecture:
Vision Encoder: Based on InternViT, supports dynamic image tiling, handles up to 4K resolution Language Model Backbone: Llama 3.1 for 8B/70B versions, Qwen2.5 for 3B version Cross-modal Adapter: Two-layer MLP structure, simple and efficient
The brilliance of this design lies in fully leveraging pre-trained model capabilities while achieving efficient cross-modal alignment through adapters.
This training strategy is textbook-level design:
Stage 1 - Cross-modal Alignment (100B tokens): Only updates the adapter, keeping encoder and language model frozen. Like building a bridge between two languages, ensuring solid basic connection first. Stage 2 - General Knowledge Injection (2.66T tokens): Full parameter updates with massive data training. Interestingly, OCR and Caption tasks account for 85%, laying a solid foundation for subsequent domain enhancement. Stage 3 - Domain Enhancement (0.32T tokens): Golden ratio of 70% domain data + 30% general data. This stage is the essence of the entire training - strengthening professional capabilities while maintaining generality. Stage 4 - Instruction Fine-tuning (1B tokens): Introduces Long Chain-of-Thought (CoT) training, significantly improving reasoning capabilities.
The Qianfan team built six major data synthesis pipelines covering document OCR, math problems, chart understanding, table recognition, formula recognition, and scene OCR. Each pipeline features carefully designed quality control mechanisms ensuring data accuracy and diversity.
Particularly noteworthy is the math problem synthesis pipeline: from K-12 to university level, including detailed solution steps, simulating handwriting, different paper backgrounds and other real scenarios. This attention to detail is key to the model's breakthrough in mathematical reasoning.
Performance: Comprehensive Excellence

Qianfan-VL excels on 14 standard benchmarks:
ScienceQA: 98.76% accuracy (70B version), demonstrating strong scientific reasoning CCBench: 80.98%, outstanding Chinese comprehension SEEDBench_IMG: 79.13%, excellent visual perception
In OCR and document understanding tasks, Qianfan-VL shows stunning capabilities:
DocVQA: 94.75% (70B version), top-tier document Q&A ability ChartQA: 89.60%, chart understanding leads peer models OCRBench: 873 points, strong comprehensive OCR capabilities
The scene OCR examples demonstrate the model's ability to accurately recognize text in complex real-world scenarios, including different angles and lighting conditions.
After introducing long chain-of-thought training, math reasoning capabilities improved dramatically:
MathVista: 78.60% (70B version), SOTA among open-source models Mathvision: 50.29%, outstanding complex visual math problem handling Mathverse: 61.04%, excellent multi-step reasoning performance
The model not only accurately identifies multiple trend lines but also combines contextual information to analyze UK Conservative and Labour Party support rate changes, demonstrating powerful data visualization understanding. In the China heating map example, the model accurately understands map legends and can answer questions about heating coverage in specific regions (like Kunming, Yunnan), showing excellent spatial reasoning. Infrastructure: A Milestone for Domestic AI Chips Qianfan-VL was trained entirely on Baidu's Kunlun P800 chips, a milestone achievement:
Scaling Efficiency: Over 90%, reaching world-class levels Optimization Strategy: 3D parallelism (data parallel + tensor parallel + pipeline parallel) Communication-Computation Fusion: Leveraging Kunlun chip's unique hardware architecture for true parallel communication and computation
Particularly commendable is the communication-computation fusion optimization: Kunlun P800 chip's physically separated communication and matrix computation units enable true parallel execution of data transfer and matrix operations, reducing end-to-end latency by 40%.
The team proved Stage 3 (domain enhancement) importance through rigorous ablation experiments:
OCR Tasks: Handwriting recognition improved by 8.20%, complex HTML table recognition by 3.67% Mathematical Reasoning: Internal datasets improved up to 18%, public benchmarks showed 2-6% gains No Performance Degradation: All 16 evaluation tasks showed positive improvements
This proves the domain enhancement strategy is not only effective but forms excellent complementarity with general training.
Qianfan-VL introduces and chain-of-thought tokens, allowing the model to perform explicit chain reasoning. Users can activate reasoning mode by adding these tokens - the model generates detailed reasoning steps within thinking boundaries but only shows final answers to users. This design ensures both reasoning transparency and output conciseness.
Context Length Extension: From 32K to 128K and beyond Computational Efficiency Optimization: Integrating NaViT technology for native resolution processing Capability Boundary Expansion: Video understanding, 3D spatial reasoning, temporal analysis Domain-Specific Versions: Medical imaging, scientific charts, technical drawings and other vertical domains
Qianfan-VL is not just an excellent multimodal model but an important milestone in China's AI self-innovation. It proves: Domain Enhancement and General Capabilities Can Coexist: Through carefully designed training strategies, professional breakthroughs can be achieved while maintaining generality Data Quality Over Quantity: Stage 3 achieved significant improvements with just 0.32T tokens - the key is precise data design Domestic Chips Have Matured: Training SOTA models on Kunlun chips demonstrates China's AI infrastructure strength Equal Emphasis on Engineering and Research: From data synthesis to training optimization, every aspect reflects industrial-grade rigor
This technical report's insight: In the era of large models, finding the right positioning, careful design, and solid execution can still carve out your own space in a field dominated by giants. Qianfan-VL points to a practical path for enterprise-level AI applications.
Domain-Enhanced Universal Vision-Language Models