Qianfan-VL: A Milestone Achievement in Chinese Multimodal AI with Domestic Chips

Community Article Published September 24, 2025

Dear AI research enthusiasts, let's dive into a technical report from Baidu AI Cloud team - Qianfan-VL. This is a multimodal large language model series with tremendous industrial value, not only excelling in general benchmarks but achieving SOTA performance in domain-specific tasks like OCR, document understanding, and mathematical reasoning. Most excitingly, this model was trained entirely on Baidu's Kunlun chips, achieving over 90% scaling efficiency on a cluster of 5000+ chips!

Related Resources

Github: https://github.com/baidubce/Qianfan-VL (Welcome to star/fork/watch!)

Huggingface: https://huggingface.co/collections/baidu/qianfan-vl-68d0b9b0be8575c17267c85c

Tech Report: https://arxiv.org/abs/2509.18189

Online Demo: https://huggingface.co/spaces/baidu/Qianfan-VL

Core Challenge: How to Excel in Domains While Maintaining General Capabilities?

Current Vision-Language Models (VLMs) face a dilemma: either pursue generality but lack professional depth, or specialize in specific domains but lose general capabilities. Enterprise applications need both - broad multimodal understanding abilities AND excellent performance in critical areas like document processing, OCR recognition, and mathematical reasoning. The Qianfan-VL team proposes an elegant solution: a four-stage progressive training pipeline that successfully achieves "having your cake and eating it too" through carefully designed training strategies and data ratios. Technical Innovation: The Art of Four-Stage Progressive Training

1. Model Architecture: The Wisdom of Modular Design

Qianfan-VL adopts a classic and effective three-component architecture:

Vision Encoder: Based on InternViT, supports dynamic image tiling, handles up to 4K resolution Language Model Backbone: Llama 3.1 for 8B/70B versions, Qwen2.5 for 3B version Cross-modal Adapter: Two-layer MLP structure, simple and efficient

The brilliance of this design lies in fully leveraging pre-trained model capabilities while achieving efficient cross-modal alignment through adapters.

2. Four-Stage Training Pipeline: Progressive Wisdom

This training strategy is textbook-level design:

Stage 1 - Cross-modal Alignment (100B tokens): Only updates the adapter, keeping encoder and language model frozen. Like building a bridge between two languages, ensuring solid basic connection first. Stage 2 - General Knowledge Injection (2.66T tokens): Full parameter updates with massive data training. Interestingly, OCR and Caption tasks account for 85%, laying a solid foundation for subsequent domain enhancement. Stage 3 - Domain Enhancement (0.32T tokens): Golden ratio of 70% domain data + 30% general data. This stage is the essence of the entire training - strengthening professional capabilities while maintaining generality. Stage 4 - Instruction Fine-tuning (1B tokens): Introduces Long Chain-of-Thought (CoT) training, significantly improving reasoning capabilities.

3. Data Synthesis Pipeline: Industrial-Grade Data Production Line

The Qianfan team built six major data synthesis pipelines covering document OCR, math problems, chart understanding, table recognition, formula recognition, and scene OCR. Each pipeline features carefully designed quality control mechanisms ensuring data accuracy and diversity. Particularly noteworthy is the math problem synthesis pipeline: from K-12 to university level, including detailed solution steps, simulating handwriting, different paper backgrounds and other real scenarios. This attention to detail is key to the model's breakthrough in mathematical reasoning. Performance: Comprehensive Excellence

General Multimodal Benchmarks

Qianfan-VL excels on 14 standard benchmarks:

ScienceQA: 98.76% accuracy (70B version), demonstrating strong scientific reasoning CCBench: 80.98%, outstanding Chinese comprehension SEEDBench_IMG: 79.13%, excellent visual perception

OCR and Document Understanding: True Domain Mastery

In OCR and document understanding tasks, Qianfan-VL shows stunning capabilities:

DocVQA: 94.75% (70B version), top-tier document Q&A ability ChartQA: 89.60%, chart understanding leads peer models OCRBench: 873 points, strong comprehensive OCR capabilities

The scene OCR examples demonstrate the model's ability to accurately recognize text in complex real-world scenarios, including different angles and lighting conditions.

Mathematical Reasoning: The Power of Chain-of-Thought

After introducing long chain-of-thought training, math reasoning capabilities improved dramatically:

MathVista: 78.60% (70B version), SOTA among open-source models Mathvision: 50.29%, outstanding complex visual math problem handling Mathverse: 61.04%, excellent multi-step reasoning performance

Real-World Applications: From Political Analysis to Geographic Information

The model not only accurately identifies multiple trend lines but also combines contextual information to analyze UK Conservative and Labour Party support rate changes, demonstrating powerful data visualization understanding. In the China heating map example, the model accurately understands map legends and can answer questions about heating coverage in specific regions (like Kunming, Yunnan), showing excellent spatial reasoning. Infrastructure: A Milestone for Domestic AI Chips Qianfan-VL was trained entirely on Baidu's Kunlun P800 chips, a milestone achievement:

Cluster Scale: 5000+ chips parallel training

Scaling Efficiency: Over 90%, reaching world-class levels Optimization Strategy: 3D parallelism (data parallel + tensor parallel + pipeline parallel) Communication-Computation Fusion: Leveraging Kunlun chip's unique hardware architecture for true parallel communication and computation

Particularly commendable is the communication-computation fusion optimization: Kunlun P800 chip's physically separated communication and matrix computation units enable true parallel execution of data transfer and matrix operations, reducing end-to-end latency by 40%.

Ablation Studies: Validating Domain Enhancement Necessity

The team proved Stage 3 (domain enhancement) importance through rigorous ablation experiments:

OCR Tasks: Handwriting recognition improved by 8.20%, complex HTML table recognition by 3.67% Mathematical Reasoning: Internal datasets improved up to 18%, public benchmarks showed 2-6% gains No Performance Degradation: All 16 evaluation tasks showed positive improvements

This proves the domain enhancement strategy is not only effective but forms excellent complementarity with general training.

Interesting Discovery: The Magic of Thinking Tokens

Qianfan-VL introduces and chain-of-thought tokens, allowing the model to perform explicit chain reasoning. Users can activate reasoning mode by adding these tokens - the model generates detailed reasoning steps within thinking boundaries but only shows final answers to users. This design ensures both reasoning transparency and output conciseness.

Future Outlook

Upcoming improvements:

Context Length Extension: From 32K to 128K and beyond Computational Efficiency Optimization: Integrating NaViT technology for native resolution processing Capability Boundary Expansion: Video understanding, 3D spatial reasoning, temporal analysis Domain-Specific Versions: Medical imaging, scientific charts, technical drawings and other vertical domains

Summary: A Model of Industry-Academia Collaboration

Qianfan-VL is not just an excellent multimodal model but an important milestone in China's AI self-innovation. It proves: Domain Enhancement and General Capabilities Can Coexist: Through carefully designed training strategies, professional breakthroughs can be achieved while maintaining generality Data Quality Over Quantity: Stage 3 achieved significant improvements with just 0.32T tokens - the key is precise data design Domestic Chips Have Matured: Training SOTA models on Kunlun chips demonstrates China's AI infrastructure strength Equal Emphasis on Engineering and Research: From data synthesis to training optimization, every aspect reflects industrial-grade rigor

This technical report's insight: In the era of large models, finding the right positioning, careful design, and solid execution can still carve out your own space in a field dominated by giants. Qianfan-VL points to a practical path for enterprise-level AI applications.

Spaces mentioned in this article 1

Collections mentioned in this article 1

Unleashing the Full Potential of ERNIE4.5 using FastDeploy

September 19, 2025

PP-OCRv5 on Hugging Face: A Specialized Approach to OCR

111

September 10, 2025

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Qianfan-VL: A Milestone Achievement in Chinese Multimodal AI with Domestic Chips

Related Resources

Core Challenge: How to Excel in Domains While Maintaining General Capabilities?

1. Model Architecture: The Wisdom of Modular Design

2. Four-Stage Training Pipeline: Progressive Wisdom

3. Data Synthesis Pipeline: Industrial-Grade Data Production Line

General Multimodal Benchmarks

OCR and Document Understanding: True Domain Mastery

Mathematical Reasoning: The Power of Chain-of-Thought

Real-World Applications: From Political Analysis to Geographic Information

Cluster Scale: 5000+ chips parallel training

Ablation Studies: Validating Domain Enhancement Necessity

Interesting Discovery: The Magic of Thinking Tokens

Future Outlook

Upcoming improvements:

Summary: A Model of Industry-Academia Collaboration

Spaces mentioned in this article 1

Qianfan VL Demo

Collections mentioned in this article 1

Unleashing the Full Potential of ERNIE4.5 using FastDeploy

PP-OCRv5 on Hugging Face: A Specialized Approach to OCR

Community

Spaces mentioned in this article 1

Qianfan VL Demo

Collections mentioned in this article 1