arxiv:2604.07413

FORGE:Fine-grained Multimodal Evaluation for Manufacturing Scenarios

Published on Apr 8

University of Waterloo

Upvote

Authors:

Xiangru Jian ,

Wei Pang ,

Abstract

FORGE introduces a high-quality multimodal manufacturing dataset with fine-grained domain semantics to evaluate MLLMs on real-world tasks, revealing that domain-specific knowledge rather than visual grounding limits performance, and demonstrating that supervised fine-tuning on structured annotations significantly improves accuracy.

AI-generated summary

The manufacturing sector is increasingly adopting Multimodal Large Language Models (MLLMs) to transition from simple perception to autonomous execution, yet current evaluations fail to reflect the rigorous demands of real-world manufacturing environments. Progress is hindered by data scarcity and a lack of fine-grained domain semantics in existing datasets. To bridge this gap, we introduce FORGE. Wefirst construct a high-quality multimodal dataset that combines real-world 2D images and 3D point clouds, annotated with fine-grained domain semantics (e.g., exact model numbers). We then evaluate 18 state-of-the-art MLLMs across three manufacturing tasks, namely workpiece verification, structural surface inspection, and assembly verification, revealing significant performance gaps. Counter to conventional understanding, the bottleneck analysis shows that visual grounding is not the primary limiting factor. Instead, insufficient domain-specific knowledge is the key bottleneck, setting a clear direction for future research. Beyond evaluation, we show that our structured annotations can serve as an actionable training resource: supervised fine-tuning of a compact 3B-parameter model on our data yields up to 90.8% relative improvement in accuracy on held-out manufacturing scenarios, providing preliminary evidence for a practical pathway toward domain-adapted manufacturing MLLMs. The code and datasets are available at https://ai4manufacturing.github.io/forge-web.

View arXiv page View PDF Project page GitHub Add to collection

Community

HideOnBush

Paper author 5 minutes ago

🏭 Releasing FORGE — a fine-grained multimodal benchmark for manufacturing scenarios
📊 ~12,000 samples across 14 workpiece categories and 90 model numbers
🔍 Built to test whether MLLMs can handle industrial precision, not just general perception

General multimodal models are getting good at broad visual understanding. But manufacturing is a different standard: the

FORGE is a manufacturing benchmark for evaluating whether MLLMs can move from coarse perception to fine-grained industrial reasoning. It includes:

🧩 ~12K samples
🏗️ 14 workpiece categories and 90 model numbers
🖼️ Dual modality: 2D images + 3D point clouds (rendered)
✅ Three core tasks: Workpiece Verification, Structural Surface Inspection, and Assembly Verification

The main takeaway is clear: macro perception is increasingly solved, but micro reasoning is still the bottleneck.

Across experiments, models are much better at workpiece-level recognition than model-number-level decisions. Visual grounding is near-perfect, but decision accuracy still lags, suggesting the real problem is not simply “seeing” the right part but understanding manufacturing semantics, tolerances, and rules.

We also find that:

📌 Surface inspection remains difficult even for strong models
📌 Reference-conditioned prompting does not reliably help in 3D settings
📌 Raw point-cloud-as-text reasoning is still extremely weak
📌 Domain-specific fine-tuning can significantly improve smaller open models

FORGE is intended to serve both as a diagnostic benchmark and as a practical resource for industrial multimodal AI.

📄 Paper: https://arxiv.org/abs/2604.07413
🌐 Website: https://ai4manufacturing.github.io/forge-web/
🤗 Dataset: https://huggingface.co/datasets/AI4Manufacturing/forge
💻 Code: https://github.com/AI4Manufacturing/FORGE

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.07413 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.07413 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.