File size: 9,346 Bytes
955956c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 | ---
license: apache-2.0
language:
- ja
- en
library_name: transformers
pipeline_tag: image-text-to-text
tags:
- vision-language-model
- vlm
- llava
- llava-onevision
- japanese
- siglip
- llm-jp
- finance
- multimodal
base_model:
- llm-jp/llm-jp-4-8b-instruct
- google/siglip2-so400m-patch14-384
datasets:
- shunk031/STAIR-Captions
- Yana/ft-llm-2026-ocr-dataset
- Yana/ft-llm-2026-qa-dataset
- llm-jp/ja-vg-vqa-conversation
- SakanaAI/JA-VG-VQA-500
---
# COMPASS-VLM Phase 1
**Development of a Japanese Financial VLM through Integration of Reasoning Enhancement and Document Comprehension**
(推論強化と文書読解の統合による日本語金融VLMの開発)
This model is the **Phase 1 checkpoint** of the COMPASS project — a Japanese Vision-Language Model (VLM) built on a LLaVA-OneVision-style architecture. Phase 1 produces a general-purpose Japanese VLM through image-caption pretraining and visual instruction tuning. It serves as the vision-grounded foundation for the subsequent reasoning enhancement (Phase 2) and financial domain fine-tuning (Phase 3) stages.
Developed by [Atsushi Yanagisawa](https://atsushiyanaigsawa768.github.io/mysite/en/) and [Genshin Kakimoto](https://github.com/kakimoto0225) as part of the FT-LLM 2026 free-form task.
- 📦 **Code**: [github.com/AtsushiYanaigsawa768/Compass](https://github.com/AtsushiYanaigsawa768/Compass)
- 📚 **Collection**: [Yana/compass](https://huggingface.co/collections/Yana/compass)
- 📝 **Blog (EN)**: [atsushiyanaigsawa768.github.io/mysite/en/blog/compass](https://atsushiyanaigsawa768.github.io/mysite/en/blog/compass/)
---
## Model Details
| Item | Value |
|------|-------|
| Model type | Vision-Language Model (LLaVA-OneVision-style) |
| Parameters | ~9B |
| Precision | BF16 |
| Primary language | Japanese (with English support inherited from the base LLM) |
| License | Apache-2.0 (see [License](#license)) |
### Architecture
```
Input Image ──► SigLIP-v2 Vision Encoder ──► MLP Projector ──┐
├──► LLM-JP-4-8B-Instruct ──► Output Text
Input Text ──────────────────────────────────────────────────┘
```
| Component | Model | Role in Phase 1 |
|-----------|-------|-----------------|
| Vision Encoder | `google/siglip2-so400m-patch14-384` | Frozen in Stage 1-1, trainable (lr = 2e-6) in Stage 1-2 |
| MLP Projector | Linear(1152→4096) → GELU → Linear(4096→4096), ~8M params | Trainable in both stages |
| LLM | `llm-jp/llm-jp-4-8b-instruct` (8B) | Frozen by default; trainable via LoRA in Stage 1-2 |
---
## Training Procedure
Phase 1 follows the two-stage recipe popularized by LLaVA-1.5 / LLaVA-OneVision, adapted to Japanese data.
### Stage 1-1 — Image Caption Pretraining
- **Goal**: Align vision tokens with the LLM embedding space.
- **Trainable**: MLP projector only.
- **Datasets**:
- STAIR Captions (license_id = 4 only, with multi-caption random sampling providing 5× effective diversity)
- [Yana/ft-llm-2026-ocr-dataset](https://huggingface.co/datasets/Yana/ft-llm-2026-ocr-dataset)
- **Learning rate**: 1e-3 · **Epochs**: 2 · **Effective batch size**: 128
### Stage 1-2 — Visual Instruction Tuning
- **Goal**: Enable VQA and instruction following in Japanese.
- **Trainable**: MLP projector + LLM (via LoRA, r = 64, α = 128) + Vision Encoder (lr = 2e-6).
- **Datasets**:
- [Yana/ft-llm-2026-qa-dataset](https://huggingface.co/datasets/Yana/ft-llm-2026-qa-dataset)
- [llm-jp/ja-vg-vqa-conversation](https://huggingface.co/datasets/llm-jp/ja-vg-vqa-conversation) (~90k on Visual Genome images)
- [SakanaAI/JA-VG-VQA-500](https://huggingface.co/datasets/SakanaAI/JA-VG-VQA-500)
- **Learning rate**: 2e-5 · **Epochs**: 1 · **Effective batch size**: 128
### Common Hyperparameters
| Parameter | Value |
|-----------|-------|
| Per-device batch size | 2 |
| Gradient accumulation steps | 64 |
| Warmup ratio | 0.03 |
| Weight decay | 0.0 |
| Max sequence length | 2048 |
| Mixed precision | BF16 |
| Seed | 42 |
Training uses NCCL and supports `torchrun`, SLURM, and OpenMPI. Gradient checkpointing is enabled by default. An H100 80GB GPU is recommended.
---
## Chat Template
The model uses the LLM-JP v4 instruct template:
```
以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。
### 指示:
<image>
この画像を見て、質問に答えてください。
{user_question}
### 応答:
{assistant_answer}<|eos|>
```
Special tokens:
| Token | Purpose |
|-------|---------|
| `<image>` | Image placeholder replaced by vision embeddings |
| `<|eos|>` | End-of-turn token |
Typical prompts used during training:
- Stage 1-1 caption prompt: `この画像を端的に説明してください。` ("Please briefly describe this image.")
- Stage 1-2 VQA prompt: `この画像を見て、質問に答えてください。` ("Look at this image and answer the question.")
---
## Intended Use
### Direct Use
- Japanese image captioning
- Japanese visual question answering (VQA)
- Foundation checkpoint for downstream fine-tuning (e.g., document understanding, financial reasoning)
### Downstream Use
This checkpoint is specifically intended to be continued into:
- **Phase 2** — reasoning enhancement via SFT + DPO distilled from Qwen3-30B → [Yana/compass-vlm-phase2](https://huggingface.co/Yana/compass-vlm-phase2)
- **Phase 3** — Japanese financial domain fine-tuning on TAT-QA / ConvFinQA / FinQA / domain-specific QA → [Yana/compass-vlm](https://huggingface.co/Yana/compass-vlm)
### Out-of-Scope Use
- High-stakes decision making (medical, legal, financial advisory, etc.) without human oversight.
- Generation of factual claims without verification; the model can hallucinate.
- Use in languages other than Japanese and English is not evaluated.
---
## Evaluation
Phase 1 is evaluated qualitatively via automatically generated raw outputs on:
- STAIR Captions **License ID 5** held-out samples
- OCR held-out samples from the training OCR corpus
Quantitative benchmarks (GSM8K, JP Harness, EDINET Bench) are reported for the full pipeline rather than Phase 1 alone. See the project repository for numbers.
---
## Limitations and Biases
- The vision encoder was pretrained on web-scale image data and may reflect biases present therein.
- The LLM backbone (LLM-JP-4-8B) was trained primarily on Japanese and English corpora; performance in other languages is not guaranteed.
- OCR quality on small-font or low-resolution documents is limited.
- This Phase 1 checkpoint has **not** received the reasoning enhancement (Phase 2) or financial domain adaptation (Phase 3), so its behavior on multi-step reasoning and financial documents will be weaker than the final COMPASS model.
---
## How to Use
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
import torch
model_id = "Yana/compass-vlm-phase1"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
```
For the full inference pipeline (image preprocessing with SigLIP-v2, `<image>` token expansion, and AnyRes handling), please refer to the [`phase1/` directory](https://github.com/AtsushiYanaigsawa768/Compass/tree/main/phase1) in the GitHub repository.
---
## Citation
If you use this model, please cite the COMPASS project:
```bibtex
@misc{compass2026,
title = {COMPASS: Development of a Japanese Financial VLM through
Integration of Reasoning Enhancement and Document Comprehension},
author = {Yanagisawa, Atsushi and Kakimoto, Genshin},
year = {2026},
howpublished = {\url{https://github.com/AtsushiYanaigsawa768/Compass}},
note = {FT-LLM 2026 free-form task}
}
```
Please also cite the upstream works (LLaVA-1.5, LLaVA-OneVision, SigLIP, LLM-JP, STAIR Captions, ja-vg-vqa) as appropriate.
---
## License
This model is released under the **Apache License 2.0**.
**Note on training data and Japanese copyright law:**
Under **Article 30-4 of the Japanese Copyright Act**, the use of copyrighted works for the purpose of information analysis — including machine learning model training — is a permitted use that does not require authorization from, or trigger license conditions of, the copyright holders. Training of this model was conducted in Japan on this basis; the resulting model weights are redistributed under Apache-2.0.
Downstream users are responsible for complying with the licenses of any datasets or images they use for further fine-tuning or evaluation.
---
## Acknowledgements
Built on top of outstanding open-source work, including:
- [LLM-JP-4-8B-Instruct](https://huggingface.co/llm-jp/llm-jp-4-8b-instruct)
- [SigLIP-v2](https://huggingface.co/google/siglip2-so400m-patch14-384)
- [LLaVA-1.5](https://arxiv.org/abs/2310.03744) and [LLaVA-OneVision](https://arxiv.org/abs/2408.03326)
- [LLaVA-JP](https://github.com/tosiyuki/LLaVA-JP)
- [STAIR Captions](https://huggingface.co/datasets/shunk031/STAIR-Captions) and [ja-vg-vqa-conversation](https://huggingface.co/datasets/llm-jp/ja-vg-vqa-conversation)
|