File size: 9,346 Bytes
955956c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
---
license: apache-2.0
language:
- ja
- en
library_name: transformers
pipeline_tag: image-text-to-text
tags:
- vision-language-model
- vlm
- llava
- llava-onevision
- japanese
- siglip
- llm-jp
- finance
- multimodal
base_model:
- llm-jp/llm-jp-4-8b-instruct
- google/siglip2-so400m-patch14-384
datasets:
- shunk031/STAIR-Captions
- Yana/ft-llm-2026-ocr-dataset
- Yana/ft-llm-2026-qa-dataset
- llm-jp/ja-vg-vqa-conversation
- SakanaAI/JA-VG-VQA-500
---

# COMPASS-VLM Phase 1

**Development of a Japanese Financial VLM through Integration of Reasoning Enhancement and Document Comprehension**
(推論強化と文書読解の統合による日本語金融VLMの開発)

This model is the **Phase 1 checkpoint** of the COMPASS project — a Japanese Vision-Language Model (VLM) built on a LLaVA-OneVision-style architecture. Phase 1 produces a general-purpose Japanese VLM through image-caption pretraining and visual instruction tuning. It serves as the vision-grounded foundation for the subsequent reasoning enhancement (Phase 2) and financial domain fine-tuning (Phase 3) stages.

Developed by [Atsushi Yanagisawa](https://atsushiyanaigsawa768.github.io/mysite/en/) and [Genshin Kakimoto](https://github.com/kakimoto0225) as part of the FT-LLM 2026 free-form task.

- 📦 **Code**: [github.com/AtsushiYanaigsawa768/Compass](https://github.com/AtsushiYanaigsawa768/Compass)
- 📚 **Collection**: [Yana/compass](https://huggingface.co/collections/Yana/compass)
- 📝 **Blog (EN)**: [atsushiyanaigsawa768.github.io/mysite/en/blog/compass](https://atsushiyanaigsawa768.github.io/mysite/en/blog/compass/)

---

## Model Details

| Item | Value |
|------|-------|
| Model type | Vision-Language Model (LLaVA-OneVision-style) |
| Parameters | ~9B |
| Precision | BF16 |
| Primary language | Japanese (with English support inherited from the base LLM) |
| License | Apache-2.0 (see [License](#license)) |

### Architecture

```
Input Image ──► SigLIP-v2 Vision Encoder ──► MLP Projector ──┐
                                                             ├──► LLM-JP-4-8B-Instruct ──► Output Text
Input Text ──────────────────────────────────────────────────┘
```

| Component | Model | Role in Phase 1 |
|-----------|-------|-----------------|
| Vision Encoder | `google/siglip2-so400m-patch14-384` | Frozen in Stage 1-1, trainable (lr = 2e-6) in Stage 1-2 |
| MLP Projector | Linear(1152→4096) → GELU → Linear(4096→4096), ~8M params | Trainable in both stages |
| LLM | `llm-jp/llm-jp-4-8b-instruct` (8B) | Frozen by default; trainable via LoRA in Stage 1-2 |

---

## Training Procedure

Phase 1 follows the two-stage recipe popularized by LLaVA-1.5 / LLaVA-OneVision, adapted to Japanese data.

### Stage 1-1 — Image Caption Pretraining

- **Goal**: Align vision tokens with the LLM embedding space.
- **Trainable**: MLP projector only.
- **Datasets**:
  - STAIR Captions (license_id = 4 only, with multi-caption random sampling providing 5× effective diversity)
  - [Yana/ft-llm-2026-ocr-dataset](https://huggingface.co/datasets/Yana/ft-llm-2026-ocr-dataset)
- **Learning rate**: 1e-3 · **Epochs**: 2 · **Effective batch size**: 128

### Stage 1-2 — Visual Instruction Tuning

- **Goal**: Enable VQA and instruction following in Japanese.
- **Trainable**: MLP projector + LLM (via LoRA, r = 64, α = 128) + Vision Encoder (lr = 2e-6).
- **Datasets**:
  - [Yana/ft-llm-2026-qa-dataset](https://huggingface.co/datasets/Yana/ft-llm-2026-qa-dataset)
  - [llm-jp/ja-vg-vqa-conversation](https://huggingface.co/datasets/llm-jp/ja-vg-vqa-conversation) (~90k on Visual Genome images)
  - [SakanaAI/JA-VG-VQA-500](https://huggingface.co/datasets/SakanaAI/JA-VG-VQA-500)
- **Learning rate**: 2e-5 · **Epochs**: 1 · **Effective batch size**: 128

### Common Hyperparameters

| Parameter | Value |
|-----------|-------|
| Per-device batch size | 2 |
| Gradient accumulation steps | 64 |
| Warmup ratio | 0.03 |
| Weight decay | 0.0 |
| Max sequence length | 2048 |
| Mixed precision | BF16 |
| Seed | 42 |

Training uses NCCL and supports `torchrun`, SLURM, and OpenMPI. Gradient checkpointing is enabled by default. An H100 80GB GPU is recommended.

---

## Chat Template

The model uses the LLM-JP v4 instruct template:

```
以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。

### 指示:
<image>
この画像を見て、質問に答えてください。
{user_question}

### 応答:
{assistant_answer}<|eos|>
```

Special tokens:

| Token | Purpose |
|-------|---------|
| `<image>` | Image placeholder replaced by vision embeddings |
| `<|eos|>` | End-of-turn token |

Typical prompts used during training:

- Stage 1-1 caption prompt: `この画像を端的に説明してください。` ("Please briefly describe this image.")
- Stage 1-2 VQA prompt: `この画像を見て、質問に答えてください。` ("Look at this image and answer the question.")

---

## Intended Use

### Direct Use

- Japanese image captioning
- Japanese visual question answering (VQA)
- Foundation checkpoint for downstream fine-tuning (e.g., document understanding, financial reasoning)

### Downstream Use

This checkpoint is specifically intended to be continued into:

- **Phase 2** — reasoning enhancement via SFT + DPO distilled from Qwen3-30B → [Yana/compass-vlm-phase2](https://huggingface.co/Yana/compass-vlm-phase2)
- **Phase 3** — Japanese financial domain fine-tuning on TAT-QA / ConvFinQA / FinQA / domain-specific QA → [Yana/compass-vlm](https://huggingface.co/Yana/compass-vlm)

### Out-of-Scope Use

- High-stakes decision making (medical, legal, financial advisory, etc.) without human oversight.
- Generation of factual claims without verification; the model can hallucinate.
- Use in languages other than Japanese and English is not evaluated.

---

## Evaluation

Phase 1 is evaluated qualitatively via automatically generated raw outputs on:

- STAIR Captions **License ID 5** held-out samples
- OCR held-out samples from the training OCR corpus

Quantitative benchmarks (GSM8K, JP Harness, EDINET Bench) are reported for the full pipeline rather than Phase 1 alone. See the project repository for numbers.

---

## Limitations and Biases

- The vision encoder was pretrained on web-scale image data and may reflect biases present therein.
- The LLM backbone (LLM-JP-4-8B) was trained primarily on Japanese and English corpora; performance in other languages is not guaranteed.
- OCR quality on small-font or low-resolution documents is limited.
- This Phase 1 checkpoint has **not** received the reasoning enhancement (Phase 2) or financial domain adaptation (Phase 3), so its behavior on multi-step reasoning and financial documents will be weaker than the final COMPASS model.

---

## How to Use

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
import torch

model_id = "Yana/compass-vlm-phase1"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
```

For the full inference pipeline (image preprocessing with SigLIP-v2, `<image>` token expansion, and AnyRes handling), please refer to the [`phase1/` directory](https://github.com/AtsushiYanaigsawa768/Compass/tree/main/phase1) in the GitHub repository.

---

## Citation

If you use this model, please cite the COMPASS project:

```bibtex
@misc{compass2026,
  title  = {COMPASS: Development of a Japanese Financial VLM through
            Integration of Reasoning Enhancement and Document Comprehension},
  author = {Yanagisawa, Atsushi and Kakimoto, Genshin},
  year   = {2026},
  howpublished = {\url{https://github.com/AtsushiYanaigsawa768/Compass}},
  note   = {FT-LLM 2026 free-form task}
}
```

Please also cite the upstream works (LLaVA-1.5, LLaVA-OneVision, SigLIP, LLM-JP, STAIR Captions, ja-vg-vqa) as appropriate.

---

## License

This model is released under the **Apache License 2.0**.

**Note on training data and Japanese copyright law:**
Under **Article 30-4 of the Japanese Copyright Act**, the use of copyrighted works for the purpose of information analysis — including machine learning model training — is a permitted use that does not require authorization from, or trigger license conditions of, the copyright holders. Training of this model was conducted in Japan on this basis; the resulting model weights are redistributed under Apache-2.0.

Downstream users are responsible for complying with the licenses of any datasets or images they use for further fine-tuning or evaluation.

---

## Acknowledgements

Built on top of outstanding open-source work, including:

- [LLM-JP-4-8B-Instruct](https://huggingface.co/llm-jp/llm-jp-4-8b-instruct)
- [SigLIP-v2](https://huggingface.co/google/siglip2-so400m-patch14-384)
- [LLaVA-1.5](https://arxiv.org/abs/2310.03744) and [LLaVA-OneVision](https://arxiv.org/abs/2408.03326)
- [LLaVA-JP](https://github.com/tosiyuki/LLaVA-JP)
- [STAIR Captions](https://huggingface.co/datasets/shunk031/STAIR-Captions) and [ja-vg-vqa-conversation](https://huggingface.co/datasets/llm-jp/ja-vg-vqa-conversation)