Yana commited on
Commit
955956c
·
verified ·
1 Parent(s): 20e8df5

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +241 -0
README.md ADDED
@@ -0,0 +1,241 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - ja
5
+ - en
6
+ library_name: transformers
7
+ pipeline_tag: image-text-to-text
8
+ tags:
9
+ - vision-language-model
10
+ - vlm
11
+ - llava
12
+ - llava-onevision
13
+ - japanese
14
+ - siglip
15
+ - llm-jp
16
+ - finance
17
+ - multimodal
18
+ base_model:
19
+ - llm-jp/llm-jp-4-8b-instruct
20
+ - google/siglip2-so400m-patch14-384
21
+ datasets:
22
+ - shunk031/STAIR-Captions
23
+ - Yana/ft-llm-2026-ocr-dataset
24
+ - Yana/ft-llm-2026-qa-dataset
25
+ - llm-jp/ja-vg-vqa-conversation
26
+ - SakanaAI/JA-VG-VQA-500
27
+ ---
28
+
29
+ # COMPASS-VLM Phase 1
30
+
31
+ **Development of a Japanese Financial VLM through Integration of Reasoning Enhancement and Document Comprehension**
32
+ (推論強化と文書読解の統合による日本語金融VLMの開発)
33
+
34
+ This model is the **Phase 1 checkpoint** of the COMPASS project — a Japanese Vision-Language Model (VLM) built on a LLaVA-OneVision-style architecture. Phase 1 produces a general-purpose Japanese VLM through image-caption pretraining and visual instruction tuning. It serves as the vision-grounded foundation for the subsequent reasoning enhancement (Phase 2) and financial domain fine-tuning (Phase 3) stages.
35
+
36
+ Developed by [Atsushi Yanagisawa](https://atsushiyanaigsawa768.github.io/mysite/en/) and [Genshin Kakimoto](https://github.com/kakimoto0225) as part of the FT-LLM 2026 free-form task.
37
+
38
+ - 📦 **Code**: [github.com/AtsushiYanaigsawa768/Compass](https://github.com/AtsushiYanaigsawa768/Compass)
39
+ - 📚 **Collection**: [Yana/compass](https://huggingface.co/collections/Yana/compass)
40
+ - 📝 **Blog (EN)**: [atsushiyanaigsawa768.github.io/mysite/en/blog/compass](https://atsushiyanaigsawa768.github.io/mysite/en/blog/compass/)
41
+
42
+ ---
43
+
44
+ ## Model Details
45
+
46
+ | Item | Value |
47
+ |------|-------|
48
+ | Model type | Vision-Language Model (LLaVA-OneVision-style) |
49
+ | Parameters | ~9B |
50
+ | Precision | BF16 |
51
+ | Primary language | Japanese (with English support inherited from the base LLM) |
52
+ | License | Apache-2.0 (see [License](#license)) |
53
+
54
+ ### Architecture
55
+
56
+ ```
57
+ Input Image ──► SigLIP-v2 Vision Encoder ──► MLP Projector ──┐
58
+ ├──► LLM-JP-4-8B-Instruct ──► Output Text
59
+ Input Text ──────────────────────────────────────────────────┘
60
+ ```
61
+
62
+ | Component | Model | Role in Phase 1 |
63
+ |-----------|-------|-----------------|
64
+ | Vision Encoder | `google/siglip2-so400m-patch14-384` | Frozen in Stage 1-1, trainable (lr = 2e-6) in Stage 1-2 |
65
+ | MLP Projector | Linear(1152→4096) → GELU → Linear(4096→4096), ~8M params | Trainable in both stages |
66
+ | LLM | `llm-jp/llm-jp-4-8b-instruct` (8B) | Frozen by default; trainable via LoRA in Stage 1-2 |
67
+
68
+ ---
69
+
70
+ ## Training Procedure
71
+
72
+ Phase 1 follows the two-stage recipe popularized by LLaVA-1.5 / LLaVA-OneVision, adapted to Japanese data.
73
+
74
+ ### Stage 1-1 — Image Caption Pretraining
75
+
76
+ - **Goal**: Align vision tokens with the LLM embedding space.
77
+ - **Trainable**: MLP projector only.
78
+ - **Datasets**:
79
+ - STAIR Captions (license_id = 4 only, with multi-caption random sampling providing 5× effective diversity)
80
+ - [Yana/ft-llm-2026-ocr-dataset](https://huggingface.co/datasets/Yana/ft-llm-2026-ocr-dataset)
81
+ - **Learning rate**: 1e-3 · **Epochs**: 2 · **Effective batch size**: 128
82
+
83
+ ### Stage 1-2 — Visual Instruction Tuning
84
+
85
+ - **Goal**: Enable VQA and instruction following in Japanese.
86
+ - **Trainable**: MLP projector + LLM (via LoRA, r = 64, α = 128) + Vision Encoder (lr = 2e-6).
87
+ - **Datasets**:
88
+ - [Yana/ft-llm-2026-qa-dataset](https://huggingface.co/datasets/Yana/ft-llm-2026-qa-dataset)
89
+ - [llm-jp/ja-vg-vqa-conversation](https://huggingface.co/datasets/llm-jp/ja-vg-vqa-conversation) (~90k on Visual Genome images)
90
+ - [SakanaAI/JA-VG-VQA-500](https://huggingface.co/datasets/SakanaAI/JA-VG-VQA-500)
91
+ - **Learning rate**: 2e-5 · **Epochs**: 1 · **Effective batch size**: 128
92
+
93
+ ### Common Hyperparameters
94
+
95
+ | Parameter | Value |
96
+ |-----------|-------|
97
+ | Per-device batch size | 2 |
98
+ | Gradient accumulation steps | 64 |
99
+ | Warmup ratio | 0.03 |
100
+ | Weight decay | 0.0 |
101
+ | Max sequence length | 2048 |
102
+ | Mixed precision | BF16 |
103
+ | Seed | 42 |
104
+
105
+ Training uses NCCL and supports `torchrun`, SLURM, and OpenMPI. Gradient checkpointing is enabled by default. An H100 80GB GPU is recommended.
106
+
107
+ ---
108
+
109
+ ## Chat Template
110
+
111
+ The model uses the LLM-JP v4 instruct template:
112
+
113
+ ```
114
+ 以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。
115
+
116
+ ### 指示:
117
+ <image>
118
+ この画像を見て、質問に答えてください。
119
+ {user_question}
120
+
121
+ ### 応答:
122
+ {assistant_answer}<|eos|>
123
+ ```
124
+
125
+ Special tokens:
126
+
127
+ | Token | Purpose |
128
+ |-------|---------|
129
+ | `<image>` | Image placeholder replaced by vision embeddings |
130
+ | `<|eos|>` | End-of-turn token |
131
+
132
+ Typical prompts used during training:
133
+
134
+ - Stage 1-1 caption prompt: `この画像を端的に説明してください。` ("Please briefly describe this image.")
135
+ - Stage 1-2 VQA prompt: `この画像��見て、質問に答えてください。` ("Look at this image and answer the question.")
136
+
137
+ ---
138
+
139
+ ## Intended Use
140
+
141
+ ### Direct Use
142
+
143
+ - Japanese image captioning
144
+ - Japanese visual question answering (VQA)
145
+ - Foundation checkpoint for downstream fine-tuning (e.g., document understanding, financial reasoning)
146
+
147
+ ### Downstream Use
148
+
149
+ This checkpoint is specifically intended to be continued into:
150
+
151
+ - **Phase 2** — reasoning enhancement via SFT + DPO distilled from Qwen3-30B → [Yana/compass-vlm-phase2](https://huggingface.co/Yana/compass-vlm-phase2)
152
+ - **Phase 3** — Japanese financial domain fine-tuning on TAT-QA / ConvFinQA / FinQA / domain-specific QA → [Yana/compass-vlm](https://huggingface.co/Yana/compass-vlm)
153
+
154
+ ### Out-of-Scope Use
155
+
156
+ - High-stakes decision making (medical, legal, financial advisory, etc.) without human oversight.
157
+ - Generation of factual claims without verification; the model can hallucinate.
158
+ - Use in languages other than Japanese and English is not evaluated.
159
+
160
+ ---
161
+
162
+ ## Evaluation
163
+
164
+ Phase 1 is evaluated qualitatively via automatically generated raw outputs on:
165
+
166
+ - STAIR Captions **License ID 5** held-out samples
167
+ - OCR held-out samples from the training OCR corpus
168
+
169
+ Quantitative benchmarks (GSM8K, JP Harness, EDINET Bench) are reported for the full pipeline rather than Phase 1 alone. See the project repository for numbers.
170
+
171
+ ---
172
+
173
+ ## Limitations and Biases
174
+
175
+ - The vision encoder was pretrained on web-scale image data and may reflect biases present therein.
176
+ - The LLM backbone (LLM-JP-4-8B) was trained primarily on Japanese and English corpora; performance in other languages is not guaranteed.
177
+ - OCR quality on small-font or low-resolution documents is limited.
178
+ - This Phase 1 checkpoint has **not** received the reasoning enhancement (Phase 2) or financial domain adaptation (Phase 3), so its behavior on multi-step reasoning and financial documents will be weaker than the final COMPASS model.
179
+
180
+ ---
181
+
182
+ ## How to Use
183
+
184
+ ```python
185
+ from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
186
+ import torch
187
+
188
+ model_id = "Yana/compass-vlm-phase1"
189
+
190
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
191
+ model = AutoModelForCausalLM.from_pretrained(
192
+ model_id,
193
+ torch_dtype=torch.bfloat16,
194
+ device_map="auto",
195
+ trust_remote_code=True,
196
+ )
197
+ ```
198
+
199
+ For the full inference pipeline (image preprocessing with SigLIP-v2, `<image>` token expansion, and AnyRes handling), please refer to the [`phase1/` directory](https://github.com/AtsushiYanaigsawa768/Compass/tree/main/phase1) in the GitHub repository.
200
+
201
+ ---
202
+
203
+ ## Citation
204
+
205
+ If you use this model, please cite the COMPASS project:
206
+
207
+ ```bibtex
208
+ @misc{compass2026,
209
+ title = {COMPASS: Development of a Japanese Financial VLM through
210
+ Integration of Reasoning Enhancement and Document Comprehension},
211
+ author = {Yanagisawa, Atsushi and Kakimoto, Genshin},
212
+ year = {2026},
213
+ howpublished = {\url{https://github.com/AtsushiYanaigsawa768/Compass}},
214
+ note = {FT-LLM 2026 free-form task}
215
+ }
216
+ ```
217
+
218
+ Please also cite the upstream works (LLaVA-1.5, LLaVA-OneVision, SigLIP, LLM-JP, STAIR Captions, ja-vg-vqa) as appropriate.
219
+
220
+ ---
221
+
222
+ ## License
223
+
224
+ This model is released under the **Apache License 2.0**.
225
+
226
+ **Note on training data and Japanese copyright law:**
227
+ Under **Article 30-4 of the Japanese Copyright Act**, the use of copyrighted works for the purpose of information analysis — including machine learning model training — is a permitted use that does not require authorization from, or trigger license conditions of, the copyright holders. Training of this model was conducted in Japan on this basis; the resulting model weights are redistributed under Apache-2.0.
228
+
229
+ Downstream users are responsible for complying with the licenses of any datasets or images they use for further fine-tuning or evaluation.
230
+
231
+ ---
232
+
233
+ ## Acknowledgements
234
+
235
+ Built on top of outstanding open-source work, including:
236
+
237
+ - [LLM-JP-4-8B-Instruct](https://huggingface.co/llm-jp/llm-jp-4-8b-instruct)
238
+ - [SigLIP-v2](https://huggingface.co/google/siglip2-so400m-patch14-384)
239
+ - [LLaVA-1.5](https://arxiv.org/abs/2310.03744) and [LLaVA-OneVision](https://arxiv.org/abs/2408.03326)
240
+ - [LLaVA-JP](https://github.com/tosiyuki/LLaVA-JP)
241
+ - [STAIR Captions](https://huggingface.co/datasets/shunk031/STAIR-Captions) and [ja-vg-vqa-conversation](https://huggingface.co/datasets/llm-jp/ja-vg-vqa-conversation)