README.md · Yana/compass-vlm-phase1 at main

File size: 9,346 Bytes

955956c

---
license: apache-2.0
language:
- ja
- en
library_name: transformers
pipeline_tag: image-text-to-text
tags:
- vision-language-model
- vlm
- llava
- llava-onevision
- japanese
- siglip
- llm-jp
- finance
- multimodal
base_model:
- llm-jp/llm-jp-4-8b-instruct
- google/siglip2-so400m-patch14-384
datasets:
- shunk031/STAIR-Captions
- Yana/ft-llm-2026-ocr-dataset
- Yana/ft-llm-2026-qa-dataset
- llm-jp/ja-vg-vqa-conversation
- SakanaAI/JA-VG-VQA-500
---

# COMPASS-VLM Phase 1

**Development of a Japanese Financial VLM through Integration of Reasoning Enhancement and Document Comprehension**
(推論強化と文書読解の統合による日本語金融VLMの開発)

This model is the **Phase 1 checkpoint** of the COMPASS project — a Japanese Vision-Language Model (VLM) built on a LLaVA-OneVision-style architecture. Phase 1 produces a general-purpose Japanese VLM through image-caption pretraining and visual instruction tuning. It serves as the vision-grounded foundation for the subsequent reasoning enhancement (Phase 2) and financial domain fine-tuning (Phase 3) stages.

Developed by [Atsushi Yanagisawa](https://atsushiyanaigsawa768.github.io/mysite/en/) and [Genshin Kakimoto](https://github.com/kakimoto0225) as part of the FT-LLM 2026 free-form task.

- 📦 **Code**: [github.com/AtsushiYanaigsawa768/Compass](https://github.com/AtsushiYanaigsawa768/Compass)
- 📚 **Collection**: [Yana/compass](https://huggingface.co/collections/Yana/compass)
- 📝 **Blog (EN)**: [atsushiyanaigsawa768.github.io/mysite/en/blog/compass](https://atsushiyanaigsawa768.github.io/mysite/en/blog/compass/)

---

## Model Details

| Item | Value |
|------|-------|
| Model type | Vision-Language Model (LLaVA-OneVision-style) |
| Parameters | ~9B |
| Precision | BF16 |
| Primary language | Japanese (with English support inherited from the base LLM) |
| License | Apache-2.0 (see [License](#license)) |

### Architecture

```
Input Image ──► SigLIP-v2 Vision Encoder ──► MLP Projector ──┐
                                                             ├──► LLM-JP-4-8B-Instruct ──► Output Text
Input Text ──────────────────────────────────────────────────┘
```

| Component | Model | Role in Phase 1 |
|-----------|-------|-----------------|
| Vision Encoder | `google/siglip2-so400m-patch14-384` | Frozen in Stage 1-1, trainable (lr = 2e-6) in Stage 1-2 |
| MLP Projector | Linear(1152→4096) → GELU → Linear(4096→4096), ~8M params | Trainable in both stages |
| LLM | `llm-jp/llm-jp-4-8b-instruct` (8B) | Frozen by default; trainable via LoRA in Stage 1-2 |

---

## Training Procedure

Phase 1 follows the two-stage recipe popularized by LLaVA-1.5 / LLaVA-OneVision, adapted to Japanese data.

### Stage 1-1 — Image Caption Pretraining

- **Goal**: Align vision tokens with the LLM embedding space.
- **Trainable**: MLP projector only.
- **Datasets**:
  - STAIR Captions (license_id = 4 only, with multi-caption random sampling providing 5× effective diversity)
  - [Yana/ft-llm-2026-ocr-dataset](https://huggingface.co/datasets/Yana/ft-llm-2026-ocr-dataset)
- **Learning rate**: 1e-3 · **Epochs**: 2 · **Effective batch size**: 128

### Stage 1-2 — Visual Instruction Tuning

- **Goal**: Enable VQA and instruction following in Japanese.
- **Trainable**: MLP projector + LLM (via LoRA, r = 64, α = 128) + Vision Encoder (lr = 2e-6).
- **Datasets**:
  - [Yana/ft-llm-2026-qa-dataset](https://huggingface.co/datasets/Yana/ft-llm-2026-qa-dataset)
  - [llm-jp/ja-vg-vqa-conversation](https://huggingface.co/datasets/llm-jp/ja-vg-vqa-conversation) (~90k on Visual Genome images)
  - [SakanaAI/JA-VG-VQA-500](https://huggingface.co/datasets/SakanaAI/JA-VG-VQA-500)
- **Learning rate**: 2e-5 · **Epochs**: 1 · **Effective batch size**: 128

### Common Hyperparameters

| Parameter | Value |
|-----------|-------|
| Per-device batch size | 2 |
| Gradient accumulation steps | 64 |
| Warmup ratio | 0.03 |
| Weight decay | 0.0 |
| Max sequence length | 2048 |
| Mixed precision | BF16 |
| Seed | 42 |

Training uses NCCL and supports `torchrun`, SLURM, and OpenMPI. Gradient checkpointing is enabled by default. An H100 80GB GPU is recommended.

---

## Chat Template

The model uses the LLM-JP v4 instruct template:

```
以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。

### 指示:
<image>
この画像を見て、質問に答えてください。
{user_question}

### 応答:
{assistant_answer}<|eos|>
```

Special tokens:

| Token | Purpose |
|-------|---------|
| `<image>` | Image placeholder replaced by vision embeddings |
| `<|eos|>` | End-of-turn token |

Typical prompts used during training:

- Stage 1-1 caption prompt: `この画像を端的に説明してください。` ("Please briefly describe this image.")
- Stage 1-2 VQA prompt: `この画像を見て、質問に答えてください。` ("Look at this image and answer the question.")

---

## Intended Use

### Direct Use

- Japanese image captioning
- Japanese visual question answering (VQA)
- Foundation checkpoint for downstream fine-tuning (e.g., document understanding, financial reasoning)

### Downstream Use

This checkpoint is specifically intended to be continued into:

- **Phase 2** — reasoning enhancement via SFT + DPO distilled from Qwen3-30B → [Yana/compass-vlm-phase2](https://huggingface.co/Yana/compass-vlm-phase2)
- **Phase 3** — Japanese financial domain fine-tuning on TAT-QA / ConvFinQA / FinQA / domain-specific QA → [Yana/compass-vlm](https://huggingface.co/Yana/compass-vlm)

### Out-of-Scope Use

- High-stakes decision making (medical, legal, financial advisory, etc.) without human oversight.
- Generation of factual claims without verification; the model can hallucinate.
- Use in languages other than Japanese and English is not evaluated.

---

## Evaluation

Phase 1 is evaluated qualitatively via automatically generated raw outputs on:

- STAIR Captions **License ID 5** held-out samples
- OCR held-out samples from the training OCR corpus

Quantitative benchmarks (GSM8K, JP Harness, EDINET Bench) are reported for the full pipeline rather than Phase 1 alone. See the project repository for numbers.

---

## Limitations and Biases

- The vision encoder was pretrained on web-scale image data and may reflect biases present therein.
- The LLM backbone (LLM-JP-4-8B) was trained primarily on Japanese and English corpora; performance in other languages is not guaranteed.
- OCR quality on small-font or low-resolution documents is limited.
- This Phase 1 checkpoint has **not** received the reasoning enhancement (Phase 2) or financial domain adaptation (Phase 3), so its behavior on multi-step reasoning and financial documents will be weaker than the final COMPASS model.

---

## How to Use

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
import torch

model_id = "Yana/compass-vlm-phase1"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
```

For the full inference pipeline (image preprocessing with SigLIP-v2, `<image>` token expansion, and AnyRes handling), please refer to the [`phase1/` directory](https://github.com/AtsushiYanaigsawa768/Compass/tree/main/phase1) in the GitHub repository.

---

## Citation

If you use this model, please cite the COMPASS project:

```bibtex
@misc{compass2026,
  title  = {COMPASS: Development of a Japanese Financial VLM through
            Integration of Reasoning Enhancement and Document Comprehension},
  author = {Yanagisawa, Atsushi and Kakimoto, Genshin},
  year   = {2026},
  howpublished = {\url{https://github.com/AtsushiYanaigsawa768/Compass}},
  note   = {FT-LLM 2026 free-form task}
}
```

Please also cite the upstream works (LLaVA-1.5, LLaVA-OneVision, SigLIP, LLM-JP, STAIR Captions, ja-vg-vqa) as appropriate.

---

## License

This model is released under the **Apache License 2.0**.

**Note on training data and Japanese copyright law:**
Under **Article 30-4 of the Japanese Copyright Act**, the use of copyrighted works for the purpose of information analysis — including machine learning model training — is a permitted use that does not require authorization from, or trigger license conditions of, the copyright holders. Training of this model was conducted in Japan on this basis; the resulting model weights are redistributed under Apache-2.0.

Downstream users are responsible for complying with the licenses of any datasets or images they use for further fine-tuning or evaluation.

---

## Acknowledgements

Built on top of outstanding open-source work, including:

- [LLM-JP-4-8B-Instruct](https://huggingface.co/llm-jp/llm-jp-4-8b-instruct)
- [SigLIP-v2](https://huggingface.co/google/siglip2-so400m-patch14-384)
- [LLaVA-1.5](https://arxiv.org/abs/2310.03744) and [LLaVA-OneVision](https://arxiv.org/abs/2408.03326)
- [LLaVA-JP](https://github.com/tosiyuki/LLaVA-JP)
- [STAIR Captions](https://huggingface.co/datasets/shunk031/STAIR-Captions) and [ja-vg-vqa-conversation](https://huggingface.co/datasets/llm-jp/ja-vg-vqa-conversation)