--- license: apache-2.0 language: - ja - en library_name: transformers pipeline_tag: image-text-to-text tags: - vision-language-model - vlm - llava - llava-onevision - japanese - siglip - llm-jp - finance - multimodal base_model: - llm-jp/llm-jp-4-8b-instruct - google/siglip2-so400m-patch14-384 datasets: - shunk031/STAIR-Captions - Yana/ft-llm-2026-ocr-dataset - Yana/ft-llm-2026-qa-dataset - llm-jp/ja-vg-vqa-conversation - SakanaAI/JA-VG-VQA-500 --- # COMPASS-VLM Phase 1 **Development of a Japanese Financial VLM through Integration of Reasoning Enhancement and Document Comprehension** (推論強化と文書読解の統合による日本語金融VLMの開発) This model is the **Phase 1 checkpoint** of the COMPASS project — a Japanese Vision-Language Model (VLM) built on a LLaVA-OneVision-style architecture. Phase 1 produces a general-purpose Japanese VLM through image-caption pretraining and visual instruction tuning. It serves as the vision-grounded foundation for the subsequent reasoning enhancement (Phase 2) and financial domain fine-tuning (Phase 3) stages. Developed by [Atsushi Yanagisawa](https://atsushiyanaigsawa768.github.io/mysite/en/) and [Genshin Kakimoto](https://github.com/kakimoto0225) as part of the FT-LLM 2026 free-form task. - 📦 **Code**: [github.com/AtsushiYanaigsawa768/Compass](https://github.com/AtsushiYanaigsawa768/Compass) - 📚 **Collection**: [Yana/compass](https://huggingface.co/collections/Yana/compass) - 📝 **Blog (EN)**: [atsushiyanaigsawa768.github.io/mysite/en/blog/compass](https://atsushiyanaigsawa768.github.io/mysite/en/blog/compass/) --- ## Model Details | Item | Value | |------|-------| | Model type | Vision-Language Model (LLaVA-OneVision-style) | | Parameters | ~9B | | Precision | BF16 | | Primary language | Japanese (with English support inherited from the base LLM) | | License | Apache-2.0 (see [License](#license)) | ### Architecture ``` Input Image ──► SigLIP-v2 Vision Encoder ──► MLP Projector ──┐ ├──► LLM-JP-4-8B-Instruct ──► Output Text Input Text ──────────────────────────────────────────────────┘ ``` | Component | Model | Role in Phase 1 | |-----------|-------|-----------------| | Vision Encoder | `google/siglip2-so400m-patch14-384` | Frozen in Stage 1-1, trainable (lr = 2e-6) in Stage 1-2 | | MLP Projector | Linear(1152→4096) → GELU → Linear(4096→4096), ~8M params | Trainable in both stages | | LLM | `llm-jp/llm-jp-4-8b-instruct` (8B) | Frozen by default; trainable via LoRA in Stage 1-2 | --- ## Training Procedure Phase 1 follows the two-stage recipe popularized by LLaVA-1.5 / LLaVA-OneVision, adapted to Japanese data. ### Stage 1-1 — Image Caption Pretraining - **Goal**: Align vision tokens with the LLM embedding space. - **Trainable**: MLP projector only. - **Datasets**: - STAIR Captions (license_id = 4 only, with multi-caption random sampling providing 5× effective diversity) - [Yana/ft-llm-2026-ocr-dataset](https://huggingface.co/datasets/Yana/ft-llm-2026-ocr-dataset) - **Learning rate**: 1e-3 · **Epochs**: 2 · **Effective batch size**: 128 ### Stage 1-2 — Visual Instruction Tuning - **Goal**: Enable VQA and instruction following in Japanese. - **Trainable**: MLP projector + LLM (via LoRA, r = 64, α = 128) + Vision Encoder (lr = 2e-6). - **Datasets**: - [Yana/ft-llm-2026-qa-dataset](https://huggingface.co/datasets/Yana/ft-llm-2026-qa-dataset) - [llm-jp/ja-vg-vqa-conversation](https://huggingface.co/datasets/llm-jp/ja-vg-vqa-conversation) (~90k on Visual Genome images) - [SakanaAI/JA-VG-VQA-500](https://huggingface.co/datasets/SakanaAI/JA-VG-VQA-500) - **Learning rate**: 2e-5 · **Epochs**: 1 · **Effective batch size**: 128 ### Common Hyperparameters | Parameter | Value | |-----------|-------| | Per-device batch size | 2 | | Gradient accumulation steps | 64 | | Warmup ratio | 0.03 | | Weight decay | 0.0 | | Max sequence length | 2048 | | Mixed precision | BF16 | | Seed | 42 | Training uses NCCL and supports `torchrun`, SLURM, and OpenMPI. Gradient checkpointing is enabled by default. An H100 80GB GPU is recommended. --- ## Chat Template The model uses the LLM-JP v4 instruct template: ``` 以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。 ### 指示: この画像を見て、質問に答えてください。 {user_question} ### 応答: {assistant_answer}<|eos|> ``` Special tokens: | Token | Purpose | |-------|---------| | `` | Image placeholder replaced by vision embeddings | | `<|eos|>` | End-of-turn token | Typical prompts used during training: - Stage 1-1 caption prompt: `この画像を端的に説明してください。` ("Please briefly describe this image.") - Stage 1-2 VQA prompt: `この画像を見て、質問に答えてください。` ("Look at this image and answer the question.") --- ## Intended Use ### Direct Use - Japanese image captioning - Japanese visual question answering (VQA) - Foundation checkpoint for downstream fine-tuning (e.g., document understanding, financial reasoning) ### Downstream Use This checkpoint is specifically intended to be continued into: - **Phase 2** — reasoning enhancement via SFT + DPO distilled from Qwen3-30B → [Yana/compass-vlm-phase2](https://huggingface.co/Yana/compass-vlm-phase2) - **Phase 3** — Japanese financial domain fine-tuning on TAT-QA / ConvFinQA / FinQA / domain-specific QA → [Yana/compass-vlm](https://huggingface.co/Yana/compass-vlm) ### Out-of-Scope Use - High-stakes decision making (medical, legal, financial advisory, etc.) without human oversight. - Generation of factual claims without verification; the model can hallucinate. - Use in languages other than Japanese and English is not evaluated. --- ## Evaluation Phase 1 is evaluated qualitatively via automatically generated raw outputs on: - STAIR Captions **License ID 5** held-out samples - OCR held-out samples from the training OCR corpus Quantitative benchmarks (GSM8K, JP Harness, EDINET Bench) are reported for the full pipeline rather than Phase 1 alone. See the project repository for numbers. --- ## Limitations and Biases - The vision encoder was pretrained on web-scale image data and may reflect biases present therein. - The LLM backbone (LLM-JP-4-8B) was trained primarily on Japanese and English corpora; performance in other languages is not guaranteed. - OCR quality on small-font or low-resolution documents is limited. - This Phase 1 checkpoint has **not** received the reasoning enhancement (Phase 2) or financial domain adaptation (Phase 3), so its behavior on multi-step reasoning and financial documents will be weaker than the final COMPASS model. --- ## How to Use ```python from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor import torch model_id = "Yana/compass-vlm-phase1" tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, ) ``` For the full inference pipeline (image preprocessing with SigLIP-v2, `` token expansion, and AnyRes handling), please refer to the [`phase1/` directory](https://github.com/AtsushiYanaigsawa768/Compass/tree/main/phase1) in the GitHub repository. --- ## Citation If you use this model, please cite the COMPASS project: ```bibtex @misc{compass2026, title = {COMPASS: Development of a Japanese Financial VLM through Integration of Reasoning Enhancement and Document Comprehension}, author = {Yanagisawa, Atsushi and Kakimoto, Genshin}, year = {2026}, howpublished = {\url{https://github.com/AtsushiYanaigsawa768/Compass}}, note = {FT-LLM 2026 free-form task} } ``` Please also cite the upstream works (LLaVA-1.5, LLaVA-OneVision, SigLIP, LLM-JP, STAIR Captions, ja-vg-vqa) as appropriate. --- ## License This model is released under the **Apache License 2.0**. **Note on training data and Japanese copyright law:** Under **Article 30-4 of the Japanese Copyright Act**, the use of copyrighted works for the purpose of information analysis — including machine learning model training — is a permitted use that does not require authorization from, or trigger license conditions of, the copyright holders. Training of this model was conducted in Japan on this basis; the resulting model weights are redistributed under Apache-2.0. Downstream users are responsible for complying with the licenses of any datasets or images they use for further fine-tuning or evaluation. --- ## Acknowledgements Built on top of outstanding open-source work, including: - [LLM-JP-4-8B-Instruct](https://huggingface.co/llm-jp/llm-jp-4-8b-instruct) - [SigLIP-v2](https://huggingface.co/google/siglip2-so400m-patch14-384) - [LLaVA-1.5](https://arxiv.org/abs/2310.03744) and [LLaVA-OneVision](https://arxiv.org/abs/2408.03326) - [LLaVA-JP](https://github.com/tosiyuki/LLaVA-JP) - [STAIR Captions](https://huggingface.co/datasets/shunk031/STAIR-Captions) and [ja-vg-vqa-conversation](https://huggingface.co/datasets/llm-jp/ja-vg-vqa-conversation)