Create README.md

Browse files

Files changed (1) hide show

README.md +241 -0

README.md ADDED Viewed

	@@ -0,0 +1,241 @@

+---
+license: apache-2.0
+language:
+- ja
+- en
+library_name: transformers
+pipeline_tag: image-text-to-text
+tags:
+- vision-language-model
+- vlm
+- llava
+- llava-onevision
+- japanese
+- siglip
+- llm-jp
+- finance
+- multimodal
+base_model:
+- llm-jp/llm-jp-4-8b-instruct
+- google/siglip2-so400m-patch14-384
+datasets:
+- shunk031/STAIR-Captions
+- Yana/ft-llm-2026-ocr-dataset
+- Yana/ft-llm-2026-qa-dataset
+- llm-jp/ja-vg-vqa-conversation
+- SakanaAI/JA-VG-VQA-500
+---
+# COMPASS-VLM Phase 1
+**Development of a Japanese Financial VLM through Integration of Reasoning Enhancement and Document Comprehension**
+(推論強化と文書読解の統合による日本語金融VLMの開発)
+This model is the **Phase 1 checkpoint** of the COMPASS project — a Japanese Vision-Language Model (VLM) built on a LLaVA-OneVision-style architecture. Phase 1 produces a general-purpose Japanese VLM through image-caption pretraining and visual instruction tuning. It serves as the vision-grounded foundation for the subsequent reasoning enhancement (Phase 2) and financial domain fine-tuning (Phase 3) stages.
+Developed by [Atsushi Yanagisawa](https://atsushiyanaigsawa768.github.io/mysite/en/) and [Genshin Kakimoto](https://github.com/kakimoto0225) as part of the FT-LLM 2026 free-form task.
+- 📦 **Code**: [github.com/AtsushiYanaigsawa768/Compass](https://github.com/AtsushiYanaigsawa768/Compass)
+- 📚 **Collection**: [Yana/compass](https://huggingface.co/collections/Yana/compass)
+- 📝 **Blog (EN)**: [atsushiyanaigsawa768.github.io/mysite/en/blog/compass](https://atsushiyanaigsawa768.github.io/mysite/en/blog/compass/)
+---
+## Model Details
+| Item | Value |
+|------|-------|
+| Model type | Vision-Language Model (LLaVA-OneVision-style) |
+| Parameters | ~9B |
+| Precision | BF16 |
+| Primary language | Japanese (with English support inherited from the base LLM) |
+| License | Apache-2.0 (see [License](#license)) |
+### Architecture
+```
+Input Image ──► SigLIP-v2 Vision Encoder ──► MLP Projector ──┐
+                                                             ├──► LLM-JP-4-8B-Instruct ──► Output Text
+Input Text ──────────────────────────────────────────────────┘
+```
+| Component | Model | Role in Phase 1 |
+|-----------|-------|-----------------|
+| Vision Encoder | `google/siglip2-so400m-patch14-384` | Frozen in Stage 1-1, trainable (lr = 2e-6) in Stage 1-2 |
+| MLP Projector | Linear(1152→4096) → GELU → Linear(4096→4096), ~8M params | Trainable in both stages |
+| LLM | `llm-jp/llm-jp-4-8b-instruct` (8B) | Frozen by default; trainable via LoRA in Stage 1-2 |
+---
+## Training Procedure
+Phase 1 follows the two-stage recipe popularized by LLaVA-1.5 / LLaVA-OneVision, adapted to Japanese data.
+### Stage 1-1 — Image Caption Pretraining
+- **Goal**: Align vision tokens with the LLM embedding space.
+- **Trainable**: MLP projector only.
+- **Datasets**:
+  - STAIR Captions (license_id = 4 only, with multi-caption random sampling providing 5× effective diversity)
+  - [Yana/ft-llm-2026-ocr-dataset](https://huggingface.co/datasets/Yana/ft-llm-2026-ocr-dataset)
+- **Learning rate**: 1e-3 · **Epochs**: 2 · **Effective batch size**: 128
+### Stage 1-2 — Visual Instruction Tuning
+- **Goal**: Enable VQA and instruction following in Japanese.
+- **Trainable**: MLP projector + LLM (via LoRA, r = 64, α = 128) + Vision Encoder (lr = 2e-6).
+- **Datasets**:
+  - [Yana/ft-llm-2026-qa-dataset](https://huggingface.co/datasets/Yana/ft-llm-2026-qa-dataset)
+  - [llm-jp/ja-vg-vqa-conversation](https://huggingface.co/datasets/llm-jp/ja-vg-vqa-conversation) (~90k on Visual Genome images)
+  - [SakanaAI/JA-VG-VQA-500](https://huggingface.co/datasets/SakanaAI/JA-VG-VQA-500)
+- **Learning rate**: 2e-5 · **Epochs**: 1 · **Effective batch size**: 128
+### Common Hyperparameters
+| Parameter | Value |
+|-----------|-------|
+| Per-device batch size | 2 |
+| Gradient accumulation steps | 64 |
+| Warmup ratio | 0.03 |
+| Weight decay | 0.0 |
+| Max sequence length | 2048 |
+| Mixed precision | BF16 |
+| Seed | 42 |
+Training uses NCCL and supports `torchrun`, SLURM, and OpenMPI. Gradient checkpointing is enabled by default. An H100 80GB GPU is recommended.
+---
+## Chat Template
+The model uses the LLM-JP v4 instruct template:
+```
+以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。
+### 指示:
+<image>
+この画像を見て、質問に答えてください。
+{user_question}
+### 応答:
+{assistant_answer}<|eos|>
+```
+Special tokens:
+| Token | Purpose |
+|-------|---------|
+| `<image>` | Image placeholder replaced by vision embeddings |
+| `<|eos|>` | End-of-turn token |
+Typical prompts used during training:
+- Stage 1-1 caption prompt: `この画像を端的に説明してください。` ("Please briefly describe this image.")
+- Stage 1-2 VQA prompt: `この画像��見て、質問に答えてください。` ("Look at this image and answer the question.")
+---
+## Intended Use
+### Direct Use
+- Japanese image captioning
+- Japanese visual question answering (VQA)
+- Foundation checkpoint for downstream fine-tuning (e.g., document understanding, financial reasoning)
+### Downstream Use
+This checkpoint is specifically intended to be continued into:
+- **Phase 2** — reasoning enhancement via SFT + DPO distilled from Qwen3-30B → [Yana/compass-vlm-phase2](https://huggingface.co/Yana/compass-vlm-phase2)
+- **Phase 3** — Japanese financial domain fine-tuning on TAT-QA / ConvFinQA / FinQA / domain-specific QA → [Yana/compass-vlm](https://huggingface.co/Yana/compass-vlm)
+### Out-of-Scope Use
+- High-stakes decision making (medical, legal, financial advisory, etc.) without human oversight.
+- Generation of factual claims without verification; the model can hallucinate.
+- Use in languages other than Japanese and English is not evaluated.
+---
+## Evaluation
+Phase 1 is evaluated qualitatively via automatically generated raw outputs on:
+- STAIR Captions **License ID 5** held-out samples
+- OCR held-out samples from the training OCR corpus
+Quantitative benchmarks (GSM8K, JP Harness, EDINET Bench) are reported for the full pipeline rather than Phase 1 alone. See the project repository for numbers.
+---
+## Limitations and Biases
+- The vision encoder was pretrained on web-scale image data and may reflect biases present therein.
+- The LLM backbone (LLM-JP-4-8B) was trained primarily on Japanese and English corpora; performance in other languages is not guaranteed.
+- OCR quality on small-font or low-resolution documents is limited.
+- This Phase 1 checkpoint has **not** received the reasoning enhancement (Phase 2) or financial domain adaptation (Phase 3), so its behavior on multi-step reasoning and financial documents will be weaker than the final COMPASS model.
+---
+## How to Use
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
+import torch
+model_id = "Yana/compass-vlm-phase1"
+tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    trust_remote_code=True,
+)
+```
+For the full inference pipeline (image preprocessing with SigLIP-v2, `<image>` token expansion, and AnyRes handling), please refer to the [`phase1/` directory](https://github.com/AtsushiYanaigsawa768/Compass/tree/main/phase1) in the GitHub repository.
+---
+## Citation
+If you use this model, please cite the COMPASS project:
+```bibtex
+@misc{compass2026,
+  title  = {COMPASS: Development of a Japanese Financial VLM through
+            Integration of Reasoning Enhancement and Document Comprehension},
+  author = {Yanagisawa, Atsushi and Kakimoto, Genshin},
+  year   = {2026},
+  howpublished = {\url{https://github.com/AtsushiYanaigsawa768/Compass}},
+  note   = {FT-LLM 2026 free-form task}
+}
+```
+Please also cite the upstream works (LLaVA-1.5, LLaVA-OneVision, SigLIP, LLM-JP, STAIR Captions, ja-vg-vqa) as appropriate.
+---
+## License
+This model is released under the **Apache License 2.0**.
+**Note on training data and Japanese copyright law:**
+Under **Article 30-4 of the Japanese Copyright Act**, the use of copyrighted works for the purpose of information analysis — including machine learning model training — is a permitted use that does not require authorization from, or trigger license conditions of, the copyright holders. Training of this model was conducted in Japan on this basis; the resulting model weights are redistributed under Apache-2.0.
+Downstream users are responsible for complying with the licenses of any datasets or images they use for further fine-tuning or evaluation.
+---
+## Acknowledgements
+Built on top of outstanding open-source work, including:
+- [LLM-JP-4-8B-Instruct](https://huggingface.co/llm-jp/llm-jp-4-8b-instruct)
+- [SigLIP-v2](https://huggingface.co/google/siglip2-so400m-patch14-384)
+- [LLaVA-1.5](https://arxiv.org/abs/2310.03744) and [LLaVA-OneVision](https://arxiv.org/abs/2408.03326)
+- [LLaVA-JP](https://github.com/tosiyuki/LLaVA-JP)
+- [STAIR Captions](https://huggingface.co/datasets/shunk031/STAIR-Captions) and [ja-vg-vqa-conversation](https://huggingface.co/datasets/llm-jp/ja-vg-vqa-conversation)