--- library_name: transformers license: apache-2.0 language: - ko - en model_type: llama tags: - 3b - korean - from-scratch - orpo - instruction-tuned - preference-aligned - fp8 - b200 - gguf datasets: - cc100 - allenai/c4 - heegyu/orca-math-korean-preference-cleaned - nayohan/preference-collection-ko-full - maywell/ko_Ultrafeedback_binarized - HuggingFaceTB/cosmopedia - wikimedia/wikipedia pipeline_tag: text-generation model-index: - name: FRANKENSTALLM-3B results: - task: type: text-generation dataset: type: kobest name: KoBEST (0-shot) metrics: - name: Average type: accuracy value: 52.75 - name: COPA type: accuracy value: 63.9 - name: HellaSwag-KO type: accuracy value: 38.0 - name: SentiNeg type: accuracy value: 62.5 - name: BoolQ type: accuracy value: 50.6 - name: WiC type: accuracy value: 48.8 - task: type: text-generation dataset: type: haerae name: HAE-RAE (0-shot) metrics: - name: Average type: accuracy value: 21.81 - task: type: text-generation dataset: type: piqa name: PIQA (0-shot) metrics: - name: Accuracy type: accuracy value: 59.9 - task: type: text-generation dataset: type: ai2_arc name: ARC-Easy (0-shot) metrics: - name: Accuracy type: accuracy value: 36.0 --- # FRANKENSTALLM 3B > **⚠️ v2 모델 교체 공지 (2026-03-26)** > > v2 GGUF 및 safetensors 파일이 변환 과정의 오류로 **1.2B 모델(hidden_size=2048, 24 layers)**로 잘못 배포되었습니다. > 2026-03-26에 올바른 **3B ORPO 체크포인트(hidden_size=3072, 28 layers, vocab_size=64256, byte-fallback 적용)**로 교체 완료했습니다. > 이전에 다운로드한 v2 파일이 있다면 재다운로드를 권장합니다. > **한국어 3B LLM을 처음부터 직접 만들었습니다 — 토크나이저 학습부터 사전학습, SFT, ORPO까지, 8× NVIDIA B200 GPU 위에서.** | | | |---|---| | **개발자** | [pathcosmos](https://huggingface.co/pathcosmos) | | **파라미터** | ~24억 (weight tying 적용, 3B급) | | **언어** | 한국어 (주), 영어 (부) | | **라이선스** | Apache 2.0 | | **학습** | 3단계: 사전학습 → SFT → ORPO | | **하드웨어** | 8× NVIDIA B200 (FP8), 총 ~86시간 | --- ## 빠른 시작 ### Transformers ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_id = "pathcosmos/frankenstallm" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto" ) inputs = tokenizer( "한국의 전통 음식 중 김치에 대해 설명해주세요.", return_tensors="pt" ).to(model.device) with torch.no_grad(): outputs = model.generate( **inputs, do_sample=True, temperature=0.7, repetition_penalty=1.2, # 권장 top_p=0.9, max_new_tokens=512, pad_token_id=tokenizer.eos_token_id, ) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ### Ollama (GGUF) ```bash # GGUF + Modelfile 다운로드 huggingface-cli download pathcosmos/frankenstallm \ gguf/frankenstallm-3b-v2-Q4_K_M.gguf \ gguf/Modelfile.3b-v2-Q4_K_M \ --local-dir ./frankenstallm # Modelfile 내 FROM 경로 수정 후 생성 ollama create frankenstallm -f ./frankenstallm/gguf/Modelfile.3b-v2-Q4_K_M # 실행 ollama run frankenstallm ``` --- ## 파일 다운로드 링크 ### 모델 파일 | 파일 | 크기 | 설명 | 다운로드 | |------|------|------|----------| | [`model.safetensors`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/model.safetensors) | 5.7 GB | HF Transformers 네이티브 (3B ORPO, byte-fallback) | [Download](https://huggingface.co/pathcosmos/frankenstallm/resolve/main/model.safetensors) | | [`config.json`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/config.json) | 1 KB | 모델 설정 (hidden=3072, 28L, vocab=64256) | [Download](https://huggingface.co/pathcosmos/frankenstallm/resolve/main/config.json) | | [`tokenizer.json`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/tokenizer.json) | 4 MB | 토크나이저 (SentencePiece Unigram) | [Download](https://huggingface.co/pathcosmos/frankenstallm/resolve/main/tokenizer.json) | | [`tokenizer.model`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/tokenizer.model) | 1.4 MB | SentencePiece 모델 (GGUF 변환용) | [Download](https://huggingface.co/pathcosmos/frankenstallm/resolve/main/tokenizer.model) | | [`sampling_config.json`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/sampling_config.json) | 1 KB | 권장 샘플링 파라미터 | [Download](https://huggingface.co/pathcosmos/frankenstallm/resolve/main/sampling_config.json) | ### GGUF (Ollama / llama.cpp) | 파일 | 크기 | 양자화 | 다운로드 | |------|------|--------|----------| | [`frankenstallm-3b-v2-Q4_K_M.gguf`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/gguf/frankenstallm-3b-v2-Q4_K_M.gguf) | 1.8 GB | **Q4_K_M (권장)** | [Download](https://huggingface.co/pathcosmos/frankenstallm/resolve/main/gguf/frankenstallm-3b-v2-Q4_K_M.gguf) | | [`frankenstallm-3b-v2-Q8_0.gguf`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/gguf/frankenstallm-3b-v2-Q8_0.gguf) | 3.0 GB | Q8_0 (고품질) | [Download](https://huggingface.co/pathcosmos/frankenstallm/resolve/main/gguf/frankenstallm-3b-v2-Q8_0.gguf) | | [`frankenstallm-3b-v2-f16.gguf`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/gguf/frankenstallm-3b-v2-f16.gguf) | 5.7 GB | F16 (무손실) | [Download](https://huggingface.co/pathcosmos/frankenstallm/resolve/main/gguf/frankenstallm-3b-v2-f16.gguf) | | [`Modelfile.3b-v2-Q4_K_M`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/gguf/Modelfile.3b-v2-Q4_K_M) | 1 KB | Ollama Modelfile (Q4) | [Download](https://huggingface.co/pathcosmos/frankenstallm/resolve/main/gguf/Modelfile.3b-v2-Q4_K_M) | | [`Modelfile.3b-v2-Q8_0`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/gguf/Modelfile.3b-v2-Q8_0) | 1 KB | Ollama Modelfile (Q8) | [Download](https://huggingface.co/pathcosmos/frankenstallm/resolve/main/gguf/Modelfile.3b-v2-Q8_0) | > v1 GGUF (byte-fallback 미적용)도 `gguf/frankenstallm-3b-*.gguf`로 제공되지만, **v2 사용을 권장**합니다. ### 학습 데이터 (SFT / ORPO 재현용) | 파일 | 크기 | 용도 | 다운로드 | |------|------|------|----------| | [`train_filtered.jsonl`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/data/sft_combined/train_filtered.jsonl) | 7.5 GB | SFT 학습 데이터 (24개 소스, 240만 샘플, 필터링 완료) | [Download](https://huggingface.co/pathcosmos/frankenstallm/resolve/main/data/sft_combined/train_filtered.jsonl) | | [`val_filtered.jsonl`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/data/sft_combined/val_filtered.jsonl) | 157 MB | SFT 검증 데이터 | [Download](https://huggingface.co/pathcosmos/frankenstallm/resolve/main/data/sft_combined/val_filtered.jsonl) | | [`combined_preference.jsonl`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/data/preference/combined_preference.jsonl) | 2.6 GB | ORPO 학습 데이터 (7개 소스 통합, 63만 쌍) | [Download](https://huggingface.co/pathcosmos/frankenstallm/resolve/main/data/preference/combined_preference.jsonl) |
ORPO Preference 데이터 개별 소스 (7종) | 파일 | 크기 | 다운로드 | |------|------|----------| | [`nayohan_preference-collection-ko-full.jsonl`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/data/preference/nayohan_preference-collection-ko-full.jsonl) | 4.9 GB | [Download](https://huggingface.co/pathcosmos/frankenstallm/resolve/main/data/preference/nayohan_preference-collection-ko-full.jsonl) | | [`heegyu_orca-math-korean-preference-cleaned.jsonl`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/data/preference/heegyu_orca-math-korean-preference-cleaned.jsonl) | 1.6 GB | [Download](https://huggingface.co/pathcosmos/frankenstallm/resolve/main/data/preference/heegyu_orca-math-korean-preference-cleaned.jsonl) | | [`kuotient_orca-math-korean-dpo-pairs.jsonl`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/data/preference/kuotient_orca-math-korean-dpo-pairs.jsonl) | 750 MB | [Download](https://huggingface.co/pathcosmos/frankenstallm/resolve/main/data/preference/kuotient_orca-math-korean-dpo-pairs.jsonl) | | [`maywell_ko_Ultrafeedback_binarized.jsonl`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/data/preference/maywell_ko_Ultrafeedback_binarized.jsonl) | 394 MB | [Download](https://huggingface.co/pathcosmos/frankenstallm/resolve/main/data/preference/maywell_ko_Ultrafeedback_binarized.jsonl) | | [`tellang_yeji-preference-ko-v1.jsonl`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/data/preference/tellang_yeji-preference-ko-v1.jsonl) | 171 MB | [Download](https://huggingface.co/pathcosmos/frankenstallm/resolve/main/data/preference/tellang_yeji-preference-ko-v1.jsonl) | | [`jojo0217_korean_rlhf_dataset.jsonl`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/data/preference/jojo0217_korean_rlhf_dataset.jsonl) | 137 MB | [Download](https://huggingface.co/pathcosmos/frankenstallm/resolve/main/data/preference/jojo0217_korean_rlhf_dataset.jsonl) | | [`lemon-mint_korean-realqa-reasoning-v01-preference.jsonl`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/data/preference/lemon-mint_korean-realqa-reasoning-v01-preference.jsonl) | 58 MB | [Download](https://huggingface.co/pathcosmos/frankenstallm/resolve/main/data/preference/lemon-mint_korean-realqa-reasoning-v01-preference.jsonl) |
### 데이터 파이프라인 스크립트 | 파일 | 설명 | |------|------| | [`prepare_sft_data.py`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/data/prepare_sft_data.py) | HF 데이터셋 → JSONL 정규화 (Alpaca 포맷) | | [`filter_sft_v2.py`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/data/filter_sft_v2.py) | SFT 품질 필터링 (중복 제거, 반복률 필터) | | [`prepare_preference_combined.py`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/data/prepare_preference_combined.py) | Preference 데이터 통합 (DPO/ORPO용) | | [`tokenize_extra.py`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/data/tokenize_extra.py) | 대용량 데이터 병렬 토크나이징 | | [`sft_dataset.py`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/data/sft_dataset.py) | SFT 데이터셋 로더 (Alpaca/대화 포맷) | | [`dataset.py`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/data/dataset.py) | 사전학습 데이터셋 로더 (memmap .bin) | | [`build_korean_dataset.sh`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/data/build_korean_dataset.sh) | 한국어 데이터 전체 파이프라인 | ### Phase별 보고서 | 보고서 | 내용 | |--------|------| | [`PROJECT_COMPLETION_REPORT`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/reports/2026-03-10_PROJECT_COMPLETION_REPORT.md) | 프로젝트 최종 완료 보고서 | | [`ORPO_EVALUATION_REPORT`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/reports/2026-03-09_ORPO_EVALUATION_REPORT.md) | ORPO 10차원 종합 평가 | | [`ORPO_TRAINING_JOURNEY`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/reports/2026-03-08_ORPO_TRAINING_JOURNEY.md) | ORPO 학습 여정 (HP sweep, 디버깅) | | [`SFT_COMPLETION_AND_EVAL`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/reports/2026-03-06_3B_SFT_COMPLETION_AND_EVAL_SUMMARY.md) | SFT 완료 및 평가 | | [`3B_BASE_EVALUATION`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/reports/2026-03-05_3B_BASE_EVALUATION_REPORT.md) | 사전학습 베이스 모델 평가 | | [`Phase0_Optimization`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/reports/2026-03-02_0200_FRANKENSTALLM_phase0_optimization_report.md) | FP8 최적화 보고서 | --- ## 모델 특징 - **처음부터 만든 한국어 토크나이저**: SentencePiece Unigram, 64K 어휘, 한국어 문자 커버리지 99.95% - **3단계 학습 파이프라인**: 사전학습 (57K 스텝, ~600억 토큰) → SFT (25.5K 스텝, 240만 샘플) → ORPO (10K 스텝, 63만 선호도 쌍) - **B200 FP8 네이티브 학습**: TransformerEngine MXFP8 — BF16 대비 이론적 2배 처리량 - **GGUF 배포 지원**: Q4_K_M (1.8GB), Q8_0 (3.0GB), F16 (5.7GB) + Ollama Modelfile 제공 --- ## 아키텍처 | 구성 요소 | 값 | |-----------|-----| | 구조 | Decoder-only Transformer (LLaMA 스타일) | | Hidden size | 3,072 | | 레이어 수 | 28 | | 어텐션 헤드 | 24 | | KV 헤드 | 8 (GQA 3:1) | | FFN 차원 | 8,192 (SwiGLU) | | 어휘 크기 | 64,256 (byte-fallback 적용) | | 컨텍스트 길이 | 4,096 (학습 시 2,048) | | 위치 인코딩 | RoPE (θ=500,000) | | 정규화 | Pre-norm RMSNorm | | 어텐션 구현 | FlashAttention-2 | | 정밀도 | FP8 (TransformerEngine MXFP8) | | Weight tying | 적용 (embedding ↔ lm_head) | --- ## 학습 파이프라인 ### Phase 1: 사전학습 | 항목 | 값 | |------|-----| | 스텝 수 | 57,000 | | 최종 loss | 1.466 | | 학습 토큰 | ~600억 (385억 고유 × ~1.5 에폭) | | 소요 시간 | ~63시간 | | 데이터 | CC-100 KO, HPLT KO, C4 KO, 나무위키, 위키피디아 KO, Cosmopedia (EN) | | 배치 크기 | 5 × 8 GPU × 8 accum × 2,048 seq = ~65만 토큰/스텝 | ### Phase 2: SFT (지도 미세조정) | 항목 | 값 | |------|-----| | 스텝 수 | 25,500 (77.3% 지점에서 조기 종료) | | 최적 val_loss | 1.8851 (step 23,000) | | 소요 시간 | ~15.5시간 | | 데이터 | 24개 소스, 243만 9,397 샘플 (7.48 GB) | | 구성 | SFT 70% + 사전학습 리플레이 30% (치명적 망각 방지) | | 지식 망각률 | 0.9% (19개 데이터셋 기준) | ### Phase 3: ORPO (선호도 최적화) | 항목 | 값 | |------|-----| | 스텝 수 | 9,997 (조기 수렴) | | 최적 eval_loss | 1.625 | | 선호도 정확도 | 76.02% | | 보상 마진 | 0.6100 | | 소요 시간 | ~7시간 | | 데이터 | 한국어 HF 데이터셋 7종, ~63만 선호도 쌍 | | 하이퍼파라미터 | beta=0.25, lr=1.2e-5, eff_batch=128 | **총 학습 시간: 8× B200에서 약 86시간** --- ## 벤치마크 ### 학습 단계별 성능 변화 (Base → SFT → ORPO) | 벤치마크 | Base | SFT | ORPO | 변화 (Base→ORPO) | |-----------|:----:|:---:|:----:|:---:| | **KoBEST 평균 (0-shot)** | 43.7% | 43.3% | **52.8%** | **+9.1pp** | | KoBEST COPA | 49.3% | 48.6% | **63.9%** | +14.6pp | | KoBEST HellaSwag-KO | 21.6% | 19.8% | **38.0%** | +16.4pp | | KoBEST SentiNeg | 48.6% | 49.1% | **62.5%** | +13.9pp | | KoBEST BoolQ | 50.3% | 50.1% | 50.6% | +0.3pp | | PIQA | 52.5% | 52.6% | **59.9%** | +7.3pp | | ARC-Easy | 25.6% | 25.9% | **36.0%** | +10.4pp | | HAE-RAE | 19.7% | 19.9% | 21.8% | +2.1pp | | HellaSwag EN | 26.2% | 26.1% | 29.2% | +3.0pp | | Greedy 3-gram 반복률 | 61.0% | 73.0% | **30.9%** | -30.1pp | | EOS 종료율 | 0% | 60% | **67%** | +67pp | | PPL 망각률 | — | 0.9% | 4.1% | 15% 이내 ✅ | ### 3B급 모델 비교 (Ollama, 35개 테스트) | 모델 | 파라미터 | 한국어 NLU | 지식 | 지시 수행 | 추론 | 평균 점수 | |-------|:------:|:----------:|:----:|:---------:|:----:|:---------:| | Qwen 2.5 3B | 3B | 100.0 | 20.8 | 55.6 | 62.5 | **63.4** | | Phi-4 Mini | 3.8B | 66.7 | 29.2 | 33.3 | **87.5** | 60.6 | | **FRANKENSTALLM 3B** | **3B** | **100.0** | **75.0** | **66.7** | 50.0 | 46.7 | > FRANKENSTALLM은 **한국어 NLU** (Qwen과 동률), **한국어 지식** (75.0 vs 20.8/29.2), **지시 수행** (66.7 vs 55.6/33.3)에서 앞섭니다. ### 추론 속도 (Ollama, Q4_K_M) | 모델 | 평균 TTFT | TPS | 비고 | |-------|:--------:|:---:|------| | **FRANKENSTALLM 3B** | **16.7ms** | **142.5** | 가장 빠름 | | Phi-4 Mini 3.8B | 25.6ms | 100.4 | | | Qwen 2.5 3B | 28.2ms | 93.8 | | ### Perplexity 보존율 (ORPO 지식 유지) | 데이터셋 | Base PPL | ORPO PPL | 망각률 | |---------|:--------:|:--------:|:------:| | Korean C4 | 5.72 | 5.87 | +2.7% | | Korean Wiki | 11.84 | 12.21 | +3.2% | | 최대 망각률 | — | — | 4.1% ✅ | --- ## 학습 데이터 ### 사전학습 (~385억 토큰) | 분류 | 소스 | 추정 토큰 수 | |------|------|:-----------:| | 한국어 웹 크롤 | C4 KO, CC-100 KO, HPLT KO | ~172억 | | 한국어 백과사전 | 위키피디아 KO, 나무위키 (2개 버전) | ~28억 | | 영어 교육 | Cosmopedia (Stories, Web, Stanford, WikiHow, OpenStax, Khan) | ~57억 | | 영어 수학·과학 | AutoMathText, OpenWebMath, Proof-Pile-2 | ~85억 | | 코드 | StarCoder (필터링) | ~43억 | ### SFT (240만 샘플, 24개 소스) | 영역 | 비율 | 주요 데이터셋 | |------|:----:|-------------| | 추론/CoT | 38% | reasoning_r1_1.4m, magpie_reasoning | | 한국어 지시문 | 23% | korean_instruction_mix, open_korean_instructions, kullm_v2 | | 영어 일반 | 16% | openhermes_2.5, ultrachat_200k | | 수학 | 12% | NuminaMath-CoT, orca-math-ko | | 대화/코드/기타 | 11% | smol-koreantalk, Evol-Instruct-Code-80k-ko | ### ORPO (~63만 선호도 쌍, 7개 소스) | 데이터셋 | 용량 | 영역 | |---------|:----:|------| | nayohan/preference-collection-ko-full | 4.9GB | 일반 선호도 | | heegyu/orca-math-korean-preference-cleaned | 1.6GB | 수학 추론 | | kuotient/orca-math-korean-dpo-pairs | 750MB | 수학 DPO | | maywell/ko_Ultrafeedback_binarized | 394MB | 피드백 정렬 | | tellang/yeji-preference-ko-v1 | 171MB | 일반 선호도 | | jojo0217/korean_rlhf_dataset | 137MB | RLHF 쌍 | | lemon-mint/korean-realqa-reasoning-v01-preference | 58MB | QA 추론 | --- ## GGUF & Ollama ### 제공 양자화 파일 | 파일 | 크기 | 설명 | |------|:----:|------| | `gguf/frankenstallm-3b-v2-Q4_K_M.gguf` | 1.8GB | **권장** — 크기 대비 최적 품질 | | `gguf/frankenstallm-3b-v2-Q8_0.gguf` | 3.0GB | 높은 품질 | | `gguf/frankenstallm-3b-v2-f16.gguf` | 5.7GB | 전체 정밀도 | | `model.safetensors` | 5.7GB | Transformers 네이티브 (3B ORPO best, byte-fallback 수정, vocab=64256) | ### 권장 샘플링 파라미터 | 파라미터 | 값 | 비고 | |---------|:---:|------| | `temperature` | 0.7 | 한국어 생성 품질 최적 | | `repeat_penalty` | 1.2 | **필수** — 미적용 시 greedy 반복률 30.9% | | `top_p` | 0.9 | Nucleus 샘플링 | | `top_k` | 50 | Top-k 후보 수 | | `max_tokens` | 512 | 최대 생성 길이 | | `num_ctx` | 4096 | 컨텍스트 윈도우 (초과 금지) | > ⚠️ 반드시 `repeat_penalty >= 1.2`를 사용하세요. 적용하면 반복률이 **0%** 로 떨어집니다. 미적용 시 greedy 디코딩에서 ~31% 3-gram 반복이 발생합니다. --- ## 제한 사항 - **영어 성능 제한**: MMLU-EN ~23%, HellaSwag-EN ~29% — 한국어 특화 모델입니다 - **코드 생성**: 거의 불가능 (학습 데이터에 코드 비중이 낮음) - **Greedy 반복**: `repeat_penalty` 미사용 시 30.9% 3-gram 반복 — 반드시 `repeat_penalty >= 1.2` 사용 - **안전성**: 안전 정렬(safety alignment) 데이터가 학습에 포함되지 않았으므로 적절한 가드레일과 함께 사용하세요 - **규모 차이**: 수조 토큰으로 학습된 상용 3B 모델 대비 ~600억 토큰으로 학습 — 전반적 벤치마크 점수는 낮을 수 있습니다 --- ## 하드웨어 및 학습 환경 | 구성 요소 | 사양 | |-----------|------| | GPU | 8× NVIDIA B200 (183GB HBM3e × 8, 총 ~1.47TB) | | FP8 연산 | 2,250 TFLOPS/GPU (총 18,000 TFLOPS) | | 인터커넥트 | NVLink 5.0, NVSwitch all-to-all mesh | | CPU | 2× AMD EPYC 9365 (72코어, Zen 5) | | RAM | 2.21 TB DDR5 | | PyTorch | 2.10.0a0+b4e4ee81d3.nv25.12 (NVIDIA 커스텀) | | TransformerEngine | 2.10.0 | | FlashAttention | 2.7.4 | | NCCL | 2.28.9 | | CUDA | 13.1 | | 총 학습 시간 | ~86시간 (사전학습 63h + SFT 15.5h + ORPO 7h) | --- ## 인용 ```bibtex @misc{frankenstallm2026, title={FRANKENSTALLM: A Korean 3B LLM Built From Scratch on B200 GPUs}, author={pathcosmos}, year={2026}, url={https://huggingface.co/pathcosmos/frankenstallm}, note={3-phase training (Pretrain, SFT, ORPO) with FP8 on 8x NVIDIA B200} } ``` --- ## 링크 및 연락처 - **GitHub**: [pathcosmos/FRANKENSTALLM](https://github.com/pathcosmos/FRANKENSTALLM) — 전체 소스코드, 학습 스크립트, 빌더 로그 - **HuggingFace**: [pathcosmos/frankenstallm](https://huggingface.co/pathcosmos/frankenstallm) - **연락처**: pathcosmos@gmail.com --- ## 감사의 글 이 프로젝트는 **과학기술정보통신부**의 **「첨단 GPU 활용 지원 사업」** (과학기술정보통신부 공고 제2025-1068호)을 통해 제공된 GPU 컴퓨팅 자원을 활용하여 수행되었습니다. > **국가 AI컴퓨팅자원 지원포털**: https://aiinfrahub.kr > > - 주관: 과학기술정보통신부 (MSIT), 정보통신산업진흥원 (NIPA) > - 운영: 한국정보통신진흥협회 (KAIT) 대한민국 정부의 AI 인프라 지원 사업 덕분에 8× NVIDIA B200 GPU 환경에서 한국어 3B LLM을 처음부터 학습할 수 있었습니다. 국가 차원의 AI 컴퓨팅 자원 지원에 깊이 감사드립니다. --- --- > 🇺🇸 **English version below** --- # FRANKENSTALLM 3B > **⚠️ v2 Model Replacement Notice (2026-03-26)** > > The v2 GGUF and safetensors files were incorrectly deployed as a **1.2B model (hidden_size=2048, 24 layers)** due to a conversion pipeline error. > On 2026-03-26, they were replaced with the correct **3B ORPO checkpoint (hidden_size=3072, 28 layers, vocab_size=64256, byte-fallback applied)**. > If you downloaded v2 files before this date, please re-download. > **A Korean 3B LLM built entirely from scratch — tokenizer, pretraining, SFT, and ORPO — on 8× NVIDIA B200 GPUs.** | | | |---|---| | **Developer** | [pathcosmos](https://huggingface.co/pathcosmos) | | **Parameters** | ~2.4B (3B-class with weight tying) | | **Languages** | Korean (primary), English (secondary) | | **License** | Apache 2.0 | | **Training** | 3-phase: Pretrain → SFT → ORPO | | **Hardware** | 8× NVIDIA B200 (FP8), ~86 hours total | --- ## Quick Start ### Transformers ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_id = "pathcosmos/frankenstallm" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto" ) inputs = tokenizer( "한국의 전통 음식 중 김치에 대해 설명해주세요.", return_tensors="pt" ).to(model.device) with torch.no_grad(): outputs = model.generate( **inputs, do_sample=True, temperature=0.7, repetition_penalty=1.2, # recommended top_p=0.9, max_new_tokens=512, pad_token_id=tokenizer.eos_token_id, ) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ### Ollama (GGUF) ```bash # Download GGUF + Modelfile huggingface-cli download pathcosmos/frankenstallm \ gguf/frankenstallm-3b-v2-Q4_K_M.gguf \ gguf/Modelfile.3b-v2-Q4_K_M \ --local-dir ./frankenstallm # Fix FROM path in Modelfile, then create ollama create frankenstallm -f ./frankenstallm/gguf/Modelfile.3b-v2-Q4_K_M # Run ollama run frankenstallm ``` --- ## File Downloads ### Model Files | File | Size | Description | Download | |------|------|-------------|----------| | [`model.safetensors`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/model.safetensors) | 5.7 GB | HF Transformers native (3B ORPO, byte-fallback) | [Download](https://huggingface.co/pathcosmos/frankenstallm/resolve/main/model.safetensors) | | [`config.json`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/config.json) | 1 KB | Model config (hidden=3072, 28L, vocab=64256) | [Download](https://huggingface.co/pathcosmos/frankenstallm/resolve/main/config.json) | | [`tokenizer.json`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/tokenizer.json) | 4 MB | Tokenizer (SentencePiece Unigram) | [Download](https://huggingface.co/pathcosmos/frankenstallm/resolve/main/tokenizer.json) | | [`tokenizer.model`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/tokenizer.model) | 1.4 MB | SentencePiece model (for GGUF conversion) | [Download](https://huggingface.co/pathcosmos/frankenstallm/resolve/main/tokenizer.model) | ### GGUF (Ollama / llama.cpp) | File | Size | Quantization | Download | |------|------|--------------|----------| | [`frankenstallm-3b-v2-Q4_K_M.gguf`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/gguf/frankenstallm-3b-v2-Q4_K_M.gguf) | 1.8 GB | **Q4_K_M (Recommended)** | [Download](https://huggingface.co/pathcosmos/frankenstallm/resolve/main/gguf/frankenstallm-3b-v2-Q4_K_M.gguf) | | [`frankenstallm-3b-v2-Q8_0.gguf`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/gguf/frankenstallm-3b-v2-Q8_0.gguf) | 3.0 GB | Q8_0 (High quality) | [Download](https://huggingface.co/pathcosmos/frankenstallm/resolve/main/gguf/frankenstallm-3b-v2-Q8_0.gguf) | | [`frankenstallm-3b-v2-f16.gguf`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/gguf/frankenstallm-3b-v2-f16.gguf) | 5.7 GB | F16 (Lossless) | [Download](https://huggingface.co/pathcosmos/frankenstallm/resolve/main/gguf/frankenstallm-3b-v2-f16.gguf) | ### Training Data (for SFT / ORPO reproduction) | File | Size | Purpose | Download | |------|------|---------|----------| | [`train_filtered.jsonl`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/data/sft_combined/train_filtered.jsonl) | 7.5 GB | SFT training data (24 sources, 2.4M samples, filtered) | [Download](https://huggingface.co/pathcosmos/frankenstallm/resolve/main/data/sft_combined/train_filtered.jsonl) | | [`val_filtered.jsonl`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/data/sft_combined/val_filtered.jsonl) | 157 MB | SFT validation data | [Download](https://huggingface.co/pathcosmos/frankenstallm/resolve/main/data/sft_combined/val_filtered.jsonl) | | [`combined_preference.jsonl`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/data/preference/combined_preference.jsonl) | 2.6 GB | ORPO training data (7 sources, 630K pairs) | [Download](https://huggingface.co/pathcosmos/frankenstallm/resolve/main/data/preference/combined_preference.jsonl) |
Individual ORPO Preference Sources (7 datasets) | File | Size | Download | |------|------|----------| | [`nayohan_preference-collection-ko-full.jsonl`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/data/preference/nayohan_preference-collection-ko-full.jsonl) | 4.9 GB | [Download](https://huggingface.co/pathcosmos/frankenstallm/resolve/main/data/preference/nayohan_preference-collection-ko-full.jsonl) | | [`heegyu_orca-math-korean-preference-cleaned.jsonl`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/data/preference/heegyu_orca-math-korean-preference-cleaned.jsonl) | 1.6 GB | [Download](https://huggingface.co/pathcosmos/frankenstallm/resolve/main/data/preference/heegyu_orca-math-korean-preference-cleaned.jsonl) | | [`kuotient_orca-math-korean-dpo-pairs.jsonl`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/data/preference/kuotient_orca-math-korean-dpo-pairs.jsonl) | 750 MB | [Download](https://huggingface.co/pathcosmos/frankenstallm/resolve/main/data/preference/kuotient_orca-math-korean-dpo-pairs.jsonl) | | [`maywell_ko_Ultrafeedback_binarized.jsonl`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/data/preference/maywell_ko_Ultrafeedback_binarized.jsonl) | 394 MB | [Download](https://huggingface.co/pathcosmos/frankenstallm/resolve/main/data/preference/maywell_ko_Ultrafeedback_binarized.jsonl) | | [`tellang_yeji-preference-ko-v1.jsonl`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/data/preference/tellang_yeji-preference-ko-v1.jsonl) | 171 MB | [Download](https://huggingface.co/pathcosmos/frankenstallm/resolve/main/data/preference/tellang_yeji-preference-ko-v1.jsonl) | | [`jojo0217_korean_rlhf_dataset.jsonl`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/data/preference/jojo0217_korean_rlhf_dataset.jsonl) | 137 MB | [Download](https://huggingface.co/pathcosmos/frankenstallm/resolve/main/data/preference/jojo0217_korean_rlhf_dataset.jsonl) | | [`lemon-mint_korean-realqa-reasoning-v01-preference.jsonl`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/data/preference/lemon-mint_korean-realqa-reasoning-v01-preference.jsonl) | 58 MB | [Download](https://huggingface.co/pathcosmos/frankenstallm/resolve/main/data/preference/lemon-mint_korean-realqa-reasoning-v01-preference.jsonl) |
### Data Pipeline Scripts | File | Description | |------|-------------| | [`prepare_sft_data.py`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/data/prepare_sft_data.py) | HF datasets → JSONL normalization (Alpaca format) | | [`filter_sft_v2.py`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/data/filter_sft_v2.py) | SFT quality filtering (dedup, repetition filter) | | [`prepare_preference_combined.py`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/data/prepare_preference_combined.py) | Preference data merging (DPO/ORPO format) | | [`tokenize_extra.py`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/data/tokenize_extra.py) | Large-scale parallel tokenization | | [`sft_dataset.py`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/data/sft_dataset.py) | SFT dataset loader (Alpaca/conversation format) | ### Phase Reports | Report | Content | |--------|---------| | [`PROJECT_COMPLETION_REPORT`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/reports/2026-03-10_PROJECT_COMPLETION_REPORT.md) | Final project completion report | | [`ORPO_EVALUATION_REPORT`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/reports/2026-03-09_ORPO_EVALUATION_REPORT.md) | ORPO 10-dimension evaluation | | [`ORPO_TRAINING_JOURNEY`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/reports/2026-03-08_ORPO_TRAINING_JOURNEY.md) | ORPO training journey (HP sweep, debugging) | | [`SFT_COMPLETION_AND_EVAL`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/reports/2026-03-06_3B_SFT_COMPLETION_AND_EVAL_SUMMARY.md) | SFT completion and evaluation | | [`3B_BASE_EVALUATION`](https://huggingface.co/pathcosmos/frankenstallm/blob/main/reports/2026-03-05_3B_BASE_EVALUATION_REPORT.md) | Pretrained base model evaluation | --- ## Model Highlights - **From-scratch Korean tokenizer**: SentencePiece Unigram, 64K vocab, 99.95% Korean character coverage - **3-phase training pipeline**: Pretrain (57K steps, ~60B tokens) → SFT (25.5K steps, 2.4M samples) → ORPO (10K steps, 630K preference pairs) - **B200 FP8 native training**: TransformerEngine MXFP8 on NVIDIA B200 — 2× theoretical throughput vs BF16 - **GGUF deployment ready**: Q4_K_M (1.8GB), Q8_0 (3.0GB), F16 (5.7GB) with optimized Ollama Modelfiles --- ## Architecture | Component | Value | |-----------|-------| | Type | Decoder-only Transformer (LLaMA-style) | | Hidden size | 3,072 | | Layers | 28 | | Attention heads | 24 | | KV heads | 8 (GQA 3:1) | | FFN dim | 8,192 (SwiGLU) | | Vocab size | 64,256 (byte-fallback applied) | | Context length | 4,096 (trained at 2,048) | | Position encoding | RoPE (θ=500,000) | | Normalization | Pre-norm RMSNorm | | Attention impl | FlashAttention-2 | | Precision | FP8 (MXFP8 via TransformerEngine) | | Weight tying | Yes (embedding ↔ lm_head) | --- ## Training Pipeline ### Phase 1: Pretraining | Detail | Value | |--------|-------| | Steps | 57,000 | | Final loss | 1.466 | | Tokens seen | ~60B (38.5B unique × ~1.5 epochs) | | Duration | ~63 hours | | Data | CC-100 KO, HPLT KO, C4 KO, NamuWiki, Wikipedia KO, Cosmopedia (EN) | | Batch size | 5 × 8 GPU × 8 accum × 2,048 seq = ~655K tok/step | ### Phase 2: Supervised Fine-Tuning (SFT) | Detail | Value | |--------|-------| | Steps | 25,500 (early stop at 77.3%) | | Best val_loss | 1.8851 (step 23,000) | | Duration | ~15.5 hours | | Data | 2,439,397 samples from 24 sources (7.48 GB) | | Mix | 70% SFT + 30% pretrain replay (catastrophic forgetting prevention) | | Knowledge forgetting | 0.9% (19 datasets) | ### Phase 3: ORPO (Odds Ratio Preference Optimization) | Detail | Value | |--------|-------| | Steps | 9,997 (early convergence) | | Best eval_loss | 1.625 | | Preference accuracy | 76.02% | | Reward margin | 0.6100 | | Duration | ~7 hours | | Data | ~630K preference pairs from 7 Korean HF datasets | | Hyperparams | beta=0.25, lr=1.2e-5, eff_batch=128 | **Total training time: ~86 hours on 8× B200** --- ## Benchmarks ### Training Phase Progression (Base → SFT → ORPO) | Benchmark | Base | SFT | ORPO | Δ (Base→ORPO) | |-----------|:----:|:---:|:----:|:---:| | **KoBEST Avg (0-shot)** | 43.7% | 43.3% | **52.8%** | **+9.1pp** | | KoBEST COPA | 49.3% | 48.6% | **63.9%** | +14.6pp | | KoBEST HellaSwag-KO | 21.6% | 19.8% | **38.0%** | +16.4pp | | KoBEST SentiNeg | 48.6% | 49.1% | **62.5%** | +13.9pp | | KoBEST BoolQ | 50.3% | 50.1% | 50.6% | +0.3pp | | PIQA | 52.5% | 52.6% | **59.9%** | +7.3pp | | ARC-Easy | 25.6% | 25.9% | **36.0%** | +10.4pp | | HAE-RAE | 19.7% | 19.9% | 21.8% | +2.1pp | | HellaSwag EN | 26.2% | 26.1% | 29.2% | +3.0pp | | Greedy 3-gram repetition | 61.0% | 73.0% | **30.9%** | -30.1pp | | EOS termination rate | 0% | 60% | **67%** | +67pp | | PPL forgetting | — | 0.9% | 4.1% | within 15% ✅ | ### 3B-class Model Comparison (Ollama, 35 tests) | Model | Params | Korean NLU | Knowledge | Instruction | Reasoning | Avg Score | |-------|:------:|:----------:|:---------:|:-----------:|:---------:|:---------:| | Qwen 2.5 3B | 3B | 100.0 | 20.8 | 55.6 | 62.5 | **63.4** | | Phi-4 Mini | 3.8B | 66.7 | 29.2 | 33.3 | **87.5** | 60.6 | | **FRANKENSTALLM 3B** | **3B** | **100.0** | **75.0** | **66.7** | 50.0 | 46.7 | > FRANKENSTALLM leads in **Korean NLU** (tied with Qwen), **Korean Knowledge** (75 vs 20.8/29.2), and **Instruction Following** (66.7 vs 55.6/33.3). ### Inference Speed (Ollama, Q4_K_M) | Model | Avg TTFT | TPS | Note | |-------|:--------:|:---:|------| | **FRANKENSTALLM 3B** | **16.7ms** | **142.5** | Fastest | | Phi-4 Mini 3.8B | 25.6ms | 100.4 | | | Qwen 2.5 3B | 28.2ms | 93.8 | | ### Perplexity Preservation (ORPO Knowledge Retention) | Dataset | Base PPL | ORPO PPL | Forgetting | |---------|:--------:|:--------:|:----------:| | Korean C4 | 5.72 | 5.87 | +2.7% | | Korean Wiki | 11.84 | 12.21 | +3.2% | | Max forgetting | — | — | 4.1% ✅ | --- ## Training Data ### Pretraining (~38.5B tokens) | Category | Sources | Est. Tokens | |----------|---------|:-----------:| | Korean Web Crawl | C4 KO, CC-100 KO, HPLT KO | ~17.2B | | Korean Encyclopedia | Wikipedia KO, NamuWiki (2 versions) | ~2.8B | | English Educational | Cosmopedia (Stories, Web, Stanford, WikiHow, OpenStax, Khan) | ~5.7B | | English Math/Science | AutoMathText, OpenWebMath, Proof-Pile-2 | ~8.5B | | Code | StarCoder (filtered) | ~4.3B | ### SFT (2.4M samples, 24 sources) | Domain | Share | Key Datasets | |--------|:-----:|-------------| | Reasoning/CoT | 38% | reasoning_r1_1.4m, magpie_reasoning | | Korean Instructions | 23% | korean_instruction_mix, open_korean_instructions, kullm_v2 | | English General | 16% | openhermes_2.5, ultrachat_200k | | Math | 12% | NuminaMath-CoT, orca-math-ko | | Dialog/Code/Other | 11% | smol-koreantalk, Evol-Instruct-Code-80k-ko | ### ORPO (~630K preference pairs, 7 sources) | Dataset | Size | Domain | |---------|:----:|--------| | nayohan/preference-collection-ko-full | 4.9GB | General preference | | heegyu/orca-math-korean-preference-cleaned | 1.6GB | Math reasoning | | kuotient/orca-math-korean-dpo-pairs | 750MB | Math DPO | | maywell/ko_Ultrafeedback_binarized | 394MB | Feedback alignment | | tellang/yeji-preference-ko-v1 | 171MB | General preference | | jojo0217/korean_rlhf_dataset | 137MB | RLHF pairs | | lemon-mint/korean-realqa-reasoning-v01-preference | 58MB | QA reasoning | --- ## GGUF & Ollama ### Available Quantizations | File | Size | Description | |------|:----:|-------------| | `gguf/frankenstallm-3b-v2-Q4_K_M.gguf` | 1.8GB | **Recommended** — best size/quality balance | | `gguf/frankenstallm-3b-v2-Q8_0.gguf` | 3.0GB | Higher quality | | `gguf/frankenstallm-3b-v2-f16.gguf` | 5.7GB | Full precision | | `model.safetensors` | 5.7GB | Transformers native (3B ORPO best, byte-fallback fixed, vocab=64256) | ### Recommended Sampling Parameters | Parameter | Value | Notes | |-----------|:-----:|-------| | `temperature` | 0.7 | Optimal for Korean generation quality | | `repeat_penalty` | 1.2 | **Required** — without it, greedy repetition is 30.9% | | `top_p` | 0.9 | Nucleus sampling | | `top_k` | 50 | Top-k candidates | | `max_tokens` | 512 | Max generation length | | `num_ctx` | 4096 | Context window (do not exceed) | > ⚠️ Always use `repeat_penalty >= 1.2`. With it, repetition drops to **0%**. Without it, greedy decoding produces ~31% 3-gram repetition. --- ## Limitations - **English performance is limited**: MMLU-EN ~23%, HellaSwag-EN ~29% — this is a Korean-focused model - **Code generation**: Near zero capability (limited code in training data) - **Greedy repetition**: 30.9% 3-gram repetition without `repeat_penalty` — always use sampling with `repeat_penalty >= 1.2` - **Safety**: Safety alignment data was not included in training; use with appropriate guardrails - **Scale gap**: Compared to commercial 3B models trained on trillions of tokens, this model was trained on ~60B tokens — expect lower overall benchmark scores --- ## Hardware & Training Environment | Component | Specification | |-----------|---------------| | GPU | 8× NVIDIA B200 (183GB HBM3e each, ~1.47TB total) | | FP8 Compute | 2,250 TFLOPS/GPU (18,000 TFLOPS total) | | Interconnect | NVLink 5.0, NVSwitch all-to-all mesh | | CPU | 2× AMD EPYC 9365 (72 cores, Zen 5) | | RAM | 2.21 TB DDR5 | | PyTorch | 2.10.0a0+b4e4ee81d3.nv25.12 (NVIDIA custom) | | TransformerEngine | 2.10.0 | | FlashAttention | 2.7.4 | | NCCL | 2.28.9 | | CUDA | 13.1 | | Total training | ~86 hours (Pretrain 63h + SFT 15.5h + ORPO 7h) | --- ## Citation ```bibtex @misc{frankenstallm2026, title={FRANKENSTALLM: A Korean 3B LLM Built From Scratch on B200 GPUs}, author={pathcosmos}, year={2026}, url={https://huggingface.co/pathcosmos/frankenstallm}, note={3-phase training (Pretrain, SFT, ORPO) with FP8 on 8x NVIDIA B200} } ``` --- ## Links & Contact - **GitHub**: [pathcosmos/FRANKENSTALLM](https://github.com/pathcosmos/FRANKENSTALLM) — Full source code, training scripts, and builder's log - **HuggingFace**: [pathcosmos/frankenstallm](https://huggingface.co/pathcosmos/frankenstallm) - **Contact**: pathcosmos@gmail.com --- ## Related Projects - **[EVAFRILL-Mo](https://github.com/pathcosmos/EVAFRILL-Mo)** | [🤗 HuggingFace](https://huggingface.co/pathcosmos/EVAFRILL-Mo-3B) — Hybrid Mamba-2 + Transformer sister project (2.94B params). While FRANKENSTALLM uses a pure Transformer architecture, EVAFRILL-Mo adopts Mamba-2 SSM + sparse Transformer attention. Both share the same tokenizer and training infrastructure. --- ## Acknowledgment This project was conducted using GPU computing resources provided through the **"Advanced GPU Utilization Support Program"** (MSIT Notice No. 2025-1068) by the **Ministry of Science and ICT (MSIT)** of the Republic of Korea. > **National AI Computing Resource Support Portal**: https://aiinfrahub.kr > > - Organized by: Ministry of Science and ICT (MSIT), National IT Industry Promotion Agency (NIPA) > - Operated by: Korea Association of Information & Telecommunication (KAIT) We are deeply grateful for the national-level AI computing infrastructure support from the Korean government, which made it possible to train a Korean 3B LLM from scratch on 8× NVIDIA B200 GPUs.