\documentclass[11pt,a4paper]{article} % ============================================================================ % Packages % ============================================================================ \usepackage[utf8]{inputenc} \usepackage[T1]{fontenc} \usepackage{times} \usepackage{geometry} \geometry{margin=1in} \usepackage{amsmath,amssymb} \usepackage{graphicx} \usepackage{booktabs} \usepackage{hyperref} \usepackage{url} \urlstyle{same} \usepackage{natbib} \usepackage{xcolor} \usepackage{array} \usepackage{float} \usepackage{enumitem} \usepackage{fancyvrb} \usepackage{pgfplots} \pgfplotsset{compat=1.18} \hypersetup{ colorlinks=true, linkcolor=blue!60!black, citecolor=blue!60!black, urlcolor=blue!60!black } % ============================================================================ % Title % ============================================================================ \title{ \textbf{Julian: Efficient Training of a Bilingual 600M Parameter \\ Language Model on TPU with JAX} } \author{ Julian Kerignard \\ Independent Research \\ \texttt{github.com/JulianKrgd} \\ \texttt{huggingface.co/JulianKrgd} } \date{February 2026} \begin{document} \maketitle % ============================================================================ % Abstract % ============================================================================ \begin{abstract} We present \textbf{Julian}\footnote{Models available on HuggingFace: \url{https://huggingface.co/JulianKrgd}}, a family of decoder-only language models ranging from 100M to 600M parameters, trained entirely from scratch on up to 39 billion tokens of bilingual English-French data using JAX/Flax on Google Cloud TPUs. Our largest model, Julian-600M, employs a modern transformer architecture with Rotary Position Embeddings (RoPE), SwiGLU activations, and RMSNorm, following the design principles of LLaMA. Despite being trained on significantly fewer tokens than comparable models, Julian-600M achieves 53.5\% normalized accuracy on HellaSwag, outperforming OPT-1.3B (41.5\%) which has over twice the parameters and was trained on 8$\times$ more data. We further fine-tune Julian-600M using supervised fine-tuning (SFT) on 2.47 million instruction-response pairs formatted with the ChatML template, producing instruction-following variants at 30K and 100K training steps. We provide a detailed account of our training infrastructure, data pipeline, and the challenges of multi-host TPU training with JAX. All model weights are released openly under the Apache 2.0 license on HuggingFace. \end{abstract} % ============================================================================ % 1. Introduction % ============================================================================ \section{Introduction} The rapid advancement of large language models (LLMs) has demonstrated remarkable capabilities in natural language understanding and generation \citep{brown2020language, chowdhery2023palm, touvron2023llama}. However, the training of such models typically requires enormous computational resources, often inaccessible to independent researchers and smaller organizations. Recent work has shown that smaller language models, when trained with appropriate data and techniques, can achieve competitive performance on many benchmarks \citep{biderman2023pythia, zhang2022opt}. The Chinchilla scaling laws \citep{hoffmann2022training} further suggest that many models are undertrained relative to their size, and that optimal performance requires a careful balance between model size and training data volume. In this work, we present \textbf{Julian}, a family of bilingual (English-French) language models trained from scratch using JAX/Flax on Google Cloud TPU v4-32 pods. Our contributions are: \begin{enumerate}[leftmargin=*] \item \textbf{Efficient training}: We train a 600M parameter model on 39B tokens that outperforms OPT-1.3B on HellaSwag despite using 2$\times$ fewer parameters and 8$\times$ fewer training tokens. \item \textbf{Bilingual capability}: To the best of our knowledge, Julian is among the few openly released small language models trained from scratch on a mixture of English and French data (70\%/30\% ratio). \item \textbf{Complete pipeline}: We describe the full training pipeline including data collection, tokenizer training, pre-training, supervised fine-tuning, and evaluation, providing a practical guide for training LLMs on TPU infrastructure. \item \textbf{Open release}: All model weights, tokenizer, and training code are released under the Apache 2.0 license. \end{enumerate} % ============================================================================ % 2. Related Work % ============================================================================ \section{Related Work} \paragraph{Scaling Laws.} \citet{kaplan2020scaling} established neural scaling laws showing power-law relationships between model size, dataset size, compute budget, and loss. \citet{hoffmann2022training} refined these findings with the Chinchilla scaling laws, demonstrating that many large models are significantly undertrained and that the optimal token-to-parameter ratio is approximately 20:1. Our Julian-600M model is trained on 39B tokens (65:1 ratio), exceeding the Chinchilla-optimal budget. \paragraph{Open Language Models.} GPT-2 \citep{radford2019language} pioneered the release of pre-trained language models, with sizes ranging from 124M to 1.5B parameters. OPT \citep{zhang2022opt} provided models from 125M to 175B parameters trained on 300B tokens with detailed training logs. Pythia \citep{biderman2023pythia} offered a suite of models from 70M to 12B parameters trained on 300B tokens from The Pile, specifically designed for studying model behavior during training. LLaMA \citep{touvron2023llama} introduced architectural improvements (RoPE, SwiGLU, RMSNorm) that have become standard in modern language models. \paragraph{Small Language Models.} TinyLlama \citep{zhang2024tinyllama} demonstrated that a 1.1B model trained on 3T tokens can achieve strong performance. MobileLLM \citep{liu2024mobilellm} explored architecture design for sub-billion parameter models. These works highlight the viability and growing interest in smaller, more efficient models. \paragraph{Multilingual Models.} While large multilingual models like mBERT \citep{devlin2019bert}, XLM-R \citep{conneau2020xlmr}, and BLOOM \citep{workshop2023bloom} cover many languages, few small models are specifically designed for bilingual English-French text generation from scratch. % ============================================================================ % 3. Model Architecture % ============================================================================ \section{Model Architecture} Julian follows the LLaMA architecture \citep{touvron2023llama}: a decoder-only transformer with pre-normalization using RMSNorm \citep{zhang2019root}, SwiGLU feed-forward networks \citep{shazeer2020glu}, and Rotary Position Embeddings (RoPE) \citep{su2021roformer}. No bias terms are used in any linear projection. \subsection{Architecture Details} \begin{table}[h] \centering \caption{Julian model configurations. All models use RoPE ($\theta$=10000), SwiGLU, RMSNorm (pre-norm), and no bias terms.} \label{tab:model_configs} \begin{tabular}{lccc} \toprule \textbf{Parameter} & \textbf{Julian-100M} & \textbf{Julian-250M$^\dagger$} & \textbf{Julian-600M} \\ \midrule Hidden size ($d_{\text{model}}$) & 640 & 1024 & 1280 \\ Layers ($L$) & 12 & 14 & 18 \\ Attention heads ($H$) & 10 & 16 & 20 \\ Head dimension ($d_h$) & 64 & 64 & 64 \\ FFN size ($d_{\text{ff}}$) & 2560 & 4096 & 5120 \\ Vocabulary size ($V$) & 50{,}000 & 50{,}000 & 50{,}000 \\ Context length & 2048 & 2048 & 2048 \\ Precision & bfloat16 & bfloat16 & bfloat16 \\ \bottomrule \end{tabular} \end{table} \noindent{\small $^\dagger$ Julian-250M is currently in preparation and has not yet been trained.} \paragraph{Rotary Position Embeddings (RoPE).} We use RoPE \citep{su2021roformer} with base frequency $\theta = 10{,}000$. For each attention head, the query and key vectors are rotated by position-dependent angles: \begin{equation} f_{\theta}(x, m) = \begin{pmatrix} x_1 \\ x_2 \\ \vdots \\ x_{d-1} \\ x_d \end{pmatrix} \odot \begin{pmatrix} \cos(m\theta_1) \\ \cos(m\theta_1) \\ \vdots \\ \cos(m\theta_{d/2}) \\ \cos(m\theta_{d/2}) \end{pmatrix} + \begin{pmatrix} -x_2 \\ x_1 \\ \vdots \\ -x_d \\ x_{d-1} \end{pmatrix} \odot \begin{pmatrix} \sin(m\theta_1) \\ \sin(m\theta_1) \\ \vdots \\ \sin(m\theta_{d/2}) \\ \sin(m\theta_{d/2}) \end{pmatrix} \end{equation} where $\theta_i = \theta^{-2i/d}$ and $m$ is the position index. \paragraph{SwiGLU Feed-Forward Network.} Each transformer block uses a SwiGLU \citep{shazeer2020glu} feed-forward network: \begin{equation} \text{FFN}(x) = W_{\text{down}} \cdot (\text{SiLU}(W_{\text{gate}} x) \odot W_{\text{up}} x) \end{equation} where $W_{\text{gate}}, W_{\text{up}} \in \mathbb{R}^{d_{\text{ff}} \times d_{\text{model}}}$ and $W_{\text{down}} \in \mathbb{R}^{d_{\text{model}} \times d_{\text{ff}}}$. The SwiGLU activation introduces an additional projection compared to standard FFNs but improves quality at equivalent compute. \paragraph{RMSNorm.} We use Root Mean Square Layer Normalization \citep{zhang2019root} applied before each attention and feed-forward sub-layer (pre-norm architecture): \begin{equation} \text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{d}\sum_{i=1}^{d} x_i^2 + \epsilon}} \cdot \gamma \end{equation} where $\gamma$ is a learned scale parameter and $\epsilon = 10^{-6}$. \subsection{Tokenizer} We train a SentencePiece \citep{kudo2018sentencepiece} BPE tokenizer with a vocabulary of 50{,}000 tokens on a balanced sample of our training corpus. Key settings include: \begin{itemize}[leftmargin=*] \item Character coverage: 99.99\% \item Byte fallback enabled (handles any UTF-8 input) \item Special tokens: \texttt{} (0), \texttt{} (1), \texttt{} (2), \texttt{} (3), \texttt{<|code|>} (4), \texttt{<|endcode|>} (5), \texttt{<|im\_start|>} (6), \texttt{<|im\_end|>} (7) \end{itemize} The ChatML-style tokens (\texttt{<|im\_start|>} and \texttt{<|im\_end|>}) are included from the start of pre-training to support later instruction fine-tuning without vocabulary expansion. % ============================================================================ % 4. Training Data % ============================================================================ \section{Training Data} \subsection{Data Sources} We curate a bilingual training corpus of approximately 39 billion tokens with a 70\% English / 30\% French ratio. Table~\ref{tab:data_sources} lists our data sources. \begin{table}[H] \centering \caption{Training data composition for Julian-600M (39B tokens).} \label{tab:data_sources} \begin{tabular}{lccc} \toprule \textbf{Source} & \textbf{Languages} & \textbf{Tokens (approx.)} & \textbf{Quality} \\ \midrule Wikipedia & EN + FR & 5.5B & High \\ OSCAR 2301 & EN + FR & 15B & Medium \\ FineWeb-Edu & EN & 8B & Very High \\ Project Gutenberg & EN + FR & 1B & High \\ The Stack (code) & Multi & 2B & High \\ \midrule \textbf{Total} & & \textbf{$\sim$39B} & \\ \bottomrule \end{tabular} \end{table} \subsection{Data Processing Pipeline} Our data processing pipeline consists of the following stages: \begin{enumerate}[leftmargin=*] \item \textbf{Download}: Raw data is obtained from HuggingFace datasets (OSCAR, FineWeb-Edu, The Stack), Wikipedia dumps, and Project Gutenberg mirrors. \item \textbf{Cleaning}: Documents shorter than 100 characters or longer than 500K characters are removed. We enforce a minimum alphanumeric character ratio of 70\%. \item \textbf{Deduplication}: MinHash Locality-Sensitive Hashing (LSH) with a Jaccard similarity threshold of 0.8 is used for near-duplicate removal. \item \textbf{Language detection}: We use fastText language identification with a confidence threshold of 0.8 to ensure correct language labeling. \item \textbf{Tokenization}: The cleaned corpus is tokenized using our SentencePiece tokenizer and packed into sequences of 2048 tokens. \item \textbf{Sharding}: The tokenized data is split into 359 shards stored on Google Cloud Storage (GCS) for streaming during training. \end{enumerate} % ============================================================================ % 5. Training Procedure % ============================================================================ \section{Training Procedure} \subsection{Infrastructure} All training is conducted on Google Cloud TPU v4-32 pods (32 TPU chips across 4 hosts) provided through the TPU Research Cloud (TRC) program. We use the JAX \citep{bradbury2018jax} framework with Flax for model definition and Optax for optimization. \subsection{Parallelism Strategy} We employ Fully Sharded Data Parallelism (FSDP) \citep{xu2021gspmd} across the 32 TPU chips using JAX's \texttt{pmap} primitive. Model parameters are replicated across all devices, while the batch dimension is sharded. Gradient accumulation over 8 micro-steps yields an effective batch size of 1024 sequences. All computations use bfloat16 mixed precision \citep{micikevicius2018mixed} for both forward and backward passes, with optimizer states also stored in bfloat16. \subsection{Optimizer and Schedule} We use AdamW \citep{loshchilov2019decoupled} with the following configuration. The total compute budget for Julian-600M is approximately $2.4 \times 10^{19}$ FLOPs (estimated as $6 \times N \times D$ where $N = 600\text{M}$ parameters and $D = 39\text{B}$ tokens). Training was completed in approximately 21 days of wall-clock time on a single TPU v4-32 pod, achieving a Model FLOPs Utilization (MFU) of approximately 38\%. \begin{table}[h] \centering \caption{Pre-training hyperparameters for Julian-600M.} \label{tab:hyperparams} \begin{tabular}{lc} \toprule \textbf{Hyperparameter} & \textbf{Value} \\ \midrule Optimizer & AdamW \\ $\beta_1$, $\beta_2$ & 0.9, 0.95 \\ $\epsilon$ & $10^{-8}$ \\ Weight decay & 0.1 \\ Peak learning rate & $1.2 \times 10^{-3}$ \\ Minimum learning rate & $1.2 \times 10^{-4}$ (10\% of peak) \\ Warmup steps & 3{,}000 \\ Total steps & 300{,}000 \\ LR schedule & Cosine annealing \\ Gradient clipping & 1.0 (global norm) \\ Batch size (per device) & 4 \\ Gradient accumulation steps & 8 \\ Effective batch size & 1{,}024 \\ Sequence length & 2{,}048 \\ Tokens per step & $\sim$2.1M \\ Total tokens & $\sim$39B \\ Precision & bfloat16 \\ \bottomrule \end{tabular} \end{table} We follow the Chinchilla cosine learning rate schedule \citep{hoffmann2022training}: linear warmup from 0 to the peak learning rate over 3{,}000 steps, followed by cosine decay to 10\% of the peak value. Optimizer states ($\mu$ and $\nu$) are stored in bfloat16 to reduce memory consumption by approximately 40\%. \subsection{Robustness} Training on preemptible TPU instances requires robust checkpoint management. We implement: \begin{itemize}[leftmargin=*] \item \textbf{Asynchronous checkpointing} using Orbax, saving every 10{,}000 steps without blocking training. \item \textbf{SIGTERM handler}: On preemption, an emergency checkpoint is written within the 30-second grace period. \item \textbf{Health monitoring}: Automatic detection of NaN/Inf values in gradients and loss, with circuit-breaker logic for retries. \item \textbf{Global synchronization}: JAX barrier synchronization before checkpoint writes to ensure multi-host consistency. \end{itemize} % ============================================================================ % 6. Supervised Fine-Tuning % ============================================================================ \section{Supervised Fine-Tuning} We perform supervised fine-tuning (SFT) on the pre-trained Julian-600M checkpoint (step 300{,}000) using a large instruction-following dataset. \subsection{Instruction Dataset} Our SFT dataset comprises 2.47 million instruction-response pairs drawn from multiple sources: \begin{table}[H] \centering \caption{SFT dataset composition.} \label{tab:sft_data} \begin{tabular}{lcc} \toprule \textbf{Source} & \textbf{Examples (approx.)} & \textbf{Language} \\ \midrule Stanford Alpaca & 52K & English \\ Databricks Dolly 15K & 15K & English \\ Code Alpaca & 20K & English \\ GPT4All-J & 20K & English \\ French instruction data & 15K+ & French \\ OpenHermes 2.5 (synthetic) & $\sim$900K & English \\ SlimOrca & $\sim$500K & English \\ Other open-source instruction data & $\sim$900K & Multilingual \\ \midrule \textbf{Total} & \textbf{2.47M} & \\ \bottomrule \end{tabular} \end{table} \subsection{ChatML Format} All instruction data is formatted using the ChatML template \citep{openai2023chatml}: \smallskip\noindent\begin{minipage}{\textwidth} \begin{Verbatim}[fontsize=\small, vspace=0pt] <|im_start|>system You are a helpful assistant.<|im_end|> <|im_start|>user {instruction}<|im_end|> <|im_start|>assistant {response}<|im_end|> \end{Verbatim} \end{minipage} \smallskip\noindent During SFT, loss is computed only on assistant response tokens using a binary loss mask. System and user tokens receive zero loss weight, ensuring the model learns to generate responses rather than memorizing prompts. \subsection{SFT Hyperparameters} \begin{table}[h] \centering \caption{SFT training hyperparameters.} \label{tab:sft_hyperparams} \begin{tabular}{lc} \toprule \textbf{Hyperparameter} & \textbf{Value} \\ \midrule Base checkpoint & step 300{,}000 (39B tokens) \\ Learning rate & $2 \times 10^{-5}$ \\ Warmup steps & 1{,}000 \\ Batch size (effective) & 32--256 \\ Sequence length & 2{,}048 \\ Weight decay & 0.01 \\ Gradient clipping & 1.0 \\ \bottomrule \end{tabular} \end{table} We train two SFT variants: \begin{itemize}[leftmargin=*] \item \textbf{SFT-30K}: 30{,}000 steps, approximately 2B tokens seen, final loss 1.86 \item \textbf{SFT-100K}: 100{,}000 steps, approximately 6.5B tokens seen ($\sim$2.2 epochs), final loss 1.69 \end{itemize} An earlier variant, \textbf{Julian-600M-10B-Instruct-v0.1}, was fine-tuned from an intermediate pre-training checkpoint (step 170{,}000, $\sim$10B tokens) on a smaller instruction dataset ($\sim$185K examples). This variant serves as a baseline for comparison. % ============================================================================ % 7. Evaluation % ============================================================================ \section{Evaluation} \subsection{Benchmark Suite} We evaluate all Julian models on standard zero-shot benchmarks using the Language Model Evaluation Harness \citep{gao2023framework}: \begin{itemize}[leftmargin=*] \item \textbf{HellaSwag} \citep{zellers2019hellaswag}: Commonsense natural language inference (acc\_norm) \item \textbf{PIQA} \citep{bisk2020piqa}: Physical intuition QA (acc) \item \textbf{LAMBADA} \citep{paperno2016lambada}: Word prediction requiring broad context (acc, perplexity) \item \textbf{ARC-Easy / ARC-Challenge} \citep{clark2018think}: Science question answering (acc / acc\_norm) \item \textbf{WinoGrande} \citep{sakaguchi2020winogrande}: Commonsense coreference resolution (acc) \item \textbf{BoolQ} \citep{clark2019boolq}: Yes/no question answering (acc) \end{itemize} \subsection{Evaluation Infrastructure} Because standard lm-eval with HuggingFace models defaults to PyTorch on CPU when run on TPU VMs (no CUDA available), we implement a custom JAX-based evaluation wrapper that performs inference directly on TPU. This achieves approximately 5.8 items/second with batch size 48, completing the full evaluation suite ($\sim$72K requests) in approximately 3.5 hours on a single TPU v4-32 pod. % ============================================================================ % 8. Results % ============================================================================ \section{Results} \subsection{Julian Model Progression} Table~\ref{tab:julian_results} presents the benchmark results across Julian model variants, illustrating the impact of additional pre-training and supervised fine-tuning. \begin{table}[h] \centering \caption{Benchmark results (0-shot) for Julian model variants. Bold indicates best within Julian models for each benchmark.} \label{tab:julian_results} \begin{tabular}{lccccccc} \toprule \textbf{Model} & \textbf{HS} & \textbf{PIQA} & \textbf{LAM.} & \textbf{ARC-E} & \textbf{ARC-C} & \textbf{WG} & \textbf{BoolQ} \\ \midrule Julian-600M Base & \textbf{53.5} & \textbf{66.8} & 37.3 & --- & --- & --- & --- \\ Julian-600M SFT-30K & 41.7 & \textbf{66.8} & \textbf{37.7} & 53.5 & \textbf{27.1} & \textbf{53.8} & 60.6 \\ Julian-600M SFT-100K & 41.6 & 66.6 & \textbf{37.7} & \textbf{53.8} & 26.7 & 52.8 & \textbf{60.8} \\ Julian-600M-10B-v0.1 & 42.7 & 66.2 & 34.6 & --- & --- & --- & --- \\ \bottomrule \end{tabular} \end{table} \paragraph{SFT Impact.} Supervised fine-tuning causes a notable drop in HellaSwag accuracy ($-$11.8 points), consistent with observations in other models where instruction tuning trades benchmark performance for instruction-following capability. Other benchmarks remain largely stable, with slight improvements in LAMBADA, ARC-Easy, and BoolQ. \paragraph{SFT-30K vs SFT-100K.} The two SFT variants produce near-identical results, suggesting that 30K steps is sufficient for this dataset size. At 100K steps ($\sim$2.2 epochs), WinoGrande begins to degrade, likely due to overfitting. \subsection{Comparison with Existing Models} Table~\ref{tab:comparison} compares Julian-600M with publicly available models of similar or larger scale. \begin{table}[h] \centering \caption{Comparison with existing models (0-shot). Julian-600M Base outperforms OPT-1.3B on HellaSwag despite 2$\times$ fewer parameters and 8$\times$ fewer training tokens.} \label{tab:comparison} \resizebox{\textwidth}{!}{ \begin{tabular}{lccccccccc} \toprule \textbf{Model} & \textbf{Params} & \textbf{Tokens} & \textbf{HS} & \textbf{PIQA} & \textbf{LAM.} & \textbf{ARC-E} & \textbf{ARC-C} & \textbf{WG} \\ \midrule GPT-2 Small & 124M & 100B+ & 31.5 & --- & 46.0 & --- & --- & 50.4 \\ OPT-125M & 125M & 300B & 29.2 & 63.0 & 37.9 & 43.5 & 18.9 & 50.3 \\ OPT-350M & 331M & 300B & 32.0 & 64.4 & 45.2 & 44.0 & 20.7 & 52.3 \\ Pythia-410M & 405M & 300B & 33.3 & 66.8 & 50.5 & 50.4 & 21.3 & 53.0 \\ \midrule \textbf{Julian-600M Base} & \textbf{600M} & \textbf{39B} & \textbf{53.5} & \textbf{66.8} & \textbf{37.3} & --- & --- & --- \\ \textbf{Julian-600M SFT-30K} & \textbf{600M} & \textbf{39B+2B} & \textbf{41.7} & \textbf{66.8} & \textbf{37.7} & \textbf{53.5} & \textbf{27.1} & \textbf{53.8} \\ \midrule GPT-2 XL & 1{,}558M & 100B+ & 50.9 & 70.8 & 63.2 & --- & --- & 59.4 \\ Pythia-1B & 1B & 300B & 37.6 & 70.5 & 56.6 & 55.9 & 24.3 & 54.5 \\ OPT-1.3B & 1.3B & 300B & 41.5 & 71.7 & 57.9 & 57.0 & 23.4 & 59.5 \\ \bottomrule \end{tabular} } \end{table} \paragraph{Key Findings.} \begin{itemize}[leftmargin=*] \item \textbf{HellaSwag}: Julian-600M Base achieves 53.5\%, surpassing GPT-2~XL (50.9\%, 1.5B params), OPT-1.3B (41.5\%), and Pythia-1B (37.6\%). This is a remarkable result for a 600M model trained on only 39B tokens. \item \textbf{PIQA}: Julian-600M matches Pythia-410M at 66.8\% and falls only slightly below models 2--3$\times$ larger. \item \textbf{LAMBADA}: Julian-600M achieves 37.3\%, lower than similarly-sized models trained on more data. This likely reflects the smaller training corpus, as LAMBADA is particularly sensitive to the volume and diversity of training text. \item \textbf{Tokens efficiency}: Julian-600M achieves its HellaSwag score with 39B tokens, while OPT and Pythia models were trained on 300B tokens (7.7$\times$ more). \end{itemize} \begin{figure}[t] \centering \begin{tikzpicture} \begin{axis}[ xbar, bar width=7pt, width=0.88\textwidth, height=6cm, xlabel={HellaSwag (acc\_norm, \%)}, ytick={0,1,2,3,4,5,6,7}, yticklabels={ {OPT-125M {\scriptsize(125M, 300B tok)}}, {GPT-2 Small {\scriptsize(124M, 100B+ tok)}}, {OPT-350M {\scriptsize(331M, 300B tok)}}, {Pythia-410M {\scriptsize(405M, 300B tok)}}, {Pythia-1B {\scriptsize(1B, 300B tok)}}, {OPT-1.3B {\scriptsize(1.3B, 300B tok)}}, {GPT-2 XL {\scriptsize(1.5B, 100B+ tok)}}, {\textbf{Julian-600M} {\scriptsize\textbf{(600M, 39B tok)}}} }, xmin=25, xmax=58, nodes near coords, nodes near coords style={font=\scriptsize, anchor=west}, enlarge y limits=0.1, xmajorgrids=true, grid style={gray!20}, y tick label style={font=\footnotesize}, ] \addplot[fill=gray!40, draw=gray!60] coordinates { (29.2,0) (31.5,1) (32.0,2) (33.3,3) (37.6,4) (41.5,5) (50.9,6) (53.5,7) }; \end{axis} \end{tikzpicture} \caption{HellaSwag accuracy (acc\_norm) across models, sorted by score. Numbers in parentheses indicate parameter count and training data volume. Julian-600M achieves the highest score despite having fewer parameters and significantly less training data than most comparison models.} \label{fig:hellaswag_comparison} \end{figure} % ============================================================================ % 9. Interpretation of Results % ============================================================================ \section{Interpretation of Results} This section provides an in-depth analysis of the results presented above, examining pre-training dynamics, the impact of SFT, and the saturation phenomena observed. \subsection{Pre-training Progression} The evolution of performance between the two pre-training checkpoints reveals sustained learning dynamics. Between the 10B token checkpoint (step 100{,}000) and the final 39B token checkpoint (step 300{,}000), we observe: \begin{itemize}[leftmargin=*] \item \textbf{HellaSwag}: 45.8\% $\rightarrow$ 53.5\% (+7.7 points) \item \textbf{Loss}: 3.20 $\rightarrow$ 2.33 ($-$27\%) \item \textbf{PIQA}: 67.6\% $\rightarrow$ 66.8\% ($-$0.8 point) \item \textbf{LAMBADA}: 35.0\% $\rightarrow$ 37.3\% (+2.3 points) \end{itemize} The +7.7 point improvement on HellaSwag is particularly significant. This benchmark measures commonsense reasoning, and the continued improvement suggests that the model has not reached its maximum learning capacity at 39B tokens. The loss continuing to decrease substantially (from 3.20 to 2.33) confirms the absence of saturation: the model continues to learn effectively at each additional training step. PIQA remains stable, while LAMBADA shows a modest but encouraging improvement. Extrapolating this trajectory, continued training beyond 39B tokens would likely yield further gains, particularly on LAMBADA where Julian-600M remains behind models trained on 300B tokens. \subsection{Impact of SFT on Benchmarks} Supervised fine-tuning fundamentally transforms the model's behavior: from a text completer that statistically predicts the next token, it becomes an assistant capable of responding to structured instructions. This transformation has a measurable cost on benchmarks. \paragraph{The HellaSwag sacrifice.} The most notable drop is on HellaSwag: $-$11.8 points (53.5\% $\rightarrow$ 41.7\%). This phenomenon is well documented in the literature \citep{ouyang2022training} and is explained by the very nature of SFT. HellaSwag measures the model's ability to naturally complete a text; however, SFT reorients the model toward producing responses in a specific conversational format (ChatML). The model partially ``unlearns'' free completion in favor of instruction following. This is an expected and generally accepted trade-off. \paragraph{Reasoning stability.} In contrast, benchmarks measuring reasoning are remarkably stable after SFT: \begin{itemize}[leftmargin=*] \item \textbf{PIQA} stays at 66.8\% (identical to the base model), indicating that physical intuition is unaffected. \item \textbf{WinoGrande} reaches 53.8\%, comparable to reference models of similar size. \item \textbf{BoolQ} reaches 60.6\%, within the expected range for a 600M model. \end{itemize} These results suggest that SFT does not alter the model's underlying reasoning capabilities but primarily modifies the output distribution (the format of generated responses). \paragraph{LAMBADA improvement.} Notably, LAMBADA slightly improves after SFT (+0.4 points, from 37.3\% to 37.7\%). This counterintuitive result can be explained by the fact that the instruction-response format encourages the model to better exploit provided context to produce a precise answer---exactly what LAMBADA measures (predicting a word from a long context). \subsection{Over-SFT: Quantitative Analysis (30K vs 100K)} The comparison between SFT-30K and SFT-100K constitutes one of the most instructive findings of this work. Table~\ref{tab:sft_delta} presents the detailed differences. \begin{table}[H] \centering \caption{Detailed comparison between SFT-30K and SFT-100K. $\Delta$ represents the difference (100K $-$ 30K). SFT-100K uses 3.3$\times$ more compute for nearly identical results.} \label{tab:sft_delta} \begin{tabular}{lccccc} \toprule \textbf{Benchmark} & \textbf{SFT-30K} & \textbf{SFT-100K} & \textbf{$\Delta$} & \textbf{SFT Tokens} & \textbf{Epochs} \\ \midrule Loss & 1.86 & 1.69 & $-$0.17 & --- & --- \\ HellaSwag & 41.7\% & 41.6\% & $-$0.1 & --- & --- \\ PIQA & 66.8\% & 66.6\% & $-$0.2 & --- & --- \\ LAMBADA & 37.7\% & 37.7\% & 0.0 & --- & --- \\ ARC-Easy & 53.5\% & 53.8\% & +0.3 & --- & --- \\ ARC-Challenge & 27.1\% & 26.7\% & $-$0.4 & --- & --- \\ WinoGrande & 53.8\% & 52.8\% & \textbf{$-$1.0} & --- & --- \\ BoolQ & 60.6\% & 60.8\% & +0.2 & --- & --- \\ \midrule & & & & 1.97B vs 6.55B & 0.66 vs 2.20 \\ \bottomrule \end{tabular} \end{table} \begin{figure}[t] \centering \begin{tikzpicture} \begin{axis}[ ybar=8pt, bar width=12pt, width=\textwidth, height=6.5cm, ylabel={Accuracy (\%)}, symbolic x coords={HellaSwag, PIQA, LAMBADA}, xtick=data, ymin=30, ymax=72, nodes near coords, nodes near coords style={font=\scriptsize, /pgf/number format/fixed, /pgf/number format/precision=1, anchor=south}, legend style={at={(0.5,-0.15)}, anchor=north, legend columns=3, font=\small}, enlarge x limits=0.35, ymajorgrids=true, grid style={gray!15}, ] \addplot[fill=blue!25, draw=blue!50] coordinates { (HellaSwag, 53.5) (PIQA, 66.8) (LAMBADA, 37.3) }; \addplot[fill=orange!30, draw=orange!55] coordinates { (HellaSwag, 41.7) (PIQA, 66.8) (LAMBADA, 37.7) }; \addplot[fill=red!20, draw=red!45] coordinates { (HellaSwag, 41.6) (PIQA, 66.6) (LAMBADA, 37.7) }; \legend{Base 39B, SFT-30K (0.66 ep.), SFT-100K (2.2 ep.)} \end{axis} \end{tikzpicture} \caption{Impact of supervised fine-tuning on benchmark performance. SFT causes a significant HellaSwag drop ($-$11.8 points) while preserving PIQA and slightly improving LAMBADA. SFT-30K and SFT-100K achieve near-identical results despite 3.3$\times$ difference in compute, indicating clear saturation.} \label{fig:sft_impact} \end{figure} \paragraph{Loss is not a good SFT quality indicator.} The most striking result is the disconnect between loss and benchmark performance. The loss drops significantly from 1.86 to 1.69 ($-$9\%), but benchmarks stagnate or degrade. This reveals that the model learns to better reproduce the \emph{format} of the SFT dataset responses (lower loss on response tokens) without improving its underlying \emph{knowledge} or \emph{reasoning} capabilities. In other words, the model becomes more fluent in the ChatML format without becoming more capable. \paragraph{Overfitting signal: WinoGrande.} The degradation of WinoGrande from 53.8\% to 52.8\% ($-$1.0 point) is the clearest overfitting signal. WinoGrande tests commonsense reasoning on ambiguous pronoun resolution, a capability that should not degrade with additional training if the model were generalizing correctly. With 2.47M examples and 2.2 epochs, each example in the SFT dataset has been seen on average more than 2 times. The model begins to memorize dataset-specific patterns rather than generalize, which harms its general reasoning ability. \paragraph{ARC-Challenge confirms the trend.} The drop in ARC-Challenge ($-$0.4 points) points in the same direction. This benchmark tests scientific reasoning on difficult questions, and its parallel degradation with WinoGrande reinforces the hypothesis of overfitting that specifically impacts reasoning capabilities. \paragraph{Practical implication.} For a dataset of 2.47M examples with a batch size of 32, one epoch corresponds to 45{,}383 steps. SFT-30K (0.66 epochs) has not yet completed a full pass through the dataset but already achieves optimal performance. The additional compute of SFT-100K (3.3$\times$ more) is therefore largely wasted. \subsection{Importance of the Base Checkpoint} The comparison between the different fine-tuned variants reveals an apparent paradox: \begin{itemize}[leftmargin=*] \item \textbf{Instruct v0.1} (base 10B tokens, 5{,}500 SFT steps, 185K examples): HellaSwag = 42.7\% \item \textbf{SFT-30K} (base 39B tokens, 30{,}000 SFT steps, 2.47M examples): HellaSwag = 41.7\% \end{itemize} The model fine-tuned from a weaker base (10B tokens) achieves a higher post-SFT HellaSwag (+1.0 point) than the one fine-tuned from the stronger base (39B tokens). Several factors may explain this result: \begin{enumerate}[leftmargin=*] \item \textbf{Different SFT datasets}: Instruct v0.1 uses 185K examples (likely of higher individual quality), while SFT-30K uses 2.47M examples (more diversity but potentially more noise). The quality of SFT examples has a direct impact on benchmark degradation. \item \textbf{Different SFT duration}: 5{,}500 steps represent a much lighter SFT exposure than 30{,}000 steps, which preserves more of the base model's capabilities. With fewer steps, the model ``forgets'' less of its text completion abilities. \item \textbf{Different loss surfaces}: The model at 10B tokens is in a different training regime (loss 3.20 vs 2.33), which may influence how SFT modifies the weights---a model with higher loss may be more ``malleable'' to SFT. \end{enumerate} This result underscores that post-SFT quality is not a simple function of the base checkpoint: the combination of base checkpoint, SFT dataset, and SFT duration forms a three-dimensional hyperparameter space that should be optimized jointly. \subsection{Practical Recommendations} Based on the entirety of our observations, we formulate the following recommendations for fine-tuning small language models (under 1B parameters): \begin{enumerate}[leftmargin=*] \item \textbf{Limit SFT to less than 1 epoch}: For datasets on the order of millions of examples, 0.5--0.7 epochs appears optimal. Beyond that, the risk of overfitting increases with no measurable benefit on benchmarks. \item \textbf{Monitor WinoGrande and ARC-Challenge}: These two benchmarks are the first to show signs of overfitting during SFT. A degradation of these metrics is a more reliable stopping signal than training loss. \item \textbf{Do not trust loss for SFT quality}: Unlike pre-training where loss is a reliable indicator of model quality, SFT loss primarily measures format compliance, not reasoning quality. \item \textbf{Prefer diversity over volume}: A high-quality SFT dataset with diverse examples is preferable to a large noisy dataset trained over multiple epochs. \item \textbf{Invest in pre-training}: The progression from 45.8\% to 53.5\% on HellaSwag shows that additional pre-training yields gains that far exceed those from increasing SFT. \end{enumerate} % ============================================================================ % 10. Analysis % ============================================================================ \section{Analysis} \subsection{Training Efficiency} The strong HellaSwag performance of Julian-600M despite limited training data suggests that our architecture and training procedure are highly efficient. We hypothesize several contributing factors: \begin{enumerate}[leftmargin=*] \item \textbf{Modern architecture}: The combination of RoPE, SwiGLU, and RMSNorm (as in LLaMA) provides better inductive biases than the architectures used in GPT-2 and OPT (learned positional embeddings, standard FFN, LayerNorm). \item \textbf{Data quality}: FineWeb-Edu and Wikipedia provide high-quality, factual training data, potentially offering more ``learning per token'' than noisier web crawls. \item \textbf{Bilingual training}: Exposure to both English and French may provide cross-lingual transfer benefits, particularly for commonsense reasoning tasks. \end{enumerate} \begin{figure}[t] \centering \begin{tikzpicture} \begin{axis}[ width=0.92\textwidth, height=7cm, xlabel={Training tokens}, ylabel={HellaSwag (acc\_norm, \%)}, xmode=log, xmin=2e10, xmax=5e11, ymin=25, ymax=58, grid=both, grid style={gray!15}, legend style={at={(0.97,0.97)}, anchor=north east, font=\small}, xtick={5e10, 1e11, 3e11}, xticklabels={50B, 100B, 300B}, ] \addplot[only marks, mark=*, mark size=2.5pt, gray!60] coordinates { (3e11, 29.2) (1e11, 31.5) (3e11, 32.0) (3e11, 33.3) (3e11, 37.6) (3e11, 41.5) (1e11, 50.9) }; \addplot[only marks, mark=*, mark size=3.5pt, black, fill=black!70] coordinates { (3.9e10, 53.5) }; \node[font=\tiny, anchor=south west] at (axis cs:3.15e11, 29.2) {OPT-125M}; \node[font=\tiny, anchor=south east] at (axis cs:9.5e10, 31.5) {GPT-2 Small}; \node[font=\tiny, anchor=south west] at (axis cs:3.15e11, 32.0) {OPT-350M}; \node[font=\tiny, anchor=south west] at (axis cs:3.15e11, 33.3) {Pythia-410M}; \node[font=\tiny, anchor=south west] at (axis cs:3.15e11, 37.6) {Pythia-1B}; \node[font=\tiny, anchor=south west] at (axis cs:3.15e11, 41.5) {OPT-1.3B}; \node[font=\tiny, anchor=south east] at (axis cs:9.5e10, 50.9) {GPT-2 XL}; \node[font=\scriptsize, anchor=south west] at (axis cs:4.2e10, 53.5) {\textbf{Julian-600M}}; \legend{Other models, Julian (ours)} \end{axis} \end{tikzpicture} \caption{Token efficiency: HellaSwag accuracy vs.\ training data volume. Julian-600M (bottom-left, 39B tokens) achieves the highest HellaSwag score with 7.7$\times$ less training data than OPT and Pythia models (300B tokens). The diamond marker highlights Julian's position in the high-accuracy, low-data region.} \label{fig:token_efficiency} \end{figure} \subsection{The HellaSwag Anomaly} The HellaSwag score of 53.5\% for Julian-600M is remarkably high---surpassing even GPT-2~XL (50.9\%) which has 2.5$\times$ more parameters. Several hypotheses merit investigation: \begin{itemize}[leftmargin=*] \item \textbf{Architectural hypothesis}: Modern components (RoPE, SwiGLU, RMSNorm) may be particularly advantageous for text completion tasks measured by HellaSwag. The length-normalized scoring (acc\_norm) could also favor our architecture. \item \textbf{Data quality hypothesis}: FineWeb-Edu's educational content may provide particularly relevant training signal for the commonsense scenarios tested by HellaSwag. \item \textbf{Contamination hypothesis}: While we applied rigorous deduplication \citep{lee2022deduplicating}, we cannot fully exclude partial contamination with benchmark-adjacent data, particularly through FineWeb-Edu. \end{itemize} % ============================================================================ % 11. Limitations % ============================================================================ \section{Limitations} \begin{itemize}[leftmargin=*] \item \textbf{Model size}: At 600M parameters, Julian has limited reasoning capabilities and factual accuracy compared to larger models. \item \textbf{Training data volume}: While efficient, 39B tokens is below the Chinchilla-optimal ratio for 600M parameters ($\sim$12B optimal model size for 39B tokens), suggesting the model could benefit from further training. \item \textbf{English-centric evaluation}: All benchmarks are in English. We lack standardized French evaluation benchmarks for language models of this size. \item \textbf{Hallucination}: Like all language models, Julian frequently generates incorrect or fabricated information, particularly for factual queries. \item \textbf{Basic instruction following}: SFT without reinforcement learning from human feedback \citep{christiano2017deep} (RLHF) or direct preference optimization \citep{rafailov2023direct} (DPO) produces basic instruction-following capabilities that are significantly weaker than RLHF-trained models. \item \textbf{LAMBADA underperformance}: The relatively low LAMBADA accuracy (37.3\% vs.\ 50.5\% for Pythia-410M) indicates that broader text prediction capabilities lag behind the strong commonsense reasoning performance. \end{itemize} % ============================================================================ % 12. Conclusion % ============================================================================ \section{Conclusion} We have presented Julian, a family of bilingual language models trained from scratch on TPU infrastructure using JAX/Flax. Our flagship Julian-600M model achieves remarkable efficiency on HellaSwag (53.5\%), outperforming models with 2$\times$ more parameters trained on 8$\times$ more data. We have documented the complete training pipeline, from data collection and tokenizer training to pre-training, supervised fine-tuning, and evaluation. \paragraph{Future Work.} We plan to: (1) scale Julian to 2B parameters using larger TPU configurations (v6e-64); (2) implement DPO \citep{rafailov2023direct} for improved instruction following; (3) develop French-language evaluation benchmarks; and (4) explore continued pre-training on larger datasets to improve LAMBADA and general text prediction performance. \paragraph{Open Release.} All model weights are available at \url{https://huggingface.co/JulianKrgd} under the Apache 2.0 license. % ============================================================================ % Acknowledgments % ============================================================================ \section*{Acknowledgments} This work was supported by the Google TPU Research Cloud (TRC) program, which provided access to Cloud TPU v4-32 pods. We thank the TRC team for their support and the allocation of compute resources that made this research possible. % ============================================================================ % References % ============================================================================ \bibliographystyle{plainnat} \begin{thebibliography}{36} \bibitem[Biderman et~al.(2023)]{biderman2023pythia} Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van~der Wal. \newblock Pythia: A suite for analyzing large language models across training and scaling. \newblock In \emph{ICML}, 2023. \newblock \url{https://arxiv.org/abs/2304.01373} \bibitem[Christiano et~al.(2017)]{christiano2017deep} Paul~F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. \newblock Deep reinforcement learning from human preferences. \newblock In \emph{NeurIPS}, 2017. \newblock \url{https://arxiv.org/abs/1706.03741} \bibitem[Bradbury et~al.(2018)]{bradbury2018jax} James Bradbury, Roy Frostig, Peter Hawkins, Matthew~James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander{P}las, Skye Wanderman-{M}ilne, and Qiao Zhang. \newblock {JAX}: Composable transformations of {Python}+{NumPy} programs. \newblock 2018. \newblock \url{https://github.com/jax-ml/jax} \bibitem[Bisk et~al.(2020)]{bisk2020piqa} Yonatan Bisk, Rowan Zellers, Ronan Le~Bras, Jianfeng Gao, and Yejin Choi. \newblock {PIQA}: Reasoning about physical intuition in natural language. \newblock In \emph{AAAI}, 2020. \newblock \url{https://arxiv.org/abs/1911.11641} \bibitem[Brown et~al.(2020)]{brown2020language} Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared~D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et~al. \newblock Language models are few-shot learners. \newblock In \emph{NeurIPS}, 2020. \newblock \url{https://arxiv.org/abs/2005.14165} \bibitem[Chowdhery et~al.(2023)]{chowdhery2023palm} Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung~Won Chung, Charles Sutton, Sebastian Gehrmann, et~al. \newblock {PaLM}: Scaling language modeling with {P}athways. \newblock \emph{JMLR}, 2023. \newblock \url{https://arxiv.org/abs/2204.02311} \bibitem[Clark et~al.(2018)]{clark2018think} Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. \newblock Think you have solved question answering? {T}ry {ARC}, the {AI2} reasoning challenge. \newblock \emph{arXiv preprint arXiv:1803.05457}, 2018. \newblock \url{https://arxiv.org/abs/1803.05457} \bibitem[Clark et~al.(2019)]{clark2019boolq} Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. \newblock {BoolQ}: Exploring the surprising difficulty of natural yes/no questions. \newblock In \emph{NAACL}, 2019. \newblock \url{https://arxiv.org/abs/1905.10044} \bibitem[Conneau et~al.(2020)]{conneau2020xlmr} Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm{\'a}n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. \newblock Unsupervised cross-lingual representation learning at scale. \newblock In \emph{ACL}, 2020. \newblock \url{https://arxiv.org/abs/1911.02116} \bibitem[Devlin et~al.(2019)]{devlin2019bert} Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. \newblock {BERT}: Pre-training of deep bidirectional transformers for language understanding. \newblock In \emph{NAACL}, 2019. \newblock \url{https://arxiv.org/abs/1810.04805} \bibitem[Gao et~al.(2023)]{gao2023framework} Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le~Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. \newblock A framework for few-shot language model evaluation. \newblock \emph{Zenodo}, 2023. \newblock \url{https://zenodo.org/records/10256836} \bibitem[Hoffmann et~al.(2022)]{hoffmann2022training} Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de~Las~Casas, Lisa~Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van~den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack~W. Rae, Oriol Vinyals, and Laurent Sifre. \newblock Training compute-optimal large language models. \newblock In \emph{NeurIPS}, 2022. \newblock \url{https://arxiv.org/abs/2203.15556} \bibitem[Kaplan et~al.(2020)]{kaplan2020scaling} Jared Kaplan, Sam McCandlish, Tom Henighan, Tom~B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. \newblock Scaling laws for neural language models. \newblock \emph{arXiv preprint arXiv:2001.08361}, 2020. \newblock \url{https://arxiv.org/abs/2001.08361} \bibitem[Lee et~al.(2022)]{lee2022deduplicating} Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. \newblock Deduplicating training data makes language models better. \newblock In \emph{ACL}, 2022. \newblock \url{https://arxiv.org/abs/2107.06499} \bibitem[Kudo and Richardson(2018)]{kudo2018sentencepiece} Taku Kudo and John Richardson. \newblock {SentencePiece}: A simple and language independent subword tokenizer and detokenizer for neural text processing. \newblock In \emph{EMNLP (demo)}, 2018. \newblock \url{https://arxiv.org/abs/1808.06226} \bibitem[Liu et~al.(2024)]{liu2024mobilellm} Zechun Liu, Changlin Li, Barlas O\u{g}uz, et~al. \newblock {MobileLLM}: Optimizing sub-billion parameter language models for on-device use cases. \newblock In \emph{ICML}, 2024. \newblock \url{https://arxiv.org/abs/2402.14905} \bibitem[Micikevicius et~al.(2018)]{micikevicius2018mixed} Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. \newblock Mixed precision training. \newblock In \emph{ICLR}, 2018. \newblock \url{https://arxiv.org/abs/1710.03740} \bibitem[Loshchilov and Hutter(2019)]{loshchilov2019decoupled} Ilya Loshchilov and Frank Hutter. \newblock Decoupled weight decay regularization. \newblock In \emph{ICLR}, 2019. \newblock \url{https://arxiv.org/abs/1711.05101} \bibitem[OpenAI(2023)]{openai2023chatml} OpenAI. \newblock {ChatML}: Chat markup language. \newblock Technical documentation, 2023. \newblock \url{https://github.com/openai/openai-python/blob/v0.28.1/chatml.md} \bibitem[Ouyang et~al.(2022)]{ouyang2022training} Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et~al. \newblock Training language models to follow instructions with human feedback. \newblock In \emph{NeurIPS}, 2022. \newblock \url{https://arxiv.org/abs/2203.02155} \bibitem[Paperno et~al.(2016)]{paperno2016lambada} Denis Paperno, Germ{\'a}n Kruszewski, Angeliki Lazaridou, Quan~Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern{\'a}ndez. \newblock The {LAMBADA} dataset: Word prediction requiring a broad discourse context. \newblock In \emph{ACL}, 2016. \newblock \url{https://arxiv.org/abs/1606.06031} \bibitem[Rafailov et~al.(2023)]{rafailov2023direct} Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher~D Manning, and Chelsea Finn. \newblock Direct preference optimization: Your language model is secretly a reward model. \newblock In \emph{NeurIPS}, 2023. \newblock \url{https://arxiv.org/abs/2305.18290} \bibitem[Radford et~al.(2019)]{radford2019language} Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. \newblock Language models are unsupervised multitask learners. \newblock \emph{OpenAI blog}, 2019. \newblock \url{https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf} \bibitem[Sakaguchi et~al.(2020)]{sakaguchi2020winogrande} Keisuke Sakaguchi, Ronan Le~Bras, Chandra Bhagavatula, and Yejin Choi. \newblock {WinoGrande}: An adversarial winograd schema challenge at scale. \newblock In \emph{AAAI}, 2020. \newblock \url{https://arxiv.org/abs/1907.10641} \bibitem[Shazeer(2020)]{shazeer2020glu} Noam Shazeer. \newblock {GLU} variants improve transformer. \newblock \emph{arXiv preprint arXiv:2002.05202}, 2020. \newblock \url{https://arxiv.org/abs/2002.05202} \bibitem[Su et~al.(2021)]{su2021roformer} Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. \newblock {RoFormer}: Enhanced transformer with rotary position embedding. \newblock \emph{arXiv preprint arXiv:2104.09864}, 2021. \newblock \url{https://arxiv.org/abs/2104.09864} \bibitem[Touvron et~al.(2023)]{touvron2023llama} Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth{\'e}e Lacroix, Baptiste Rozi{\`e}re, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. \newblock {LLaMA}: Open and efficient foundation language models. \newblock \emph{arXiv preprint arXiv:2302.13971}, 2023. \newblock \url{https://arxiv.org/abs/2302.13971} \bibitem[Xu et~al.(2021)]{xu2021gspmd} Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yanping Huang, Rahul Joshi, Maxim Krikun, Dmitry Lepikhin, Andy Ly, Marcello Maggioni, Ruoming Pang, Noam Shazeer, Shibo Wang, Tao Wang, Yonghui Wu, and Zhifeng Chen. \newblock {GSPMD}: General and scalable parallelization for {ML} computation graphs. \newblock \emph{arXiv preprint arXiv:2105.04663}, 2021. \newblock \url{https://arxiv.org/abs/2105.04663} \bibitem[Vaswani et~al.(2017)]{vaswani2017attention} Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan~N Gomez, {\L}ukasz Kaiser, and Illia Polosukhin. \newblock Attention is all you need. \newblock In \emph{NeurIPS}, 2017. \newblock \url{https://arxiv.org/abs/1706.03762} \bibitem[Workshop et~al.(2023)]{workshop2023bloom} BigScience Workshop, Teven Le~Scao, Angela Fan, et~al. \newblock {BLOOM}: A 176B-parameter open-access multilingual language model. \newblock \emph{arXiv preprint arXiv:2211.05100}, 2023. \newblock \url{https://arxiv.org/abs/2211.05100} \bibitem[Zellers et~al.(2019)]{zellers2019hellaswag} Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. \newblock {HellaSwag}: Can a machine really finish your sentence? \newblock In \emph{ACL}, 2019. \newblock \url{https://arxiv.org/abs/1905.07830} \bibitem[Zhang et~al.(2022)]{zhang2022opt} Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi~Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit~Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. \newblock {OPT}: Open pre-trained transformer language models. \newblock \emph{arXiv preprint arXiv:2205.01068}, 2022. \newblock \url{https://arxiv.org/abs/2205.01068} \bibitem[Zhang and Sennrich(2019)]{zhang2019root} Biao Zhang and Rico Sennrich. \newblock Root mean square layer normalization. \newblock In \emph{NeurIPS}, 2019. \newblock \url{https://arxiv.org/abs/1910.07467} \bibitem[Zhang et~al.(2024)]{zhang2024tinyllama} Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. \newblock {TinyLlama}: An open-source small language model. \newblock \emph{arXiv preprint arXiv:2401.02385}, 2024. \newblock \url{https://arxiv.org/abs/2401.02385} \end{thebibliography} % ============================================================================ % Appendix % ============================================================================ \appendix \section{Full Hyperparameter Tables} \label{app:hyperparams} \begin{table}[h] \centering \caption{Complete pre-training configuration for Julian-600M.} \begin{tabular}{lc} \toprule \textbf{Category} & \textbf{Value} \\ \midrule \multicolumn{2}{l}{\textit{Model}} \\ Parameters & $\sim$600M \\ Hidden dimension & 1280 \\ Layers & 18 \\ Attention heads & 20 \\ Head dimension & 64 \\ FFN dimension & 5120 \\ Activation & SwiGLU (SiLU gate) \\ Normalization & RMSNorm ($\epsilon = 10^{-6}$) \\ Position encoding & RoPE ($\theta = 10{,}000$) \\ Vocabulary & 50{,}000 (SentencePiece BPE) \\ Context length & 2{,}048 \\ Dropout & 0.1 \\ \midrule \multicolumn{2}{l}{\textit{Optimization}} \\ Optimizer & AdamW \\ $\beta_1, \beta_2$ & 0.9, 0.95 \\ $\epsilon$ & $10^{-8}$ \\ Weight decay & 0.1 \\ Peak LR & $1.2 \times 10^{-3}$ \\ Min LR & $1.2 \times 10^{-4}$ \\ LR schedule & Cosine with linear warmup \\ Warmup steps & 3{,}000 \\ Total steps & 300{,}000 \\ Gradient clipping & 1.0 (global norm) \\ Optimizer state precision & bfloat16 \\ \midrule \multicolumn{2}{l}{\textit{Compute}} \\ Hardware & TPU v4-32 (32 chips, 4 hosts) \\ Batch per device & 4 \\ Gradient accumulation & 8 \\ Effective batch size & 1{,}024 \\ Precision & bfloat16 mixed \\ Tokens per step & $\sim$2.1M \\ Total tokens & $\sim$39B \\ Checkpointing & Orbax async, every 10K steps \\ \bottomrule \end{tabular} \end{table} \section{Model Availability} All Julian models are available on the HuggingFace Hub: \begin{table}[h] \centering \begin{tabular}{ll} \toprule \textbf{Model} & \textbf{HuggingFace Repository} \\ \midrule Julian-600M Base & \texttt{JulianKrgd/julian-600m-40b} \\ Julian-600M-10B-Instruct-v0.1 & \texttt{JulianKrgd/julian-600m-10b-instruct-v0.1} \\ Julian-600M SFT-30K & \texttt{JulianKrgd/julian-600m-40b-instruct-sft30k} \\ Julian-600M SFT-100K & \texttt{JulianKrgd/julian-600m-40b-instruct-sft100k} \\ \bottomrule \end{tabular} \end{table} \end{document}