--- license: mit datasets: - teknium/OpenHermes-2.5 - HuggingFaceTB/smollm-corpus --- # Model Card for LRC-1.5B-Base LRC-1.5B-Base is a Small Language Model (SLM) with approximately 1.5 billion parameters. It is the base pre-trained version, developed using the **Low-Rank Clone (LRC)** method, before any Supervised Fine-Tuning (SFT). The LRC method is an efficient knowledge distillation technique designed to construct SLMs that aspire to behavioral equivalence with larger, more powerful teacher models. This model was distilled from **Llama-3.2-3B-Instruct**. The LRC approach trains a set of low-rank projection matrices that enable soft pruning by compressing teacher weights and an "activation clone" mechanism that aligns student activations (including FFN signals) with those of the teacher. LRC-1.5B-Base was trained on **10 billion tokens**, demonstrating significant training efficiency compared to models trained on trillions of tokens. ## Uses ### Direct Use LRC-1.5B-Base is a base pre-trained model. While it has not undergone specific Supervised Fine-Tuning (SFT) for instruction following or chat, it was distilled from an instruction-tuned teacher (Llama-3.2-3B-Instruct) and trained on data including OpenHermes (synthetic assistant dialogues). Consequently, it may exhibit some nascent instruction-following or conversational capabilities. ## How to Get Started with the Model ```python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained('JitaiHao/LRC-1.5B-Base') model = AutoModelForCausalLM.from_pretrained('JitaiHao/LRC-1.5B-Base') # Example: Text generation (output quality will depend on the base model's capabilities) prompt = "The capital of France is" inputs = tokenizer(prompt, return_tensors="pt") # Generate text # Note: Add generation parameters as needed (e.g., max_length, num_beams) outputs = model.generate(**inputs, max_length=50) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Training Details ### Training Data LRC-1.5B-Base was pre-trained as part of the LRC-1.5B development (which then underwent SFT). The pre-training phase for LRC-1.5B used **10 billion tokens**. The dataset, referred to as "Mixed-1.1" in Table 1 and detailed in Table 10 of the paper, consists of: * **Fineweb-Edu:** 10B tokens (high-quality educational content, filtered subset) * **OpenHermes 2.5:** 450M tokens (synthetic data for generalist assistants) This data was used for distillation from the Llama-3.2-3B-Instruct teacher model. ### Training Procedure LRC-1.5B-Base was trained using the Low-Rank Clone (LRC) method. Key aspects: * **Distillation Method:** Low-Rank Projection of teacher weights and Activation Clone (aligning student's internal activations, including FFNs, with the teacher's via MSE loss). * **Overall Loss:** $L = \mathcal{L}_\mathrm{KL} + \mathcal{L}_\mathrm{LM} + α\mathcal{L}_\mathrm{clone}$ (KL divergence for output logits, next-token prediction loss, and activation cloning loss). * **Teacher Model:** Llama-3.2-3B-Instruct #### Training Hyperparameters (for LRC-1.5B pre-training, which LRC-1.5B-Base is the result of): * **Total Training Tokens:** **10B** * **Student Hidden Size:** 1,536 * **Sequence Length:** 2,048 * **Batch Size (tokens):** 49,152 * **Clone Loss Weight (α):** 0.2 * **Learning Rate (Pre-train):** 1.0 x 10⁻⁴ * **LR Scheduler:** Linear decay with a warmup ratio of 0.005 * **Optimizer:** Adam (β₁=0.9, β₂=0.999) * **Temperature for \mathcal{L}_\mathrm{KL} (KL divergence loss):** 40 * **RMSNorm ε:** 1.0 x 10⁻⁵ * **Hardware:** 8 x NVIDIA H800 GPUs * **Training Time (for pre-training):** Approximately 34 Hours (as per Table 8 for LRC-1.5B) ## Evaluation **Zero-shot** performance of **LRC-1.5B-Base** (pre-SFT base model) on general downstream tasks: | Benchmark | Metric | Score | | :--------- | :------------- | :----- | | ARC-E | Accuracy | 73.40 | | ARC-C | Accuracy Norm | 42.15 | | LogiQA | Accuracy Norm | 31.03 | | CSQA | Accuracy | 64.46 | | PIQA | Accuracy | 71.60 | | WinoG | Accuracy | 61.88 | | BoolQ | Accuracy | 73.27 | | SciQ | Accuracy | 94.40 | | MMLU | Accuracy | 50.09 | | **Avg.** | | **62.48** | Its SFT version, LRC-1.5B (trained on **10B tokens**), achieves an average of 63.48% on these tasks (Table 13). Below is a comparison of the **SFT version (LRC-1.5B)** with other publicly available SFT models under 2B parameters (from Table 1 of the paper): | Model | # Tokens | ARC-E | ARC-C | LogiQA | CSQA | PIQA | WinoG | BoolQ | SciQ | MMLU | **Avg.** | | :---------------- | :-------- | :---- | :---- | :----- | :---- | :---- | :---- | :---- | :---- | :---- | :------- | | InternLM2-1.8B | 2T | 71.04 | 42.06 | 28.42 | 70.11 | 74.27 | 63.77 | 75.50 | 94.50 | 43.75 | 62.60 | | **LRC-1.7B-SFT**| **20B** | **74.62** | **44.20** | **30.88** | **70.19** | **73.07** | **63.30** | **79.82** | **93.80** | **54.93** | **64.98**| | Qwen3-1.7B | 36T | 72.47 | 43.00 | 28.42 | 64.78 | 72.20 | 61.48 | 77.65 | 93.10 | 55.44 | 63.17 | | SmolLM2-1.7B | 11T | 69.11 | 43.52 | 28.88 | 51.19 | 76.01 | 68.98 | 68.47 | 89.80 | 48.50 | 60.50 | | **LRC-1.5B-SFT**| **10B** | **74.75** | **44.97** | **30.72** | **65.77** | **73.07** | **62.25** | **75.78** | **94.60** | **49.42** | **63.48**| | MiniCPM-1.2B | 1T | 70.16 | 39.68 | 30.88 | 64.29 | 74.65 | 60.77 | 67.58 | 91.50 | 44.23 | 60.42 | ## Technical Specifications ### Model Architecture and Objective * **Architecture:** Transformer-based decoder-only model, adhering to the Llama architecture. * Number of Layers: 28 * Hidden Size: 1,536 * FFN Intermediate Size: 8,192 * Attention Q Heads: 24 * Attention KV Heads: 8 * Head Dimension: 128 * Vocabulary Size: 128,256 * Word Embeddings: Tied * **Objective:** The model is trained via knowledge distillation. The primary objective is next-token prediction (language modeling, $\mathcal{L}_\mathrm{LM}$ loss). This is augmented by: * A KL divergence loss ($\mathcal{L}_\mathrm{KL}$) between the student's and teacher's output logits. * An "Activation Clone" loss ($\mathcal{L}_\mathrm{clone}$) using Mean Squared Error (MSE) to align the student's intermediate hidden states (for attention inputs q,k,v and FFN inputs gate, up) and output activations (from attention and FFN modules after projection by student's output weights) with those of the teacher model. The teacher's weights are compressed into student weights using trainable low-rank projection matrices. * The total training objective is $\mathcal{L} = \mathcal{L}_\mathrm{KL} + \mathcal{L}_\mathrm{LM} + α\mathcal{L}_\mathrm{clone}$.