File size: 6,555 Bytes
b651ad2
835b367
35aa43c
471e0f7
 
 
 
 
 
 
35aa43c
 
 
 
 
 
 
 
 
 
 
 
 
 
78a7285
35aa43c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32615f2
35aa43c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bbe4303
35aa43c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bbe4303
 
 
35aa43c
bbe4303
 
 
35aa43c
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---
license: apache-2.0
tags:
- PEFT
- Mixture-of-Experts
- MoE-LoRA
- Multi-Task-Learning
- Large-Language-Models
- LLaMA
- LLaMA-2
- pytorch
---

<a id="top"></a>
<div align="center">
  <h1>πŸš€ D<sup>2</sup>MoRA: Diversity-Regulated Asymmetric MoE-LoRA Decomposition for Efficient Multi-Task Adaptation</h1>

  <p>
    <b>Jianhui Zuo</b><sup>1</sup>&nbsp;
    <b>Xuemeng Song</b><sup>2βœ‰</sup>&nbsp;
    <b>Haokun Wen</b><sup>3,4</sup>&nbsp;
    <b>Meng Liu</b><sup>5</sup>&nbsp;
    <b>Yupeng Hu</b><sup>1</sup>&nbsp;
    <b>Jiuru Wang</b><sup>6</sup>&nbsp;
    <b>Liqiang Nie</b><sup>3βœ‰</sup>
  </p>

  <p>
    <sup>1</sup>School of Software, Shandong University<br>
    <sup>2</sup>Department of Computer Science and Engineering, Southern University of Science and Technology<br>
    <sup>3</sup>School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)<br>
    <sup>4</sup>School of Data Science, City University of Hong Kong<br>
    <sup>5</sup>School of Computer and Artificial Intelligence, Shandong Jianzhu University<br>
    <sup>6</sup>School of Computer Science and Engineering, Linyi University
  </p>
</div>

These are the official pre-trained model weights and configuration files for **D<sup>2</sup>MoRA**, a novel **diversity-regulated asymmetric MoE-LoRA decomposition framework** for **parameter-efficient fine-tuning (PEFT)** of large language models in **multi-task adaptation** scenarios.

πŸ”— **Paper:** [Accepted by AAAI 2026]  
πŸ”— **GitHub Repository:** [AAAI26-D2MoRA](https://github.com/iLearn-Lab/AAAI26-D2MoRA)

---

## πŸ“Œ Model Information

### 1. Model Name
**D<sup>2</sup>MoRA** (**D**iversity-Regulated Asymmetric **MoE-LoRA** Decomposition) Checkpoints.

### 2. Task Type & Applicable Tasks
- **Task Type:** Parameter-Efficient Fine-Tuning (PEFT) / Low-Rank Adaptation (LoRA) / Mixture-of-Experts (MoE) / Multi-Task Learning
- **Applicable Tasks:** Efficient adaptation of large language models for heterogeneous downstream tasks, especially **multi-task commonsense reasoning** and related language understanding tasks.

### 3. Project Introduction
Low-Rank Adaptation (LoRA) has become a powerful parameter-efficient fine-tuning paradigm for adapting large language models. Recent studies further integrate LoRA with the Mixture-of-Experts (MoE) mechanism to improve multi-task adaptation. However, existing knowledge-sharing paradigms among LoRA experts still suffer from two major limitations:

1. **Constrained Functional Specialization**  
   Existing one-to-many sharing paradigms force all experts to operate in a single shared low-rank subspace, limiting the flexibility of expert-specific transformations.

2. **Induced Expert Homogenization**  
   Sharing a single down-projection matrix across experts may cause different experts to become overly similar, weakening expert diversity and reducing the benefit of MoE specialization.

To address these issues, **D<sup>2</sup>MoRA** introduces a **diversity-regulated asymmetric MoE-LoRA decomposition framework**. Instead of treating each LoRA expert as a fixed `(A, B)` pair, D<sup>2</sup>MoRA decomposes LoRA experts into two independent sets of base experts:

- **Down-projection experts:** A<sub>1</sub>, A<sub>1</sub>, ..., A<sub>M</sub>
- **Up-projection experts:** B<sub>1</sub>, B<sub>2</sub>, ..., B<sub>N</sub>

This design enables a novel **asymmetric many-to-many pairing** mechanism between down-projection and up-projection experts, allowing more flexible cross-expert knowledge sharing while preserving expert specialization. In addition, D<sup>2</sup>MoRA introduces:

- **Sample-Aware Down-Projection Expert Mixture**
- **Low-Rank Embedding-Aware Up-Projection Expert Mixture**
- **Dual Orthogonality Regularization**

to explicitly improve the diversity of both \(A\)-experts and \(B\)-experts and mitigate expert homogenization.

> πŸ’‘ **Note:** D<sup>2</sup>MoRA is evaluated in both **multi-task** and **single-task** settings, and consistently demonstrates strong effectiveness and generalization ability.

### 4. Training Data Source
The model was primarily trained and evaluated on the **Commonsense 170K** benchmark, which contains eight public commonsense reasoning datasets:
- **BoolQ**
- **PIQA**
- **SIQA**
- **HellaSwag**
- **WinoGrande**
- **ARC-c**
- **ARC-e**
- **OBQA**

---

## πŸš€ Usage & Basic Inference

These weights are designed to be used directly with the official **D<sup>2</sup>MoRA** GitHub repository.

### Step 1: Prepare the Environment
Clone the GitHub repository and install dependencies following the official repository instructions:

```bash
git clone https://github.com/iLearn-Lab/AAAI26-D2MoRA.git
cd D2MoRA
```

Please refer to the official repository for the exact environment setup and dependency installation details.

### Step 2: Download Model Weights & Data

Download the checkpoint files (e.g., `best_model.pth`) from this Hugging Face repository and place them into your local checkpoint directory.

You should also prepare the **Commonsense 170K** benchmark and related processed data according to the official repository instructions.

### Step 3: Training / Evaluation

D<sup>2</sup>MoRA is built for PEFT-based adaptation of large language models such as **LLaMA-7B** and **LLaMA2-7B**.

In the paper, the method fine-tunes the **Query / Key / Value** projections of self-attention layers. Typical experimental settings include:

- **Backbones:** LLaMA-7B, LLaMA2-7B
- **Adapted modules:** Query / Key / Value projections
- **Orthogonality coefficient:** `Ξ» = 1e-4`
- **Dropout:** `0.05`
- **Learning rate:** `3e-4`
- **Batch size:** `4` per A100 GPU (40GB)

Representative D<sup>2</sup>MoRA settings reported in the paper include:

- **LLaMA-7B**
  - `{M = 3, N = 8, r = 8}`
  - `{M = 3, N = 4, r = 16}`

- **LLaMA2-7B**
  - `{M = 3, N = 8, r = 8}`
  - `{M = 4, N = 3, r = 16}`



Please use the official repository scripts for training and evaluation.


## πŸ“β­οΈ Citation

If you find our work or these model weights useful in your research, please consider leaving a **Star** ⭐️ on our GitHub repo and citing our paper:

```bibtex
@inproceedings{zuo2026d2mora,
  title={D2MoRA: Diversity-Regulated Asymmetric MoE-LoRA Decomposition for Efficient Multi-Task Adaptation},
  author={Zuo, Jianhui and Song, Xuemeng and Wen, Haokun and Liu, Meng and Hu, Yupeng and Wang, Jiuru and Nie, Liqiang},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={40},
  number={34},
  pages={29286--29294},
  year={2026}
}
```