File size: 8,841 Bytes
24dd6e1
 
 
 
 
 
 
 
 
 
 
 
dd17b59
 
 
24dd6e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bb95b03
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24dd6e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
---
license: mit
base_model:
- jhu-clsp/mmBERT-small
---

# mmBERT-L4H384 / mmBERT-L7H384 / mmBERT-L13H384

Pruned variants of [mmBERT-small](/jhu-clsp/mmBERT-small).

## Models

- [mmBERT-L4H384-pruned](https://huggingface.co/hotchpotch/mmBERT-L4H384-pruned)
- [mmBERT-L7H384-pruned](https://huggingface.co/hotchpotch/mmBERT-L7H384-pruned)
- [mmBERT-L13H384-pruned](https://huggingface.co/hotchpotch/mmBERT-L13H384-pruned)

### ⚠️ Note: Pruning-Only (Not Distilled)

These are **pruning-only** variants—we simply remove layers without any knowledge distillation or fine-tuning. Fully trained or distilled models with the same architecture may outperform these pruned versions.

## Overview

These models are created by **layer pruning** from [mmBERT-small](/jhu-clsp/mmBERT-small) (22 layers, 384 hidden dimensions). We select specific layers to retain while preserving the ModernBERT global/local attention cadence.

### Layer Selection and Evaluation

We fine-tuned the pruned models for information retrieval on the MS MARCO dataset and evaluated them on nanoBEIR (NDCG@10).

The numbers in model names (e.g., `0_1_2_18`) indicate **which layers are retained** from the original 22-layer model:

- **L4H384 (0_1_2_18)**: Keeps layers 0, 1, 2, and 18 → 4 layers total
- **L7H384 (0_1_2_3_4_5_18)**: Keeps layers 0–5 and 18 → 7 layers total
- **L13H384 (0_1_2_3_4_5_6_7_8_9_10_11_18)**: Keeps layers 0–11 and 18 → 13 layers total

### Why These Configurations?

We chose these "official" configurations based on two criteria:

1. **Simplicity**: Consecutive layer indices (0, 1, 2, 3, ...) are easier to understand and reproduce than scattered indices like `0_1_2_3_6_8_18`.

2. **Competitive performance**: While not always the absolute best score, these configurations perform competitively within their layer count category.

For example, `L7H384-0_1_2_3_6_8_18` (mean: 0.4722) slightly outperforms our official pick `L7H384-0_1_2_3_4_5_18` (mean: 0.4693), but the consecutive layer pattern is more interpretable and the performance difference is marginal.

### Why Layer 18?

ModernBERT uses an alternating attention pattern:

- **Global attention (g)**: Full self-attention across all tokens
- **Local attention (l)**: Attention within a sliding window

The pattern follows a `g-l-l-g-l-l-...` rhythm. In the original 22-layer mmBERT-small, both layer 18 and layer 21 are global attention layers, with layer 21 being the final layer.

However, our experiments showed that **ending with layer 18 consistently outperforms ending with layer 21**. For example:

- `L4H384-0_1_2_18` (mean: 0.4530) vs `L4H384-0_1_2_21` (mean: 0.4558)
- `L7H384-0_1_2_3_4_5_18` (mean: 0.4693) vs `L7H384-0_1_2_3_4_5_21` (mean: 0.4629)

This suggests that the representations at layer 18 are more effective for retrieval tasks when combined with early layers, possibly because layer 18 provides a better balance between abstraction and retention of fine-grained information.

### Experimental Variations

We explored different pruning strategies by shifting the start positions and coverage:

- **Front-heavy** (e.g., `0_1_2_3_4_5_18`): Retains early layers, skips middle layers
- **Back-heavy** (e.g., `0_16_17_18_19_20_21`): Retains later layers
- **Distributed** (e.g., `0_1_2_3_4_5_6_7_8_10_12_15_18`): Spreads retained layers across depth

This probes the trade-off between **depth** (how many layers) and **coverage** (which parts of the network contribute).

## Scores (NDCG@10) — All L4/L7/L13 Runs

| model | mean | NanoArguAna | NanoClimateFEVER | NanoDBPedia | NanoFEVER | NanoFiQA2018 | NanoHotpotQA | NanoMSMARCO | NanoNFCorpus | NanoNQ | NanoQuoraRetrieval | NanoSCIDOCS | NanoSciFact | NanoTouche2020 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mmBERT-small (22 layers) | 0.5151 | 0.4345 | 0.2888 | 0.4548 | 0.7534 | 0.4199 | 0.6629 | 0.5853 | 0.2849 | 0.5634 | 0.9367 | 0.2704 | 0.5042 | 0.5378 |
| L13H384-0_1_2_3_4_5_6_7_8_9_10_11_12 | 0.4553 | 0.3908 | 0.2715 | 0.4385 | 0.7290 | 0.3289 | 0.6191 | 0.4702 | 0.2178 | 0.4649 | 0.9198 | 0.2152 | 0.4402 | 0.4129 |
| L13H384-0_1_2_3_4_5_6_7_8_9_10_11_15 | 0.4576 | 0.4395 | 0.2457 | 0.4284 | 0.7472 | 0.3237 | 0.5920 | 0.4918 | 0.2199 | 0.4531 | 0.9195 | 0.1852 | 0.4820 | 0.4208 |
| **L13H384-0_1_2_3_4_5_6_7_8_9_10_11_18** | **0.4964** | 0.4462 | 0.2955 | 0.4907 | 0.7564 | 0.3886 | 0.6469 | 0.5142 | 0.2644 | 0.5268 | 0.9412 | 0.2326 | 0.4840 | 0.4662 |
| L13H384-0_1_2_3_4_5_6_7_8_9_10_11_21 | 0.4800 | 0.4162 | 0.2858 | 0.4695 | 0.7197 | 0.3358 | 0.6338 | 0.5512 | 0.2603 | 0.5127 | 0.9305 | 0.2389 | 0.4457 | 0.4393 |
| L13H384-0_1_2_3_4_5_6_7_8_9_10_12_18 | 0.4904 | 0.4594 | 0.2619 | 0.4904 | 0.7481 | 0.3832 | 0.6552 | 0.5476 | 0.2540 | 0.5092 | 0.9183 | 0.2411 | 0.4518 | 0.4551 |
| L13H384-0_1_2_3_4_5_6_7_8_9_12_15_18 | 0.4791 | 0.4401 | 0.2754 | 0.4849 | 0.7384 | 0.3201 | 0.6369 | 0.5059 | 0.2478 | 0.5237 | 0.9190 | 0.2602 | 0.4666 | 0.4099 |
| L13H384-0_1_2_3_4_5_6_7_8_10_12_15_18 | 0.4877 | 0.4384 | 0.2749 | 0.4937 | 0.7299 | 0.3366 | 0.6698 | 0.5314 | 0.2588 | 0.5073 | 0.9264 | 0.2430 | 0.4879 | 0.4414 |
| L13H384-0_1_2_3_4_5_6_7_9_12_15_18_21 | 0.4810 | 0.4007 | 0.2739 | 0.4989 | 0.7180 | 0.3403 | 0.6441 | 0.5257 | 0.2541 | 0.5093 | 0.9187 | 0.2413 | 0.4905 | 0.4371 |
| L13H384-0_10_11_12_13_14_15_16_17_18_19_20_21 | 0.4806 | 0.3938 | 0.2855 | 0.4911 | 0.7974 | 0.3504 | 0.6034 | 0.5211 | 0.2361 | 0.4486 | 0.9144 | 0.2257 | 0.4508 | 0.5294 |
| L13H384-9_10_11_12_13_14_15_16_17_18_19_20_21 | 0.4307 | 0.3901 | 0.2621 | 0.4753 | 0.7185 | 0.2927 | 0.5371 | 0.4487 | 0.2361 | 0.3267 | 0.8605 | 0.1513 | 0.3860 | 0.5143 |
| L7H384-0_1_2_3_4_5_6 | 0.4291 | 0.3635 | 0.2839 | 0.4665 | 0.6299 | 0.2958 | 0.5433 | 0.4692 | 0.1841 | 0.4174 | 0.8800 | 0.2217 | 0.4570 | 0.3660 |
| L7H384-0_1_2_3_4_5_9 | 0.4282 | 0.3929 | 0.2719 | 0.4447 | 0.6674 | 0.2890 | 0.5192 | 0.4847 | 0.2226 | 0.3850 | 0.8870 | 0.2074 | 0.4145 | 0.3804 |
| L7H384-0_1_2_3_4_5_12 | 0.4204 | 0.4035 | 0.2501 | 0.4283 | 0.6245 | 0.3044 | 0.5350 | 0.4518 | 0.1900 | 0.3760 | 0.8763 | 0.2073 | 0.4438 | 0.3748 |
| **L7H384-0_1_2_3_4_5_18** | **0.4693** | 0.3879 | 0.2782 | 0.5046 | 0.7257 | 0.3631 | 0.6139 | 0.4633 | 0.2353 | 0.4623 | 0.8951 | 0.2310 | 0.5111 | 0.4296 |
| L7H384-0_1_2_3_4_5_21 | 0.4629 | 0.4331 | 0.2731 | 0.4958 | 0.7667 | 0.3368 | 0.5943 | 0.4194 | 0.2666 | 0.4428 | 0.8742 | 0.2542 | 0.4220 | 0.4386 |
| L7H384-0_1_2_3_6_7_8 | 0.4236 | 0.3903 | 0.2590 | 0.4613 | 0.6097 | 0.2692 | 0.5962 | 0.4556 | 0.1790 | 0.3755 | 0.8501 | 0.2157 | 0.4596 | 0.3850 |
| L7H384-0_1_2_3_6_7_12 | 0.4149 | 0.3752 | 0.2369 | 0.4489 | 0.5763 | 0.2798 | 0.5630 | 0.4600 | 0.1955 | 0.3881 | 0.8458 | 0.2303 | 0.4260 | 0.3671 |
| L7H384-0_1_2_3_6_8_12 | 0.4171 | 0.3215 | 0.2305 | 0.4491 | 0.5696 | 0.2803 | 0.5615 | 0.4959 | 0.1897 | 0.3790 | 0.8756 | 0.2313 | 0.4600 | 0.3787 |
| L7H384-0_1_2_3_6_8_18 | 0.4722 | 0.3988 | 0.2619 | 0.5002 | 0.7551 | 0.3186 | 0.6438 | 0.5024 | 0.2429 | 0.4259 | 0.8969 | 0.2162 | 0.5054 | 0.4704 |
| L7H384-0_16_17_18_19_20_21 | 0.4589 | 0.3684 | 0.2711 | 0.4949 | 0.7224 | 0.3087 | 0.5750 | 0.4676 | 0.2317 | 0.4541 | 0.8829 | 0.2050 | 0.4668 | 0.5171 |
| L7H384-15_16_17_18_19_20_21 | 0.4299 | 0.3728 | 0.2747 | 0.4572 | 0.6557 | 0.2529 | 0.5594 | 0.4474 | 0.2197 | 0.3528 | 0.8883 | 0.1887 | 0.4160 | 0.5034 |
| L4H384-0_1_2_3 | 0.3329 | 0.2011 | 0.1529 | 0.4820 | 0.3088 | 0.1937 | 0.4178 | 0.3890 | 0.1897 | 0.3238 | 0.8441 | 0.2045 | 0.2912 | 0.3286 |
| **L4H384-0_1_2_18** | **0.4530** | 0.3806 | 0.2544 | 0.4657 | 0.7230 | 0.2793 | 0.5704 | 0.5060 | 0.2270 | 0.4283 | 0.8942 | 0.2246 | 0.4671 | 0.4682 |
| L4H384-0_1_2_21 | 0.4558 | 0.3801 | 0.2553 | 0.4871 | 0.7350 | 0.3097 | 0.5734 | 0.4899 | 0.2510 | 0.4193 | 0.8860 | 0.2249 | 0.4620 | 0.4517 |
| L4H384-0_19_20_21 | 0.4408 | 0.3888 | 0.2651 | 0.4880 | 0.6629 | 0.3018 | 0.6010 | 0.4224 | 0.2342 | 0.4086 | 0.8714 | 0.2027 | 0.4238 | 0.4597 |
| L4H384-18_19_20_21 | 0.4130 | 0.3067 | 0.2546 | 0.4740 | 0.6206 | 0.2363 | 0.5393 | 0.4074 | 0.2233 | 0.2879 | 0.8850 | 0.2015 | 0.4270 | 0.5058 |

**Bold** rows indicate the official picks for each layer count.

## Key Findings

1. **Front-heavy pruning works best**: Retaining early layers (0–N) plus a global attention layer consistently outperforms other strategies.

2. **Layer 18 > Layer 21**: Ending with layer 18 (global attention) outperforms ending with layer 21 (final global attention layer), suggesting that intermediate global attention layers provide better representations for retrieval when combined with early layers.

3. **Early layers are critical**: Models that skip early layers (e.g., `9_10_11_...` or `15_16_17_...`) show significant performance degradation.

4. **Diminishing returns with depth**: L13 (0.496) vs L7 (0.469) shows only ~3% improvement for nearly double the layers.

## License

MIT