ModernBERT Embed Base Legal Fine-tuned

This is a sentence-transformers model finetuned from nomic-ai/modernbert-embed-base on the legal-rag-positives-synthetic dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False, 'architecture': 'ModernBertModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("aaa961/modernbert-embed-base-legal-no_MRL_symmetricMNRL_3sets")
# Run inference
sentences = [
    'What sections of the document are referenced in the location Supplement 2, AR?',
    'the Polaris Solicitations as currently drafted do not comply with Section 3306(c)(3).  In its request \nto apply Section 3306(c)(3) to the Polaris Solicitations, GSA stated that \n \n \n  \nSupplement 2, AR at 2907–08.  Because GSA adopted an overly broad understanding of Section \n3306(c)(3)’s scope, GSA stated the Solicitations will include a “full range of order types,”',
    '“Based on this misunderstanding, the CIA attorney incorrectly cited some of the justifications for \nredacting the material to the DOJ attorney, who in turn shared that information with plaintiff.”  \nId. ¶ 9. \nE. \nProcedural History \nThe plaintiff filed the Complaints in each of these three actions on February 28, 2011,',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.3908, 0.0520],
#         [0.3908, 1.0000, 0.0703],
#         [0.0520, 0.0703, 1.0000]])

Evaluation

Metrics

Information Retrieval

Metric ir_eval_test ir_eval_eval
cosine_accuracy@1 0.3431 0.5796
cosine_accuracy@3 0.5131 0.7527
cosine_accuracy@5 0.575 0.8176
cosine_accuracy@10 0.6785 0.864
cosine_precision@1 0.3431 0.5796
cosine_precision@3 0.171 0.2509
cosine_precision@5 0.115 0.1635
cosine_precision@10 0.0679 0.0864
cosine_recall@1 0.3431 0.5796
cosine_recall@3 0.5131 0.7527
cosine_recall@5 0.575 0.8176
cosine_recall@10 0.6785 0.864
cosine_ndcg@10 0.5028 0.7246
cosine_mrr@10 0.4477 0.6795
cosine_map@100 0.4565 0.6842

Training Details

Training Dataset

legal-rag-positives-synthetic

  • Dataset: legal-rag-positives-synthetic at f11534a
  • Size: 5,175 training samples
  • Columns: anchor and positive
  • Approximate statistics based on the first 1000 samples:
    anchor positive
    type string string
    details
    • min: 8 tokens
    • mean: 16.61 tokens
    • max: 34 tokens
    • min: 44 tokens
    • mean: 96.79 tokens
    • max: 157 tokens
  • Samples:
    anchor positive
    Where is the similar statement to the one about business judgment and scoring merit found? is with each bidder itself, and its own business judgment in forming a team and what score it thinks
    is enough to merit an award.”) (emphasis in original); VCH MJAR at 21–22 (same); Oral Ar. Tr.
    at 10:5–7 (“[C]ompetition involves . . . some sort of tradeoff between offerors, some sort of
    evaluation of how offerors are against one another, and that’s not the case here. The case here is
    Who do lawyers generally employ as assistants in their practice? abide by the Rules of Professional Conduct. See rule 4-5.2(a).

    RULE 4-5.3.
    RESPONSIBILITIES REGARDING NONLAWYER
    ASSISTANTS
    (a) – (c) [No Change]
    Comment
    Lawyers generally employ assistants in their practice,
    including secretaries, investigators, law student interns, and
    paraprofessionals such as paralegals and legal assistants. Such
    Which court case is cited with a page number of 1327? 30; VCH MJAR at 28–30 (same).
    As noted, this Court applies the same interpretive rules to analyze both statutes and federal
    regulations. See Boeing, 983 F.3d at 1327 (citing Mass. Mut. Life Ins. Co., 782 F.3d at 1365); see
    also supra Discussion Section I. It is a “fundamental canon of statutory construction that the words
  • Loss: CachedMultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim",
        "mini_batch_size": 32,
        "gather_across_devices": false,
        "directions": [
            "query_to_doc",
            "doc_to_query"
        ],
        "partition_mode": "per_direction",
        "hardness_mode": null,
        "hardness_strength": 0.0
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 32
  • num_train_epochs: 4
  • learning_rate: 2e-05
  • lr_scheduler_type: cosine
  • warmup_steps: 0.1
  • optim: adamw_torch_fused
  • gradient_accumulation_steps: 16
  • bf16: True
  • tf32: True
  • eval_strategy: epoch
  • per_device_eval_batch_size: 16
  • load_best_model_at_end: True

All Hyperparameters

Click to expand
  • per_device_train_batch_size: 32
  • num_train_epochs: 4
  • max_steps: -1
  • learning_rate: 2e-05
  • lr_scheduler_type: cosine
  • lr_scheduler_kwargs: None
  • warmup_steps: 0.1
  • optim: adamw_torch_fused
  • optim_args: None
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • optim_target_modules: None
  • gradient_accumulation_steps: 16
  • average_tokens_across_devices: True
  • max_grad_norm: 1.0
  • label_smoothing_factor: 0.0
  • bf16: True
  • fp16: False
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: True
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • use_liger_kernel: False
  • liger_kernel_config: None
  • use_cache: False
  • neftune_noise_alpha: None
  • torch_empty_cache_steps: None
  • auto_find_batch_size: False
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • include_num_input_tokens_seen: no
  • log_level: passive
  • log_level_replica: warning
  • disable_tqdm: False
  • project: huggingface
  • trackio_space_id: trackio
  • eval_strategy: epoch
  • per_device_eval_batch_size: 16
  • prediction_loss_only: True
  • eval_on_start: False
  • eval_do_concat_batches: True
  • eval_use_gather_object: False
  • eval_accumulation_steps: None
  • include_for_metrics: []
  • batch_eval_metrics: False
  • save_only_model: False
  • save_on_each_node: False
  • enable_jit_checkpoint: False
  • push_to_hub: False
  • hub_private_repo: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_always_push: False
  • hub_revision: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • restore_callback_states_from_checkpoint: False
  • full_determinism: False
  • seed: 42
  • data_seed: None
  • use_cpu: False
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • parallelism_config: None
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • dataloader_prefetch_factor: None
  • remove_unused_columns: True
  • label_names: None
  • train_sampling_strategy: random
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • ddp_backend: None
  • ddp_timeout: 1800
  • fsdp: []
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • deepspeed: None
  • debug: []
  • skip_memory_metrics: True
  • do_predict: False
  • resume_from_checkpoint: None
  • warmup_ratio: None
  • local_rank: -1
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

Epoch Step Training Loss ir_eval_test_cosine_ndcg@10 ir_eval_eval_cosine_ndcg@10
-1 -1 - 0.5028 -
0.9877 10 0.9180 - -
1.0 11 - - 0.6723
1.8889 20 0.4042 - -
2.0 22 - - 0.7082
2.7901 30 0.2940 - -
3.0 33 - - 0.7236
3.6914 40 0.2646 - -
4.0 44 - - 0.7246
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.12.11
  • Sentence Transformers: 5.3.0
  • Transformers: 5.3.0
  • PyTorch: 2.5.1+cu121
  • Accelerate: 1.13.0
  • Datasets: 4.8.2
  • Tokenizers: 0.22.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

CachedMultipleNegativesRankingLoss

@misc{gao2021scaling,
    title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
    author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
    year={2021},
    eprint={2101.06983},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}
Downloads last month
6
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for aaa961/modernbert-embed-base-legal-no_MRL_symmetricMNRL_3sets

Finetuned
(109)
this model

Dataset used to train aaa961/modernbert-embed-base-legal-no_MRL_symmetricMNRL_3sets

Papers for aaa961/modernbert-embed-base-legal-no_MRL_symmetricMNRL_3sets

Evaluation results