Matryoshka Representation Learning
Paper • 2205.13147 • Published • 27
How to use amentaphd/eu-regulation-embeddings-snowflake-m-v2 with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("amentaphd/eu-regulation-embeddings-snowflake-m-v2", trust_remote_code=True)
sentences = [
"What are the anticipated financial effects that could arise from material risks associated with resource use and circular economy, and how might these risks impact the financial position, performance, and cash flows of an undertaking over different time frames?",
"(a)\n\nanticipated financial effects due to material risks arising from material resource use and circular economy -related impacts and dependencies and how these risks have or could reasonably be expected to have) a material influence on the undertaking’s financial position, financial performance performance, and cash flows over the short-, medium- and long-term; and\n\n(b)\n\nanticipated financial effects due to material opportunities related to resource use and circular economy.\n\nThe disclosure shall include:\n\n(a)",
"combination of hydrocarbons obtained as a raffinate from a sulphuric acid treating process. It consists of hydrocarbons having carbon numbers predominantly in the range of C7 through C12 and boiling in the range of approximately 90 °C to 230 °C.) 649-351-00-7 265-115-2 64742-15-0 P Naphtha (petroleum), chemically neutralised heavy; Low boiling point naphtha — unspecified (A complex combination of hydrocarbons produced by a treating process to remove acidic materials. It consists of hydrocarbons having carbon numbers predominantly in the range of C6 through C12 and boiling in the range of approximately 65 °C to 230 °C.) 649-352-00-2 265-122-0 64742-22-9 P Naphtha (petroleum), chemically neutralised light; Low boiling point naphtha —",
"2. Member States shall require any investment firm wishing to establish a branch within the territory of another Member State or to use tied agents established in another Member State in which it has not established a branch, first to notify the competent authority of its home Member State and to provide it with the following information:\n\n(a) the Member States within the territory of which it plans to establish a branch or the Member States in which it has not established a branch but plans to use tied agents established there;\n\n(b) a programme of operations setting out, inter alia, the investment services and/or activities as well as the ancillary services to be offered;\n\n(c) where established, the organisational structure of the branch and indicating whether the branch intends to use tied agents and the identity of those tied agents;\n\n(d) where tied agents are to be used in a Member State in which an investment firm has not established a branch, a description of the intended use of the tied agent(s) and an organisational structure, including reporting lines, indicating how the agent(s) fit into the corporate structure of the investment firm;\n\n(e) the address in the host Member State from which documents may be obtained;\n\n(f) the names of those responsible for the management of the branch or of the tied agent.\n\nWhere an investment firm uses a tied agent established in a Member State outside its home Member State, such tied agent shall be assimilated to the branch, where one is established, and shall in any event be subject to the provisions of this Directive relating to branches."
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-m-v2.0. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: GteModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
'How does the text suggest addressing the social aspects related to low- and middle-income transport users in the context of zero-emission vehicle initiatives?',
'(b)\n\nmeasures intended to accelerate the uptake of zero-emission vehicles or to provide financial support for the deployment of fully interoperable refuelling and recharging infrastructure for zero-emission vehicles, or measures to encourage a shift to public transport and improve multimodality, or to provide financial support in order to address social aspects concerning low- and middle-income transport users;\n\n(c)\n\nto finance their Social Climate Plan in accordance with Article 15 of Regulation (EU) 2023/955;\n\n(d)',
'If the planned change is implemented notwithstanding the first and second subparagraphs, or if an unplanned change has taken place pursuant to which the AIFM’s management of the AIF no longer complies with this Directive or the AIFM otherwise no longer complies with this Directive, the competent authorities of the Member State of reference of the AIFM shall take all due measures in accordance with Article 46, including, if necessary, the express prohibition of marketing of the AIF.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
InformationRetrievalEvaluator| Metric | Value |
|---|---|
| cosine_accuracy@1 | 0.7059 |
| cosine_accuracy@3 | 0.9068 |
| cosine_accuracy@5 | 0.9448 |
| cosine_accuracy@10 | 0.9731 |
| cosine_precision@1 | 0.7059 |
| cosine_precision@3 | 0.3023 |
| cosine_precision@5 | 0.189 |
| cosine_precision@10 | 0.0973 |
| cosine_recall@1 | 0.7059 |
| cosine_recall@3 | 0.9068 |
| cosine_recall@5 | 0.9448 |
| cosine_recall@10 | 0.9731 |
| cosine_ndcg@10 | 0.8513 |
| cosine_mrr@10 | 0.8109 |
| cosine_map@100 | 0.8123 |
sentence_0 and sentence_1| sentence_0 | sentence_1 | |
|---|---|---|
| type | string | string |
| details |
|
|
| sentence_0 | sentence_1 |
|---|---|
What is the maximum allowable reduction in excise duty for mixtures used as motor fuels containing biodiesel in Italy until 30 June 2004? |
for waste oils which are reused as fuel, either directly after recovery or following a recycling process for waste oils, and where the reuse is subject to duty. |
What are the minimum indicative share percentages for the years 2023 to 2030, and how do these percentages relate to the interconnectivity levels of the Member States? |
Such indicative shares may, in each year, amount to at least 5 % from 2023 to 2026 and at least 10 % from 2027 to 2030, or, where lower, to the level of interconnectivity of the Member State concerned in any given year. |
What is the significance of the one-month period mentioned in the context? |
one month after its notification, in accordance with the arrangements provided for in Article 23. |
MatryoshkaLoss with these parameters:{
"loss": "MultipleNegativesRankingLoss",
"matryoshka_dims": [
768,
512,
256,
128,
64
],
"matryoshka_weights": [
1,
1,
1,
1,
1
],
"n_dims_per_step": -1
}
eval_strategy: stepsnum_train_epochs: 4fp16: Truemulti_dataset_batch_sampler: round_robinoverwrite_output_dir: Falsedo_predict: Falseeval_strategy: stepsprediction_loss_only: Trueper_device_train_batch_size: 8per_device_eval_batch_size: 8per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 5e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1num_train_epochs: 4max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.0warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Falsefp16: Truefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torchoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Nonehub_always_push: Falsegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseinclude_for_metrics: []eval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Nonedispatch_batches: Nonesplit_batches: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseuse_liger_kernel: Falseeval_use_gather_object: Falseaverage_tokens_across_devices: Falseprompts: Nonebatch_sampler: batch_samplermulti_dataset_batch_sampler: round_robin| Epoch | Step | Training Loss | cosine_ndcg@10 |
|---|---|---|---|
| 0.0863 | 500 | 0.225 | - |
| 0.1726 | 1000 | 0.1337 | - |
| 0.2589 | 1500 | 0.1195 | - |
| 0.3452 | 2000 | 0.0803 | - |
| 0.4316 | 2500 | 0.0775 | - |
| 0.5179 | 3000 | 0.0714 | - |
| 0.6042 | 3500 | 0.0852 | - |
| 0.6905 | 4000 | 0.0718 | - |
| 0.7768 | 4500 | 0.0499 | - |
| 0.8631 | 5000 | 0.0665 | 0.8371 |
| 0.9494 | 5500 | 0.0674 | - |
| 1.0 | 5793 | - | 0.8416 |
| 1.0357 | 6000 | 0.0538 | - |
| 1.1220 | 6500 | 0.0606 | - |
| 1.2084 | 7000 | 0.0294 | - |
| 1.2947 | 7500 | 0.0129 | - |
| 1.3810 | 8000 | 0.0101 | - |
| 1.4673 | 8500 | 0.0072 | - |
| 1.5536 | 9000 | 0.0211 | - |
| 1.6399 | 9500 | 0.0133 | - |
| 1.7262 | 10000 | 0.0063 | 0.8513 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Base model
Snowflake/snowflake-arctic-embed-m-v2.0