Sentence Similarity
sentence-transformers
Safetensors
bert
feature-extraction
Generated from Trainer
dataset_size:46338
loss:MatryoshkaLoss
loss:MultipleNegativesRankingLoss
Eval Results (legacy)
text-embeddings-inference
Instructions to use fjavigv/snoweu_v4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use fjavigv/snoweu_v4 with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("fjavigv/snoweu_v4") sentences = [ "What are the chemical names and corresponding identifiers for octabromo derivate and 2-Methoxyethanol, including their CAS numbers and EC numbers?", "octabromo derivate 602-094-00-4 251-087-9 32536-52-0 2-Methoxyethanol; ethylene glycol monomethyl ether; methylglycol 603-011-00-4 203-713-7 109-86-4 2-Ethoxyethanol; ethylene glycol monoethyl ether; ethylglycol 603-012-00-X 203-804-1 110-80-5 [▼M61](./../../../legal-content/EN/AUTO/?uri=celex:32020R2096 \"32020R2096: INSERTED\") Ethylene oxide; oxirane 603-023-00-X 200-849-9 75-21-8 [▼C1](./../../../legal-content/EN/AUTO/?uri=celex:32006R1907R%2801%29 \"32006R1907R(01): REPLACED\") 1,2-Dimethoxyethane ethylene glycol dimethyl ether EGDME 603-031-00-3 203-794-9 110-71-4 [▼M45](./../../../legal-content/EN/AUTO/?uri=celex:32017R1510 \"32017R1510: INSERTED\") Tetrahydro-2-furyl-methanol; tetrahydrofurfuryl alcohol 603-061-00-7 202-625-6 97-99-4", "hydrocarbons produced as the residual fraction from the distillation of heavy coker gas oil and vacuum gas oil. It predominantly consists of hydrocarbons having carbon numbers predominantly greater than C13 and boiling above approximately 230 °C.) 649-026-00-X 270-796-4 68478-17-1 Residues (petroleum), heavy coker and light vacuum; Heavy fuel oil (A complex combination of hydrocarbons produced as the residual fraction from the distillation of heavy coker gas oil and light vacuum gas oil. It consists predominantly of hydrocarbons having carbon numbers predominantly greater than C13 and boiling above approximately 230 °C.) 649-027-00-5 270-983-0 68512-61-8 Residues (petroleum), light vacuum; Heavy fuel oil (A complex residuum from the vacuum distillation of the residuum from the atmospheric distillation of crude oil. It consists of hydrocarbons having carbon numbers predominantly greater than C13 and boiling above approximately 230 °C.) 649-028-00-0 270-984-6 68512-62-9 Residues (petroleum), steam-cracked light; Heavy fuel oil (A complex residuum from the distillation of the products from a steam-cracking process. It consists predominantly of aromatic and unsaturated hydrocarbons having carbon numbers greater than C7 and boiling in the range of approximately 101 to 555 °C.) 649-029-00-6 271-013-9 68513-69-9 Fuel oil, No 6; Heavy fuel oil (A distillate oil having a minimum viscosity of 197 10-6 m2s-1 at 37,7 °C to a maximum of 197 10-5 m2s-1 at 37,7 °C.) 649-030-00-1 271-384-7 68553-00-4 Residues (petroleum), topping plant, low-sulfur; Heavy fuel oil (A low-sulfur complex combination of hydrocarbons produced as the residual fraction from the topping plant distillation of crude oil. It is the residuum after the straight-run gasoline cut, kerosene cut and gas oil cut have been removed.) 649-031-00-7 271-763-7 68607-30-7 Gas oils (petroleum), heavy atmospheric; Heavy fuel oil (A complex combination of hydrocarbons obtained by the distillation of crude oil. It consists of hydrocarbons having carbon numbers predominantly in the range of C7 through C35 and boiling in the range of approximately 121 to 510 °C.) 649-032-00-2 272-184-2 68783-08-4 Residues (petroleum), coker scrubber, Condensed-ring-arom.-contg.; Heavy fuel", "(e)\n\nwhere applicable, how the undertaking assesses the effectiveness of its engagement with its own workforce, including, where relevant, any agreements or outcomes that result.\n\nWhere applicable, the undertaking shall disclose the steps it takes to gain insight into the perspectives of people in its own workforce who may be particularly vulnerable to impacts and/or marginalised (for example, women, migrants, people with disabilities).\n\nIf the undertaking cannot disclose the above required information because it has not adopted a general process to engage with its own workforce , it shall disclose this to be the case. It may disclose a timeframe in which it aims to have such a process in place." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
Upload 12 files
Browse files- 1_Pooling/config.json +10 -0
- README.md +871 -3
- config.json +26 -0
- config_sentence_transformers.json +12 -0
- eval/Information-Retrieval_evaluation_results.csv +5 -0
- model.safetensors +3 -0
- modules.json +20 -0
- sentence_bert_config.json +4 -0
- special_tokens_map.json +37 -0
- tokenizer.json +0 -0
- tokenizer_config.json +63 -0
- vocab.txt +0 -0
1_Pooling/config.json
ADDED
|
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"word_embedding_dimension": 768,
|
| 3 |
+
"pooling_mode_cls_token": true,
|
| 4 |
+
"pooling_mode_mean_tokens": false,
|
| 5 |
+
"pooling_mode_max_tokens": false,
|
| 6 |
+
"pooling_mode_mean_sqrt_len_tokens": false,
|
| 7 |
+
"pooling_mode_weightedmean_tokens": false,
|
| 8 |
+
"pooling_mode_lasttoken": false,
|
| 9 |
+
"include_prompt": true
|
| 10 |
+
}
|
README.md
CHANGED
|
@@ -1,3 +1,871 @@
|
|
| 1 |
-
---
|
| 2 |
-
|
| 3 |
-
--
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
tags:
|
| 3 |
+
- sentence-transformers
|
| 4 |
+
- sentence-similarity
|
| 5 |
+
- feature-extraction
|
| 6 |
+
- generated_from_trainer
|
| 7 |
+
- dataset_size:46338
|
| 8 |
+
- loss:MatryoshkaLoss
|
| 9 |
+
- loss:MultipleNegativesRankingLoss
|
| 10 |
+
base_model: Snowflake/snowflake-arctic-embed-m-v1.5
|
| 11 |
+
widget:
|
| 12 |
+
- source_sentence: What are the chemical names and corresponding identifiers for octabromo
|
| 13 |
+
derivate and 2-Methoxyethanol, including their CAS numbers and EC numbers?
|
| 14 |
+
sentences:
|
| 15 |
+
- 'octabromo derivate 602-094-00-4 251-087-9 32536-52-0 2-Methoxyethanol; ethylene
|
| 16 |
+
glycol monomethyl ether; methylglycol 603-011-00-4 203-713-7 109-86-4 2-Ethoxyethanol;
|
| 17 |
+
ethylene glycol monoethyl ether; ethylglycol 603-012-00-X 203-804-1 110-80-5 [▼M61](./../../../legal-content/EN/AUTO/?uri=celex:32020R2096
|
| 18 |
+
"32020R2096: INSERTED") Ethylene oxide; oxirane 603-023-00-X 200-849-9 75-21-8
|
| 19 |
+
[▼C1](./../../../legal-content/EN/AUTO/?uri=celex:32006R1907R%2801%29 "32006R1907R(01):
|
| 20 |
+
REPLACED") 1,2-Dimethoxyethane ethylene glycol dimethyl ether EGDME 603-031-00-3
|
| 21 |
+
203-794-9 110-71-4 [▼M45](./../../../legal-content/EN/AUTO/?uri=celex:32017R1510
|
| 22 |
+
"32017R1510: INSERTED") Tetrahydro-2-furyl-methanol; tetrahydrofurfuryl alcohol
|
| 23 |
+
603-061-00-7 202-625-6 97-99-4'
|
| 24 |
+
- hydrocarbons produced as the residual fraction from the distillation of heavy
|
| 25 |
+
coker gas oil and vacuum gas oil. It predominantly consists of hydrocarbons having
|
| 26 |
+
carbon numbers predominantly greater than C13 and boiling above approximately
|
| 27 |
+
230 °C.) 649-026-00-X 270-796-4 68478-17-1 Residues (petroleum), heavy coker and
|
| 28 |
+
light vacuum; Heavy fuel oil (A complex combination of hydrocarbons produced as
|
| 29 |
+
the residual fraction from the distillation of heavy coker gas oil and light vacuum
|
| 30 |
+
gas oil. It consists predominantly of hydrocarbons having carbon numbers predominantly
|
| 31 |
+
greater than C13 and boiling above approximately 230 °C.) 649-027-00-5 270-983-0
|
| 32 |
+
68512-61-8 Residues (petroleum), light vacuum; Heavy fuel oil (A complex residuum
|
| 33 |
+
from the vacuum distillation of the residuum from the atmospheric distillation
|
| 34 |
+
of crude oil. It consists of hydrocarbons having carbon numbers predominantly
|
| 35 |
+
greater than C13 and boiling above approximately 230 °C.) 649-028-00-0 270-984-6
|
| 36 |
+
68512-62-9 Residues (petroleum), steam-cracked light; Heavy fuel oil (A complex
|
| 37 |
+
residuum from the distillation of the products from a steam-cracking process.
|
| 38 |
+
It consists predominantly of aromatic and unsaturated hydrocarbons having carbon
|
| 39 |
+
numbers greater than C7 and boiling in the range of approximately 101 to 555 °C.)
|
| 40 |
+
649-029-00-6 271-013-9 68513-69-9 Fuel oil, No 6; Heavy fuel oil (A distillate
|
| 41 |
+
oil having a minimum viscosity of 197 10-6 m2s-1 at 37,7 °C to a maximum of 197
|
| 42 |
+
10-5 m2s-1 at 37,7 °C.) 649-030-00-1 271-384-7 68553-00-4 Residues (petroleum),
|
| 43 |
+
topping plant, low-sulfur; Heavy fuel oil (A low-sulfur complex combination of
|
| 44 |
+
hydrocarbons produced as the residual fraction from the topping plant distillation
|
| 45 |
+
of crude oil. It is the residuum after the straight-run gasoline cut, kerosene
|
| 46 |
+
cut and gas oil cut have been removed.) 649-031-00-7 271-763-7 68607-30-7 Gas
|
| 47 |
+
oils (petroleum), heavy atmospheric; Heavy fuel oil (A complex combination of
|
| 48 |
+
hydrocarbons obtained by the distillation of crude oil. It consists of hydrocarbons
|
| 49 |
+
having carbon numbers predominantly in the range of C7 through C35 and boiling
|
| 50 |
+
in the range of approximately 121 to 510 °C.) 649-032-00-2 272-184-2 68783-08-4
|
| 51 |
+
Residues (petroleum), coker scrubber, Condensed-ring-arom.-contg.; Heavy fuel
|
| 52 |
+
- '(e)
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
where applicable, how the undertaking assesses the effectiveness of its engagement
|
| 56 |
+
with its own workforce, including, where relevant, any agreements or outcomes
|
| 57 |
+
that result.
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
Where applicable, the undertaking shall disclose the steps it takes to gain insight
|
| 61 |
+
into the perspectives of people in its own workforce who may be particularly vulnerable
|
| 62 |
+
to impacts and/or marginalised (for example, women, migrants, people with disabilities).
|
| 63 |
+
|
| 64 |
+
|
| 65 |
+
If the undertaking cannot disclose the above required information because it has
|
| 66 |
+
not adopted a general process to engage with its own workforce , it shall disclose
|
| 67 |
+
this to be the case. It may disclose a timeframe in which it aims to have such
|
| 68 |
+
a process in place.'
|
| 69 |
+
- source_sentence: Under what circumstances can the suspension or removal of a financial
|
| 70 |
+
instrument or derivative from trading be exempted, despite infringing Articles
|
| 71 |
+
7 and 17 of Regulation (EU) No 596/2014?
|
| 72 |
+
sentences:
|
| 73 |
+
- '(15) Directive 2010/75/EU of the European Parliament and of the Council of 24
|
| 74 |
+
November 2010 on industrial emissions (integrated pollution prevention and control)
|
| 75 |
+
(recast) (OJ L 334, 17.12.2010, p. 17).
|
| 76 |
+
|
| 77 |
+
|
| 78 |
+
(16) Directive 2011/92/EU of the European Parliament and of the Council of 13
|
| 79 |
+
December 2011 on the assessment of the effects of certain public and private projects
|
| 80 |
+
on the environment (OJ L 26, 28.1.2012, p. 1).
|
| 81 |
+
|
| 82 |
+
|
| 83 |
+
(17) Directive 2012/18/EU of the European Parliament and of the Council of 4 July
|
| 84 |
+
2012 on the control of major-accident hazards involving dangerous substances,
|
| 85 |
+
amending and subsequently repealing Council Directive 96/82/EC (OJ L 197, 24.7.2012,
|
| 86 |
+
p. 1).'
|
| 87 |
+
- '3.
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
Where the competent authority of the host Member State of a regulated market,
|
| 91 |
+
an MTF or OTF has clear and demonstrable grounds for believing that such regulated
|
| 92 |
+
market, MTF or OTF infringes the obligations arising from the provisions adopted
|
| 93 |
+
pursuant to this Directive, it shall refer those findings to the competent authority
|
| 94 |
+
of the home Member State of the regulated market or the MTF or OTF.'
|
| 95 |
+
- The notified competent authorities of the other Member States shall require that
|
| 96 |
+
regulated markets, other MTFs, other OTFs and systematic internalisers, which
|
| 97 |
+
fall under their jurisdiction and trade the same financial instrument or derivatives
|
| 98 |
+
referred to in points (4) to (10) of Section C of Annex I that relate or are referenced
|
| 99 |
+
to that financial instrument, also suspend or remove that financial instrument
|
| 100 |
+
or derivatives from trading, where the suspension or removal is due to suspected
|
| 101 |
+
market abuse, a take-over bid or the non- disclosure of inside information about
|
| 102 |
+
the issuer or financial instrument infringing Articles 7 and 17 of Regulation
|
| 103 |
+
(EU) No 596/2014 except where such suspension or removal could cause significant
|
| 104 |
+
damage to the
|
| 105 |
+
- source_sentence: How can the limitation period for the Commission's powers be interrupted
|
| 106 |
+
according to Article 38?
|
| 107 |
+
sentences:
|
| 108 |
+
- '2.
|
| 109 |
+
|
| 110 |
+
|
| 111 |
+
That third-country dialogue shall not prevent the Commission from taking action
|
| 112 |
+
under this Regulation. Individual measures adopted pursuant to this Regulation
|
| 113 |
+
shall not be addressed within that dialogue.
|
| 114 |
+
|
| 115 |
+
|
| 116 |
+
Article 38
|
| 117 |
+
|
| 118 |
+
|
| 119 |
+
Limitation periods
|
| 120 |
+
|
| 121 |
+
|
| 122 |
+
1.
|
| 123 |
+
|
| 124 |
+
|
| 125 |
+
The powers of the Commission under Articles 10 and 11 shall be subject to a limitation
|
| 126 |
+
period of 10 years, starting on the day on which a foreign subsidy is granted
|
| 127 |
+
to an undertaking. Any action taken by the Commission under Article 10, 13, 14
|
| 128 |
+
or 15 with respect to a foreign subsidy shall interrupt the limitation period.
|
| 129 |
+
After each interruption, the limitation period of 10 years shall start to run
|
| 130 |
+
afresh.
|
| 131 |
+
|
| 132 |
+
|
| 133 |
+
2.'
|
| 134 |
+
- (36) Member States should promote energy efficient means of mobility, including
|
| 135 |
+
in their public procurement practices, such as rail, cycling, walking or shared
|
| 136 |
+
mobility, by renewing and decarbonising fleets, encouraging a modal shift and
|
| 137 |
+
including those modes in urban mobility planning.
|
| 138 |
+
- air oxidation of petrolatum.) 649-255-00-5 265-206-7 64743-01-7 N Petrolatum (petroleum),
|
| 139 |
+
alumina-treated; Petrolatum (A complex combination of hydrocarbons obtained when
|
| 140 |
+
petrolatum is treated with Al2O3 to remove polar components and impurities. It
|
| 141 |
+
consists predominantly of saturated, crystalline, and liquid hydrocarbons having
|
| 142 |
+
carbon numbers predominantly greater than C25.) 649-256-00-0 285-098-5 85029-74-9
|
| 143 |
+
N Petrolatum (petroleum), hydrotreated; Petrolatum (A complex combination of hydrocarbons
|
| 144 |
+
obtained as a semi-solid from dewaxed paraffinic residual oil treated with hydrogen
|
| 145 |
+
in the presence of a catalyst. It consists predominantly of saturated, microcrystalline,
|
| 146 |
+
and liquid hydrocarbons having carbon numbers predominantly greater than
|
| 147 |
+
- source_sentence: What specific sections and points of Annex VIII are included in
|
| 148 |
+
the registration for high-risk AI systems in the areas of law enforcement, migration,
|
| 149 |
+
asylum, and border control management?
|
| 150 |
+
sentences:
|
| 151 |
+
- '▼M15
|
| 152 |
+
|
| 153 |
+
|
| 154 |
+
Article 18b
|
| 155 |
+
|
| 156 |
+
|
| 157 |
+
Assistance from the Commission, EMSA and other relevant organisations
|
| 158 |
+
|
| 159 |
+
|
| 160 |
+
1.
|
| 161 |
+
|
| 162 |
+
|
| 163 |
+
For the purposes of carrying out its obligations under Article 3c(4) and Articles
|
| 164 |
+
3g, 3gd, 3ge, 3gf, 3gg and 18a, the Commission, the administering Member State
|
| 165 |
+
and administering authorities in respect of a shipping company may request the
|
| 166 |
+
assistance of EMSA or another relevant organisation and may conclude to that effect
|
| 167 |
+
any appropriate agreements with those organisations.
|
| 168 |
+
|
| 169 |
+
|
| 170 |
+
2.
|
| 171 |
+
|
| 172 |
+
|
| 173 |
+
The Commission, assisted by EMSA, shall endeavour to develop appropriate tools
|
| 174 |
+
and guidance to facilitate and coordinate verification and enforcement activities
|
| 175 |
+
related to the application of this Directive to maritime transport. As far as
|
| 176 |
+
practicable, such guidance and tools shall be made available to the Member States
|
| 177 |
+
and the verifiers for information-sharing purposes and in order to better ensure
|
| 178 |
+
robust enforcement of the national measures transposing this Directive.
|
| 179 |
+
|
| 180 |
+
|
| 181 |
+
▼B
|
| 182 |
+
|
| 183 |
+
|
| 184 |
+
Article 19
|
| 185 |
+
|
| 186 |
+
|
| 187 |
+
Registries
|
| 188 |
+
|
| 189 |
+
|
| 190 |
+
▼M4
|
| 191 |
+
|
| 192 |
+
|
| 193 |
+
1.
|
| 194 |
+
|
| 195 |
+
|
| 196 |
+
Allowances issued from 1 January 2012 onwards shall be held in the ►M9 Union ◄
|
| 197 |
+
registry for the execution of processes pertaining to the maintenance of the holding
|
| 198 |
+
accounts opened in the Member State and the allocation, surrender and cancellation
|
| 199 |
+
of allowances under the Commission ►M9 Acts ◄ referred to in paragraph 3.
|
| 200 |
+
|
| 201 |
+
|
| 202 |
+
Each Member State shall be able to fulfil the execution of authorised operations
|
| 203 |
+
under the UNFCCC or the Kyoto Protocol.
|
| 204 |
+
|
| 205 |
+
|
| 206 |
+
▼B
|
| 207 |
+
|
| 208 |
+
|
| 209 |
+
2.
|
| 210 |
+
|
| 211 |
+
|
| 212 |
+
Any person may hold allowances. The registry shall be accessible to the public
|
| 213 |
+
and shall contain separate accounts to record the allowances held by each person
|
| 214 |
+
to whom and from whom allowances are issued or transferred.
|
| 215 |
+
|
| 216 |
+
|
| 217 |
+
▼M9
|
| 218 |
+
|
| 219 |
+
|
| 220 |
+
3.'
|
| 221 |
+
- '(35)
|
| 222 |
+
|
| 223 |
+
|
| 224 |
+
‘recycled carbon fuels’ means liquid and gaseous fuels that are produced from
|
| 225 |
+
liquid or solid waste streams of non-renewable origin which are not suitable for
|
| 226 |
+
material recovery in accordance with Article 4 of Directive 2008/98/EC, or from
|
| 227 |
+
waste processing gas and exhaust gas of non-renewable origin which are produced
|
| 228 |
+
as an unavoidable and unintentional consequence of the production process in industrial
|
| 229 |
+
installations;
|
| 230 |
+
|
| 231 |
+
|
| 232 |
+
▼M2
|
| 233 |
+
|
| 234 |
+
|
| 235 |
+
(36)
|
| 236 |
+
|
| 237 |
+
|
| 238 |
+
‘renewable fuels of non-biological origin’ means liquid and gaseous fuels the
|
| 239 |
+
energy content of which is derived from renewable sources other than biomass;
|
| 240 |
+
|
| 241 |
+
|
| 242 |
+
▼B
|
| 243 |
+
|
| 244 |
+
|
| 245 |
+
(37)'
|
| 246 |
+
- '4. For high-risk AI systems referred to in points 1, 6 and 7 of Annex III, in
|
| 247 |
+
the areas of law enforcement, migration, asylum and border control management,
|
| 248 |
+
the registration referred to in paragraphs 1, 2 and 3 of this Article shall be
|
| 249 |
+
in a secure non-public section of the EU database referred to in Article 71 and
|
| 250 |
+
shall include only the following information, as applicable, referred to in:
|
| 251 |
+
|
| 252 |
+
|
| 253 |
+
(a) Section A, points 1 to 10, of Annex VIII, with the exception of points 6,
|
| 254 |
+
8 and 9; (b) Section B, points 1 to 5, and points 8 and 9 of Annex VIII; --- ---
|
| 255 |
+
(c) Section C, points 1 to 3, of Annex VIII; --- --- (d) points 1, 2, 3 and 5,
|
| 256 |
+
of Annex IX. --- ---'
|
| 257 |
+
- source_sentence: The document outlines various chemical substances classified as
|
| 258 |
+
carcinogenic or toxic for reproduction, detailing their respective categories
|
| 259 |
+
and regulatory dates. Specific compounds such as diarsenic trioxide, lead chromate,
|
| 260 |
+
and chromium trioxide are highlighted, indicating their potential health risks
|
| 261 |
+
and the timeline for their regulation.
|
| 262 |
+
sentences:
|
| 263 |
+
- '57(f) – human health) (a) 21 August 2013 (*) (b) By way of derogation from point
|
| 264 |
+
(a): 14 June 2023 for uses in mixtures containing DIBP at or above 0,1 % and below
|
| 265 |
+
0,3 % weight by weight. (a) 21 February 2015 (**) (b) By way of derogation from
|
| 266 |
+
point (a): 14 December 2024 for uses in mixtures containing DIBP at or above 0,1
|
| 267 |
+
% and below 0,3 % weight by weight. - [▼M15](./../../../legal-content/EN/AUTO/?uri=celex:32012R0125
|
| 268 |
+
"32012R0125: INSERTED") 8. Diarsenic trioxide EC No: 215-481-4 CAS No: 1327-53-3
|
| 269 |
+
Carcinogenic (category 1A) 21 November 2013 21 May 2015 — 9. Diarsenic pentaoxide
|
| 270 |
+
EC No: 215-116-9 CAS No: 1303-28-2 Carcinogenic (category 1A) 21 November 2013
|
| 271 |
+
21 May 2015 — 10. Lead chromate EC No: 231-846-0 CAS No: 7758-97-6 Carcinogenic
|
| 272 |
+
(category 1B) Toxic for reproduction (category 1A) 21 November 2013 ►M43 (*1)
|
| 273 |
+
◄ 21 May 2015 ►M43 (*2) ◄ — 11. Lead sulfochromate yellow (C.I. Pigment Yellow
|
| 274 |
+
34) EC No: 215-693-7 CAS No: 1344-37-2 Carcinogenic (category 1B) Toxic for reproduction
|
| 275 |
+
(category 1A) 21 November 2013 ►M43 (*1) ◄ 21 May 2015 ►M43 (*2) ◄ — 12. Lead
|
| 276 |
+
chromate molybdate sulphate red (C.I. Pigment Red 104) EC No: 235-759-9 CAS No:
|
| 277 |
+
12656-85-8 Carcinogenic (category 1B) Toxic for reproduction (category 1A) 21
|
| 278 |
+
November 2013 ►M43 (*1) ◄ 21 May 2015 ►M43 (*2) ◄ 13. Tris (2-chloroethyl) phosphate
|
| 279 |
+
(TCEP) EC No: 204-118-5 CAS No: 115-96-8 Toxic for reproduction (category 1B)
|
| 280 |
+
21 February 2014 21 August 2015 14. 2,4-Dinitrotoluene (2,4-DNT) EC No: 204-450-0
|
| 281 |
+
CAS No: 121-14-2 Carcinogenic (category 1B) 21 February 2014 ►M43 (*1) ◄ 21 August
|
| 282 |
+
2015 ►M43 (*2) ◄ [▼M22](./../../../legal-content/EN/AUTO/?uri=celex:32013R0348
|
| 283 |
+
"32013R0348: INSERTED") 15. Trichloroethylene EC No: 201-167-4 CAS No: 79-01-6
|
| 284 |
+
Carcinogenic (category 1B) 21 October 2014 ►M43 (*1) ◄ 21 April 2016 ►M43 (*2)
|
| 285 |
+
◄ — 16. Chromium trioxide EC No: 215-607-8 CAS No: 1333-82-0 Carcinogenic (category
|
| 286 |
+
1A) Mutagenic (category 1B) 21 March 2016 ►M43 (*1) ◄ 21 September 2017 ►M43 (*2)
|
| 287 |
+
◄ — 17. Acids generated from chromium trioxide and their oligomers Group containing:
|
| 288 |
+
Chromic acid EC No: 231-801-5 CAS No: 7738-94-5 Dichromic acid EC No: 236-881-5
|
| 289 |
+
CAS No: 13530-68-2 Oligomers of chromic acid and dichromic acid EC No: not yet
|
| 290 |
+
assigned CAS No: not yet assigned Carcinogenic (category 1B) 21 March 2016 ►M43
|
| 291 |
+
(*1) ◄ 21 September 2017 ►M43 (*2) ◄ — 18. Sodium dichromate EC No: 234-190-3
|
| 292 |
+
CAS No: 7789-12-0 10588-01-9 Carcinogenic (category 1B) Mutagenic (category 1B)
|
| 293 |
+
Toxic for reproduction (category 1B) 21 March 2016 ►M43 (*1) ◄ 21 September 2017
|
| 294 |
+
►M43 (*2) ◄ — 19. Potassium dichromate EC No: 231-906-6 CAS No: 7778-50-9 Carcinogenic
|
| 295 |
+
(category 1B) Mutagenic (category 1B) Toxic for reproduction (category 1B) 21
|
| 296 |
+
March 2016 ►M43 (*1) ◄ 21 September 2017 ►M43 (*2) ◄ — 20. Ammonium dichromate
|
| 297 |
+
EC No: 232-143-1 CAS No: 7789-09-5 Carcinogenic (category 1B) Mutagenic (category
|
| 298 |
+
1B) Toxic for reproduction (category 1B) 21 March 2016 ►M43 (*1) ◄ 21 September
|
| 299 |
+
2017 ►M43 (*2) ◄ 21. Potassium chromate EC No: 232-140-5 CAS No: 7789-00-6 Carcinogenic
|
| 300 |
+
(category 1B) Mutagenic (category 1B) 21 March 2016 ►M43 (*1) ◄ 21 September 2017
|
| 301 |
+
►M43 (*2) ◄ 22. Sodium chromate EC No: 231-889-5 CAS No: 7775-11-3 Carcinogenic
|
| 302 |
+
(category 1B) Mutagenic (category 1B) Toxic for reproduction (category 1B) 21
|
| 303 |
+
March 2016 ►M43 (*1) ◄ 21 September 2017 ►M43 (*2) ◄ [▼M28](./../../../legal-content/EN/AUTO/?uri=celex:32014R0895
|
| 304 |
+
"32014R0895: INSERTED") 23. Formaldehyde, oligomeric reaction products with aniline
|
| 305 |
+
(technical MDA) EC No: 500-036-1 CAS No: 25214-70-4 Carcinogenic (category 1B)
|
| 306 |
+
22 February 2016 ►M43 (*1) ◄ 22 August 2017 ►M43 (*2) ◄ — 24. Arsenic acid EC
|
| 307 |
+
No: 231-901-9 CAS No: 7778-39-4 Carcinogenic (category 1A) 22 February 2016 22
|
| 308 |
+
August 2017 — 25. Bis(2-methoxyethyl) ether (diglyme) EC No: 203-924-4 CAS No:
|
| 309 |
+
111-96-6 Toxic for reproduction (category 1B) 22 February 2016 ►M43 (*1) ◄ 22
|
| 310 |
+
August 2017 ►M43 (*2) ◄ — 26. 1,2-dichloroethane (EDC) EC No: 203-458-1 CAS No:
|
| 311 |
+
107-06-2 Carcinogenic (category 1B) 22 May 2016 22 November 2017 — 27. 2,2′-dichloro-4,4′-methylenedianiline
|
| 312 |
+
(MOCA) EC No: 202-918-9 CAS No: 101-14-4 Carcinogenic (category 1B) 22 May 2016
|
| 313 |
+
►M43 (*1) ◄ 22 November 2017 ►M43 (*2) ◄ — 28. Dichromium tris(chromate) EC No:
|
| 314 |
+
246-356-2 CAS No: 24613-89-6 Carcinogenic (category 1B) 22 July 2017 ►M43 (*1)
|
| 315 |
+
◄ 22 January 2019 ►M43 (*2) ◄ — 29. Strontium chromate EC No: 232-142-6 CAS No:
|
| 316 |
+
7789-06-2 Carcinogenic (category 1B) 22 July 2017 ►M43 (*1) ◄ 22 January 2019
|
| 317 |
+
►M43 (*2) ◄ — 30. Potassium hydroxyoctaoxodizincatedichromate EC'
|
| 318 |
+
- '(c)
|
| 319 |
+
|
| 320 |
+
|
| 321 |
+
the financial soundness of the proposed acquirer, in particular in relation to
|
| 322 |
+
the type of business pursued and envisaged in the investment firm in which the
|
| 323 |
+
acquisition is proposed;
|
| 324 |
+
|
| 325 |
+
|
| 326 |
+
(d)
|
| 327 |
+
|
| 328 |
+
|
| 329 |
+
whether the investment firm will be able to comply and continue to comply with
|
| 330 |
+
the prudential requirements based on this Directive and, where applicable, other
|
| 331 |
+
Directives, in particular Directives 2002/87/EC and 2013/36/EU, in particular,
|
| 332 |
+
whether the group of which it will become a part has a structure that makes it
|
| 333 |
+
possible to exercise effective supervision, effectively exchange information among
|
| 334 |
+
the competent authorities and determine the allocation of responsibilities among
|
| 335 |
+
the competent authorities;
|
| 336 |
+
|
| 337 |
+
|
| 338 |
+
(e)'
|
| 339 |
+
- No administrative costs or fees related to the implementation of financing and
|
| 340 |
+
investment operations under the EU guarantee shall be due to the implementing
|
| 341 |
+
partner by the Commission unless the nature of the policy objectives targeted
|
| 342 |
+
by the financial product to be implemented and the affordability for the targeted
|
| 343 |
+
final recipients or the type of financing provided allow the implementing partner
|
| 344 |
+
to duly justify to the Commission the need for an exception. The coverage of such
|
| 345 |
+
costs by the Union budget shall be limited to the amount strictly required to
|
| 346 |
+
implement the relevant financing and investment operations, and shall be provided
|
| 347 |
+
only to the extent to which the costs are not covered by revenues received by
|
| 348 |
+
the implementing partners from
|
| 349 |
+
pipeline_tag: sentence-similarity
|
| 350 |
+
library_name: sentence-transformers
|
| 351 |
+
metrics:
|
| 352 |
+
- cosine_accuracy@1
|
| 353 |
+
- cosine_accuracy@3
|
| 354 |
+
- cosine_accuracy@5
|
| 355 |
+
- cosine_accuracy@10
|
| 356 |
+
- cosine_precision@1
|
| 357 |
+
- cosine_precision@3
|
| 358 |
+
- cosine_precision@5
|
| 359 |
+
- cosine_precision@10
|
| 360 |
+
- cosine_recall@1
|
| 361 |
+
- cosine_recall@3
|
| 362 |
+
- cosine_recall@5
|
| 363 |
+
- cosine_recall@10
|
| 364 |
+
- cosine_ndcg@10
|
| 365 |
+
- cosine_mrr@10
|
| 366 |
+
- cosine_map@100
|
| 367 |
+
model-index:
|
| 368 |
+
- name: SentenceTransformer based on Snowflake/snowflake-arctic-embed-m-v1.5
|
| 369 |
+
results:
|
| 370 |
+
- task:
|
| 371 |
+
type: information-retrieval
|
| 372 |
+
name: Information Retrieval
|
| 373 |
+
dataset:
|
| 374 |
+
name: Unknown
|
| 375 |
+
type: unknown
|
| 376 |
+
metrics:
|
| 377 |
+
- type: cosine_accuracy@1
|
| 378 |
+
value: 0.6777144829967202
|
| 379 |
+
name: Cosine Accuracy@1
|
| 380 |
+
- type: cosine_accuracy@3
|
| 381 |
+
value: 0.8972898325565337
|
| 382 |
+
name: Cosine Accuracy@3
|
| 383 |
+
- type: cosine_accuracy@5
|
| 384 |
+
value: 0.9390643880545486
|
| 385 |
+
name: Cosine Accuracy@5
|
| 386 |
+
- type: cosine_accuracy@10
|
| 387 |
+
value: 0.9691006387018816
|
| 388 |
+
name: Cosine Accuracy@10
|
| 389 |
+
- type: cosine_precision@1
|
| 390 |
+
value: 0.6777144829967202
|
| 391 |
+
name: Cosine Precision@1
|
| 392 |
+
- type: cosine_precision@3
|
| 393 |
+
value: 0.2990966108521779
|
| 394 |
+
name: Cosine Precision@3
|
| 395 |
+
- type: cosine_precision@5
|
| 396 |
+
value: 0.18781287761090967
|
| 397 |
+
name: Cosine Precision@5
|
| 398 |
+
- type: cosine_precision@10
|
| 399 |
+
value: 0.09691006387018813
|
| 400 |
+
name: Cosine Precision@10
|
| 401 |
+
- type: cosine_recall@1
|
| 402 |
+
value: 0.6777144829967202
|
| 403 |
+
name: Cosine Recall@1
|
| 404 |
+
- type: cosine_recall@3
|
| 405 |
+
value: 0.8972898325565337
|
| 406 |
+
name: Cosine Recall@3
|
| 407 |
+
- type: cosine_recall@5
|
| 408 |
+
value: 0.9390643880545486
|
| 409 |
+
name: Cosine Recall@5
|
| 410 |
+
- type: cosine_recall@10
|
| 411 |
+
value: 0.9691006387018816
|
| 412 |
+
name: Cosine Recall@10
|
| 413 |
+
- type: cosine_ndcg@10
|
| 414 |
+
value: 0.8364282304724784
|
| 415 |
+
name: Cosine Ndcg@10
|
| 416 |
+
- type: cosine_mrr@10
|
| 417 |
+
value: 0.7924261355385132
|
| 418 |
+
name: Cosine Mrr@10
|
| 419 |
+
- type: cosine_map@100
|
| 420 |
+
value: 0.7938274567816883
|
| 421 |
+
name: Cosine Map@100
|
| 422 |
+
---
|
| 423 |
+
|
| 424 |
+
# SentenceTransformer based on Snowflake/snowflake-arctic-embed-m-v1.5
|
| 425 |
+
|
| 426 |
+
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Snowflake/snowflake-arctic-embed-m-v1.5](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
|
| 427 |
+
|
| 428 |
+
## Model Details
|
| 429 |
+
|
| 430 |
+
### Model Description
|
| 431 |
+
- **Model Type:** Sentence Transformer
|
| 432 |
+
- **Base model:** [Snowflake/snowflake-arctic-embed-m-v1.5](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5) <!-- at revision 8e4eaca09c27ad3d501908636ec7c8bc3561b6de -->
|
| 433 |
+
- **Maximum Sequence Length:** 512 tokens
|
| 434 |
+
- **Output Dimensionality:** 768 dimensions
|
| 435 |
+
- **Similarity Function:** Cosine Similarity
|
| 436 |
+
<!-- - **Training Dataset:** Unknown -->
|
| 437 |
+
<!-- - **Language:** Unknown -->
|
| 438 |
+
<!-- - **License:** Unknown -->
|
| 439 |
+
|
| 440 |
+
### Model Sources
|
| 441 |
+
|
| 442 |
+
- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
|
| 443 |
+
- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
|
| 444 |
+
- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
|
| 445 |
+
|
| 446 |
+
### Full Model Architecture
|
| 447 |
+
|
| 448 |
+
```
|
| 449 |
+
SentenceTransformer(
|
| 450 |
+
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
|
| 451 |
+
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
|
| 452 |
+
(2): Normalize()
|
| 453 |
+
)
|
| 454 |
+
```
|
| 455 |
+
|
| 456 |
+
## Usage
|
| 457 |
+
|
| 458 |
+
### Direct Usage (Sentence Transformers)
|
| 459 |
+
|
| 460 |
+
First install the Sentence Transformers library:
|
| 461 |
+
|
| 462 |
+
```bash
|
| 463 |
+
pip install -U sentence-transformers
|
| 464 |
+
```
|
| 465 |
+
|
| 466 |
+
Then you can load this model and run inference.
|
| 467 |
+
```python
|
| 468 |
+
from sentence_transformers import SentenceTransformer
|
| 469 |
+
|
| 470 |
+
# Download from the 🤗 Hub
|
| 471 |
+
model = SentenceTransformer("sentence_transformers_model_id")
|
| 472 |
+
# Run inference
|
| 473 |
+
sentences = [
|
| 474 |
+
'The document outlines various chemical substances classified as carcinogenic or toxic for reproduction, detailing their respective categories and regulatory dates. Specific compounds such as diarsenic trioxide, lead chromate, and chromium trioxide are highlighted, indicating their potential health risks and the timeline for their regulation.',
|
| 475 |
+
'57(f) – human health) (a) 21 August 2013 (*) (b) By way of derogation from point (a): 14 June 2023 for uses in mixtures containing DIBP at or above 0,1 % and below 0,3 % weight by weight. (a) 21 February 2015 (**) (b) By way of derogation from point (a): 14 December 2024 for uses in mixtures containing DIBP at or above 0,1 % and below 0,3 % weight by weight. - [▼M15](./../../../legal-content/EN/AUTO/?uri=celex:32012R0125 "32012R0125: INSERTED") 8. Diarsenic trioxide EC No: 215-481-4 CAS No: 1327-53-3 Carcinogenic (category 1A) 21 November 2013 21 May 2015 — 9. Diarsenic pentaoxide EC No: 215-116-9 CAS No: 1303-28-2 Carcinogenic (category 1A) 21 November 2013 21 May 2015 — 10. Lead chromate EC No: 231-846-0 CAS No: 7758-97-6 Carcinogenic (category 1B) Toxic for reproduction (category 1A) 21 November 2013 ►M43 (*1) ◄ 21 May 2015 ►M43 (*2) ◄ — 11. Lead sulfochromate yellow (C.I. Pigment Yellow 34) EC No: 215-693-7 CAS No: 1344-37-2 Carcinogenic (category 1B) Toxic for reproduction (category 1A) 21 November 2013 ►M43 (*1) ◄ 21 May 2015 ►M43 (*2) ◄ — 12. Lead chromate molybdate sulphate red (C.I. Pigment Red 104) EC No: 235-759-9 CAS No: 12656-85-8 Carcinogenic (category 1B) Toxic for reproduction (category 1A) 21 November 2013 ►M43 (*1) ◄ 21 May 2015 ►M43 (*2) ◄ 13. Tris (2-chloroethyl) phosphate (TCEP) EC No: 204-118-5 CAS No: 115-96-8 Toxic for reproduction (category 1B) 21 February 2014 21 August 2015 14. 2,4-Dinitrotoluene (2,4-DNT) EC No: 204-450-0 CAS No: 121-14-2 Carcinogenic (category 1B) 21 February 2014 ►M43 (*1) ◄ 21 August 2015 ►M43 (*2) ◄ [▼M22](./../../../legal-content/EN/AUTO/?uri=celex:32013R0348 "32013R0348: INSERTED") 15. Trichloroethylene EC No: 201-167-4 CAS No: 79-01-6 Carcinogenic (category 1B) 21 October 2014 ►M43 (*1) ◄ 21 April 2016 ►M43 (*2) ◄ — 16. Chromium trioxide EC No: 215-607-8 CAS No: 1333-82-0 Carcinogenic (category 1A) Mutagenic (category 1B) 21 March 2016 ►M43 (*1) ◄ 21 September 2017 ►M43 (*2) ◄ — 17. Acids generated from chromium trioxide and their oligomers Group containing: Chromic acid EC No: 231-801-5 CAS No: 7738-94-5 Dichromic acid EC No: 236-881-5 CAS No: 13530-68-2 Oligomers of chromic acid and dichromic acid EC No: not yet assigned CAS No: not yet assigned Carcinogenic (category 1B) 21 March 2016 ►M43 (*1) ◄ 21 September 2017 ►M43 (*2) ◄ — 18. Sodium dichromate EC No: 234-190-3 CAS No: 7789-12-0 10588-01-9 Carcinogenic (category 1B) Mutagenic (category 1B) Toxic for reproduction (category 1B) 21 March 2016 ►M43 (*1) ◄ 21 September 2017 ►M43 (*2) ◄ — 19. Potassium dichromate EC No: 231-906-6 CAS No: 7778-50-9 Carcinogenic (category 1B) Mutagenic (category 1B) Toxic for reproduction (category 1B) 21 March 2016 ►M43 (*1) ◄ 21 September 2017 ►M43 (*2) ◄ — 20. Ammonium dichromate EC No: 232-143-1 CAS No: 7789-09-5 Carcinogenic (category 1B) Mutagenic (category 1B) Toxic for reproduction (category 1B) 21 March 2016 ►M43 (*1) ◄ 21 September 2017 ►M43 (*2) ◄ 21. Potassium chromate EC No: 232-140-5 CAS No: 7789-00-6 Carcinogenic (category 1B) Mutagenic (category 1B) 21 March 2016 ►M43 (*1) ◄ 21 September 2017 ►M43 (*2) ◄ 22. Sodium chromate EC No: 231-889-5 CAS No: 7775-11-3 Carcinogenic (category 1B) Mutagenic (category 1B) Toxic for reproduction (category 1B) 21 March 2016 ►M43 (*1) ◄ 21 September 2017 ►M43 (*2) ◄ [▼M28](./../../../legal-content/EN/AUTO/?uri=celex:32014R0895 "32014R0895: INSERTED") 23. Formaldehyde, oligomeric reaction products with aniline (technical MDA) EC No: 500-036-1 CAS No: 25214-70-4 Carcinogenic (category 1B) 22 February 2016 ►M43 (*1) ◄ 22 August 2017 ►M43 (*2) ◄ — 24. Arsenic acid EC No: 231-901-9 CAS No: 7778-39-4 Carcinogenic (category 1A) 22 February 2016 22 August 2017 — 25. Bis(2-methoxyethyl) ether (diglyme) EC No: 203-924-4 CAS No: 111-96-6 Toxic for reproduction (category 1B) 22 February 2016 ►M43 (*1) ◄ 22 August 2017 ►M43 (*2) ◄ — 26. 1,2-dichloroethane (EDC) EC No: 203-458-1 CAS No: 107-06-2 Carcinogenic (category 1B) 22 May 2016 22 November 2017 — 27. 2,2′-dichloro-4,4′-methylenedianiline (MOCA) EC No: 202-918-9 CAS No: 101-14-4 Carcinogenic (category 1B) 22 May 2016 ►M43 (*1) ◄ 22 November 2017 ►M43 (*2) ◄ — 28. Dichromium tris(chromate) EC No: 246-356-2 CAS No: 24613-89-6 Carcinogenic (category 1B) 22 July 2017 ►M43 (*1) ◄ 22 January 2019 ►M43 (*2) ◄ — 29. Strontium chromate EC No: 232-142-6 CAS No: 7789-06-2 Carcinogenic (category 1B) 22 July 2017 ►M43 (*1) ◄ 22 January 2019 ►M43 (*2) ◄ — 30. Potassium hydroxyoctaoxodizincatedichromate EC',
|
| 476 |
+
'(c)\n\nthe financial soundness of the proposed acquirer, in particular in relation to the type of business pursued and envisaged in the investment firm in which the acquisition is proposed;\n\n(d)\n\nwhether the investment firm will be able to comply and continue to comply with the prudential requirements based on this Directive and, where applicable, other Directives, in particular Directives 2002/87/EC and 2013/36/EU, in particular, whether the group of which it will become a part has a structure that makes it possible to exercise effective supervision, effectively exchange information among the competent authorities and determine the allocation of responsibilities among the competent authorities;\n\n(e)',
|
| 477 |
+
]
|
| 478 |
+
embeddings = model.encode(sentences)
|
| 479 |
+
print(embeddings.shape)
|
| 480 |
+
# [3, 768]
|
| 481 |
+
|
| 482 |
+
# Get the similarity scores for the embeddings
|
| 483 |
+
similarities = model.similarity(embeddings, embeddings)
|
| 484 |
+
print(similarities.shape)
|
| 485 |
+
# [3, 3]
|
| 486 |
+
```
|
| 487 |
+
|
| 488 |
+
<!--
|
| 489 |
+
### Direct Usage (Transformers)
|
| 490 |
+
|
| 491 |
+
<details><summary>Click to see the direct usage in Transformers</summary>
|
| 492 |
+
|
| 493 |
+
</details>
|
| 494 |
+
-->
|
| 495 |
+
|
| 496 |
+
<!--
|
| 497 |
+
### Downstream Usage (Sentence Transformers)
|
| 498 |
+
|
| 499 |
+
You can finetune this model on your own dataset.
|
| 500 |
+
|
| 501 |
+
<details><summary>Click to expand</summary>
|
| 502 |
+
|
| 503 |
+
</details>
|
| 504 |
+
-->
|
| 505 |
+
|
| 506 |
+
<!--
|
| 507 |
+
### Out-of-Scope Use
|
| 508 |
+
|
| 509 |
+
*List how the model may foreseeably be misused and address what users ought not to do with the model.*
|
| 510 |
+
-->
|
| 511 |
+
|
| 512 |
+
## Evaluation
|
| 513 |
+
|
| 514 |
+
### Metrics
|
| 515 |
+
|
| 516 |
+
#### Information Retrieval
|
| 517 |
+
|
| 518 |
+
* Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
|
| 519 |
+
|
| 520 |
+
| Metric | Value |
|
| 521 |
+
|:--------------------|:-----------|
|
| 522 |
+
| cosine_accuracy@1 | 0.6777 |
|
| 523 |
+
| cosine_accuracy@3 | 0.8973 |
|
| 524 |
+
| cosine_accuracy@5 | 0.9391 |
|
| 525 |
+
| cosine_accuracy@10 | 0.9691 |
|
| 526 |
+
| cosine_precision@1 | 0.6777 |
|
| 527 |
+
| cosine_precision@3 | 0.2991 |
|
| 528 |
+
| cosine_precision@5 | 0.1878 |
|
| 529 |
+
| cosine_precision@10 | 0.0969 |
|
| 530 |
+
| cosine_recall@1 | 0.6777 |
|
| 531 |
+
| cosine_recall@3 | 0.8973 |
|
| 532 |
+
| cosine_recall@5 | 0.9391 |
|
| 533 |
+
| cosine_recall@10 | 0.9691 |
|
| 534 |
+
| **cosine_ndcg@10** | **0.8364** |
|
| 535 |
+
| cosine_mrr@10 | 0.7924 |
|
| 536 |
+
| cosine_map@100 | 0.7938 |
|
| 537 |
+
|
| 538 |
+
<!--
|
| 539 |
+
## Bias, Risks and Limitations
|
| 540 |
+
|
| 541 |
+
*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
|
| 542 |
+
-->
|
| 543 |
+
|
| 544 |
+
<!--
|
| 545 |
+
### Recommendations
|
| 546 |
+
|
| 547 |
+
*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
|
| 548 |
+
-->
|
| 549 |
+
|
| 550 |
+
## Training Details
|
| 551 |
+
|
| 552 |
+
### Training Dataset
|
| 553 |
+
|
| 554 |
+
#### Unnamed Dataset
|
| 555 |
+
|
| 556 |
+
* Size: 46,338 training samples
|
| 557 |
+
* Columns: <code>sentence_0</code> and <code>sentence_1</code>
|
| 558 |
+
* Approximate statistics based on the first 1000 samples:
|
| 559 |
+
| | sentence_0 | sentence_1 |
|
| 560 |
+
|:--------|:------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|
|
| 561 |
+
| type | string | string |
|
| 562 |
+
| details | <ul><li>min: 11 tokens</li><li>mean: 35.09 tokens</li><li>max: 214 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 202.2 tokens</li><li>max: 512 tokens</li></ul> |
|
| 563 |
+
* Samples:
|
| 564 |
+
| sentence_0 | sentence_1 |
|
| 565 |
+
|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
| 566 |
+
| <code>How do the Academies support education and training providers in maintaining and ensuring the quality of the training offered?</code> | <code>to in Chapter IV of this Regulation; (b) promoting the voluntary use of the learning programmes, content and materials by education and training providers in the Member States; --- --- (c) offering support to the education and training providers that use the learning programmes, content and materials produced by the Academies to uphold the quality of the training offered and to develop mechanisms to ensure the quality of the training offered; --- --- (d) developing credentials, including, if appropriate, micro-credentials, for voluntary use by Member States and education and training providers on their territories, in order to facilitate the identification of skills and, where appropriate, the recognition of qualifications, to enhance the</code> |
|
| 567 |
+
| <code>The text provides a comprehensive list of various nickel compounds, including their chemical names and associated identifiers. It covers a range of nickel salts, oxides, and other derivatives, highlighting their diverse applications and chemical properties. The compounds mentioned include nickel arsenate, nickel oxalate, and nickel dichromate, among others, indicating their significance in industrial and chemical processes.</code> | <code>[5] 235-688-3 [5] 12519-85-6 [5] Dinickel hexacyanoferrate 028-037-00-8 238-946-3 14874-78-3 Trinickel bis(arsenate); Nickel (II) arsenate 028-038-00-3 236-771-7 13477-70-8 Nickel oxalate; [1] 028-039-00-9 208-933-7 [1] 547-67-1 [1] Oxalic acid, nickel salt; [2] 243-867-2 [2] 20543-06-0 [2] Nickel telluride 028-040-00-4 235-260-6 12142-88-0 Trinickel tetrasulfide 028-041-00-X — 12137-12-1 Trinickel bis(arsenite) 028-042-00-5 — 74646-29-0 Cobalt nickel gray periclase; 028-043-00-0 C.I. Pigment Black 25; C.I. 77332; [1] 269-051-6 [1] 68186-89-0 [1] Cobalt nickel dioxide; [2] 261-346-8 [2] 58591-45-0 [2] Cobalt nickel oxide; [3] - [3] 12737-30-3 [3] Nickel tin trioxide; Nickel stannate 028-044-00-6 234-824-9 12035-38-0 Nickel triuranium decaoxide 028-045-00-1 239-876-6 15780-33-3 Nickel dithiocyanate 028-046-00-7 237-205-1 13689-92-4 Nickel dichromate 028-047-00-2 239-646-5 15586-38-6 Nickel (II) selenite 028-048-00-8 233-263-7 10101-96-9 Nickel selenide 028-049-00-3 215-216-2 1314-05-2 S...</code> |
|
| 568 |
+
| <code>What is the definition of 'Union airport managing body' and how does it relate to the management of centralized infrastructures for fuel distribution systems?</code> | <code>(2)<br><br>‘Union airport managing body’ means, in respect of a Union airport, the ‘airport managing body’ as defined in Article 2, point (2), of Directive 2009/12/EC or, where the Member State concerned has reserved the management of the centralised infrastructures for fuel distribution systems for another body pursuant to Article 8(1) of Council Directive 96/67/EC ( 2 ), that other body;<br><br>(3)<br><br>‘aircraft operator’ means a person that operated at least 500 commercial passenger air transport flights, or 52 commercial all-cargo air transport flights departing from Union airports in the previous reporting period or, where it is not possible for that person to be identified, the owner of the aircraft;<br><br>(4)</code> |
|
| 569 |
+
* Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
|
| 570 |
+
```json
|
| 571 |
+
{
|
| 572 |
+
"loss": "MultipleNegativesRankingLoss",
|
| 573 |
+
"matryoshka_dims": [
|
| 574 |
+
768,
|
| 575 |
+
512,
|
| 576 |
+
256,
|
| 577 |
+
128,
|
| 578 |
+
64
|
| 579 |
+
],
|
| 580 |
+
"matryoshka_weights": [
|
| 581 |
+
1,
|
| 582 |
+
1,
|
| 583 |
+
1,
|
| 584 |
+
1,
|
| 585 |
+
1
|
| 586 |
+
],
|
| 587 |
+
"n_dims_per_step": -1
|
| 588 |
+
}
|
| 589 |
+
```
|
| 590 |
+
|
| 591 |
+
### Training Hyperparameters
|
| 592 |
+
#### Non-Default Hyperparameters
|
| 593 |
+
|
| 594 |
+
- `eval_strategy`: steps
|
| 595 |
+
- `per_device_train_batch_size`: 4
|
| 596 |
+
- `per_device_eval_batch_size`: 4
|
| 597 |
+
- `num_train_epochs`: 4
|
| 598 |
+
- `multi_dataset_batch_sampler`: round_robin
|
| 599 |
+
|
| 600 |
+
#### All Hyperparameters
|
| 601 |
+
<details><summary>Click to expand</summary>
|
| 602 |
+
|
| 603 |
+
- `overwrite_output_dir`: False
|
| 604 |
+
- `do_predict`: False
|
| 605 |
+
- `eval_strategy`: steps
|
| 606 |
+
- `prediction_loss_only`: True
|
| 607 |
+
- `per_device_train_batch_size`: 4
|
| 608 |
+
- `per_device_eval_batch_size`: 4
|
| 609 |
+
- `per_gpu_train_batch_size`: None
|
| 610 |
+
- `per_gpu_eval_batch_size`: None
|
| 611 |
+
- `gradient_accumulation_steps`: 1
|
| 612 |
+
- `eval_accumulation_steps`: None
|
| 613 |
+
- `torch_empty_cache_steps`: None
|
| 614 |
+
- `learning_rate`: 5e-05
|
| 615 |
+
- `weight_decay`: 0.0
|
| 616 |
+
- `adam_beta1`: 0.9
|
| 617 |
+
- `adam_beta2`: 0.999
|
| 618 |
+
- `adam_epsilon`: 1e-08
|
| 619 |
+
- `max_grad_norm`: 1
|
| 620 |
+
- `num_train_epochs`: 4
|
| 621 |
+
- `max_steps`: -1
|
| 622 |
+
- `lr_scheduler_type`: linear
|
| 623 |
+
- `lr_scheduler_kwargs`: {}
|
| 624 |
+
- `warmup_ratio`: 0.0
|
| 625 |
+
- `warmup_steps`: 0
|
| 626 |
+
- `log_level`: passive
|
| 627 |
+
- `log_level_replica`: warning
|
| 628 |
+
- `log_on_each_node`: True
|
| 629 |
+
- `logging_nan_inf_filter`: True
|
| 630 |
+
- `save_safetensors`: True
|
| 631 |
+
- `save_on_each_node`: False
|
| 632 |
+
- `save_only_model`: False
|
| 633 |
+
- `restore_callback_states_from_checkpoint`: False
|
| 634 |
+
- `no_cuda`: False
|
| 635 |
+
- `use_cpu`: False
|
| 636 |
+
- `use_mps_device`: False
|
| 637 |
+
- `seed`: 42
|
| 638 |
+
- `data_seed`: None
|
| 639 |
+
- `jit_mode_eval`: False
|
| 640 |
+
- `use_ipex`: False
|
| 641 |
+
- `bf16`: False
|
| 642 |
+
- `fp16`: False
|
| 643 |
+
- `fp16_opt_level`: O1
|
| 644 |
+
- `half_precision_backend`: auto
|
| 645 |
+
- `bf16_full_eval`: False
|
| 646 |
+
- `fp16_full_eval`: False
|
| 647 |
+
- `tf32`: None
|
| 648 |
+
- `local_rank`: 0
|
| 649 |
+
- `ddp_backend`: None
|
| 650 |
+
- `tpu_num_cores`: None
|
| 651 |
+
- `tpu_metrics_debug`: False
|
| 652 |
+
- `debug`: []
|
| 653 |
+
- `dataloader_drop_last`: False
|
| 654 |
+
- `dataloader_num_workers`: 0
|
| 655 |
+
- `dataloader_prefetch_factor`: None
|
| 656 |
+
- `past_index`: -1
|
| 657 |
+
- `disable_tqdm`: False
|
| 658 |
+
- `remove_unused_columns`: True
|
| 659 |
+
- `label_names`: None
|
| 660 |
+
- `load_best_model_at_end`: False
|
| 661 |
+
- `ignore_data_skip`: False
|
| 662 |
+
- `fsdp`: []
|
| 663 |
+
- `fsdp_min_num_params`: 0
|
| 664 |
+
- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
|
| 665 |
+
- `fsdp_transformer_layer_cls_to_wrap`: None
|
| 666 |
+
- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
|
| 667 |
+
- `deepspeed`: None
|
| 668 |
+
- `label_smoothing_factor`: 0.0
|
| 669 |
+
- `optim`: adamw_torch
|
| 670 |
+
- `optim_args`: None
|
| 671 |
+
- `adafactor`: False
|
| 672 |
+
- `group_by_length`: False
|
| 673 |
+
- `length_column_name`: length
|
| 674 |
+
- `ddp_find_unused_parameters`: None
|
| 675 |
+
- `ddp_bucket_cap_mb`: None
|
| 676 |
+
- `ddp_broadcast_buffers`: False
|
| 677 |
+
- `dataloader_pin_memory`: True
|
| 678 |
+
- `dataloader_persistent_workers`: False
|
| 679 |
+
- `skip_memory_metrics`: True
|
| 680 |
+
- `use_legacy_prediction_loop`: False
|
| 681 |
+
- `push_to_hub`: False
|
| 682 |
+
- `resume_from_checkpoint`: None
|
| 683 |
+
- `hub_model_id`: None
|
| 684 |
+
- `hub_strategy`: every_save
|
| 685 |
+
- `hub_private_repo`: None
|
| 686 |
+
- `hub_always_push`: False
|
| 687 |
+
- `gradient_checkpointing`: False
|
| 688 |
+
- `gradient_checkpointing_kwargs`: None
|
| 689 |
+
- `include_inputs_for_metrics`: False
|
| 690 |
+
- `include_for_metrics`: []
|
| 691 |
+
- `eval_do_concat_batches`: True
|
| 692 |
+
- `fp16_backend`: auto
|
| 693 |
+
- `push_to_hub_model_id`: None
|
| 694 |
+
- `push_to_hub_organization`: None
|
| 695 |
+
- `mp_parameters`:
|
| 696 |
+
- `auto_find_batch_size`: False
|
| 697 |
+
- `full_determinism`: False
|
| 698 |
+
- `torchdynamo`: None
|
| 699 |
+
- `ray_scope`: last
|
| 700 |
+
- `ddp_timeout`: 1800
|
| 701 |
+
- `torch_compile`: False
|
| 702 |
+
- `torch_compile_backend`: None
|
| 703 |
+
- `torch_compile_mode`: None
|
| 704 |
+
- `dispatch_batches`: None
|
| 705 |
+
- `split_batches`: None
|
| 706 |
+
- `include_tokens_per_second`: False
|
| 707 |
+
- `include_num_input_tokens_seen`: False
|
| 708 |
+
- `neftune_noise_alpha`: None
|
| 709 |
+
- `optim_target_modules`: None
|
| 710 |
+
- `batch_eval_metrics`: False
|
| 711 |
+
- `eval_on_start`: False
|
| 712 |
+
- `use_liger_kernel`: False
|
| 713 |
+
- `eval_use_gather_object`: False
|
| 714 |
+
- `average_tokens_across_devices`: False
|
| 715 |
+
- `prompts`: None
|
| 716 |
+
- `batch_sampler`: batch_sampler
|
| 717 |
+
- `multi_dataset_batch_sampler`: round_robin
|
| 718 |
+
|
| 719 |
+
</details>
|
| 720 |
+
|
| 721 |
+
### Training Logs
|
| 722 |
+
| Epoch | Step | Training Loss | cosine_ndcg@10 |
|
| 723 |
+
|:------:|:-----:|:-------------:|:--------------:|
|
| 724 |
+
| 0.0432 | 500 | 0.5169 | 0.7365 |
|
| 725 |
+
| 0.0863 | 1000 | 0.1341 | 0.7914 |
|
| 726 |
+
| 0.1295 | 1500 | 0.0784 | 0.7992 |
|
| 727 |
+
| 0.1726 | 2000 | 0.0782 | 0.8058 |
|
| 728 |
+
| 0.2158 | 2500 | 0.0596 | 0.8012 |
|
| 729 |
+
| 0.2590 | 3000 | 0.057 | 0.8079 |
|
| 730 |
+
| 0.3021 | 3500 | 0.0785 | 0.8086 |
|
| 731 |
+
| 0.3453 | 4000 | 0.0423 | 0.8010 |
|
| 732 |
+
| 0.3884 | 4500 | 0.0586 | 0.8075 |
|
| 733 |
+
| 0.4316 | 5000 | 0.0508 | 0.8008 |
|
| 734 |
+
| 0.4748 | 5500 | 0.0764 | 0.7934 |
|
| 735 |
+
| 0.5179 | 6000 | 0.0583 | 0.8068 |
|
| 736 |
+
| 0.5611 | 6500 | 0.0663 | 0.8008 |
|
| 737 |
+
| 0.6042 | 7000 | 0.0344 | 0.8083 |
|
| 738 |
+
| 0.6474 | 7500 | 0.0506 | 0.8104 |
|
| 739 |
+
| 0.6905 | 8000 | 0.0478 | 0.8089 |
|
| 740 |
+
| 0.7337 | 8500 | 0.0509 | 0.8034 |
|
| 741 |
+
| 0.7769 | 9000 | 0.0426 | 0.8114 |
|
| 742 |
+
| 0.8200 | 9500 | 0.0603 | 0.8097 |
|
| 743 |
+
| 0.8632 | 10000 | 0.036 | 0.8142 |
|
| 744 |
+
| 0.9063 | 10500 | 0.0581 | 0.8081 |
|
| 745 |
+
| 0.9495 | 11000 | 0.0351 | 0.8018 |
|
| 746 |
+
| 0.9927 | 11500 | 0.0358 | 0.8082 |
|
| 747 |
+
| 1.0 | 11585 | - | 0.8076 |
|
| 748 |
+
| 1.0358 | 12000 | 0.0398 | 0.8093 |
|
| 749 |
+
| 1.0790 | 12500 | 0.0197 | 0.8023 |
|
| 750 |
+
| 1.1221 | 13000 | 0.0376 | 0.8137 |
|
| 751 |
+
| 1.1653 | 13500 | 0.0287 | 0.8136 |
|
| 752 |
+
| 1.2085 | 14000 | 0.0269 | 0.8146 |
|
| 753 |
+
| 1.2516 | 14500 | 0.0089 | 0.8161 |
|
| 754 |
+
| 1.2948 | 15000 | 0.0149 | 0.8126 |
|
| 755 |
+
| 1.3379 | 15500 | 0.0457 | 0.8138 |
|
| 756 |
+
| 1.3811 | 16000 | 0.0119 | 0.8171 |
|
| 757 |
+
| 1.4243 | 16500 | 0.0107 | 0.8105 |
|
| 758 |
+
| 1.4674 | 17000 | 0.015 | 0.8171 |
|
| 759 |
+
| 1.5106 | 17500 | 0.0208 | 0.8153 |
|
| 760 |
+
| 1.5537 | 18000 | 0.0168 | 0.8111 |
|
| 761 |
+
| 1.5969 | 18500 | 0.0114 | 0.8171 |
|
| 762 |
+
| 1.6401 | 19000 | 0.0188 | 0.8239 |
|
| 763 |
+
| 1.6832 | 19500 | 0.01 | 0.8182 |
|
| 764 |
+
| 1.7264 | 20000 | 0.0158 | 0.8125 |
|
| 765 |
+
| 1.7695 | 20500 | 0.0155 | 0.8201 |
|
| 766 |
+
| 1.8127 | 21000 | 0.0276 | 0.8182 |
|
| 767 |
+
| 1.8558 | 21500 | 0.0245 | 0.8123 |
|
| 768 |
+
| 1.8990 | 22000 | 0.0135 | 0.8223 |
|
| 769 |
+
| 1.9422 | 22500 | 0.0334 | 0.8182 |
|
| 770 |
+
| 1.9853 | 23000 | 0.0111 | 0.8200 |
|
| 771 |
+
| 2.0 | 23170 | - | 0.8221 |
|
| 772 |
+
| 2.0285 | 23500 | 0.0139 | 0.8225 |
|
| 773 |
+
| 2.0716 | 24000 | 0.0113 | 0.8237 |
|
| 774 |
+
| 2.1148 | 24500 | 0.0072 | 0.8223 |
|
| 775 |
+
| 2.1580 | 25000 | 0.0138 | 0.8218 |
|
| 776 |
+
| 2.2011 | 25500 | 0.0071 | 0.8200 |
|
| 777 |
+
| 2.2443 | 26000 | 0.0091 | 0.8240 |
|
| 778 |
+
| 2.2874 | 26500 | 0.013 | 0.8224 |
|
| 779 |
+
| 2.3306 | 27000 | 0.008 | 0.8248 |
|
| 780 |
+
| 2.3738 | 27500 | 0.0084 | 0.8203 |
|
| 781 |
+
| 2.4169 | 28000 | 0.0147 | 0.8255 |
|
| 782 |
+
| 2.4601 | 28500 | 0.0067 | 0.8268 |
|
| 783 |
+
| 2.5032 | 29000 | 0.0028 | 0.8219 |
|
| 784 |
+
| 2.5464 | 29500 | 0.0124 | 0.8234 |
|
| 785 |
+
| 2.5896 | 30000 | 0.0051 | 0.8237 |
|
| 786 |
+
| 2.6327 | 30500 | 0.0151 | 0.8256 |
|
| 787 |
+
| 2.6759 | 31000 | 0.0051 | 0.8207 |
|
| 788 |
+
| 2.7190 | 31500 | 0.0086 | 0.8250 |
|
| 789 |
+
| 2.7622 | 32000 | 0.0152 | 0.8265 |
|
| 790 |
+
| 2.8054 | 32500 | 0.0085 | 0.8297 |
|
| 791 |
+
| 2.8485 | 33000 | 0.0097 | 0.8316 |
|
| 792 |
+
| 2.8917 | 33500 | 0.0269 | 0.8284 |
|
| 793 |
+
| 2.9348 | 34000 | 0.008 | 0.8305 |
|
| 794 |
+
| 2.9780 | 34500 | 0.0146 | 0.8309 |
|
| 795 |
+
| 3.0 | 34755 | - | 0.8301 |
|
| 796 |
+
| 3.0211 | 35000 | 0.0218 | 0.8326 |
|
| 797 |
+
| 3.0643 | 35500 | 0.0152 | 0.8301 |
|
| 798 |
+
| 3.1075 | 36000 | 0.0072 | 0.8290 |
|
| 799 |
+
| 3.1506 | 36500 | 0.0077 | 0.8270 |
|
| 800 |
+
| 3.1938 | 37000 | 0.0155 | 0.8299 |
|
| 801 |
+
| 3.2369 | 37500 | 0.0069 | 0.8328 |
|
| 802 |
+
| 3.2801 | 38000 | 0.0103 | 0.8364 |
|
| 803 |
+
|
| 804 |
+
|
| 805 |
+
### Framework Versions
|
| 806 |
+
- Python: 3.10.11
|
| 807 |
+
- Sentence Transformers: 3.4.1
|
| 808 |
+
- Transformers: 4.48.1
|
| 809 |
+
- PyTorch: 2.4.0+cu121
|
| 810 |
+
- Accelerate: 1.4.0
|
| 811 |
+
- Datasets: 3.3.2
|
| 812 |
+
- Tokenizers: 0.21.0
|
| 813 |
+
|
| 814 |
+
## Citation
|
| 815 |
+
|
| 816 |
+
### BibTeX
|
| 817 |
+
|
| 818 |
+
#### Sentence Transformers
|
| 819 |
+
```bibtex
|
| 820 |
+
@inproceedings{reimers-2019-sentence-bert,
|
| 821 |
+
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
|
| 822 |
+
author = "Reimers, Nils and Gurevych, Iryna",
|
| 823 |
+
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
|
| 824 |
+
month = "11",
|
| 825 |
+
year = "2019",
|
| 826 |
+
publisher = "Association for Computational Linguistics",
|
| 827 |
+
url = "https://arxiv.org/abs/1908.10084",
|
| 828 |
+
}
|
| 829 |
+
```
|
| 830 |
+
|
| 831 |
+
#### MatryoshkaLoss
|
| 832 |
+
```bibtex
|
| 833 |
+
@misc{kusupati2024matryoshka,
|
| 834 |
+
title={Matryoshka Representation Learning},
|
| 835 |
+
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
|
| 836 |
+
year={2024},
|
| 837 |
+
eprint={2205.13147},
|
| 838 |
+
archivePrefix={arXiv},
|
| 839 |
+
primaryClass={cs.LG}
|
| 840 |
+
}
|
| 841 |
+
```
|
| 842 |
+
|
| 843 |
+
#### MultipleNegativesRankingLoss
|
| 844 |
+
```bibtex
|
| 845 |
+
@misc{henderson2017efficient,
|
| 846 |
+
title={Efficient Natural Language Response Suggestion for Smart Reply},
|
| 847 |
+
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
|
| 848 |
+
year={2017},
|
| 849 |
+
eprint={1705.00652},
|
| 850 |
+
archivePrefix={arXiv},
|
| 851 |
+
primaryClass={cs.CL}
|
| 852 |
+
}
|
| 853 |
+
```
|
| 854 |
+
|
| 855 |
+
<!--
|
| 856 |
+
## Glossary
|
| 857 |
+
|
| 858 |
+
*Clearly define terms in order to be accessible across audiences.*
|
| 859 |
+
-->
|
| 860 |
+
|
| 861 |
+
<!--
|
| 862 |
+
## Model Card Authors
|
| 863 |
+
|
| 864 |
+
*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
|
| 865 |
+
-->
|
| 866 |
+
|
| 867 |
+
<!--
|
| 868 |
+
## Model Card Contact
|
| 869 |
+
|
| 870 |
+
*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
|
| 871 |
+
-->
|
config.json
ADDED
|
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"_name_or_path": "Snowflake/snowflake-arctic-embed-m-v1.5",
|
| 3 |
+
"architectures": [
|
| 4 |
+
"BertModel"
|
| 5 |
+
],
|
| 6 |
+
"attention_probs_dropout_prob": 0.1,
|
| 7 |
+
"classifier_dropout": null,
|
| 8 |
+
"gradient_checkpointing": false,
|
| 9 |
+
"hidden_act": "gelu",
|
| 10 |
+
"hidden_dropout_prob": 0.1,
|
| 11 |
+
"hidden_size": 768,
|
| 12 |
+
"initializer_range": 0.02,
|
| 13 |
+
"intermediate_size": 3072,
|
| 14 |
+
"layer_norm_eps": 1e-12,
|
| 15 |
+
"max_position_embeddings": 512,
|
| 16 |
+
"model_type": "bert",
|
| 17 |
+
"num_attention_heads": 12,
|
| 18 |
+
"num_hidden_layers": 12,
|
| 19 |
+
"pad_token_id": 0,
|
| 20 |
+
"position_embedding_type": "absolute",
|
| 21 |
+
"torch_dtype": "float32",
|
| 22 |
+
"transformers_version": "4.48.1",
|
| 23 |
+
"type_vocab_size": 2,
|
| 24 |
+
"use_cache": true,
|
| 25 |
+
"vocab_size": 30522
|
| 26 |
+
}
|
config_sentence_transformers.json
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"__version__": {
|
| 3 |
+
"sentence_transformers": "3.4.1",
|
| 4 |
+
"transformers": "4.48.1",
|
| 5 |
+
"pytorch": "2.4.0+cu121"
|
| 6 |
+
},
|
| 7 |
+
"prompts": {
|
| 8 |
+
"query": "Represent this sentence for searching relevant passages: "
|
| 9 |
+
},
|
| 10 |
+
"default_prompt_name": null,
|
| 11 |
+
"similarity_fn_name": "cosine"
|
| 12 |
+
}
|
eval/Information-Retrieval_evaluation_results.csv
ADDED
|
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
epoch,steps,cosine-Accuracy@1,cosine-Accuracy@3,cosine-Accuracy@5,cosine-Accuracy@10,cosine-Precision@1,cosine-Recall@1,cosine-Precision@3,cosine-Recall@3,cosine-Precision@5,cosine-Recall@5,cosine-Precision@10,cosine-Recall@10,cosine-MRR@10,cosine-NDCG@10,cosine-MAP@100
|
| 2 |
+
1.0,11585,0.6407733471431037,0.8641463835663732,0.913688934921457,0.9535646469877438,0.6407733471431037,0.6407733471431037,0.2880487945221244,0.8641463835663732,0.18273778698429138,0.913688934921457,0.09535646469877437,0.9535646469877438,0.7595666910529687,0.807555539832259,0.7615541387765282
|
| 3 |
+
2.0,23170,0.6606248921111687,0.8817538408423959,0.9254272397721388,0.9601242879337131,0.6606248921111687,0.6606248921111687,0.29391794694746537,0.8817538408423959,0.18508544795442772,0.9254272397721388,0.09601242879337128,0.9601242879337131,0.7765838765450377,0.8221377062027736,0.7785590851486591
|
| 4 |
+
3.0,34755,0.6658035560158813,0.8953909891248057,0.9352667011910927,0.9675470395304678,0.6658035560158813,0.6658035560158813,0.29846366304160193,0.8953909891248057,0.1870533402382185,0.9352667011910927,0.09675470395304678,0.9675470395304678,0.7845114108708103,0.8300831579250323,0.786055972684675
|
| 5 |
+
4.0,46340,0.6759882616951494,0.8984981874676333,0.9373381667529778,0.9696185050923528,0.6759882616951494,0.6759882616951494,0.2994993958225444,0.8984981874676333,0.18746763335059555,0.9373381667529778,0.09696185050923527,0.9696185050923528,0.7910710518167795,0.8355178184076232,0.7924944010948121
|
model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e7c32a0a3c6de3cd1b80ec5d0819e76d0f8094c331a236d4e55e04e3407a4042
|
| 3 |
+
size 435588776
|
modules.json
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[
|
| 2 |
+
{
|
| 3 |
+
"idx": 0,
|
| 4 |
+
"name": "0",
|
| 5 |
+
"path": "",
|
| 6 |
+
"type": "sentence_transformers.models.Transformer"
|
| 7 |
+
},
|
| 8 |
+
{
|
| 9 |
+
"idx": 1,
|
| 10 |
+
"name": "1",
|
| 11 |
+
"path": "1_Pooling",
|
| 12 |
+
"type": "sentence_transformers.models.Pooling"
|
| 13 |
+
},
|
| 14 |
+
{
|
| 15 |
+
"idx": 2,
|
| 16 |
+
"name": "2",
|
| 17 |
+
"path": "2_Normalize",
|
| 18 |
+
"type": "sentence_transformers.models.Normalize"
|
| 19 |
+
}
|
| 20 |
+
]
|
sentence_bert_config.json
ADDED
|
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"max_seq_length": 512,
|
| 3 |
+
"do_lower_case": false
|
| 4 |
+
}
|
special_tokens_map.json
ADDED
|
@@ -0,0 +1,37 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cls_token": {
|
| 3 |
+
"content": "[CLS]",
|
| 4 |
+
"lstrip": false,
|
| 5 |
+
"normalized": false,
|
| 6 |
+
"rstrip": false,
|
| 7 |
+
"single_word": false
|
| 8 |
+
},
|
| 9 |
+
"mask_token": {
|
| 10 |
+
"content": "[MASK]",
|
| 11 |
+
"lstrip": false,
|
| 12 |
+
"normalized": false,
|
| 13 |
+
"rstrip": false,
|
| 14 |
+
"single_word": false
|
| 15 |
+
},
|
| 16 |
+
"pad_token": {
|
| 17 |
+
"content": "[PAD]",
|
| 18 |
+
"lstrip": false,
|
| 19 |
+
"normalized": false,
|
| 20 |
+
"rstrip": false,
|
| 21 |
+
"single_word": false
|
| 22 |
+
},
|
| 23 |
+
"sep_token": {
|
| 24 |
+
"content": "[SEP]",
|
| 25 |
+
"lstrip": false,
|
| 26 |
+
"normalized": false,
|
| 27 |
+
"rstrip": false,
|
| 28 |
+
"single_word": false
|
| 29 |
+
},
|
| 30 |
+
"unk_token": {
|
| 31 |
+
"content": "[UNK]",
|
| 32 |
+
"lstrip": false,
|
| 33 |
+
"normalized": false,
|
| 34 |
+
"rstrip": false,
|
| 35 |
+
"single_word": false
|
| 36 |
+
}
|
| 37 |
+
}
|
tokenizer.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1,63 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"added_tokens_decoder": {
|
| 3 |
+
"0": {
|
| 4 |
+
"content": "[PAD]",
|
| 5 |
+
"lstrip": false,
|
| 6 |
+
"normalized": false,
|
| 7 |
+
"rstrip": false,
|
| 8 |
+
"single_word": false,
|
| 9 |
+
"special": true
|
| 10 |
+
},
|
| 11 |
+
"100": {
|
| 12 |
+
"content": "[UNK]",
|
| 13 |
+
"lstrip": false,
|
| 14 |
+
"normalized": false,
|
| 15 |
+
"rstrip": false,
|
| 16 |
+
"single_word": false,
|
| 17 |
+
"special": true
|
| 18 |
+
},
|
| 19 |
+
"101": {
|
| 20 |
+
"content": "[CLS]",
|
| 21 |
+
"lstrip": false,
|
| 22 |
+
"normalized": false,
|
| 23 |
+
"rstrip": false,
|
| 24 |
+
"single_word": false,
|
| 25 |
+
"special": true
|
| 26 |
+
},
|
| 27 |
+
"102": {
|
| 28 |
+
"content": "[SEP]",
|
| 29 |
+
"lstrip": false,
|
| 30 |
+
"normalized": false,
|
| 31 |
+
"rstrip": false,
|
| 32 |
+
"single_word": false,
|
| 33 |
+
"special": true
|
| 34 |
+
},
|
| 35 |
+
"103": {
|
| 36 |
+
"content": "[MASK]",
|
| 37 |
+
"lstrip": false,
|
| 38 |
+
"normalized": false,
|
| 39 |
+
"rstrip": false,
|
| 40 |
+
"single_word": false,
|
| 41 |
+
"special": true
|
| 42 |
+
}
|
| 43 |
+
},
|
| 44 |
+
"clean_up_tokenization_spaces": true,
|
| 45 |
+
"cls_token": "[CLS]",
|
| 46 |
+
"do_lower_case": true,
|
| 47 |
+
"extra_special_tokens": {},
|
| 48 |
+
"mask_token": "[MASK]",
|
| 49 |
+
"max_length": 512,
|
| 50 |
+
"model_max_length": 512,
|
| 51 |
+
"pad_to_multiple_of": null,
|
| 52 |
+
"pad_token": "[PAD]",
|
| 53 |
+
"pad_token_type_id": 0,
|
| 54 |
+
"padding_side": "right",
|
| 55 |
+
"sep_token": "[SEP]",
|
| 56 |
+
"stride": 0,
|
| 57 |
+
"strip_accents": null,
|
| 58 |
+
"tokenize_chinese_chars": true,
|
| 59 |
+
"tokenizer_class": "BertTokenizer",
|
| 60 |
+
"truncation_side": "right",
|
| 61 |
+
"truncation_strategy": "longest_first",
|
| 62 |
+
"unk_token": "[UNK]"
|
| 63 |
+
}
|
vocab.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|