snoweu_v4 / README.md
fjavigv's picture
Upload 12 files
0e47369 verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:46338
  - loss:MatryoshkaLoss
  - loss:MultipleNegativesRankingLoss
base_model: Snowflake/snowflake-arctic-embed-m-v1.5
widget:
  - source_sentence: >-
      What are the chemical names and corresponding identifiers for octabromo
      derivate and 2-Methoxyethanol, including their CAS numbers and EC numbers?
    sentences:
      - >-
        octabromo derivate 602-094-00-4 251-087-9 32536-52-0 2-Methoxyethanol;
        ethylene glycol monomethyl ether; methylglycol 603-011-00-4 203-713-7
        109-86-4 2-Ethoxyethanol; ethylene glycol monoethyl ether; ethylglycol
        603-012-00-X 203-804-1 110-80-5
        [▼M61](./../../../legal-content/EN/AUTO/?uri=celex:32020R2096
        "32020R2096: INSERTED") Ethylene oxide; oxirane 603-023-00-X 200-849-9
        75-21-8
        [▼C1](./../../../legal-content/EN/AUTO/?uri=celex:32006R1907R%2801%29
        "32006R1907R(01): REPLACED") 1,2-Dimethoxyethane ethylene glycol
        dimethyl ether EGDME 603-031-00-3 203-794-9 110-71-4
        [▼M45](./../../../legal-content/EN/AUTO/?uri=celex:32017R1510
        "32017R1510: INSERTED") Tetrahydro-2-furyl-methanol; tetrahydrofurfuryl
        alcohol 603-061-00-7 202-625-6 97-99-4
      - >-
        hydrocarbons produced as the residual fraction from the distillation of
        heavy coker gas oil and vacuum gas oil. It predominantly consists of
        hydrocarbons having carbon numbers predominantly greater than C13 and
        boiling above approximately 230 °C.) 649-026-00-X 270-796-4 68478-17-1
        Residues (petroleum), heavy coker and light vacuum; Heavy fuel oil (A
        complex combination of hydrocarbons produced as the residual fraction
        from the distillation of heavy coker gas oil and light vacuum gas oil.
        It consists predominantly of hydrocarbons having carbon numbers
        predominantly greater than C13 and boiling above approximately 230 °C.)
        649-027-00-5 270-983-0 68512-61-8 Residues (petroleum), light vacuum;
        Heavy fuel oil (A complex residuum from the vacuum distillation of the
        residuum from the atmospheric distillation of crude oil. It consists of
        hydrocarbons having carbon numbers predominantly greater than C13 and
        boiling above approximately 230 °C.) 649-028-00-0 270-984-6 68512-62-9
        Residues (petroleum), steam-cracked light; Heavy fuel oil (A complex
        residuum from the distillation of the products from a steam-cracking
        process. It consists predominantly of aromatic and unsaturated
        hydrocarbons having carbon numbers greater than C7 and boiling in the
        range of approximately 101 to 555 °C.) 649-029-00-6 271-013-9 68513-69-9
        Fuel oil, No 6; Heavy fuel oil (A distillate oil having a minimum
        viscosity of 197 10-6 m2s-1 at 37,7 °C to a maximum of 197 10-5 m2s-1 at
        37,7 °C.) 649-030-00-1 271-384-7 68553-00-4 Residues (petroleum),
        topping plant, low-sulfur; Heavy fuel oil (A low-sulfur complex
        combination of hydrocarbons produced as the residual fraction from the
        topping plant distillation of crude oil. It is the residuum after the
        straight-run gasoline cut, kerosene cut and gas oil cut have been
        removed.) 649-031-00-7 271-763-7 68607-30-7 Gas oils (petroleum), heavy
        atmospheric; Heavy fuel oil (A complex combination of hydrocarbons
        obtained by the distillation of crude oil. It consists of hydrocarbons
        having carbon numbers predominantly in the range of C7 through C35 and
        boiling in the range of approximately 121 to 510 °C.) 649-032-00-2
        272-184-2 68783-08-4 Residues (petroleum), coker scrubber,
        Condensed-ring-arom.-contg.; Heavy fuel
      - >-
        (e)


        where applicable, how the undertaking assesses the effectiveness of its
        engagement with its own workforce, including, where relevant, any
        agreements or outcomes that result.


        Where applicable, the undertaking shall disclose the steps it takes to
        gain insight into the perspectives of people in its own workforce who
        may be particularly vulnerable to impacts and/or marginalised (for
        example, women, migrants, people with disabilities).


        If the undertaking cannot disclose the above required information
        because it has not adopted a general process to engage with its own
        workforce , it shall disclose this to be the case. It may disclose a
        timeframe in which it aims to have such a process in place.
  - source_sentence: >-
      Under what circumstances can the suspension or removal of a financial
      instrument or derivative from trading be exempted, despite infringing
      Articles 7 and 17 of Regulation (EU) No 596/2014?
    sentences:
      - >-
        (15) Directive 2010/75/EU of the European Parliament and of the Council
        of 24 November 2010 on industrial emissions (integrated pollution
        prevention and control) (recast) (OJ L 334, 17.12.2010, p. 17).


        (16) Directive 2011/92/EU of the European Parliament and of the Council
        of 13 December 2011 on the assessment of the effects of certain public
        and private projects on the environment (OJ L 26, 28.1.2012, p. 1).


        (17) Directive 2012/18/EU of the European Parliament and of the Council
        of 4 July 2012 on the control of major-accident hazards involving
        dangerous substances, amending and subsequently repealing Council
        Directive 96/82/EC (OJ L 197, 24.7.2012, p. 1).
      - >-
        3.


        Where the competent authority of the host Member State of a regulated
        market, an MTF or OTF has clear and demonstrable grounds for believing
        that such regulated market, MTF or OTF infringes the obligations arising
        from the provisions adopted pursuant to this Directive, it shall refer
        those findings to the competent authority of the home Member State of
        the regulated market or the MTF or OTF.
      - >-
        The notified competent authorities of the other Member States shall
        require that regulated markets, other MTFs, other OTFs and systematic
        internalisers, which fall under their jurisdiction and trade the same
        financial instrument or derivatives referred to in points (4) to (10) of
        Section C of Annex I that relate or are referenced to that financial
        instrument, also suspend or remove that financial instrument or
        derivatives from trading, where the suspension or removal is due to
        suspected market abuse, a take-over bid or the non- disclosure of inside
        information about the issuer or financial instrument infringing Articles
        7 and 17 of Regulation (EU) No 596/2014 except where such suspension or
        removal could cause significant damage to the
  - source_sentence: >-
      How can the limitation period for the Commission's powers be interrupted
      according to Article 38?
    sentences:
      - >-
        2.


        That third-country dialogue shall not prevent the Commission from taking
        action under this Regulation. Individual measures adopted pursuant to
        this Regulation shall not be addressed within that dialogue.


        Article 38


        Limitation periods


        1.


        The powers of the Commission under Articles 10 and 11 shall be subject
        to a limitation period of 10 years, starting on the day on which a
        foreign subsidy is granted to an undertaking. Any action taken by the
        Commission under Article 10, 13, 14 or 15 with respect to a foreign
        subsidy shall interrupt the limitation period. After each interruption,
        the limitation period of 10 years shall start to run afresh.


        2.
      - >-
        (36) Member States should promote energy efficient means of mobility,
        including in their public procurement practices, such as rail, cycling,
        walking or shared mobility, by renewing and decarbonising fleets,
        encouraging a modal shift and including those modes in urban mobility
        planning.
      - >-
        air oxidation of petrolatum.) 649-255-00-5 265-206-7 64743-01-7 N
        Petrolatum (petroleum), alumina-treated; Petrolatum (A complex
        combination of hydrocarbons obtained when petrolatum is treated with
        Al2O3 to remove polar components and impurities. It consists
        predominantly of saturated, crystalline, and liquid hydrocarbons having
        carbon numbers predominantly greater than C25.) 649-256-00-0 285-098-5
        85029-74-9 N Petrolatum (petroleum), hydrotreated; Petrolatum (A complex
        combination of hydrocarbons obtained as a semi-solid from dewaxed
        paraffinic residual oil treated with hydrogen in the presence of a
        catalyst. It consists predominantly of saturated, microcrystalline, and
        liquid hydrocarbons having carbon numbers predominantly greater than
  - source_sentence: >-
      What specific sections and points of Annex VIII are included in the
      registration for high-risk AI systems in the areas of law enforcement,
      migration, asylum, and border control management?
    sentences:
      - >-
        ▼M15


        Article 18b


        Assistance from the Commission, EMSA and other relevant organisations


        1.


        For the purposes of carrying out its obligations under Article 3c(4) and
        Articles 3g, 3gd, 3ge, 3gf, 3gg and 18a, the Commission, the
        administering Member State and administering authorities in respect of a
        shipping company may request the assistance of EMSA or another relevant
        organisation and may conclude to that effect any appropriate agreements
        with those organisations.


        2.


        The Commission, assisted by EMSA, shall endeavour to develop appropriate
        tools and guidance to facilitate and coordinate verification and
        enforcement activities related to the application of this Directive to
        maritime transport. As far as practicable, such guidance and tools shall
        be made available to the Member States and the verifiers for
        information-sharing purposes and in order to better ensure robust
        enforcement of the national measures transposing this Directive.


        ▼B


        Article 19


        Registries


        ▼M4


        1.


        Allowances issued from 1 January 2012 onwards shall be held in the ►M9
        Union  registry for the execution of processes pertaining to the
        maintenance of the holding accounts opened in the Member State and the
        allocation, surrender and cancellation of allowances under the
        Commission ►M9 Acts  referred to in paragraph 3.


        Each Member State shall be able to fulfil the execution of authorised
        operations under the UNFCCC or the Kyoto Protocol.


        ▼B


        2.


        Any person may hold allowances. The registry shall be accessible to the
        public and shall contain separate accounts to record the allowances held
        by each person to whom and from whom allowances are issued or
        transferred.


        ▼M9


        3.
      - >-
        (35)


        ‘recycled carbon fuels’ means liquid and gaseous fuels that are produced
        from liquid or solid waste streams of non-renewable origin which are not
        suitable for material recovery in accordance with Article 4 of Directive
        2008/98/EC, or from waste processing gas and exhaust gas of
        non-renewable origin which are produced as an unavoidable and
        unintentional consequence of the production process in industrial
        installations;


        ▼M2


        (36)


        ‘renewable fuels of non-biological origin’ means liquid and gaseous
        fuels the energy content of which is derived from renewable sources
        other than biomass;


        ▼B


        (37)
      - >-
        4. For high-risk AI systems referred to in points 1, 6 and 7 of Annex
        III, in the areas of law enforcement, migration, asylum and border
        control management, the registration referred to in paragraphs 1, 2 and
        3 of this Article shall be in a secure non-public section of the EU
        database referred to in Article 71 and shall include only the following
        information, as applicable, referred to in:


        (a) Section A, points 1 to 10, of Annex VIII, with the exception of
        points 6, 8 and 9; (b) Section B, points 1 to 5, and points 8 and 9 of
        Annex VIII; --- --- (c) Section C, points 1 to 3, of Annex VIII; --- ---
        (d) points 1, 2, 3 and 5, of Annex IX. --- ---
  - source_sentence: >-
      The document outlines various chemical substances classified as
      carcinogenic or toxic for reproduction, detailing their respective
      categories and regulatory dates. Specific compounds such as diarsenic
      trioxide, lead chromate, and chromium trioxide are highlighted, indicating
      their potential health risks and the timeline for their regulation.
    sentences:
      - >-
        57(f) – human health) (a) 21 August 2013 (*) (b) By way of derogation
        from point (a): 14 June 2023 for uses in mixtures containing DIBP at or
        above 0,1 % and below 0,3 % weight by weight. (a) 21 February 2015 (**)
        (b) By way of derogation from point (a): 14 December 2024 for uses in
        mixtures containing DIBP at or above 0,1 % and below 0,3 % weight by
        weight. - [▼M15](./../../../legal-content/EN/AUTO/?uri=celex:32012R0125
        "32012R0125: INSERTED") 8. Diarsenic trioxide EC No: 215-481-4 CAS No:
        1327-53-3 Carcinogenic (category 1A) 21 November 2013 21 May 2015 — 9.
        Diarsenic pentaoxide EC No: 215-116-9 CAS No: 1303-28-2 Carcinogenic
        (category 1A) 21 November 2013 21 May 2015 — 10. Lead chromate EC No:
        231-846-0 CAS No: 7758-97-6 Carcinogenic (category 1B) Toxic for
        reproduction (category 1A) 21 November 2013 ►M43 (*1) ◄ 21 May 2015 ►M43
        (*2) ◄ — 11. Lead sulfochromate yellow (C.I. Pigment Yellow 34) EC No:
        215-693-7 CAS No: 1344-37-2 Carcinogenic (category 1B) Toxic for
        reproduction (category 1A) 21 November 2013 ►M43 (*1) ◄ 21 May 2015 ►M43
        (*2) ◄ — 12. Lead chromate molybdate sulphate red (C.I. Pigment Red 104)
        EC No: 235-759-9 CAS No: 12656-85-8 Carcinogenic (category 1B) Toxic for
        reproduction (category 1A) 21 November 2013 ►M43 (*1) ◄ 21 May 2015 ►M43
        (*2) ◄ 13. Tris (2-chloroethyl) phosphate (TCEP) EC No: 204-118-5 CAS
        No: 115-96-8 Toxic for reproduction (category 1B) 21 February 2014 21
        August 2015 14. 2,4-Dinitrotoluene (2,4-DNT) EC No: 204-450-0 CAS No:
        121-14-2 Carcinogenic (category 1B) 21 February 2014 ►M43 (*1) ◄ 21
        August 2015 ►M43 (*2) ◄
        [▼M22](./../../../legal-content/EN/AUTO/?uri=celex:32013R0348
        "32013R0348: INSERTED") 15. Trichloroethylene EC No: 201-167-4 CAS No:
        79-01-6 Carcinogenic (category 1B) 21 October 2014 ►M43 (*1) ◄ 21 April
        2016 ►M43 (*2) ◄ — 16. Chromium trioxide EC No: 215-607-8 CAS No:
        1333-82-0 Carcinogenic (category 1A) Mutagenic (category 1B) 21 March
        2016 ►M43 (*1) ◄ 21 September 2017 ►M43 (*2) ◄ — 17. Acids generated
        from chromium trioxide and their oligomers Group containing: Chromic
        acid EC No: 231-801-5 CAS No: 7738-94-5 Dichromic acid EC No: 236-881-5
        CAS No: 13530-68-2 Oligomers of chromic acid and dichromic acid EC No:
        not yet assigned CAS No: not yet assigned Carcinogenic (category 1B) 21
        March 2016 ►M43 (*1) ◄ 21 September 2017 ►M43 (*2) ◄ — 18. Sodium
        dichromate EC No: 234-190-3 CAS No: 7789-12-0 10588-01-9 Carcinogenic
        (category 1B) Mutagenic (category 1B) Toxic for reproduction (category
        1B) 21 March 2016 ►M43 (*1) ◄ 21 September 2017 ►M43 (*2) ◄ — 19.
        Potassium dichromate EC No: 231-906-6 CAS No: 7778-50-9 Carcinogenic
        (category 1B) Mutagenic (category 1B) Toxic for reproduction (category
        1B) 21 March 2016 ►M43 (*1) ◄ 21 September 2017 ►M43 (*2) ◄ — 20.
        Ammonium dichromate EC No: 232-143-1 CAS No: 7789-09-5 Carcinogenic
        (category 1B) Mutagenic (category 1B) Toxic for reproduction (category
        1B) 21 March 2016 ►M43 (*1) ◄ 21 September 2017 ►M43 (*2) ◄ 21.
        Potassium chromate EC No: 232-140-5 CAS No: 7789-00-6 Carcinogenic
        (category 1B) Mutagenic (category 1B) 21 March 2016 ►M43 (*1) ◄ 21
        September 2017 ►M43 (*2) ◄ 22. Sodium chromate EC No: 231-889-5 CAS No:
        7775-11-3 Carcinogenic (category 1B) Mutagenic (category 1B) Toxic for
        reproduction (category 1B) 21 March 2016 ►M43 (*1) ◄ 21 September 2017
        ►M43 (*2) ◄
        [▼M28](./../../../legal-content/EN/AUTO/?uri=celex:32014R0895
        "32014R0895: INSERTED") 23. Formaldehyde, oligomeric reaction products
        with aniline (technical MDA) EC No: 500-036-1 CAS No: 25214-70-4
        Carcinogenic (category 1B) 22 February 2016 ►M43 (*1) ◄ 22 August 2017
        ►M43 (*2) ◄ — 24. Arsenic acid EC No: 231-901-9 CAS No: 7778-39-4
        Carcinogenic (category 1A) 22 February 2016 22 August 2017 — 25.
        Bis(2-methoxyethyl) ether (diglyme) EC No: 203-924-4 CAS No: 111-96-6
        Toxic for reproduction (category 1B) 22 February 2016 ►M43 (*1) ◄ 22
        August 2017 ►M43 (*2) ◄ — 26. 1,2-dichloroethane (EDC) EC No: 203-458-1
        CAS No: 107-06-2 Carcinogenic (category 1B) 22 May 2016 22 November 2017
        — 27. 2,2′-dichloro-4,4′-methylenedianiline (MOCA) EC No: 202-918-9 CAS
        No: 101-14-4 Carcinogenic (category 1B) 22 May 2016 ►M43 (*1) ◄ 22
        November 2017 ►M43 (*2) ◄ — 28. Dichromium tris(chromate) EC No:
        246-356-2 CAS No: 24613-89-6 Carcinogenic (category 1B) 22 July 2017
        ►M43 (*1) ◄ 22 January 2019 ►M43 (*2) ◄ — 29. Strontium chromate EC No:
        232-142-6 CAS No: 7789-06-2 Carcinogenic (category 1B) 22 July 2017 ►M43
        (*1) ◄ 22 January 2019 ►M43 (*2) ◄ — 30. Potassium
        hydroxyoctaoxodizincatedichromate EC
      - >-
        (c)


        the financial soundness of the proposed acquirer, in particular in
        relation to the type of business pursued and envisaged in the investment
        firm in which the acquisition is proposed;


        (d)


        whether the investment firm will be able to comply and continue to
        comply with the prudential requirements based on this Directive and,
        where applicable, other Directives, in particular Directives 2002/87/EC
        and 2013/36/EU, in particular, whether the group of which it will become
        a part has a structure that makes it possible to exercise effective
        supervision, effectively exchange information among the competent
        authorities and determine the allocation of responsibilities among the
        competent authorities;


        (e)
      - >-
        No administrative costs or fees related to the implementation of
        financing and investment operations under the EU guarantee shall be due
        to the implementing partner by the Commission unless the nature of the
        policy objectives targeted by the financial product to be implemented
        and the affordability for the targeted final recipients or the type of
        financing provided allow the implementing partner to duly justify to the
        Commission the need for an exception. The coverage of such costs by the
        Union budget shall be limited to the amount strictly required to
        implement the relevant financing and investment operations, and shall be
        provided only to the extent to which the costs are not covered by
        revenues received by the implementing partners from
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - cosine_accuracy@1
  - cosine_accuracy@3
  - cosine_accuracy@5
  - cosine_accuracy@10
  - cosine_precision@1
  - cosine_precision@3
  - cosine_precision@5
  - cosine_precision@10
  - cosine_recall@1
  - cosine_recall@3
  - cosine_recall@5
  - cosine_recall@10
  - cosine_ndcg@10
  - cosine_mrr@10
  - cosine_map@100
model-index:
  - name: SentenceTransformer based on Snowflake/snowflake-arctic-embed-m-v1.5
    results:
      - task:
          type: information-retrieval
          name: Information Retrieval
        dataset:
          name: Unknown
          type: unknown
        metrics:
          - type: cosine_accuracy@1
            value: 0.6777144829967202
            name: Cosine Accuracy@1
          - type: cosine_accuracy@3
            value: 0.8972898325565337
            name: Cosine Accuracy@3
          - type: cosine_accuracy@5
            value: 0.9390643880545486
            name: Cosine Accuracy@5
          - type: cosine_accuracy@10
            value: 0.9691006387018816
            name: Cosine Accuracy@10
          - type: cosine_precision@1
            value: 0.6777144829967202
            name: Cosine Precision@1
          - type: cosine_precision@3
            value: 0.2990966108521779
            name: Cosine Precision@3
          - type: cosine_precision@5
            value: 0.18781287761090967
            name: Cosine Precision@5
          - type: cosine_precision@10
            value: 0.09691006387018813
            name: Cosine Precision@10
          - type: cosine_recall@1
            value: 0.6777144829967202
            name: Cosine Recall@1
          - type: cosine_recall@3
            value: 0.8972898325565337
            name: Cosine Recall@3
          - type: cosine_recall@5
            value: 0.9390643880545486
            name: Cosine Recall@5
          - type: cosine_recall@10
            value: 0.9691006387018816
            name: Cosine Recall@10
          - type: cosine_ndcg@10
            value: 0.8364282304724784
            name: Cosine Ndcg@10
          - type: cosine_mrr@10
            value: 0.7924261355385132
            name: Cosine Mrr@10
          - type: cosine_map@100
            value: 0.7938274567816883
            name: Cosine Map@100

SentenceTransformer based on Snowflake/snowflake-arctic-embed-m-v1.5

This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-m-v1.5. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: Snowflake/snowflake-arctic-embed-m-v1.5
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
    'The document outlines various chemical substances classified as carcinogenic or toxic for reproduction, detailing their respective categories and regulatory dates. Specific compounds such as diarsenic trioxide, lead chromate, and chromium trioxide are highlighted, indicating their potential health risks and the timeline for their regulation.',
    '57(f) – human health) (a) 21 August 2013 (*) (b) By way of derogation from point (a): 14 June 2023 for uses in mixtures containing DIBP at or above 0,1 % and below 0,3 % weight by weight. (a) 21 February 2015 (**) (b) By way of derogation from point (a): 14 December 2024 for uses in mixtures containing DIBP at or above 0,1 % and below 0,3 % weight by weight. - [▼M15](./../../../legal-content/EN/AUTO/?uri=celex:32012R0125 "32012R0125: INSERTED") 8. Diarsenic trioxide EC No: 215-481-4 CAS No: 1327-53-3 Carcinogenic (category 1A) 21 November 2013 21 May 2015 — 9. Diarsenic pentaoxide EC No: 215-116-9 CAS No: 1303-28-2 Carcinogenic (category 1A) 21 November 2013 21 May 2015 — 10. Lead chromate EC No: 231-846-0 CAS No: 7758-97-6 Carcinogenic (category 1B) Toxic for reproduction (category 1A) 21 November 2013 ►M43 (*1) ◄ 21 May 2015 ►M43 (*2) ◄ — 11. Lead sulfochromate yellow (C.I. Pigment Yellow 34) EC No: 215-693-7 CAS No: 1344-37-2 Carcinogenic (category 1B) Toxic for reproduction (category 1A) 21 November 2013 ►M43 (*1) ◄ 21 May 2015 ►M43 (*2) ◄ — 12. Lead chromate molybdate sulphate red (C.I. Pigment Red 104) EC No: 235-759-9 CAS No: 12656-85-8 Carcinogenic (category 1B) Toxic for reproduction (category 1A) 21 November 2013 ►M43 (*1) ◄ 21 May 2015 ►M43 (*2) ◄ 13. Tris (2-chloroethyl) phosphate (TCEP) EC No: 204-118-5 CAS No: 115-96-8 Toxic for reproduction (category 1B) 21 February 2014 21 August 2015 14. 2,4-Dinitrotoluene (2,4-DNT) EC No: 204-450-0 CAS No: 121-14-2 Carcinogenic (category 1B) 21 February 2014 ►M43 (*1) ◄ 21 August 2015 ►M43 (*2) ◄ [▼M22](./../../../legal-content/EN/AUTO/?uri=celex:32013R0348 "32013R0348: INSERTED") 15. Trichloroethylene EC No: 201-167-4 CAS No: 79-01-6 Carcinogenic (category 1B) 21 October 2014 ►M43 (*1) ◄ 21 April 2016 ►M43 (*2) ◄ — 16. Chromium trioxide EC No: 215-607-8 CAS No: 1333-82-0 Carcinogenic (category 1A) Mutagenic (category 1B) 21 March 2016 ►M43 (*1) ◄ 21 September 2017 ►M43 (*2) ◄ — 17. Acids generated from chromium trioxide and their oligomers Group containing: Chromic acid EC No: 231-801-5 CAS No: 7738-94-5 Dichromic acid EC No: 236-881-5 CAS No: 13530-68-2 Oligomers of chromic acid and dichromic acid EC No: not yet assigned CAS No: not yet assigned Carcinogenic (category 1B) 21 March 2016 ►M43 (*1) ◄ 21 September 2017 ►M43 (*2) ◄ — 18. Sodium dichromate EC No: 234-190-3 CAS No: 7789-12-0 10588-01-9 Carcinogenic (category 1B) Mutagenic (category 1B) Toxic for reproduction (category 1B) 21 March 2016 ►M43 (*1) ◄ 21 September 2017 ►M43 (*2) ◄ — 19. Potassium dichromate EC No: 231-906-6 CAS No: 7778-50-9 Carcinogenic (category 1B) Mutagenic (category 1B) Toxic for reproduction (category 1B) 21 March 2016 ►M43 (*1) ◄ 21 September 2017 ►M43 (*2) ◄ — 20. Ammonium dichromate EC No: 232-143-1 CAS No: 7789-09-5 Carcinogenic (category 1B) Mutagenic (category 1B) Toxic for reproduction (category 1B) 21 March 2016 ►M43 (*1) ◄ 21 September 2017 ►M43 (*2) ◄ 21. Potassium chromate EC No: 232-140-5 CAS No: 7789-00-6 Carcinogenic (category 1B) Mutagenic (category 1B) 21 March 2016 ►M43 (*1) ◄ 21 September 2017 ►M43 (*2) ◄ 22. Sodium chromate EC No: 231-889-5 CAS No: 7775-11-3 Carcinogenic (category 1B) Mutagenic (category 1B) Toxic for reproduction (category 1B) 21 March 2016 ►M43 (*1) ◄ 21 September 2017 ►M43 (*2) ◄ [▼M28](./../../../legal-content/EN/AUTO/?uri=celex:32014R0895 "32014R0895: INSERTED") 23. Formaldehyde, oligomeric reaction products with aniline (technical MDA) EC No: 500-036-1 CAS No: 25214-70-4 Carcinogenic (category 1B) 22 February 2016 ►M43 (*1) ◄ 22 August 2017 ►M43 (*2) ◄ — 24. Arsenic acid EC No: 231-901-9 CAS No: 7778-39-4 Carcinogenic (category 1A) 22 February 2016 22 August 2017 — 25. Bis(2-methoxyethyl) ether (diglyme) EC No: 203-924-4 CAS No: 111-96-6 Toxic for reproduction (category 1B) 22 February 2016 ►M43 (*1) ◄ 22 August 2017 ►M43 (*2) ◄ — 26. 1,2-dichloroethane (EDC) EC No: 203-458-1 CAS No: 107-06-2 Carcinogenic (category 1B) 22 May 2016 22 November 2017 — 27. 2,2′-dichloro-4,4′-methylenedianiline (MOCA) EC No: 202-918-9 CAS No: 101-14-4 Carcinogenic (category 1B) 22 May 2016 ►M43 (*1) ◄ 22 November 2017 ►M43 (*2) ◄ — 28. Dichromium tris(chromate) EC No: 246-356-2 CAS No: 24613-89-6 Carcinogenic (category 1B) 22 July 2017 ►M43 (*1) ◄ 22 January 2019 ►M43 (*2) ◄ — 29. Strontium chromate EC No: 232-142-6 CAS No: 7789-06-2 Carcinogenic (category 1B) 22 July 2017 ►M43 (*1) ◄ 22 January 2019 ►M43 (*2) ◄ — 30. Potassium hydroxyoctaoxodizincatedichromate EC',
    '(c)\n\nthe financial soundness of the proposed acquirer, in particular in relation to the type of business pursued and envisaged in the investment firm in which the acquisition is proposed;\n\n(d)\n\nwhether the investment firm will be able to comply and continue to comply with the prudential requirements based on this Directive and, where applicable, other Directives, in particular Directives 2002/87/EC and 2013/36/EU, in particular, whether the group of which it will become a part has a structure that makes it possible to exercise effective supervision, effectively exchange information among the competent authorities and determine the allocation of responsibilities among the competent authorities;\n\n(e)',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.6777
cosine_accuracy@3 0.8973
cosine_accuracy@5 0.9391
cosine_accuracy@10 0.9691
cosine_precision@1 0.6777
cosine_precision@3 0.2991
cosine_precision@5 0.1878
cosine_precision@10 0.0969
cosine_recall@1 0.6777
cosine_recall@3 0.8973
cosine_recall@5 0.9391
cosine_recall@10 0.9691
cosine_ndcg@10 0.8364
cosine_mrr@10 0.7924
cosine_map@100 0.7938

Training Details

Training Dataset

Unnamed Dataset

  • Size: 46,338 training samples
  • Columns: sentence_0 and sentence_1
  • Approximate statistics based on the first 1000 samples:
    sentence_0 sentence_1
    type string string
    details
    • min: 11 tokens
    • mean: 35.09 tokens
    • max: 214 tokens
    • min: 4 tokens
    • mean: 202.2 tokens
    • max: 512 tokens
  • Samples:
    sentence_0 sentence_1
    How do the Academies support education and training providers in maintaining and ensuring the quality of the training offered? to in Chapter IV of this Regulation; (b) promoting the voluntary use of the learning programmes, content and materials by education and training providers in the Member States; --- --- (c) offering support to the education and training providers that use the learning programmes, content and materials produced by the Academies to uphold the quality of the training offered and to develop mechanisms to ensure the quality of the training offered; --- --- (d) developing credentials, including, if appropriate, micro-credentials, for voluntary use by Member States and education and training providers on their territories, in order to facilitate the identification of skills and, where appropriate, the recognition of qualifications, to enhance the
    The text provides a comprehensive list of various nickel compounds, including their chemical names and associated identifiers. It covers a range of nickel salts, oxides, and other derivatives, highlighting their diverse applications and chemical properties. The compounds mentioned include nickel arsenate, nickel oxalate, and nickel dichromate, among others, indicating their significance in industrial and chemical processes. [5] 235-688-3 [5] 12519-85-6 [5] Dinickel hexacyanoferrate 028-037-00-8 238-946-3 14874-78-3 Trinickel bis(arsenate); Nickel (II) arsenate 028-038-00-3 236-771-7 13477-70-8 Nickel oxalate; [1] 028-039-00-9 208-933-7 [1] 547-67-1 [1] Oxalic acid, nickel salt; [2] 243-867-2 [2] 20543-06-0 [2] Nickel telluride 028-040-00-4 235-260-6 12142-88-0 Trinickel tetrasulfide 028-041-00-X — 12137-12-1 Trinickel bis(arsenite) 028-042-00-5 — 74646-29-0 Cobalt nickel gray periclase; 028-043-00-0 C.I. Pigment Black 25; C.I. 77332; [1] 269-051-6 [1] 68186-89-0 [1] Cobalt nickel dioxide; [2] 261-346-8 [2] 58591-45-0 [2] Cobalt nickel oxide; [3] - [3] 12737-30-3 [3] Nickel tin trioxide; Nickel stannate 028-044-00-6 234-824-9 12035-38-0 Nickel triuranium decaoxide 028-045-00-1 239-876-6 15780-33-3 Nickel dithiocyanate 028-046-00-7 237-205-1 13689-92-4 Nickel dichromate 028-047-00-2 239-646-5 15586-38-6 Nickel (II) selenite 028-048-00-8 233-263-7 10101-96-9 Nickel selenide 028-049-00-3 215-216-2 1314-05-2 S...
    What is the definition of 'Union airport managing body' and how does it relate to the management of centralized infrastructures for fuel distribution systems? (2)

    ‘Union airport managing body’ means, in respect of a Union airport, the ‘airport managing body’ as defined in Article 2, point (2), of Directive 2009/12/EC or, where the Member State concerned has reserved the management of the centralised infrastructures for fuel distribution systems for another body pursuant to Article 8(1) of Council Directive 96/67/EC ( 2 ), that other body;

    (3)

    ‘aircraft operator’ means a person that operated at least 500 commercial passenger air transport flights, or 52 commercial all-cargo air transport flights departing from Union airports in the previous reporting period or, where it is not possible for that person to be identified, the owner of the aircraft;

    (4)
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            768,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 4
  • per_device_eval_batch_size: 4
  • num_train_epochs: 4
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 4
  • per_device_eval_batch_size: 4
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 4
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin

Training Logs

Epoch Step Training Loss cosine_ndcg@10
0.0432 500 0.5169 0.7365
0.0863 1000 0.1341 0.7914
0.1295 1500 0.0784 0.7992
0.1726 2000 0.0782 0.8058
0.2158 2500 0.0596 0.8012
0.2590 3000 0.057 0.8079
0.3021 3500 0.0785 0.8086
0.3453 4000 0.0423 0.8010
0.3884 4500 0.0586 0.8075
0.4316 5000 0.0508 0.8008
0.4748 5500 0.0764 0.7934
0.5179 6000 0.0583 0.8068
0.5611 6500 0.0663 0.8008
0.6042 7000 0.0344 0.8083
0.6474 7500 0.0506 0.8104
0.6905 8000 0.0478 0.8089
0.7337 8500 0.0509 0.8034
0.7769 9000 0.0426 0.8114
0.8200 9500 0.0603 0.8097
0.8632 10000 0.036 0.8142
0.9063 10500 0.0581 0.8081
0.9495 11000 0.0351 0.8018
0.9927 11500 0.0358 0.8082
1.0 11585 - 0.8076
1.0358 12000 0.0398 0.8093
1.0790 12500 0.0197 0.8023
1.1221 13000 0.0376 0.8137
1.1653 13500 0.0287 0.8136
1.2085 14000 0.0269 0.8146
1.2516 14500 0.0089 0.8161
1.2948 15000 0.0149 0.8126
1.3379 15500 0.0457 0.8138
1.3811 16000 0.0119 0.8171
1.4243 16500 0.0107 0.8105
1.4674 17000 0.015 0.8171
1.5106 17500 0.0208 0.8153
1.5537 18000 0.0168 0.8111
1.5969 18500 0.0114 0.8171
1.6401 19000 0.0188 0.8239
1.6832 19500 0.01 0.8182
1.7264 20000 0.0158 0.8125
1.7695 20500 0.0155 0.8201
1.8127 21000 0.0276 0.8182
1.8558 21500 0.0245 0.8123
1.8990 22000 0.0135 0.8223
1.9422 22500 0.0334 0.8182
1.9853 23000 0.0111 0.8200
2.0 23170 - 0.8221
2.0285 23500 0.0139 0.8225
2.0716 24000 0.0113 0.8237
2.1148 24500 0.0072 0.8223
2.1580 25000 0.0138 0.8218
2.2011 25500 0.0071 0.8200
2.2443 26000 0.0091 0.8240
2.2874 26500 0.013 0.8224
2.3306 27000 0.008 0.8248
2.3738 27500 0.0084 0.8203
2.4169 28000 0.0147 0.8255
2.4601 28500 0.0067 0.8268
2.5032 29000 0.0028 0.8219
2.5464 29500 0.0124 0.8234
2.5896 30000 0.0051 0.8237
2.6327 30500 0.0151 0.8256
2.6759 31000 0.0051 0.8207
2.7190 31500 0.0086 0.8250
2.7622 32000 0.0152 0.8265
2.8054 32500 0.0085 0.8297
2.8485 33000 0.0097 0.8316
2.8917 33500 0.0269 0.8284
2.9348 34000 0.008 0.8305
2.9780 34500 0.0146 0.8309
3.0 34755 - 0.8301
3.0211 35000 0.0218 0.8326
3.0643 35500 0.0152 0.8301
3.1075 36000 0.0072 0.8290
3.1506 36500 0.0077 0.8270
3.1938 37000 0.0155 0.8299
3.2369 37500 0.0069 0.8328
3.2801 38000 0.0103 0.8364

Framework Versions

  • Python: 3.10.11
  • Sentence Transformers: 3.4.1
  • Transformers: 4.48.1
  • PyTorch: 2.4.0+cu121
  • Accelerate: 1.4.0
  • Datasets: 3.3.2
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}