Instructions to use DataWise/gte-large-en-v1.5_SEC_docs_ft with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use DataWise/gte-large-en-v1.5_SEC_docs_ft with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("DataWise/gte-large-en-v1.5_SEC_docs_ft", trust_remote_code=True) sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
This is a finetuned version of Alibaba-NLP/gte-large-en-v1.5 optimized for SEC financial documents retrieval. It supports text input with a context length of up to 8192 tokens, mapping it to a 1024-dimensional dense vector space, enabling semantic textual similarity, semantic search, and clustering. Fine-tuning was conducted using a dataset of SEC documents to improve domain-specific retrieval accuracy.
The dataset consists of 9400 query-context pairs for training, 1770 pairs for validation, and 1190 pairs for testing, created from a total of 6 PDFs using gpt-4o-mini. Fine-tuning was conducted over 5 epochs (after some experimentation with less and more epochs) using LlamaIndex’s pipeline, optimizing the model for retrieval. The dataset can be found here. For more details about the original model, please refer to its model card.
- Downloads last month
- 12