Dataset : Custom Dataset

VCTI-RoBERTa-Fiber

Model Summary

This model is a domain-adapted RoBERTa-base model fine-tuned using Masked Language Modeling (MLM) on optical communication and photonics data. It is optimized for generating domain-specific embeddings that capture the nuances and technical jargon of the optical domain.

⚠️ Note: This is the basic version of our ongoing development.
A significantly improved version trained on much larger and more diverse optical corpora will be released soon!

Training Data

The model was trained on:

1000+ Optical Wikipedia Articles
120+ Optical Communication & Photonics Textbooks
500+ ITUT and IEEE Papers
1000+ Web Articles

The training corpus includes content related to:

Optical fibers
Photonic devices
Multiplexing (WDM, TDM, OTN )
Optical amplifiers
Modulation techniques
Communication networks
Laser systems ....etc

⚙️ Training Details

Parameter	Value	Description
batch_size	64	Number of samples per training batch
epochs	15	Number of training epochs
patience	6	Early stopping patience
learning_rate	5e-5	Learning rate for the AdamW optimizer
weight_decay	0.01	Weight decay for regularization
objective	MLM	Masked Language Modeling

The training was performed using the transformers library by Hugging Face.

Core Use Case: Domain-Specific Embeddings

The fine-tuned model is particularly effective at generating context-aware embeddings for the optical domain. This makes it highly suitable for tasks such as:

Semantic Search across technical documents
Retrieval-Augmented Generation (RAG) for Q&A systems
Topic Modeling and document clustering
Similarity Matching between questions, answers, or papers

How to Use

Load the model

from transformers import RobertaTokenizerFast, RobertaModel

tokenizer = RobertaTokenizerFast.from_pretrained("quantum-leap-vcti/VCTI-RoBERTa-Fiber")
model = RobertaModel.from_pretrained("quantum-leap-vcti/VCTI-RoBERTa-Fiber")

text = "Wavelength-division multiplexing increases the capacity of optical fibers."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
embedding = outputs.last_hidden_state.mean(dim=1)

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files , to deal in the Software without restriction

Downloads last month: 3

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for quantum-leap-vcti/VCTI-RoBERTa-Fiber

Base model

FacebookAI/roberta-base

Finetuned

(2197)

this model