| --- |
| license: cc-by-nc-sa-4.0 |
| widget: |
| - text: AGTCCAGTGGACGACCAGCCACGGCTCCGGTCTGTAGAACCATCGCGGAAACGGCTCGCAAAACTCTAAACAGCGCAAACGATGCGCGCGCCGAAGCAACCCGGCTCTACTTATAAAAACGTCCAACGGTGAGCACCGAGCAGCTACTACTCGTACTCCCCCCACCGATC |
| tags: |
| - DNA |
| - biology |
| - genomics |
| --- |
| # Plant foundation DNA large language models |
|
|
| The plant DNA large language models (LLMs) contain a series of foundation models based on different model architectures, which are pre-trained on various plant reference genomes. |
| All the models have a comparable model size between 90 MB and 150 MB, BPE tokenizer is used for tokenization and 8000 tokens are included in the vocabulary. |
|
|
|
|
| **Developed by:** zhangtaolab |
|
|
| ### Model Sources |
|
|
| - **Repository:** [Plant DNA LLMs](https://github.com/zhangtaolab/plant_DNA_LLMs) |
| - **Manuscript:** [Versatile applications of foundation DNA large language models in plant genomes]() |
|
|
| ### Architecture |
|
|
| The model is trained based on the OpenAI GPT-2 model with modified tokenizer specific for DNA sequence. |
|
|
| This model is fine-tuned for predicting promoter strength in maize protoplasts system. |
|
|
|
|
| ### How to use |
|
|
| Install the runtime library first: |
| ```bash |
| pip install transformers |
| ``` |
|
|
| Here is a simple code for inference: |
| ```python |
| from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline |
| |
| model_name = 'plant-dnagpt-promoter_strength_protoplast' |
| # load model and tokenizer |
| model = AutoModelForSequenceClassification.from_pretrained(f'zhangtaolab/{model_name}', trust_remote_code=True) |
| tokenizer = AutoTokenizer.from_pretrained(f'zhangtaolab/{model_name}', trust_remote_code=True) |
| |
| # inference |
| sequences = ['TACTCTAATCGTATCAGCTGCACTTGCGTACAGGCTACCGGCGTCCTCAGCCACGTAAGAAAAGGCCCAATAAAGGCCCAACTACAACCAGCGGATATATATACTGGAGCCTGGCGAGATCACCCTAACCCCTCACACTCCCATCCAGCCGCCACCAGGTGCAGAGTGTT', |
| 'ATTTCAAAACTAGTTTTCTATAAACGAAAACTTATATTTATTCCGCTTGTTCCGTTTGATCTGCTGATTCGACACCGTTTTAACGTATTTTAAGTAAGTATCAGAAATATTAATGTGAAGATAAAAGAAAATAGAGTAAATGTAAAGGAAAATGCATAAGATTTTGTTGA'] |
| pipe = pipeline('text-classification', model=model, tokenizer=tokenizer, |
| trust_remote_code=True, function_to_apply="none") |
| results = pipe(sequences) |
| print(results) |
| |
| ``` |
|
|
|
|
| ### Training data |
| We use GPT2ForSequenceClassification to fine-tune the model. |
| Detailed training procedure can be found in our manuscript. |
|
|
|
|
| #### Hardware |
| Model was trained on a NVIDIA GTX1080Ti GPU (11 GB). |
|
|