|
--- |
|
license: mit |
|
|
|
datasets: |
|
- damlab/uniprot |
|
metrics: |
|
- accuracy |
|
|
|
widget: |
|
- text: 'involved_in GO:0006468 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372' |
|
example_title: 'Function' |
|
|
|
--- |
|
|
|
# GO-Language model |
|
|
|
## Table of Contents |
|
- [Summary](#model-summary) |
|
- [Model Description](#model-description) |
|
- [Intended Uses & Limitations](#intended-uses-&-limitations) |
|
- [How to Use](#how-to-use) |
|
- [Training Data](#training-data) |
|
- [Training Procedure](#training-procedure) |
|
- [Preprocessing](#preprocessing) |
|
- [Training](#training) |
|
- [Evaluation Results](#evaluation-results) |
|
- [BibTeX Entry and Citation Info](#bibtex-entry-and-citation-info) |
|
|
|
## Summary |
|
|
|
This model was built as a way to encode the Gene Ontology definition of a protein as vector representation. |
|
It was trained on a collection of gene-ontology terms from model organisms. |
|
Each function was sorted by the ID number and combined with its annotation description ie (`is_a`, `enables`, `located_in`, etc). |
|
The model is tokenized such that each description and GO term is its own token. |
|
This is intended to be used as a translation model between PROT-BERT and GO-Language. |
|
That type of translation model will be useful for predicting the function of novel genes. |
|
|
|
## Model Description |
|
|
|
This model was trained using the damlab/uniprot dataset on the `go` field with 256 token chunks and a 15% mask rate. |
|
|
|
|
|
## Intended Uses & Limitations |
|
|
|
This model is a useful encapsulation of gene ontology functions. |
|
It allows both an exploration of gene-level similarities as well as comparisons between functional terms. |
|
|
|
## How to use |
|
|
|
As this is a BERT-style Masked Language learner, it can be used to determine the most likely token a masked position. |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
unmasker = pipeline("fill-mask", model="damlab/GO-language") |
|
|
|
unmasker("involved_in [MASK] involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372") |
|
|
|
[{'score': 0.1040298342704773, |
|
'token': 103, |
|
'token_str': 'GO:0002250', |
|
'sequence': 'involved_in GO:0002250 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'}, |
|
{'score': 0.018045395612716675, |
|
'token': 21, |
|
'token_str': 'GO:0005576', |
|
'sequence': 'involved_in GO:0005576 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'}, |
|
{'score': 0.015035462565720081, |
|
'token': 50, |
|
'token_str': 'GO:0000139', |
|
'sequence': 'involved_in GO:0000139 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'}, |
|
{'score': 0.01181247178465128, |
|
'token': 37, |
|
'token_str': 'GO:0007165', |
|
'sequence': 'involved_in GO:0007165 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'}, |
|
{'score': 0.01000668853521347, |
|
'token': 14, |
|
'token_str': 'GO:0005737', |
|
'sequence': 'involved_in GO:0005737 involved_in GO:0007165 located_in GO:0042470 involved_in GO:0070372'} |
|
] |
|
|
|
``` |
|
|
|
## Training Data |
|
|
|
The dataset was trained using [damlab/uniprot](https://huggingface.co/datasets/damlab/uniprot) from a random initial model. |
|
The Gene Ontology functions were sorted (by ID number) along with annotating term. |
|
|
|
## Training Procedure |
|
|
|
### Preprocessing |
|
|
|
All strings were concatenated and chunked into 256 token chunks for training. A random 20% of chunks were held for validation. |
|
|
|
### Training |
|
|
|
Training was performed with the HuggingFace training module using the MaskedLM data loader with a 15% masking rate. The learning rate was set at E-5, 50K warm-up steps, and a cosine_with_restarts learning rate schedule and continued until 3 consecutive epochs did not improve the loss on the held-out dataset. |
|
|
|
|
|
## BibTeX Entry and Citation Info |
|
|
|
[More Information Needed] |
|
|