ProtBERT-Unmasking

This model is a fine-tuned version of ProtBERT specifically optimized for unmasking protein sequences. It can predict masked amino acids in protein sequences based on the surrounding context.

Model Description

  • Base Model: ProtBERT
  • Task: Protein Sequence Unmasking
  • Training: Fine-tuned on masked protein sequences
  • Use Case: Predicting missing or masked amino acids in protein sequences
  • Optimal Use: Best performance on E. coli sequences with known amino acids K, C, Y, H, S, M

For detailed information about the training methodology and approach, please refer to our paper: https://arxiv.org/abs/2408.00892

Usage

from transformers import AutoModelForMaskedLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained("your-username/protbert-sequence-unmasking")
tokenizer = AutoTokenizer.from_pretrained("your-username/protbert-sequence-unmasking")

# Example usage for E. coli sequence with known amino acids (K,C,Y,H,S,M)
sequence = "MALN[MASK]KFGP[MASK]LVRK"
inputs = tokenizer(sequence, return_tensors="pt")
outputs = model(**inputs)
predictions = outputs.logits

Inference API

The model is optimized for:

  • Organism: E. coli
  • Known Amino Acids: K, C, Y, H, S, M
  • Task: Predicting unknown amino acids in a sequence

Example API usage:

from transformers import pipeline

unmasker = pipeline('fill-mask', model='your-username/protbert-sequence-unmasking')
sequence = "K[MASK]YHS[MASK]"  # Example with known amino acids K,Y,H,S
results = unmasker(sequence)

for result in results:
    print(f"Predicted amino acid: {result['token_str']}, Score: {result['score']:.3f}")

Limitations and Biases

  • This model is specifically designed for protein sequence unmasking in E. coli
  • Optimal performance is achieved when working with sequences containing known amino acids K, C, Y, H, S, M
  • The model may not perform optimally for:
    • Sequences from other organisms
    • Sequences without the specified known amino acids
    • Other protein-related tasks

Training Details

The complete details of the training methodology, dataset preparation, and model evaluation can be found in our paper: https://arxiv.org/abs/2408.00892

Downloads last month
24
Safetensors
Model size
420M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.