TCRT5 model (pre-trained)

Model description

This model is the pre-trained model used for finetuning TCRT5. The finetuned model is a seq2seq model designed for the conditional generation of T-cell receptor (TCR) sequences given a target peptide-MHC (pMHC). It is a transformers model that is built on the T5 architecture and operationalized by the associated HuggingFace abstraction. It is released along with this paper.

Intended uses & limitations

This model is released to be used for seq2seq finetuning on custom datasets. It may be useful for both the pMHC -> TCR (TCR design) or TCR -> pMHC (TCR de-orphanization) sequence generation. Additionally, it can also be used (though it has not been tested in this capacity) for finetuning on classification or regression-style tasks involving sequence representations of TCR (CDR3 β\beta) and pMHC (peptide-pseudo sequence):

How to use

from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained('dkarthikeyan1/tcrt5_pre_tcrdb')
tcrt5 = T5ForConditionalGeneration.from_pretrained("dkarthikeyan1/tcrt5_pre_tcrdb")
pmhc = "[PMHC]KLGGALQAK[SEP]YFAMYQENVAQTDVDTLYIIYRDYTWAELAYTWY[EOS]"
encoded_pmhc = tokenizer(pmhc, return_tensors='pt')

# Can be useful for classification/regression downstream tasks
enc_outputs = tcrt5.encoder(**encoded_pmhc)

Limitations

As it stands, the model was jointly pre-trained on peptide-pseudosequence and CDR3 β\beta sequences. As such sequences comprised of just peptide, CDR3 α\alpha, or other parts of the TCR would be out-of-distribution OOD.

Training data

TCRT5 was pre-trained on masked span reconstruction of on a dataset built around ~14M CDR3 β\beta sequences from TCRdb as well as ~780k peptide-pseudosequence pairs taken from IEDB. To correct for the data imbalance, upsampling was used to bring the TCR:pMHC sequence ratio to 70:30.

Training procedure

Preprocessing

All amino acid sequences, and V/J gene names were standardized using the tidytcells package. See here. MHC allele information was standardized using mhcgnomes, available here before mapping allele information to the MHC pseudo-sequence as defined in NetMHCpan.

Pre-training

TCRT5 was pretrained with Masked language modeling (MLM): Span reconstruction similar to the original training loss of the T5 paper. For a given sequence, the model masks 15% of the sequence using contiguous spans of random length from length 1-3. This is done via the sentinel tokens introduced in the T5 paper. Then the entire masked sequence is passed into the model and the model is trained to reconstruct a concatenated sequence comprised of the sentinel tokens followed by the masked tokens. This forces the model to learn richer k-mer dependencies of the masked sequences.

Masks 'mlm_probability' tokens grouped into spans of size 'max_span_length' according to the following algorithm:
        * Radnomly generate span lengths that add up to round(mlm_probability*seq_len) (ignoring pad token) for each sequence.
        * Ensure that the spans are not directly adjacent to ensure max_span_length is observed
        * Once the span masks are generated according to T5 standards mask the inputs and generate the targets 
    
    
    Example Input:
    
    CASSLGQGYEQYF
    
    Masked Input:
    
    CASSLG[X]GY[Y]F
    
    Target:
    
    [X]Q[Y]EQY[Z].

Hyperparameters:

Hparam #Enc. #Dec. Vocab. Size D_model Num Attn. Heads Dropout D_ff
10 10 128 256 16 0.1 1024
TA Bsz. LR Steps Weight Decay Warmup
512 3e- 4 168k (~4eps) 0.1 500

Hardware

  • Hardware Type: NVIDIA A100 80GB PCIe
  • Hours used: 60
  • Carbon Emitted: 6.48 kg CO2 eq.

Carbon emissions were estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

BibTeX entry and citation info

@article{dkarthikeyan2024tcrtranslate,
  title={TCR-TRANSLATE: Conditional Generation of Real Antigen Specific T-cell Receptor Sequences},
  author={Dhuvarakesh Karthikeyan and Colin Raffel and Benjamin Vincent and Alex Rubinsteyn},
  journal={bioArXiv},
  year={2024},
}
Downloads last month
43
Safetensors
Model size
42M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for dkarthikeyan1/tcrt5_pre_tcrdb

Finetunes
1 model

Collection including dkarthikeyan1/tcrt5_pre_tcrdb