Model for feature generation requires very high memory.
Feature generation of protein sequence length of about 1000 takes very high ram usage and google colab's 12GB gpu memory became 'out of memory' error just after using of 6 of those protein sequences.
Maybe try to cast the model to half-precision before running feature extraction.
Also, I would recommend to use our ProtT5-XL model because it proved to be better in any of our benchmarks:
https://huggingface.co/Rostlab/prot_t5_xl_half_uniref50-enc
Also, when you only hit OOM after embedding 6 sequences of identical length, you have memory leakage somewhere.
Once you managed to embed a single protein of e.g. 1k residues, it should not make any difference whether you repeat the process x-times.
when using this https://huggingface.co/Rostlab/prot_t5_xl_half_uniref50-enc model , the tokenizer shows error "Exception: You're trying to run a Unigram
model but you're file was trained with a different algorithm".
Yeah, I guess you are running into this issue: https://github.com/huggingface/transformers/issues/9871
I think your problem should be solved by loading BertTokenizer or T5Tokenizer instead of AutoTokenizer