ESM2Bind / datasets /README.md
wangjin2000's picture
Upload 5 files
5c47379 verified
---
license: mit
---
This is a refined version of a dataset obtained from UniProt ([see here](https://www.uniprot.org/uniprotkb?facets=reviewed%3Atrue%2Cproteins_with%3A9&fields=accession%2Cprotein_families%2Cft_binding%2Cft_act_site%2Csequence&query=%28family%3A*%29+AND+%28ft_binding%3A*%29&view=table)).
The data was first sorted by family, then random families were selected until approximately 20% of the data was separates out for test data.
Next, each sequences longer than 1000 residues was segmented into non-overlapping sections of 1000 amino acids or less. Any sequences
with only partial binding site annotations were thrown out (any sequences with `<`, `>`, or `?`).
Note: Copied from https://huggingface.co/datasets/AmelieSchreiber/binding_sites_random_split_by_family