wangjin2000 commited on
Commit
5c47379
·
verified ·
1 Parent(s): eccecb6

Upload 5 files

Browse files
datasets/README.md ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+
5
+ This is a refined version of a dataset obtained from UniProt ([see here](https://www.uniprot.org/uniprotkb?facets=reviewed%3Atrue%2Cproteins_with%3A9&fields=accession%2Cprotein_families%2Cft_binding%2Cft_act_site%2Csequence&query=%28family%3A*%29+AND+%28ft_binding%3A*%29&view=table)).
6
+ The data was first sorted by family, then random families were selected until approximately 20% of the data was separates out for test data.
7
+ Next, each sequences longer than 1000 residues was segmented into non-overlapping sections of 1000 amino acids or less. Any sequences
8
+ with only partial binding site annotations were thrown out (any sequences with `<`, `>`, or `?`).
9
+
10
+ Note: Copied from https://huggingface.co/datasets/AmelieSchreiber/binding_sites_random_split_by_family
datasets/test_labels_chunked_by_family.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:855a8c80c45bb0b8266b530761427571285aafb9cfbd407df10c71a374b91c16
3
+ size 35254503
datasets/test_sequences_chunked_by_family.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0bad16c8293d917e1dcc03e6205ac45fdaf62efe8d2b321d826a1c1cef256d50
3
+ size 15556955
datasets/train_labels_chunked_by_family.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8c46866bd8e3127f0394bc5187e075b146cb30c90dec5a67be000c9e6cb6ea5f
3
+ size 143591612
datasets/train_sequences_chunked_by_family.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8acb6a3111d8105b372b4d811e3f4912b071ab7bd55d8a501967f5269a6e30ac
3
+ size 63908077