Spaces:
Paused
Paused
wangjin2000
commited on
Upload 5 files
Browse files
datasets/README.md
ADDED
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
---
|
4 |
+
|
5 |
+
This is a refined version of a dataset obtained from UniProt ([see here](https://www.uniprot.org/uniprotkb?facets=reviewed%3Atrue%2Cproteins_with%3A9&fields=accession%2Cprotein_families%2Cft_binding%2Cft_act_site%2Csequence&query=%28family%3A*%29+AND+%28ft_binding%3A*%29&view=table)).
|
6 |
+
The data was first sorted by family, then random families were selected until approximately 20% of the data was separates out for test data.
|
7 |
+
Next, each sequences longer than 1000 residues was segmented into non-overlapping sections of 1000 amino acids or less. Any sequences
|
8 |
+
with only partial binding site annotations were thrown out (any sequences with `<`, `>`, or `?`).
|
9 |
+
|
10 |
+
Note: Copied from https://huggingface.co/datasets/AmelieSchreiber/binding_sites_random_split_by_family
|
datasets/test_labels_chunked_by_family.pkl
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:855a8c80c45bb0b8266b530761427571285aafb9cfbd407df10c71a374b91c16
|
3 |
+
size 35254503
|
datasets/test_sequences_chunked_by_family.pkl
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:0bad16c8293d917e1dcc03e6205ac45fdaf62efe8d2b321d826a1c1cef256d50
|
3 |
+
size 15556955
|
datasets/train_labels_chunked_by_family.pkl
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:8c46866bd8e3127f0394bc5187e075b146cb30c90dec5a67be000c9e6cb6ea5f
|
3 |
+
size 143591612
|
datasets/train_sequences_chunked_by_family.pkl
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:8acb6a3111d8105b372b4d811e3f4912b071ab7bd55d8a501967f5269a6e30ac
|
3 |
+
size 63908077
|