as-cle-bert commited on
Commit
0548fd2
·
verified ·
1 Parent(s): 985369e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +208 -14
README.md CHANGED
@@ -7,34 +7,228 @@ tags:
7
  - antiobiotic-resistance
8
  widget:
9
  - text: I love AutoTrain
 
10
  datasets:
11
  - as-cle-bert/AMR-Gene-Families
12
  pipeline_tag: text-classification
13
  ---
14
 
15
- # Model Trained Using AutoTrain
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
- - Problem type: Text Classification
18
 
19
- ## Validation Metrics
20
- loss: 0.08235077559947968
21
 
22
- f1_macro: 0.986759581881533
23
 
24
- f1_micro: 0.99
25
 
26
- f1_weighted: 0.9899790940766551
27
 
28
- precision_macro: 0.9871615312791784
29
 
30
- precision_micro: 0.99
31
 
32
- precision_weighted: 0.9901213818860879
33
 
34
- recall_macro: 0.986574074074074
35
 
36
- recall_micro: 0.99
37
 
38
- recall_weighted: 0.99
39
 
40
- accuracy: 0.99
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  - antiobiotic-resistance
8
  widget:
9
  - text: I love AutoTrain
10
+ - text: M T L A L V G E K I D R N R F T G E K V E N S T F F N C D F S G A D L S G T E F I G C Q F Y D R E S Q K G C N F S R A N L K D A I F K S C D L S M A D F R N I N A L G I E I R H C R A Q G S D F R G A S F M N M I T T R T W F C S A Y I T N T N L S Y A N F S K V V L E K C E L W E N R W M G T Q V L G A T F S G S D L S G G E F S S F D W R A A N V T H C D L T N S E L G D L D I R G V D L Q G V K L D S Y Q A S L L L E R L G I A V M G
11
  datasets:
12
  - as-cle-bert/AMR-Gene-Families
13
  pipeline_tag: text-classification
14
  ---
15
 
16
+ <table>
17
+ <tr>
18
+ <td>
19
+ <img src="https://img.shields.io/github/languages/top/AstraBert/resistML" alt="GitHub top language">
20
+ </td>
21
+ <td>
22
+ <img src="https://img.shields.io/github/commit-activity/t/AstraBert/resistML" alt="GitHub commit activity">
23
+ </td>
24
+ <td>
25
+ <img src="https://img.shields.io/badge/resistML-stable-green" alt="Static Badge">
26
+ </td>
27
+ <td>
28
+ <img src="https://img.shields.io/badge/resistBERT-unstable-orange" alt="Static Badge">
29
+ </td>
30
+ <td>
31
+ <img src="https://img.shields.io/badge/Release-v0.0.0-blue" alt="Static Badge">
32
+ </td>
33
+ </tr>
34
+ </table>
35
 
 
36
 
37
+ # resistML
 
38
 
39
+ A tool for AMR gene family prediction, simple and ML-based. Please refer to [this GitHub repository](https://github.com/AstraBert/resistML).
40
 
41
+ ## Training
42
 
 
43
 
44
+ ### Data collection for training
45
 
46
+ Latest reference sequences release (Feb 2024) were downloaded from **CARD** (*The Comprehensive Antibiotic Resistance Database*). If you want to automatically download them too, use `this link <https://card.mcmaster.ca/latest/data>`_.
47
 
48
+ Protein sequences were mapped with their ARO indices to the corrresponding AMR gene families (see [this file](https://github.com/AstraBert/resistML/tree/main/data/aro_categories_index.tsv) for reference) and the 12 most common families were chosen to train resistML and resistBERT.
49
 
50
+ ### Training procedures
51
 
52
+ #### resistML (stable)
53
 
54
+ resistML was trained starting from all the protein sequences retrieved beforehands, extracting their features in a [csv file](https://github.com/AstraBert/resistML/tree/main/data/proteinstats.tsv).
55
 
56
+ Features were extracted through biopython `Bio.SeqUtils.ProtParam --> ProteinAnalysis` subclass, and they are (maiusc is for the header you can find in the csv):
57
+
58
+ - HIDROPHOBICITY score
59
+ - ISOELECTRIC point
60
+ - AROMATICity
61
+ - INSTABility
62
+ - MW (molar weight)
63
+ - HELIX,TURN,SHEET (percentage of these three secondary strcutures)
64
+ - MOL_EXT_RED,MOL_EXT_OX (molar extinction reduced and oxidized)
65
+
66
+ Dataset building occured [here](https://github.com/AstraBert/resistML/tree/main/scripts/build_base_dataset.py)
67
+
68
+ The base model itself is a simple Voting Classifier based on a DecisionTreeClassifier, ExtraTreesClassifier and HistGradientBoostingClassifier, all provided by scikit-learn library.
69
+
70
+ During validation, it yielded 100% accuracy on predicting training data.
71
+
72
+ #### resistBERT (unstable)
73
+
74
+
75
+ resistBERT is a BERT model for text classification, finetuned from [prot_bert](https://huggingface.co/Rostlab/prot_bert) by RosettaLab.
76
+
77
+ Data using from finetuning were a selection of 1496 sequences out of the total 1836 ones. 80% were used for training, 20% were used for validations.
78
+
79
+ Sequences were preprocessed and labelled [here](https://github.com/AstraBert/resistML/tree/main/scripts/build_base_dataset.py), then the complete jsonl file was reduced [here](https://github.com/AstraBert/resistML/tree/main/scripts/reduce_dataset.py) and uploaded to Huggingface under the identifier `as-cle-bert/AMR-Gene-Families` through [this script](https://github.com/AstraBert/resistML/tree/main/scripts/jsonl2hfdataset.py).
80
+
81
+ Finetuning occurred from the HF dataset thanks to AutoTrain: during validation, the model yielded the following stats:
82
+
83
+ - loss: 0.08235077559947968
84
+
85
+ - f1_macro: 0.986759581881533
86
+
87
+ - f1_micro: 0.99
88
+
89
+ - f1_weighted: 0.9899790940766551
90
+
91
+ - precision_macro: 0.9871615312791784
92
+
93
+ - precision_micro: 0.99
94
+
95
+ - precision_weighted: 0.9901213818860879
96
+
97
+ - recall_macro: 0.986574074074074
98
+
99
+ - recall_micro: 0.99
100
+
101
+ - recall_weighted: 0.99
102
+
103
+ - accuracy: 0.99
104
+
105
+ The model is now available on Huggingface under the identifier `as-cle-bert/resistBERT`. There is also a widget through which you can make inferences thanks to HF `Inference API`. Keep in mind that Inference API *can* be unstable, so downloading the model and using it from a local machine/cloud service would be preferable.
106
+
107
+ ## Testing
108
+
109
+
110
+ ### Data retrieval for tests
111
+
112
+ Data were downloaded from **CARD** (*The Comprehensive Antibiotic Resistance Database*), as the annotations for the family names used to label training sequences were the same.
113
+
114
+ For families "PDC beta-lactamase", "CTX-M beta-lactamase", "SHV beta-lactamase", "CMY beta-lactamase", sequences were downloaded after having searched the exact AMR gene family as in the labels used for training, through `Download sequences` method. In the downloading customization page, filters were set to `is_a` and `Protein`.
115
+
116
+ For all the other families, procedure was the same but customization filters were set to `is_a`, `structurally_homologous_to`, `evolutionary_variant_of` and `Protein` to increase the number of retrieved sequences.
117
+
118
+ ### Test building
119
+
120
+
121
+ Test were built thanks to [this script](https://github.com/AstraBert/resistML/tree/main/scripts/build_tests.py).
122
+
123
+ These are the test metadata:
124
+
125
+ **Metadata for test 0:**
126
+
127
+ - Protein statistics for resistML were saved in test/testfiles/test_0.csv
128
+ - Sequences and labels for resistBERT were saved in test/testfiles/test_0.jsonl
129
+ - 12 protein sequences were taken into account for 2 families
130
+ - Families taken into account were: quinolone resistance protein (qnr), CMY beta-lactamase
131
+
132
+ **Metadata for test 1:**
133
+
134
+ - Protein statistics for resistML were saved in test/testfiles/test_1.csv
135
+ - Sequences and labels for resistBERT were saved in test/testfiles/test_1.jsonl
136
+ - 11 protein sequences were taken into account for 2 families
137
+ - Families taken into account were: VIM beta-lactamase,IMP beta-lactamase
138
+
139
+ **Metadata for test 2:**
140
+
141
+ - Protein statistics for resistML were saved in test/testfiles/test_2.csv
142
+ - Sequences and labels for resistBERT were saved in test/testfiles/test_2.jsonl
143
+ - 13 protein sequences were taken into account for 2 families
144
+ - Families taken into account were: quinolone resistance protein (qnr),SHV beta-lactamase
145
+
146
+ **Metadata for test 3:**
147
+
148
+ - Protein statistics for resistML were saved in test/testfiles/test_3.csv
149
+ - Sequences and labels for resistBERT were saved in test/testfiles/test_3.jsonl
150
+ - 10 protein sequences were taken into account for 3 families
151
+ - Families taken into account were: quinolone resistance protein (qnr),VIM beta-lactamase,CMY beta-lactamase
152
+
153
+ **Metadata for test 4:**
154
+
155
+ - Protein statistics for resistML were saved in test/testfiles/test_4.csv
156
+ - Sequences and labels for resistBERT were saved in test/testfiles/test_4.jsonl
157
+ - 12 protein sequences were taken into account for 2 families
158
+ - Families taken into account were: CMY beta-lactamase,IMP beta-lactamase
159
+
160
+ **Metadata for test 5:**
161
+
162
+ - Protein statistics for resistML were saved in test/testfiles/test_5.csv
163
+ - Sequences and labels for resistBERT were saved in test/testfiles/test_5.jsonl
164
+ - 12 protein sequences were taken into account for 2 families
165
+ - Families taken into account were: VIM beta-lactamase,SHV beta-lactamase
166
+
167
+ **Metadata for test 6:**
168
+
169
+ - Protein statistics for resistML were saved in test/testfiles/test_6.csv
170
+ - Sequences and labels for resistBERT were saved in test/testfiles/test_6.jsonl
171
+ - 11 protein sequences were taken into account for 3 families
172
+ - Families taken into account were: PDC beta-lactamase,MCR phosphoethanolamine transferase,ACT beta-lactamase
173
+
174
+ **Metadata for test 7:**
175
+
176
+ - Protein statistics for resistML were saved in test/testfiles/test_7.csv
177
+ - Sequences and labels for resistBERT were saved in test/testfiles/test_7.jsonl
178
+ - 10 protein sequences were taken into account for 3 families
179
+ - Families taken into account were: MCR phosphoethanolamine transferase,CTX-M beta-lactamase,PDC beta-lactamase
180
+
181
+ **Metadata for test 8:**
182
+
183
+ - Protein statistics for resistML were saved in test/testfiles/test_8.csv
184
+ - Sequences and labels for resistBERT were saved in test/testfiles/test_8.jsonl
185
+ - 12 protein sequences were taken into account for 2 families
186
+ - Families taken into account were: ACT beta-lactamase,CMY beta-lactamase
187
+
188
+ **Metadata for test 9:**
189
+ - Protein statistics for resistML were saved in test/testfiles/test_9.csv
190
+ - Sequences and labels for resistBERT were saved in test/testfiles/test_9.jsonl
191
+ - 15 protein sequences were taken into account for 3 families
192
+ - Families taken into account were: quinolone resistance protein (qnr),SHV beta-lactamase,KPC beta-lactamase
193
+
194
+ All data can be found [here](http://github.com/AstraBert/resistML/tree/main/test), along with the seqences used to generate them.
195
+
196
+ ### Test results
197
+
198
+ **resistML** yielded 100% accuracy, f1 score, recall score and precision score in all 10 tests.
199
+
200
+ **resistBERT** was more unstable:
201
+
202
+ - On test_0, test_2, test_4, test_6, test_7, test_8 and test_9 yielded 100% accuracy, f1 score, recall score and precision score
203
+ - On test_1 it yielded:
204
+ 1. Accuracy: 50%
205
+ 2. f1 score: 33%
206
+ 3. Precision: 25%
207
+ 4. Recall: 50%
208
+ - On test_3 it yielded 66.7% accuracy, f1 score, recall score and precision score
209
+ - On test_5 it yielded 50% accuracy, f1 score, recall score and precision score
210
+
211
+
212
+ All results for resistBERT can be found [in the dedicated notebook](http://github.com/AstraBert/resistML/scripts/test_resistBERT.ipynb) .
213
+
214
+ ## License and rights of usage
215
+
216
+
217
+ The[ GitHub repository](http://github.com/AstraBert/resistML) is provided under MIT license (more at [LICENSE](https://github.com/AstraBert/resistML/tree/main/LICENSE)`).
218
+
219
+ If you use this work for your projects, please consider citing the author [Astra Bertelli](http://astrabert.vercel.app).
220
+
221
+ ## References
222
+
223
+
224
+ 1. **CARD - The Comprehensive Antibiotic Resistance Database**
225
+
226
+ 2. **Biopython**
227
+
228
+ 3. **Scikit-learn**
229
+
230
+ 4. **Hugging Face's prot_bert Model**
231
+
232
+ 5. **Hugging Face's AutoTrain**
233
+
234
+ If you feel that your work was relevant in building resistML and you weren't referenced in this section, feel free to flag an issue on GitHub or to contact the author.