as-cle-bert
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -7,34 +7,228 @@ tags:
|
|
7 |
- antiobiotic-resistance
|
8 |
widget:
|
9 |
- text: I love AutoTrain
|
|
|
10 |
datasets:
|
11 |
- as-cle-bert/AMR-Gene-Families
|
12 |
pipeline_tag: text-classification
|
13 |
---
|
14 |
|
15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
16 |
|
17 |
-
- Problem type: Text Classification
|
18 |
|
19 |
-
|
20 |
-
loss: 0.08235077559947968
|
21 |
|
22 |
-
|
23 |
|
24 |
-
|
25 |
|
26 |
-
f1_weighted: 0.9899790940766551
|
27 |
|
28 |
-
|
29 |
|
30 |
-
|
31 |
|
32 |
-
|
33 |
|
34 |
-
|
35 |
|
36 |
-
|
37 |
|
38 |
-
|
39 |
|
40 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
7 |
- antiobiotic-resistance
|
8 |
widget:
|
9 |
- text: I love AutoTrain
|
10 |
+
- text: M T L A L V G E K I D R N R F T G E K V E N S T F F N C D F S G A D L S G T E F I G C Q F Y D R E S Q K G C N F S R A N L K D A I F K S C D L S M A D F R N I N A L G I E I R H C R A Q G S D F R G A S F M N M I T T R T W F C S A Y I T N T N L S Y A N F S K V V L E K C E L W E N R W M G T Q V L G A T F S G S D L S G G E F S S F D W R A A N V T H C D L T N S E L G D L D I R G V D L Q G V K L D S Y Q A S L L L E R L G I A V M G
|
11 |
datasets:
|
12 |
- as-cle-bert/AMR-Gene-Families
|
13 |
pipeline_tag: text-classification
|
14 |
---
|
15 |
|
16 |
+
<table>
|
17 |
+
<tr>
|
18 |
+
<td>
|
19 |
+
<img src="https://img.shields.io/github/languages/top/AstraBert/resistML" alt="GitHub top language">
|
20 |
+
</td>
|
21 |
+
<td>
|
22 |
+
<img src="https://img.shields.io/github/commit-activity/t/AstraBert/resistML" alt="GitHub commit activity">
|
23 |
+
</td>
|
24 |
+
<td>
|
25 |
+
<img src="https://img.shields.io/badge/resistML-stable-green" alt="Static Badge">
|
26 |
+
</td>
|
27 |
+
<td>
|
28 |
+
<img src="https://img.shields.io/badge/resistBERT-unstable-orange" alt="Static Badge">
|
29 |
+
</td>
|
30 |
+
<td>
|
31 |
+
<img src="https://img.shields.io/badge/Release-v0.0.0-blue" alt="Static Badge">
|
32 |
+
</td>
|
33 |
+
</tr>
|
34 |
+
</table>
|
35 |
|
|
|
36 |
|
37 |
+
# resistML
|
|
|
38 |
|
39 |
+
A tool for AMR gene family prediction, simple and ML-based. Please refer to [this GitHub repository](https://github.com/AstraBert/resistML).
|
40 |
|
41 |
+
## Training
|
42 |
|
|
|
43 |
|
44 |
+
### Data collection for training
|
45 |
|
46 |
+
Latest reference sequences release (Feb 2024) were downloaded from **CARD** (*The Comprehensive Antibiotic Resistance Database*). If you want to automatically download them too, use `this link <https://card.mcmaster.ca/latest/data>`_.
|
47 |
|
48 |
+
Protein sequences were mapped with their ARO indices to the corrresponding AMR gene families (see [this file](https://github.com/AstraBert/resistML/tree/main/data/aro_categories_index.tsv) for reference) and the 12 most common families were chosen to train resistML and resistBERT.
|
49 |
|
50 |
+
### Training procedures
|
51 |
|
52 |
+
#### resistML (stable)
|
53 |
|
54 |
+
resistML was trained starting from all the protein sequences retrieved beforehands, extracting their features in a [csv file](https://github.com/AstraBert/resistML/tree/main/data/proteinstats.tsv).
|
55 |
|
56 |
+
Features were extracted through biopython `Bio.SeqUtils.ProtParam --> ProteinAnalysis` subclass, and they are (maiusc is for the header you can find in the csv):
|
57 |
+
|
58 |
+
- HIDROPHOBICITY score
|
59 |
+
- ISOELECTRIC point
|
60 |
+
- AROMATICity
|
61 |
+
- INSTABility
|
62 |
+
- MW (molar weight)
|
63 |
+
- HELIX,TURN,SHEET (percentage of these three secondary strcutures)
|
64 |
+
- MOL_EXT_RED,MOL_EXT_OX (molar extinction reduced and oxidized)
|
65 |
+
|
66 |
+
Dataset building occured [here](https://github.com/AstraBert/resistML/tree/main/scripts/build_base_dataset.py)
|
67 |
+
|
68 |
+
The base model itself is a simple Voting Classifier based on a DecisionTreeClassifier, ExtraTreesClassifier and HistGradientBoostingClassifier, all provided by scikit-learn library.
|
69 |
+
|
70 |
+
During validation, it yielded 100% accuracy on predicting training data.
|
71 |
+
|
72 |
+
#### resistBERT (unstable)
|
73 |
+
|
74 |
+
|
75 |
+
resistBERT is a BERT model for text classification, finetuned from [prot_bert](https://huggingface.co/Rostlab/prot_bert) by RosettaLab.
|
76 |
+
|
77 |
+
Data using from finetuning were a selection of 1496 sequences out of the total 1836 ones. 80% were used for training, 20% were used for validations.
|
78 |
+
|
79 |
+
Sequences were preprocessed and labelled [here](https://github.com/AstraBert/resistML/tree/main/scripts/build_base_dataset.py), then the complete jsonl file was reduced [here](https://github.com/AstraBert/resistML/tree/main/scripts/reduce_dataset.py) and uploaded to Huggingface under the identifier `as-cle-bert/AMR-Gene-Families` through [this script](https://github.com/AstraBert/resistML/tree/main/scripts/jsonl2hfdataset.py).
|
80 |
+
|
81 |
+
Finetuning occurred from the HF dataset thanks to AutoTrain: during validation, the model yielded the following stats:
|
82 |
+
|
83 |
+
- loss: 0.08235077559947968
|
84 |
+
|
85 |
+
- f1_macro: 0.986759581881533
|
86 |
+
|
87 |
+
- f1_micro: 0.99
|
88 |
+
|
89 |
+
- f1_weighted: 0.9899790940766551
|
90 |
+
|
91 |
+
- precision_macro: 0.9871615312791784
|
92 |
+
|
93 |
+
- precision_micro: 0.99
|
94 |
+
|
95 |
+
- precision_weighted: 0.9901213818860879
|
96 |
+
|
97 |
+
- recall_macro: 0.986574074074074
|
98 |
+
|
99 |
+
- recall_micro: 0.99
|
100 |
+
|
101 |
+
- recall_weighted: 0.99
|
102 |
+
|
103 |
+
- accuracy: 0.99
|
104 |
+
|
105 |
+
The model is now available on Huggingface under the identifier `as-cle-bert/resistBERT`. There is also a widget through which you can make inferences thanks to HF `Inference API`. Keep in mind that Inference API *can* be unstable, so downloading the model and using it from a local machine/cloud service would be preferable.
|
106 |
+
|
107 |
+
## Testing
|
108 |
+
|
109 |
+
|
110 |
+
### Data retrieval for tests
|
111 |
+
|
112 |
+
Data were downloaded from **CARD** (*The Comprehensive Antibiotic Resistance Database*), as the annotations for the family names used to label training sequences were the same.
|
113 |
+
|
114 |
+
For families "PDC beta-lactamase", "CTX-M beta-lactamase", "SHV beta-lactamase", "CMY beta-lactamase", sequences were downloaded after having searched the exact AMR gene family as in the labels used for training, through `Download sequences` method. In the downloading customization page, filters were set to `is_a` and `Protein`.
|
115 |
+
|
116 |
+
For all the other families, procedure was the same but customization filters were set to `is_a`, `structurally_homologous_to`, `evolutionary_variant_of` and `Protein` to increase the number of retrieved sequences.
|
117 |
+
|
118 |
+
### Test building
|
119 |
+
|
120 |
+
|
121 |
+
Test were built thanks to [this script](https://github.com/AstraBert/resistML/tree/main/scripts/build_tests.py).
|
122 |
+
|
123 |
+
These are the test metadata:
|
124 |
+
|
125 |
+
**Metadata for test 0:**
|
126 |
+
|
127 |
+
- Protein statistics for resistML were saved in test/testfiles/test_0.csv
|
128 |
+
- Sequences and labels for resistBERT were saved in test/testfiles/test_0.jsonl
|
129 |
+
- 12 protein sequences were taken into account for 2 families
|
130 |
+
- Families taken into account were: quinolone resistance protein (qnr), CMY beta-lactamase
|
131 |
+
|
132 |
+
**Metadata for test 1:**
|
133 |
+
|
134 |
+
- Protein statistics for resistML were saved in test/testfiles/test_1.csv
|
135 |
+
- Sequences and labels for resistBERT were saved in test/testfiles/test_1.jsonl
|
136 |
+
- 11 protein sequences were taken into account for 2 families
|
137 |
+
- Families taken into account were: VIM beta-lactamase,IMP beta-lactamase
|
138 |
+
|
139 |
+
**Metadata for test 2:**
|
140 |
+
|
141 |
+
- Protein statistics for resistML were saved in test/testfiles/test_2.csv
|
142 |
+
- Sequences and labels for resistBERT were saved in test/testfiles/test_2.jsonl
|
143 |
+
- 13 protein sequences were taken into account for 2 families
|
144 |
+
- Families taken into account were: quinolone resistance protein (qnr),SHV beta-lactamase
|
145 |
+
|
146 |
+
**Metadata for test 3:**
|
147 |
+
|
148 |
+
- Protein statistics for resistML were saved in test/testfiles/test_3.csv
|
149 |
+
- Sequences and labels for resistBERT were saved in test/testfiles/test_3.jsonl
|
150 |
+
- 10 protein sequences were taken into account for 3 families
|
151 |
+
- Families taken into account were: quinolone resistance protein (qnr),VIM beta-lactamase,CMY beta-lactamase
|
152 |
+
|
153 |
+
**Metadata for test 4:**
|
154 |
+
|
155 |
+
- Protein statistics for resistML were saved in test/testfiles/test_4.csv
|
156 |
+
- Sequences and labels for resistBERT were saved in test/testfiles/test_4.jsonl
|
157 |
+
- 12 protein sequences were taken into account for 2 families
|
158 |
+
- Families taken into account were: CMY beta-lactamase,IMP beta-lactamase
|
159 |
+
|
160 |
+
**Metadata for test 5:**
|
161 |
+
|
162 |
+
- Protein statistics for resistML were saved in test/testfiles/test_5.csv
|
163 |
+
- Sequences and labels for resistBERT were saved in test/testfiles/test_5.jsonl
|
164 |
+
- 12 protein sequences were taken into account for 2 families
|
165 |
+
- Families taken into account were: VIM beta-lactamase,SHV beta-lactamase
|
166 |
+
|
167 |
+
**Metadata for test 6:**
|
168 |
+
|
169 |
+
- Protein statistics for resistML were saved in test/testfiles/test_6.csv
|
170 |
+
- Sequences and labels for resistBERT were saved in test/testfiles/test_6.jsonl
|
171 |
+
- 11 protein sequences were taken into account for 3 families
|
172 |
+
- Families taken into account were: PDC beta-lactamase,MCR phosphoethanolamine transferase,ACT beta-lactamase
|
173 |
+
|
174 |
+
**Metadata for test 7:**
|
175 |
+
|
176 |
+
- Protein statistics for resistML were saved in test/testfiles/test_7.csv
|
177 |
+
- Sequences and labels for resistBERT were saved in test/testfiles/test_7.jsonl
|
178 |
+
- 10 protein sequences were taken into account for 3 families
|
179 |
+
- Families taken into account were: MCR phosphoethanolamine transferase,CTX-M beta-lactamase,PDC beta-lactamase
|
180 |
+
|
181 |
+
**Metadata for test 8:**
|
182 |
+
|
183 |
+
- Protein statistics for resistML were saved in test/testfiles/test_8.csv
|
184 |
+
- Sequences and labels for resistBERT were saved in test/testfiles/test_8.jsonl
|
185 |
+
- 12 protein sequences were taken into account for 2 families
|
186 |
+
- Families taken into account were: ACT beta-lactamase,CMY beta-lactamase
|
187 |
+
|
188 |
+
**Metadata for test 9:**
|
189 |
+
- Protein statistics for resistML were saved in test/testfiles/test_9.csv
|
190 |
+
- Sequences and labels for resistBERT were saved in test/testfiles/test_9.jsonl
|
191 |
+
- 15 protein sequences were taken into account for 3 families
|
192 |
+
- Families taken into account were: quinolone resistance protein (qnr),SHV beta-lactamase,KPC beta-lactamase
|
193 |
+
|
194 |
+
All data can be found [here](http://github.com/AstraBert/resistML/tree/main/test), along with the seqences used to generate them.
|
195 |
+
|
196 |
+
### Test results
|
197 |
+
|
198 |
+
**resistML** yielded 100% accuracy, f1 score, recall score and precision score in all 10 tests.
|
199 |
+
|
200 |
+
**resistBERT** was more unstable:
|
201 |
+
|
202 |
+
- On test_0, test_2, test_4, test_6, test_7, test_8 and test_9 yielded 100% accuracy, f1 score, recall score and precision score
|
203 |
+
- On test_1 it yielded:
|
204 |
+
1. Accuracy: 50%
|
205 |
+
2. f1 score: 33%
|
206 |
+
3. Precision: 25%
|
207 |
+
4. Recall: 50%
|
208 |
+
- On test_3 it yielded 66.7% accuracy, f1 score, recall score and precision score
|
209 |
+
- On test_5 it yielded 50% accuracy, f1 score, recall score and precision score
|
210 |
+
|
211 |
+
|
212 |
+
All results for resistBERT can be found [in the dedicated notebook](http://github.com/AstraBert/resistML/scripts/test_resistBERT.ipynb) .
|
213 |
+
|
214 |
+
## License and rights of usage
|
215 |
+
|
216 |
+
|
217 |
+
The[ GitHub repository](http://github.com/AstraBert/resistML) is provided under MIT license (more at [LICENSE](https://github.com/AstraBert/resistML/tree/main/LICENSE)`).
|
218 |
+
|
219 |
+
If you use this work for your projects, please consider citing the author [Astra Bertelli](http://astrabert.vercel.app).
|
220 |
+
|
221 |
+
## References
|
222 |
+
|
223 |
+
|
224 |
+
1. **CARD - The Comprehensive Antibiotic Resistance Database**
|
225 |
+
|
226 |
+
2. **Biopython**
|
227 |
+
|
228 |
+
3. **Scikit-learn**
|
229 |
+
|
230 |
+
4. **Hugging Face's prot_bert Model**
|
231 |
+
|
232 |
+
5. **Hugging Face's AutoTrain**
|
233 |
+
|
234 |
+
If you feel that your work was relevant in building resistML and you weren't referenced in this section, feel free to flag an issue on GitHub or to contact the author.
|