File size: 2,189 Bytes
e401175
 
 
 
 
 
 
f1adb8a
e401175
 
 
 
 
 
 
 
 
 
 
 
 
e9a38e9
e401175
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7d109db
178b025
e401175
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
---
tags:
- dnabert
- bacteria
- kmer
- classification
- sequence-modeling
- DNA
library_name: transformers
---

# BacteriaCDS-DNABERT-K6-89M

This model, `BacteriaCDS-DNABERT-K6-89M`, is a **DNA sequence classifier** based on **DNABERT** trained for **coding sequence (CDS) classification** in bacterial genomes. It operates on **6-mer tokenized sequences** and was fine-tuned using **89M trainable parameters**.

## Model Details
- **Base Model:** DNABERT
- **Task:** Bacterial CDS Classification
- **K-mer Size:** 6
- **Input Sequence:** Open Reading Frame(Last 510 nucleotides from end of the sequence)
- **Number of Trainable Parameters:** 89M
- **Max Sequence Length:** 512
- **Precision Used:** AMP (Automatic Mixed Precision)

---

### **Install Dependencies**
Ensure you have `transformers` and `torch` installed:
```bash
pip install torch transformers
```

### **Load Model & Tokenizer**
```python
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Load Model
model_checkpoint = "Genereux-akotenou/BacteriaCDS-DNABERT-K6-89M"
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
```

### **Inference Example**
This model works with 6-mer tokenized sequences. You need to convert raw DNA sequences into k-mer format:
```python
def generate_kmer(sequence: str, k: int, overlap: int = 1):
    return " ".join([sequence[j:j+k] for j in range(0, len(sequence) - k + 1, overlap)])

sequence = "ATGAGAACCAGCCGGAGACCTCCTGCTCGTACATGAAAGGCTCGAGCAGCCGGGCGAGGGCGGTAG" 
seq_kmer = generate_kmer(sequence, k=6, overlap=3)

# Run inference
inputs = tokenizer(
  seq_kmer,
  return_tensors="pt",
  max_length=tokenizer.model_max_length,
  padding="max_length",
  truncation=True
)
with torch.no_grad():
  outputs = model(**inputs)
  logits = outputs.logits
  predicted_class = torch.argmax(logits, dim=-1).item()
```

<!-- ### **Citation**
If you use this model in your research, please cite:
```tex
@article{paper title,
  title={DNABERT for Bacterial CDS Classification},
  author={Genereux Akotenou, et al.},
  journal={Hugging Face Model Hub},
  year={2024}
}
``` -->