Commit
·
3173a3e
1
Parent(s):
c13dc5f
Update README.md
Browse files
README.md
CHANGED
@@ -10,11 +10,11 @@ datasets:
|
|
10 |
- InstaDeepAI/multi_species_genome
|
11 |
- InstaDeepAI/nucleotide_transformer_downstream_tasks
|
12 |
---
|
13 |
-
# nucleotide-transformer-v2-
|
14 |
|
15 |
The Nucleotide Transformers are a collection of foundational language models that were pre-trained on DNA sequences from whole-genomes. Compared to other approaches, our models do not only integrate information from single reference genomes, but leverage DNA sequences from over 3,200 diverse human genomes, as well as 850 genomes from a wide range of species, including model and non-model organisms. Through robust and extensive evaluation, we show that these large models provide extremely accurate molecular phenotype prediction compared to existing methods
|
16 |
|
17 |
-
Part of this collection is the **nucleotide-transformer-v2-
|
18 |
|
19 |
**Developed by:** InstaDeep, NVIDIA and TUM
|
20 |
|
@@ -39,8 +39,8 @@ from transformers import AutoTokenizer, AutoModelForMaskedLM
|
|
39 |
import torch
|
40 |
|
41 |
# Import the tokenizer and the model
|
42 |
-
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/nucleotide-transformer-v2-
|
43 |
-
model = AutoModelForMaskedLM.from_pretrained("InstaDeepAI/nucleotide-transformer-v2-
|
44 |
|
45 |
# Create a dummy dna sequence and tokenize it
|
46 |
sequences = ['ATTCTG' * 9]
|
@@ -68,7 +68,7 @@ print(f"Mean sequence embeddings: {mean_sequence_embeddings}")
|
|
68 |
|
69 |
## Training data
|
70 |
|
71 |
-
The **nucleotide-transformer-v2-
|
72 |
|
73 |
## Training procedure
|
74 |
|
|
|
10 |
- InstaDeepAI/multi_species_genome
|
11 |
- InstaDeepAI/nucleotide_transformer_downstream_tasks
|
12 |
---
|
13 |
+
# nucleotide-transformer-v2-50m-multi-species
|
14 |
|
15 |
The Nucleotide Transformers are a collection of foundational language models that were pre-trained on DNA sequences from whole-genomes. Compared to other approaches, our models do not only integrate information from single reference genomes, but leverage DNA sequences from over 3,200 diverse human genomes, as well as 850 genomes from a wide range of species, including model and non-model organisms. Through robust and extensive evaluation, we show that these large models provide extremely accurate molecular phenotype prediction compared to existing methods
|
16 |
|
17 |
+
Part of this collection is the **nucleotide-transformer-v2-50m-multi-species**, a 50M parameters transformer pre-trained on a collection of 850 genomes from a wide range of species, including model and non-model organisms.
|
18 |
|
19 |
**Developed by:** InstaDeep, NVIDIA and TUM
|
20 |
|
|
|
39 |
import torch
|
40 |
|
41 |
# Import the tokenizer and the model
|
42 |
+
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/nucleotide-transformer-v2-50m-multi-species")
|
43 |
+
model = AutoModelForMaskedLM.from_pretrained("InstaDeepAI/nucleotide-transformer-v2-50m-multi-species")
|
44 |
|
45 |
# Create a dummy dna sequence and tokenize it
|
46 |
sequences = ['ATTCTG' * 9]
|
|
|
68 |
|
69 |
## Training data
|
70 |
|
71 |
+
The **nucleotide-transformer-v2-50m-multi-species** model was pretrained on a total of 850 genomes downloaded from [NCBI](https://www.ncbi.nlm.nih.gov/). Plants and viruses are not included in these genomes, as their regulatory elements differ from those of interest in the paper's tasks. Some heavily studied model organisms were picked to be included in the collection of genomes, which represents a total of 174B nucleotides, i.e roughly 29B tokens. The data has been released as a HuggingFace dataset [here](https://huggingface.co/datasets/InstaDeepAI/multi_species_genomes).
|
72 |
|
73 |
## Training procedure
|
74 |
|