InstaDeepAI
/

nucleotide-transformer-v2-50m-multi-species

@@ -10,11 +10,11 @@ datasets:
 - InstaDeepAI/multi_species_genome
 - InstaDeepAI/nucleotide_transformer_downstream_tasks
 ---
-# nucleotide-transformer-v2-50-multi-species
 The Nucleotide Transformers are a collection of foundational language models that were pre-trained on DNA sequences from whole-genomes. Compared to other approaches, our models do not only integrate information from single reference genomes, but leverage DNA sequences from over 3,200 diverse human genomes, as well as 850 genomes from a wide range of species, including model and non-model organisms. Through robust and extensive evaluation, we show that these large models provide extremely accurate molecular phenotype prediction compared to existing methods
-Part of this collection is the **nucleotide-transformer-v2-50-multi-species**, a 50M parameters transformer pre-trained  on a collection of 850 genomes from a wide range of species, including model and non-model organisms.
 **Developed by:** InstaDeep, NVIDIA and TUM
@@ -39,8 +39,8 @@ from transformers import AutoTokenizer, AutoModelForMaskedLM
 import torch
 # Import the tokenizer and the model
-tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/nucleotide-transformer-v2-50-multi-species")
-model = AutoModelForMaskedLM.from_pretrained("InstaDeepAI/nucleotide-transformer-v2-50-multi-species")
 # Create a dummy dna sequence and tokenize it
 sequences = ['ATTCTG' * 9]
@@ -68,7 +68,7 @@ print(f"Mean sequence embeddings: {mean_sequence_embeddings}")
 ## Training data
-The **nucleotide-transformer-v2-50-multi-species** model was pretrained on a total of 850 genomes downloaded from [NCBI](https://www.ncbi.nlm.nih.gov/). Plants and viruses are not included in these genomes, as their regulatory elements differ from those of interest in the paper's tasks. Some heavily studied model organisms were picked to be included in the collection of genomes, which represents a total of 174B nucleotides, i.e roughly 29B tokens. The data has been released as a HuggingFace dataset [here](https://huggingface.co/datasets/InstaDeepAI/multi_species_genomes).
 ## Training procedure

 - InstaDeepAI/multi_species_genome
 - InstaDeepAI/nucleotide_transformer_downstream_tasks
 ---
+# nucleotide-transformer-v2-50m-multi-species
 The Nucleotide Transformers are a collection of foundational language models that were pre-trained on DNA sequences from whole-genomes. Compared to other approaches, our models do not only integrate information from single reference genomes, but leverage DNA sequences from over 3,200 diverse human genomes, as well as 850 genomes from a wide range of species, including model and non-model organisms. Through robust and extensive evaluation, we show that these large models provide extremely accurate molecular phenotype prediction compared to existing methods
+Part of this collection is the **nucleotide-transformer-v2-50m-multi-species**, a 50M parameters transformer pre-trained  on a collection of 850 genomes from a wide range of species, including model and non-model organisms.
 **Developed by:** InstaDeep, NVIDIA and TUM
 import torch
 # Import the tokenizer and the model
+tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/nucleotide-transformer-v2-50m-multi-species")
+model = AutoModelForMaskedLM.from_pretrained("InstaDeepAI/nucleotide-transformer-v2-50m-multi-species")
 # Create a dummy dna sequence and tokenize it
 sequences = ['ATTCTG' * 9]
 ## Training data
+The **nucleotide-transformer-v2-50m-multi-species** model was pretrained on a total of 850 genomes downloaded from [NCBI](https://www.ncbi.nlm.nih.gov/). Plants and viruses are not included in these genomes, as their regulatory elements differ from those of interest in the paper's tasks. Some heavily studied model organisms were picked to be included in the collection of genomes, which represents a total of 174B nucleotides, i.e roughly 29B tokens. The data has been released as a HuggingFace dataset [here](https://huggingface.co/datasets/InstaDeepAI/multi_species_genomes).
 ## Training procedure