genbio-ai
/

AIDO.DNA-7B

PyTorch

rnabert

biology

Model card Files Files and versions Community

probablybots commited on 18 days ago

Commit

0b20da3

•

1 Parent(s): 27f7f52

Update README.md

Browse files

Files changed (1) hide show

README.md +14 -14

README.md CHANGED Viewed

@@ -2,14 +2,14 @@
 tags:
 - biology
 ---
-# DNA FM 7B
-  DNA FM 7B is DNA foundation model trained on 10.6 billion nucleotides from 796 species, enabling genome mining, in silico mutagenesis studies, gene expression prediction, and directed sequence generation.
-  By scaling model depth while maintaining a short context length of 4000 nucleotides, DNA FM shows substantial improvements across a breadth of tasks in functional genomics using transfer learning, sequence generation, and unsupervised annotation of functional elements. Notably, DNA FM outperforms prior encoder-only architectures without new data, suggesting that new scaling laws are needed to achieve compute-optimal DNA language models.
-<center><img src="DNA_RNA FM model architecture.png" alt="An Overview of DNA FM 7B" style="width:70%; height:auto;" /></center>
 ## Model Architectural Details
-  DNA FM 7B is based on the bidirectional transformer encoder (BERT) architecture with single-nucleotide tokenization, and is optimized using a masked language modeling (MLM) training objective.
   To learn semantically meaningful representations, we employed an BERT-style encoder-only dense transformer architecture. We make minor updates to this architecture to align with current best practices, including using SwiGLU and LayerNorms. Additionally, we use Rotary Positional Embeddings (RoPE), given that DNA syntax does not function based on absolute nucleotide positions but nucleotides interact in highly local and context-specific ways. Below are more detailes about the model architecture:
 | Model Arch Component        | Value          |
@@ -21,14 +21,14 @@ tags:
 | Vocab Size | 16      |
 | Context Length | 4000      |
-## Pre-training of DNA FM 7B
-Here we briefly introduce the details of pre-training of DNA FM 7B. For more detailed information, please refer to [our paper](https://openreview.net/forum?id=Kis8tVUeNi&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DNeurIPS.cc%2F2024%2FWorkshop%2FAIDrugX%2FAuthors%23your-submissions)).
 ### Data
-To test whether representation capacity has limited the development of DNA language models in previous studies, we utilize the data set and splits from the Nucleotide Transformer. Starting from a total of `812` genomes with `712` for training, `50` for validation, and `50` for testing, we removed `17` entries which had been deleted from NCBI since the original dataset’s publication on Hugging Face. One of these was the important model organism Rattus norvegigus, which we replaced with the current reference genome. This resulted in `696` genomes for training, `50` for validation, and `50` for testing. We pre-trained DNA FM 7B With a total of 10.6 billion training tokens.
 ### Training Details
-  The weights of our seven billion parameter model occupy over 200GB of memory in 32 bit precision. To train a model of this size, we use model parallelism to split training across 256 H100 GPUs using the Megatron-LM framework. We also employed bfloat16 mixed precision training and FlashAttention-2 to allow for training with large context length at scale. With this configuration, DNA FM 7B took 8 days to train.
 | Hyper-params        | Value          |
 | ------------- |:-------------:|
 | Global Batch Size      | 1024  |
@@ -37,13 +37,13 @@ To test whether representation capacity has limited the development of DNA langu
 | Total Iters | 100000 |
 ### Tokenization
-To minimize bias and learn high-resolution single-nucleotide dependencies, we opted to align closely with the real data and use character-level tokenization with a 5-letter vocabulary: `A, T, C, G, N`, where `N` is commonly used in gene sequencing to denote uncertain elements. Sequences were also prefixed with a `[CLS]` token and suffixed with a `[EOS]` token as hooks for downstream tasks. We chose a context length of 4,000 nucleotides as the longest context which would fit within DNA FM 7B during pretraining, and chunked our dataset of 796 genomes into non-overlapping segments.
-## Evaluation of DNA FM 7B
-We evaluate the benefits of pretraining DNA FM 7B by conducting a comprehensive series of experiments related to functional genomics, genome mining, metabolic engineering, synthetic biology, and therapeutics design, covering supervised, unsupervised, and generative objectives. Unless otherwise stated, hyperparameters were determined by optimizing model performance on a 10% validation split of the training data, and models were tested using the checkpoint with the lowest validation loss. For more detailed information, please refer to [our paper](https://doi.org/10.1101/2024.12.01.625444)).
 ## Results
-<center><img src="circle_benchmarks.png" alt="Downstream results of DNA FM 7B" style="width:70%; height:auto;" /></center>
 ## How to Use
 ### Build any downstream models from this backbone with ModelGenerator
@@ -94,7 +94,7 @@ print(logits)
 ## Citation
-Please cite DNA FM using the following BibTeX code:
 ```
 @inproceedings{ellington2024accurate,
 title={Accurate and General {DNA} Representations Emerge from Genome Foundation Models at Scale},

 tags:
 - biology
 ---
+# AIDO.DNA-7B
+  AIDO.DNA-7B is DNA foundation model trained on 10.6 billion nucleotides from 796 species, enabling genome mining, in silico mutagenesis studies, gene expression prediction, and directed sequence generation.
+  By scaling model depth while maintaining a short context length of 4000 nucleotides, AIDO.DNA shows substantial improvements across a breadth of tasks in functional genomics using transfer learning, sequence generation, and unsupervised annotation of functional elements. Notably, AIDO.DNA outperforms prior encoder-only architectures without new data, suggesting that new scaling laws are needed to achieve compute-optimal DNA language models.
+<center><img src="DNA_RNA FM model architecture.png" alt="An Overview of AIDO.DNA-7B" style="width:70%; height:auto;" /></center>
 ## Model Architectural Details
+  AIDO.DNA-7B is based on the bidirectional transformer encoder (BERT) architecture with single-nucleotide tokenization, and is optimized using a masked language modeling (MLM) training objective.
   To learn semantically meaningful representations, we employed an BERT-style encoder-only dense transformer architecture. We make minor updates to this architecture to align with current best practices, including using SwiGLU and LayerNorms. Additionally, we use Rotary Positional Embeddings (RoPE), given that DNA syntax does not function based on absolute nucleotide positions but nucleotides interact in highly local and context-specific ways. Below are more detailes about the model architecture:
 | Model Arch Component        | Value          |
 | Vocab Size | 16      |
 | Context Length | 4000      |
+## Pre-training of AIDO.DNA-7B
+Here we briefly introduce the details of pre-training of AIDO.DNA-7B. For more detailed information, please refer to [our paper](https://openreview.net/forum?id=Kis8tVUeNi&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DNeurIPS.cc%2F2024%2FWorkshop%2FAIDrugX%2FAuthors%23your-submissions)).
 ### Data
+To test whether representation capacity has limited the development of DNA language models in previous studies, we utilize the data set and splits from the Nucleotide Transformer. Starting from a total of `812` genomes with `712` for training, `50` for validation, and `50` for testing, we removed `17` entries which had been deleted from NCBI since the original dataset’s publication on Hugging Face. One of these was the important model organism Rattus norvegigus, which we replaced with the current reference genome. This resulted in `696` genomes for training, `50` for validation, and `50` for testing. We pre-trained AIDO.DNA-7B With a total of 10.6 billion training tokens.
 ### Training Details
+  The weights of our seven billion parameter model occupy over 200GB of memory in 32 bit precision. To train a model of this size, we use model parallelism to split training across 256 H100 GPUs using the Megatron-LM framework. We also employed bfloat16 mixed precision training and FlashAttention-2 to allow for training with large context length at scale. With this configuration, AIDO.DNA-7B took 8 days to train.
 | Hyper-params        | Value          |
 | ------------- |:-------------:|
 | Global Batch Size      | 1024  |
 | Total Iters | 100000 |
 ### Tokenization
+To minimize bias and learn high-resolution single-nucleotide dependencies, we opted to align closely with the real data and use character-level tokenization with a 5-letter vocabulary: `A, T, C, G, N`, where `N` is commonly used in gene sequencing to denote uncertain elements. Sequences were also prefixed with a `[CLS]` token and suffixed with a `[EOS]` token as hooks for downstream tasks. We chose a context length of 4,000 nucleotides as the longest context which would fit within AIDO.DNA-7B during pretraining, and chunked our dataset of 796 genomes into non-overlapping segments.
+## Evaluation of AIDO.DNA-7B
+We evaluate the benefits of pretraining AIDO.DNA-7B by conducting a comprehensive series of experiments related to functional genomics, genome mining, metabolic engineering, synthetic biology, and therapeutics design, covering supervised, unsupervised, and generative objectives. Unless otherwise stated, hyperparameters were determined by optimizing model performance on a 10% validation split of the training data, and models were tested using the checkpoint with the lowest validation loss. For more detailed information, please refer to [our paper](https://doi.org/10.1101/2024.12.01.625444)).
 ## Results
+<center><img src="circle_benchmarks.png" alt="Downstream results of AIDO.DNA-7B" style="width:70%; height:auto;" /></center>
 ## How to Use
 ### Build any downstream models from this backbone with ModelGenerator
 ## Citation
+Please cite AIDO.DNA using the following BibTeX code:
 ```
 @inproceedings{ellington2024accurate,
 title={Accurate and General {DNA} Representations Emerge from Genome Foundation Models at Scale},