Omni-DNA
Collection
A family of cross-modal multi-task models ranging from 20 million
to 1 billion parameters.
โข
14 items
โข
Updated
pip install datasets ai2-olmo
Omni-DNA is a cross-modal, multi-task genomic foundation model designed to generalize across diverse genomic tasks. Unlike previous Genomic Foundation Models (GFMs), which require separate fine-tuning for each task, Omni-DNA leverages auto-regressive transformer-based training and multi-task fine-tuning, enabling a single model to perform a wide range of genomic tasks with state-of-the-art performance.
Omni-DNA models range from 20M to 1B parameters and support tasks such as sequence annotation, regulatory element classification, acetylation/methylation prediction, and DNA2Function/DNA2Image mapping.
Size | Training Tokens | Layers | Hidden Size | Attention Heads | Context Length |
---|---|---|---|---|---|
Omni-DNA 20M | 300B | 8 | 256 | 8 | 250 |
Omni-DNA 60M | 300B | 8 | 512 | 8 | 250 |
Omni-DNA 116M | 300B | 12 | 768 | 16 | 250 |
Omni-DNA 300M | 300B | 16 | 1024 | 16 | 250 |
Omni-DNA 700M | 300B | 16 | 1536 | 16 | 250 |
Omni-DNA 1B | 300B | 16 | 2048 | 16 | 250 |
[email protected]
Omni-DNA is trained to perform multiple genomic tasks including:
import argparse
import json
import os
import re
import torch
from tqdm import tqdm
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
def preprocess_response(response, mask_token="[MASK]"):
"""
Preprocess the response to extract text after the [MASK] token.
Args:
response (str): The raw model output.
mask_token (str): The token after which the response is extracted.
Returns:
str: Processed response text.
"""
if mask_token in response:
response = response.split(mask_token, 1)[1]
response = re.sub(r'^[\sATGC]+', '', response)
return response
def generate(message, model, tokenizer):
message = message + "[MASK]"
tokenized_message = tokenizer(
[message], return_tensors='pt', return_token_type_ids=False, add_special_tokens=True
).to('cuda')
response = model.generate(**tokenized_message, max_new_tokens=110, do_sample=False)
reply = tokenizer.batch_decode(response, skip_special_tokens=True)[0]
return preprocess_response(reply)
model_tokenizer_path = "zehui127/Omni-DNA-DNA2Function"
tokenizer = AutoTokenizer.from_pretrained(model_tokenizer_path)
model = AutoModelForCausalLM.from_pretrained(model_tokenizer_path).to('cuda')
# Define the input dna sequence
dna = "TGCTGGCTTCAGGGGCACAGATGCTAACATTGGAGCGATACAGAGAAGATTAACGTGGCCACTGCGCAAGCATGACATGCAAACTCGTAAAGCATTCTTTTAATTT"
generate(dna, model, tokenizer)