Biosaic Tokenizer
Overview
Biosaic(Bio-Mosaic) is a tokenizer library built for Enigma2. It contains: Tokenizer, Embedder for DNA & Amino Acid Protein Sequences. Has a VQ-VAE & Evoformer architecture based encoders that could convert sequences into embeddings and vice-versa for model training use-case.
Features
- Tokenization: converts the sequences into K-Mers. (for DNA only)
- Encoding: converts sequences into embeddings for classification, training purposes.
- Easy use: it's very basic and easy to use library.
- SoTA encoder: Evoformer & VQ-VAE model are inspired from the AlphaFold-2
Models
It has two different Models,
- for DNA tokenization & encoding: VQ-VAE
- for Protein Encodings: EvoFormer
VQ-VAE is around 160M parameter big(for now it's just around 40M just to test run). EvoFormer is around 136M parameter big (still in training).
Config:
class ModelConfig:
d_model: int= 768
in_dim: int= 4
beta: float= 0.15
dropout: float= 0.25
n_heads: int= 16
n_layers: int= 12
class ModelConfig:
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
A = 4 # DNA alphabet
C = 21 # 21 letter for amino acid & 4 for dna
d_msa = 768
d_pair = 256
n_heads = 32
n_blocks = 28
Training:
For training the VQ-VAE
& Evo-Former
model, batch training is preferred, with it's own sepearte Dateset
class that takes input of the strings and then Hot-encodes the DNA Sequences first and then fill them into batches according to train
& val
splits which is around 20% of the full dataset.
For VQ-VAE:
class TrainConfig:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
learning_rate = 1e-4 # bumped from 1e-5
weight_decay = 1e-4
amsgrad = True
warmup_epochs = 50 # linear warm‑up
epochs = 2000
eval_interval = 100
eval_iters = 30
batch_size = 6
block_size = 256
For EvoFormer:
class TrainConfig:
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
LR = 1e-4
WD = 1e-4
AMS = True
WARMUP = 50
EPOCHS = 500
BATCH = 8
MSA_SEQ = 32 # number of sequences in each MSA
L_SEQ = 256 # length of each sequence
EVAL_ITERS = 5
EVAL_INTV = 50
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
HF Inference deployability: The model has no library tag.