Biosaic Tokenizer

Overview

Biosaic(Bio-Mosaic) is a tokenizer library built for Enigma2. It contains: Tokenizer, Embedder for DNA & Amino Acid Protein Sequences. Has a VQ-VAE & Evoformer architecture based encoders that could convert sequences into embeddings and vice-versa for model training use-case.

Features

Tokenization: converts the sequences into K-Mers. (for DNA only)
Encoding: converts sequences into embeddings for classification, training purposes.
Easy use: it's very basic and easy to use library.
SoTA encoder: Evoformer & VQ-VAE model are inspired from the AlphaFold-2

Models

It has two different Models,

for DNA tokenization & encoding: VQ-VAE
for Protein Encodings: EvoFormer

VQ-VAE is around 160M parameter big(for now it's just around 40M just to test run). EvoFormer is around 136M parameter big (still in training).

Config:

class ModelConfig:
  d_model: int= 768
  in_dim: int= 4
  beta: float= 0.15
  dropout: float= 0.25
  n_heads: int= 16
  n_layers: int= 12

class ModelConfig:
  DEVICE       = torch.device("cuda" if torch.cuda.is_available() else "cpu")
  A            = 4        # DNA alphabet
  C            = 21       # 21 letter for amino acid & 4 for dna
  d_msa        = 768
  d_pair       = 256
  n_heads      = 32
  n_blocks     = 28

Training:

For training the VQ-VAE & Evo-Former model, batch training is preferred, with it's own sepearte Dateset class that takes input of the strings and then Hot-encodes the DNA Sequences first and then fill them into batches according to train & val splits which is around 20% of the full dataset.

For VQ-VAE:

class TrainConfig:
  device        = torch.device("cuda" if torch.cuda.is_available() else "cpu")
  learning_rate = 1e-4         # bumped from 1e-5
  weight_decay  = 1e-4
  amsgrad       = True
  warmup_epochs = 50           # linear warm‑up
  epochs        = 2000
  eval_interval = 100
  eval_iters    = 30
  batch_size    = 6
  block_size    = 256

For EvoFormer:

class TrainConfig:
  DEVICE       = torch.device("cuda" if torch.cuda.is_available() else "cpu")
  LR           = 1e-4
  WD           = 1e-4
  AMS          = True
  WARMUP       = 50
  EPOCHS       = 500
  BATCH        = 8
  MSA_SEQ      = 32       # number of sequences in each MSA
  L_SEQ        = 256      # length of each sequence
  EVAL_ITERS   = 5
  EVAL_INTV    = 50

shivendrra
/

BiosaicTokenizer