--- datasets: - michaelm16/GuideRNA-3B license: apache-2.0 tags: - biology --- # CRISPR-viva CRISPR-viva is a a universal and host cell context-aware guide RNA design framework for CRISPR-based RNA virus amplification-free detection and inhibition facilitated by a foundation model. ![CRISPR-viva Schema](images/fig1.png) ## Model Details ### Model Description The foundation model module of CRISPR-viva is an LLM that harnesses the recent development of the causal attention mechanism to mimic the mechanism of DNA-RNA and RNA-RNA interactions during CRISPR editing events, where the guide RNA binds the host/virus DNA/RNA in a specific order based on the type of CRISPR system. Therefore, the foundation model is designed and trained to learn the representation of sequence interactions between guide RNAs and cellular DNA/RNAs as well as viral RNAs, an approach expected to significantly improve the performance of the downstream tasks of virus detection and inhibition. The foundation model is pretrained with [GuideRNA-3B](https://huggingface.co/datasets/michaelm16/GuideRNA-3B), which contains candidate guide RNA sequences from the genomes and transcriptomes of 23 cell lines and 26464 viral segmented genomes of RNA virususes. ## Uses ### Downstream Use * The fine-tuning strategy is applied for medium data volume scenarios for LwCas13a, Cas12a and Cas13d. * The few-shot adaptation strategy is applied for newly developed CRISPR systems whose data volume is small, including the Cas12b and LbuCas13a systems, to achieve robust generalizability. ### Preprocessing #### Sequence encoding We utilized and modified the attention block, which is widely used in various research and industrial applications, to construct our sequence attention module, which functions as the building block of our CRISPR-viva system. The input of the module is a sequence of nucleotide characters composed of “A”, “C”, “G”, “T” and “N”; the length of the sequence should be 35 bp, and sequences with fewer than 35 bp are padded with “N”. Using this sequence encoding method, the encoded complementary strand can be easily calculated as . Taking the above sequence as an example, “GGAAAGCAGCAGATGGCAGGACATGGGCTGGAGNN” can be encoded as “44111421421415442144121544425441433”, and the encoded complementary strand is thus “6” – 44111421421415442144121544425441433” = “22555245245251224522545122241225233”. ### Compute Infrastructure #### Hardware * Four NVIDIA GeForce GTX 4090 GPUs to train the foundation model * One NVIDIA GeForce GTX 4090 GPU each to train each task-specialized model. #### Software * pytorch >= 2.3 * dask == 2024.5.0 * transformers == 4.29.2 * datasets == 2.19.1 ## Citation