GeneMamba: Efficient and Effective Large Cell Model on Single Cell Data

Model Description

GeneMamba is a pretrained transformer-based model designed for analyzing single-cell RNA sequencing (scRNA-seq) data. It adapts the Mamba architecture to represent single-cell data, viewing cells as sentences and genes as tokens. GeneMamba is specifically optimized to handle long-sequence data, supporting context lengths up to 8192 tokens. This capability allows the model to incorporate previously ignored low-expression genes, enabling a comprehensive analysis of gene expression profiles.

Key Features

  • Scalability: Trained on over 50 million cells, offering robust generalization across datasets.
  • Versatility: Supports multiple tasks, including gene classification, pathway analysis, and gene-pair correlations.
  • Pretrained Efficiency: Leverages large-scale pretraining to encode gene relationships effectively.

Applications

GeneMamba excels in tasks that require high-dimensional understanding of gene-gene interactions, such as:

  • Cell Type Prediction: Facilitates accurate classification of cell types.
  • Gene Pathway Analysis: Uncovers complex relationships between genes and pathways.
  • Context-Aware Gene Correlation: Detects gene expression patterns influenced by broader biological contexts.
  • Gene Ranking Reconstruction: Evaluates gene importance by reconstructing ranking correlations.

Training Dataset

GeneMamba was trained on a diverse collection of scRNA-seq datasets, encompassing various tissue types, species, and experimental conditions. Preprocessing steps included normalization (sc.pp.normalize_total) and logarithmic transformation (sc.pp.log1p) to ensure robust handling of variability.


Usage

The model can be loaded and fine-tuned for specific scRNA-seq tasks using the Hugging Face Transformers library:

from transformers import AutoModel, AutoTokenizer

# Load the pretrained GeneMamba model
model = AutoModel.from_pretrained("your-hf-repo/GeneMamba")
tokenizer = AutoTokenizer.from_pretrained("your-hf-repo/GeneMamba")

# Example: Encode gene sequences
inputs = tokenizer(["Gene1 Gene2 Gene3 ..."], return_tensors="pt", truncation=True, padding=True)
outputs = model(**inputs)

Supported Tasks

  • Token classification
  • Sequence-to-sequence modeling
  • Embedding generation for downstream analysis

Performance Metrics

GeneMamba has been benchmarked against other state-of-the-art models like scGPT and Geneformer across 10 tasks, consistently achieving competitive rankings. Metrics include:

  • Clusterness Score (CTS)
  • Hopkins Score (HS)
  • Average Task Performance (bubble plot visualization available in supplementary material)

Limitations

  • The model may require fine-tuning for datasets with novel cell types or experimental conditions.
  • Performance might vary for genes or pathways underrepresented in the training dataset.

Ethical Considerations

GeneMamba is intended for research and educational purposes. Users should validate results before applying them in clinical or commercial settings. Model usage should comply with relevant ethical guidelines and regulations.


Citation

If you use GeneMamba in your research, please cite:

@article{GeneMamba2024,
  title={GeneMamba: Efficient and Effective Large Cell Model on Single Cell Data},
  author={Cong, et al.},
  journal={Under Review},
  year={2024}
}
Downloads last month
10
Safetensors
Model size
65.7M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.