Edit model card

scMulan Model

Model Details

Model Description

scMulan is a foundation and generative model for single-cell gene expression.

Intended Use

scMulan is designed for researchers working with single-cell RNA sequencing (scRNA-seq) data. It can be used for:

  • Zero-shot cell type annotations
  • Zero-shot batch integration
  • Conditional cell generation

Training Data

The model was trained on a subset of the hECA dataset named hECA-10M. It includes more than 10 million high-quality single-cell transcriptome data from vital human organs or tissues. The 2000 most highly variable genes across the dataset were selected. Each transcriptome is accompanied by metadata attributes including organ, donor age, donor gender, sequencing technology, and cell type.

Performance

  • In the paper, scMulan achieves better cell type prediction accuracy than scGPT, Geneformer and CellTypist.
  • It is competitive with a finetuned scGPT model on batch integration, and performs better than the other tested models.
  • Conditional generation quality is evaluated through Q-Q plots and UMAPs.

Limitations

  • The pretrained model has only seen 2000 genes.
  • The generated data has greater cell sparsity than real data.
  • Information is missing from the authors' GitHub on how to run the model for generation.

Ethical Considerations

Users should be aware that while the data used to train scMulan is anonymized, it represents human tissue samples and should be treated with appropriate respect and consideration. Researchers using this model should adhere to ethical guidelines for human subjects research.

Usage

To use the scMulan model within the DeepLife ML Infra:

  1. Install the package:

    pip install deeplife-mlinfra
    
  2. Import and use the model:

    import anndata as ad
    from huggingface_hub import hf_hub_download
    from dl_models.models.scmulan.model import ScMulanModel
    from dl_models.models.scmulan.processor import ScMulanProcessor
    
    # Load the model and preprocessor
    model = ScMulandModel.from_pretrained("deeplife/scmulan_model")
    preprocessor = ScMulanProcessor.from_pretrained("deeplife/scmulan_model")
    model.eval()
    
    # Load your data (example using a sample dataset)
    filepath = hf_hub_download(
        repo_id="deeplife/h5ad_samples",
        filename="GSE136831small.h5ad",
        repo_type="dataset",
    )
    adata = ad.read_h5ad(filepath)
    
    # Preprocess and create a dataloader
    dataloader = preprocessor.transform_to_dataloader(adata, batch_size = 256)
    
    # Get embeddings and cell type predictions
    for batch in dataloader:
        coarse_cell_types, fine_cell_types, hidden = model.get_cell_types_and_embeddings(batch)
        print(coarse_cell_types)
        print(fine_cell_types)
        break
    

For more detailed usage instructions, please refer to the documentation.

Citation

If you use this model in your research, please cite both the original scMulan paper and the DeepLife ML Infra package:

@InProceedings{10.1007/978-1-0716-3989-4_57,
author="Bian, Haiyang and Chen, Yixin and Dong, Xiaomin and Li, Chen and Hao, Minsheng and Chen, Sijie and Hu, Jinyi and Sun, Maosong and Wei, Lei and Zhang, Xuegong",
editor="Ma, Jian",
title="scMulan: A Multitask Generative Pre-Trained Language Model for Single-Cell Analysis",
booktitle="Research in Computational Molecular Biology",
year="2024",
publisher="Springer Nature Switzerland",
address="Cham",
pages="479--482",
isbn="978-1-0716-3989-4"
}

@software{deeplife_mlinfra,
  title={DeepLife ML Infra: Infrastructure for Biological Deep Learning Models},
  author={DeepLife AI Team},
  year={2023},
  url={https://github.com/deeplifeai/deeplife-mlinfra},
  version={1.0.0}
}

Contact

For questions or issues related to this model implementation in DeepLife ML Infra, please open an issue in the repository.

For questions about the original scMulan model, please contact the authors of the paper.

Downloads last month
48
Safetensors
Model size
369M params
Tensor type
F32
·
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Collection including deeplife/scmulan_model