Edit model card

Universal Cell Embedding (UCE) Model

Model Details

Model Description

UCE is a foundation model for single-cell gene expression that learns a universal representation of cell biology. It can generate representations of new single-cell gene expression datasets with no model fine-tuning or retraining while remaining robust to dataset and batch-specific artifacts.

Key Features

  • Zero-shot embedding capabilities for new datasets and species
  • No requirement for cell type annotation or input dataset preprocessing
  • Applicable to any set of genes from any species, even if they aren't homologs of genes seen during training
  • Learns a universal representation of cell biology that is intrinsically meaningful

Intended Use

UCE is designed for researchers working with single-cell RNA sequencing (scRNA-seq) data. It can be used for:

  • Analyzing and integrating diverse scRNA-seq datasets
  • Mapping new data into a universal embedding space
  • Identifying novel cell types and their functions
  • Cross-dataset discoveries and comparisons

Training Data

The model was trained on a corpus of cell atlas data from human and other species, including:

  • Over 36 million cells
  • More than 1,000 uniquely named cell types
  • Hundreds of experiments
  • Dozens of tissues
  • Eight species (human, mouse, mouse lemur, zebrafish, pig, rhesus macaque, crab eating macaque, western clawed frog)

Performance

UCE has demonstrated superior performance in zero-shot embedding tasks compared to other self-supervised transformer-based methods. It has shown the ability to:

  • Accurately embed and cluster cell types from new, unseen datasets
  • Align datasets from novel species without additional training
  • Capture meaningful biological variation despite the presence of experimental noise

Limitations

  • The model's performance may vary for extremely rare or specialized cell types not well-represented in the training data
  • While UCE can handle data from new species, its performance might be less optimal for species very distantly related to those in the training set
  • The model does not account for information contained in raw RNA transcripts, such as genetic variation and RNA-splicing processes

Ethical Considerations

Users should be aware that while the data used to train UCE is anonymized, it represents human tissue samples and should be treated with appropriate respect and consideration. Researchers using this model should adhere to ethical guidelines for human subjects research.

Usage

To use the UCE model within the DeepLife ML Infra:

  1. Install the package:

    pip install deeplife-mlinfra
    
  2. Import and use the model:

    import anndata as ad
    from huggingface_hub import hf_hub_download
    from dl_models.models.uce.model import UCEmbedModel
    from dl_models.models.uce.processor import UCEProcessor
    
    # Load the model and preprocessor
    model = UCEmbedModel.from_pretrained("deeplife/uce_model")
    preprocessor = UCEProcessor.from_pretrained("deeplife/uce_model")
    model.eval()
    
    # Load your data (example using a sample dataset)
    filepath = hf_hub_download(
        repo_id="deeplife/h5ad_samples",
        filename="GSE136831small.h5ad",
        repo_type="dataset",
    )
    adata = ad.read_h5ad(filepath)
    
    # Preprocess and create a dataloader
    dataloader = preprocessor.transform_to_dataloader(adata, batch_size=256)
    
    # Get embeddings
    for batch in dataloader:
        embed = model.get_cell_embeddings(batch)
        break  # This gets embeddings for the first batch
    
    # You can now use these embeddings for downstream tasks
    

For visualization of the embeddings, you can use techniques like PCA or UMAP:

import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import umap

# Convert embed to numpy
embed_np = embed.detach().cpu().numpy()

# Perform PCA
pca = PCA(n_components=2)
embed_pca = pca.fit_transform(embed_np)

# Perform UMAP
umap_reducer = umap.UMAP(n_components=2, random_state=42)
embed_umap = umap_reducer.fit_transform(embed_np)

# Plot the results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))

# PCA plot
scatter1 = ax1.scatter(embed_pca[:, 0], embed_pca[:, 1], alpha=0.7)
ax1.set_title('UCE Embeddings - PCA')
ax1.set_xlabel('PC1')
ax1.set_ylabel('PC2')
plt.colorbar(scatter1, ax=ax1)

# UMAP plot
scatter2 = ax2.scatter(embed_umap[:, 0], embed_umap[:, 1], alpha=0.7)
ax2.set_title('UCE Embeddings - UMAP')
ax2.set_xlabel('UMAP1')
ax2.set_ylabel('UMAP2')
plt.colorbar(scatter2, ax=ax2)

plt.tight_layout()
plt.show()

For more detailed usage instructions, please refer to the documentation.

Citation

If you use this model in your research, please cite both the original UCE paper and the DeepLife ML Infra package:

@article{rosen2023universal,
  title={Universal Cell Embeddings: A Foundation Model for Cell Biology},
  author={Rosen, Yanay and Roohani, Yusuf and Agrawal, Ayush and Samotorcan, Leon and Consortium, Tabula Sapiens and Quake, Stephen R and Leskovec, Jure},
  journal={bioRxiv},
  pages={2023--11},
  year={2023},
  publisher={Cold Spring Harbor Laboratory}
}

@software{deeplife_mlinfra,
  title={DeepLife ML Infra: Infrastructure for Biological Deep Learning Models},
  author={DeepLife AI Team},
  year={2023},
  url={https://github.com/deeplifeai/deeplife-mlinfra},
  version={1.0.0}
}

Contact

For questions or issues related to this model implementation in DeepLife ML Infra, please open an issue in the repository.

For questions about the original UCE model, please contact the authors of the paper.

Downloads last month
22
Safetensors
Model size
851M params
Tensor type
F32
·
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Collection including deeplife/uce_model