Universal Cell Embedding (UCE) Model
Model Details
- Model Name: Universal Cell Embedding (UCE)
- Version: 1.0 [deeplife version]
- Type: Foundation model for single-cell biology
- Original Paper: Universal Cell Embeddings: A Foundation Model for Cell Biology
- Original Implementation: UCE GitHub Repository
Model Description
UCE is a foundation model for single-cell gene expression that learns a universal representation of cell biology. It can generate representations of new single-cell gene expression datasets with no model fine-tuning or retraining while remaining robust to dataset and batch-specific artifacts.
Key Features
- Zero-shot embedding capabilities for new datasets and species
- No requirement for cell type annotation or input dataset preprocessing
- Applicable to any set of genes from any species, even if they aren't homologs of genes seen during training
- Learns a universal representation of cell biology that is intrinsically meaningful
Intended Use
UCE is designed for researchers working with single-cell RNA sequencing (scRNA-seq) data. It can be used for:
- Analyzing and integrating diverse scRNA-seq datasets
- Mapping new data into a universal embedding space
- Identifying novel cell types and their functions
- Cross-dataset discoveries and comparisons
Training Data
The model was trained on a corpus of cell atlas data from human and other species, including:
- Over 36 million cells
- More than 1,000 uniquely named cell types
- Hundreds of experiments
- Dozens of tissues
- Eight species (human, mouse, mouse lemur, zebrafish, pig, rhesus macaque, crab eating macaque, western clawed frog)
Performance
UCE has demonstrated superior performance in zero-shot embedding tasks compared to other self-supervised transformer-based methods. It has shown the ability to:
- Accurately embed and cluster cell types from new, unseen datasets
- Align datasets from novel species without additional training
- Capture meaningful biological variation despite the presence of experimental noise
Limitations
- The model's performance may vary for extremely rare or specialized cell types not well-represented in the training data
- While UCE can handle data from new species, its performance might be less optimal for species very distantly related to those in the training set
- The model does not account for information contained in raw RNA transcripts, such as genetic variation and RNA-splicing processes
Ethical Considerations
Users should be aware that while the data used to train UCE is anonymized, it represents human tissue samples and should be treated with appropriate respect and consideration. Researchers using this model should adhere to ethical guidelines for human subjects research.
Usage
To use the UCE model within the DeepLife ML Infra:
Install the package:
pip install deeplife-mlinfra
Import and use the model:
import anndata as ad from huggingface_hub import hf_hub_download from dl_models.models.uce.model import UCEmbedModel from dl_models.models.uce.processor import UCEProcessor # Load the model and preprocessor model = UCEmbedModel.from_pretrained("deeplife/uce_model") preprocessor = UCEProcessor.from_pretrained("deeplife/uce_model") model.eval() # Load your data (example using a sample dataset) filepath = hf_hub_download( repo_id="deeplife/h5ad_samples", filename="GSE136831small.h5ad", repo_type="dataset", ) adata = ad.read_h5ad(filepath) # Preprocess and create a dataloader dataloader = preprocessor.transform_to_dataloader(adata, batch_size=256) # Get embeddings for batch in dataloader: embed = model.get_cell_embeddings(batch) break # This gets embeddings for the first batch # You can now use these embeddings for downstream tasks
For visualization of the embeddings, you can use techniques like PCA or UMAP:
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import umap
# Convert embed to numpy
embed_np = embed.detach().cpu().numpy()
# Perform PCA
pca = PCA(n_components=2)
embed_pca = pca.fit_transform(embed_np)
# Perform UMAP
umap_reducer = umap.UMAP(n_components=2, random_state=42)
embed_umap = umap_reducer.fit_transform(embed_np)
# Plot the results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))
# PCA plot
scatter1 = ax1.scatter(embed_pca[:, 0], embed_pca[:, 1], alpha=0.7)
ax1.set_title('UCE Embeddings - PCA')
ax1.set_xlabel('PC1')
ax1.set_ylabel('PC2')
plt.colorbar(scatter1, ax=ax1)
# UMAP plot
scatter2 = ax2.scatter(embed_umap[:, 0], embed_umap[:, 1], alpha=0.7)
ax2.set_title('UCE Embeddings - UMAP')
ax2.set_xlabel('UMAP1')
ax2.set_ylabel('UMAP2')
plt.colorbar(scatter2, ax=ax2)
plt.tight_layout()
plt.show()
For more detailed usage instructions, please refer to the documentation.
Citation
If you use this model in your research, please cite both the original UCE paper and the DeepLife ML Infra package:
@article{rosen2023universal,
title={Universal Cell Embeddings: A Foundation Model for Cell Biology},
author={Rosen, Yanay and Roohani, Yusuf and Agrawal, Ayush and Samotorcan, Leon and Consortium, Tabula Sapiens and Quake, Stephen R and Leskovec, Jure},
journal={bioRxiv},
pages={2023--11},
year={2023},
publisher={Cold Spring Harbor Laboratory}
}
@software{deeplife_mlinfra,
title={DeepLife ML Infra: Infrastructure for Biological Deep Learning Models},
author={DeepLife AI Team},
year={2023},
url={https://github.com/deeplifeai/deeplife-mlinfra},
version={1.0.0}
}
Contact
For questions or issues related to this model implementation in DeepLife ML Infra, please open an issue in the repository.
For questions about the original UCE model, please contact the authors of the paper.
- Downloads last month
- 81