Universal Cell Embedding (UCE) Model

Model Details

Model Name: Universal Cell Embedding (UCE)
Version: 1.0 [deeplife version]
Type: Foundation model for single-cell biology
Original Paper: Universal Cell Embeddings: A Foundation Model for Cell Biology
Original Implementation: UCE GitHub Repository

Model Description

UCE is a foundation model for single-cell gene expression that learns a universal representation of cell biology. It can generate representations of new single-cell gene expression datasets with no model fine-tuning or retraining while remaining robust to dataset and batch-specific artifacts.

Key Features

Zero-shot embedding capabilities for new datasets and species
No requirement for cell type annotation or input dataset preprocessing
Applicable to any set of genes from any species, even if they aren't homologs of genes seen during training
Learns a universal representation of cell biology that is intrinsically meaningful

Intended Use

UCE is designed for researchers working with single-cell RNA sequencing (scRNA-seq) data. It can be used for:

Analyzing and integrating diverse scRNA-seq datasets
Mapping new data into a universal embedding space
Identifying novel cell types and their functions
Cross-dataset discoveries and comparisons

Training Data

The model was trained on a corpus of cell atlas data from human and other species, including:

Over 36 million cells
More than 1,000 uniquely named cell types
Hundreds of experiments
Dozens of tissues
Eight species (human, mouse, mouse lemur, zebrafish, pig, rhesus macaque, crab eating macaque, western clawed frog)

Performance

UCE has demonstrated superior performance in zero-shot embedding tasks compared to other self-supervised transformer-based methods. It has shown the ability to:

Accurately embed and cluster cell types from new, unseen datasets
Align datasets from novel species without additional training
Capture meaningful biological variation despite the presence of experimental noise

Limitations

The model's performance may vary for extremely rare or specialized cell types not well-represented in the training data
While UCE can handle data from new species, its performance might be less optimal for species very distantly related to those in the training set
The model does not account for information contained in raw RNA transcripts, such as genetic variation and RNA-splicing processes

Ethical Considerations

Users should be aware that while the data used to train UCE is anonymized, it represents human tissue samples and should be treated with appropriate respect and consideration. Researchers using this model should adhere to ethical guidelines for human subjects research.

Usage

To use the UCE model within the DeepLife ML Infra:

Install the package:
```
pip install deeplife-mlinfra
```

Import and use the model:

import anndata as ad
from huggingface_hub import hf_hub_download
from dl_models.models.uce.model import UCEmbedModel
from dl_models.models.uce.processor import UCEProcessor

# Load the model and preprocessor
model = UCEmbedModel.from_pretrained("deeplife/uce_model")
preprocessor = UCEProcessor.from_pretrained("deeplife/uce_model")
model.eval()

# Load your data (example using a sample dataset)
filepath = hf_hub_download(
    repo_id="deeplife/h5ad_samples",
    filename="GSE136831small.h5ad",
    repo_type="dataset",
)
adata = ad.read_h5ad(filepath)

# Preprocess and create a dataloader
dataloader = preprocessor.transform_to_dataloader(adata, batch_size=256)

# Get embeddings
for batch in dataloader:
    embed = model.get_cell_embeddings(batch)
    break  # This gets embeddings for the first batch

# You can now use these embeddings for downstream tasks

For visualization of the embeddings, you can use techniques like PCA or UMAP:

import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import umap

# Convert embed to numpy
embed_np = embed.detach().cpu().numpy()

# Perform PCA
pca = PCA(n_components=2)
embed_pca = pca.fit_transform(embed_np)

# Perform UMAP
umap_reducer = umap.UMAP(n_components=2, random_state=42)
embed_umap = umap_reducer.fit_transform(embed_np)

# Plot the results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))

# PCA plot
scatter1 = ax1.scatter(embed_pca[:, 0], embed_pca[:, 1], alpha=0.7)
ax1.set_title('UCE Embeddings - PCA')
ax1.set_xlabel('PC1')
ax1.set_ylabel('PC2')
plt.colorbar(scatter1, ax=ax1)

# UMAP plot
scatter2 = ax2.scatter(embed_umap[:, 0], embed_umap[:, 1], alpha=0.7)
ax2.set_title('UCE Embeddings - UMAP')
ax2.set_xlabel('UMAP1')
ax2.set_ylabel('UMAP2')
plt.colorbar(scatter2, ax=ax2)

plt.tight_layout()
plt.show()

For more detailed usage instructions, please refer to the documentation.

Citation

If you use this model in your research, please cite both the original UCE paper and the DeepLife ML Infra package:

@article{rosen2023universal,
  title={Universal Cell Embeddings: A Foundation Model for Cell Biology},
  author={Rosen, Yanay and Roohani, Yusuf and Agrawal, Ayush and Samotorcan, Leon and Consortium, Tabula Sapiens and Quake, Stephen R and Leskovec, Jure},
  journal={bioRxiv},
  pages={2023--11},
  year={2023},
  publisher={Cold Spring Harbor Laboratory}
}

@software{deeplife_mlinfra,
  title={DeepLife ML Infra: Infrastructure for Biological Deep Learning Models},
  author={DeepLife AI Team},
  year={2023},
  url={https://github.com/deeplifeai/deeplife-mlinfra},
  version={1.0.0}
}

Contact

For questions or issues related to this model implementation in DeepLife ML Infra, please open an issue in the repository.

For questions about the original UCE model, please contact the authors of the paper.

deeplife
/

uce_model