Getting cell and gene embeddings

#37
by abayegan - opened

Hi,
Given a Anndata object as input, how can I run it through you pre-trained model and get cell and gene embeddings?
I tried figuring it out from the notebooks but getting errors in loading your example dataset and the model. Maybe I will post those separately.
Thanks!

Thank you for your interest in Geneformer. Please see the following two closed issues:

Regarding using Anndata: https://huggingface.co/ctheodoris/Geneformer/discussions/4
Regarding how to obtain embeddings: https://huggingface.co/ctheodoris/Geneformer/discussions/27

Update:
@abayegan
We have now added a function to extract and plot cell embeddings. Please see example here:
https://huggingface.co/ctheodoris/Geneformer/blob/main/examples/extract_and_plot_cell_embeddings.ipynb

ctheodoris changed discussion status to closed

Thanks very much for the links. I have two follow-up questions regarding the gene embeddings:

  1. In the paper, it is mentioned that you have obtained the second to last layer for the gene embeddings which contains 2048 tokens. I assume this list of 2048 tokens is different for each single cell and I would like to obtain the list of genes and their weights for each cell. Would it be right to use the geneformer.tokenizer object to map these tokens to Ensembl ids for each cell or am I missing sth?
  2. Have you compared the weights from model.bert.embeddings.word_embeddings.weight? This give a weight to every single gene instead of only 2048. Have you tried this for building the cell embeddings?

Thank you for your question. Please see the manuscript Methods sections "Gene Embeddings" and "Cell Embeddings" to understand these methods.

Briefly, gene embeddings are 256 dimensions for each gene, so a cell with 2048 genes would have gene embeddings output of 2048x256 (or a cell with 2030 genes would have gene embeddings output of 2030x256). If a different cell has different genes, it will have a different set of embeddings - the embeddings are unique to each gene in each context (they are context-aware embeddings). Gene embeddings are analogous to word embeddings in NLP. Cell embeddings are a composite of the gene embeddings in that cell, so for 1 cell, the initial 2048x256 will be averaged across the gene dimension to become the shape 1x256. Cell embeddings are analogous to sentence embeddings in NLP.

Thank you for your model!
I'm currently trying to analyze the cell embedding and I'm not sure I understand the extraction of cell embedding correctly.
The last layer I got from the pre-trained model should be a batch_size x 2048 x 256 matrix, right? So I need to average across gene dimension to get a matrix of batch_size x 256. I'm wondering do the dimension of gene still represent the input length? Since in the input if a cell got less gene than 2048, there will be padding filled into the input.

In other words, do we add numbers in the gene dimension and divide by 2048 or we should add numbers in the gene dimension and divide by the number of gene that cell actually has(for example, maybe we only detect 2000 genes in that cell and we should divide the result by 2000)

Thank you for your question. Yes, you can remove the padding prior to averaging the embedding. Please see the update in the initial response above as we have now added a function to extract and plot cell embeddings.

Sign up or log in to comment