ctheodoris/Geneformer · about the values of forward_batch

Jun 16, 2023

Hello,

Thank you for the work!

When I tried running the example in silico perturbation with the code below, I found that I could only use forward_batch_size=80, instead of 400, with a 24Gb-ram graphic drive. Is the batch size too small for the inference? If it is the case, could you recommend a better ram size for the purpose in general?

isp = InSilicoPerturber(perturb_type="delete",
perturb_rank_shift=None,
genes_to_perturb="all",
combos=0,
anchor_gene=None,
model_type="CellClassifier",
num_classes=3,
emb_mode="cell",
cell_emb_style="mean_pool",
filter_data={"cell_type":["Cardiomyocyte3"]},
cell_states_to_model={"disease":(["dcm"],["nf"],["hcm"])},
max_ncells=2000,
emb_layer=0,
forward_batch_size=80,
nproc=16,
save_raw_data=True)

ctheodoris

Owner Jun 16, 2023

Thank you for your question. The batch size depends on many factors, including how many genes there are detected per cell in your dataset, whether you are outputting gene-level embeddings or just cell-level embeddings, and your GPU resources. Because you are just running inference and not training the model, the results are deterministic and will be the same regardless of what you use for your batch size. Therefore, you should reduce your batch size until what is possible with your resources, or increase your resources if you want to run it with a larger batch size for faster processing. Of note, the code sorts the cells by input size so the cells with the most genes detected are presented first to encounter memory issues earlier rather than later when you are at an intermediate point in the analysis. However, this also means that it’s possible that you can run larger batch sizes later in the dataset. You can add a wrapper to optimize the batch size dynamically if you’d like.

ctheodoris changed discussion status to closed Jun 16, 2023

pchiang5

Jun 26, 2023

Thank you for your response. However, the progress was stuck at 0% for a long time (at least overnight) without any error message (GTX1650 4Gb, compute capability 7.5). I tried the identical setup and dataset with google colab (T4 16Gb, compute capability 7.5) and it finished without any problem. It was not due to VRAM insufficiency (3.4 of the 4Gb used). Could you kindly give me some hints about how to troubleshoot or resolve this issue?

from geneformer import InSilicoPerturber
from geneformer import InSilicoPerturberStats

isp = InSilicoPerturber(perturb_type="delete",
... perturb_rank_shift=None,
... genes_to_perturb="all",
... combos=0,
... anchor_gene=None,
... model_type="CellClassifier",
... num_classes=3,
... emb_mode="cell",
... cell_emb_style="mean_pool",
... filter_data={"cell_type":["Cardiomyocyte1","Cardiomyocyte2","Cardiomyocyte3"]},
... cell_states_to_model={"disease":(["dcm"],["nf"],["hcm"])},
... max_ncells=100,
... emb_layer=0,
... forward_batch_size=20,
... nproc=16,
... save_raw_data=True)

#change all cuda or cuda:0 to cpu in the in_silico_pertuber.py

isp.perturb_data("ctheodoris/Geneformer",
... "/mnt/c/Users/pc/Downloads/GF_DCM",
... "/mnt/c/Users/pc/Downloads/GF",
... "output_prefix")
Loading cached processed dataset at /mnt/c/Users/pc/Downloads/GF_DCM/cache-1cc9cdd7e6000d3b_*_of_00016.arrow
Loading cached shuffled indices for dataset at /mnt/c/Users/pc/Downloads/GF_DCM/cache-2358227ccbc43242.arrow
Loading cached sorted indices for dataset at /mnt/c/Users/pc/Downloads/GF_DCM/cache-1a7a0c949a32027e.arrow
Some weights of the model checkpoint at ctheodoris/Geneformer were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias']

This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ctheodoris/Geneformer and are newly initialized: ['bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Loading cached processed dataset at /mnt/c/Users/pc/Downloads/GF_DCM/cache-364fded4f33ad57b_*of_00016.arrow
Loading cached processed dataset at /mnt/c/Users/pc/Downloads/GF_DCM/cache-5e71ce5595a074e1of_00016.arrow
Loading cached processed dataset at /mnt/c/Users/pc/Downloads/GF_DCM/cache-694066f5f18596b0_of_00016.arrow
0%| | 0/100 [00:00<?, ?it/s]

pchiang5 changed discussion status to open Jun 26, 2023

ctheodoris

Owner Jun 26, 2023

Thank you for your question. If it is working for you on Colab with the larger GPU, it sounds like it’s probably just running very slowly due to low resources. This computation involves deleting each individual gene in the cell (so 2048 perturbations per cell if the cell has at least 2048 genes detected), with gene embedding outputs of 2047x256 for each of those perturbations, compared back to the original gene embeddings of 2047x256. You may consider running it on Colab since it’s completing there. As I noted in my previous response, the first cell will take the most time due to the sorting by size to encounter memory constraints earlier rather than later, so if later cells have less genes detected, they will run faster.

Also, I wanted to mention that it looks like you are loading the Geneformer pretrained model without fine-tuning, but you are incorrectly indicating to the in silico perturber that it is a CellClassifier fine-tuned model. As the warning states, this means it’s loading it with the head layer untrained, with randomly initialized weights. Since you are asking it to quantify embeddings from the head layer (emb_layer=0), your results will be random. You should either first fine-tune your model, which is recommended in this case as discussed below, or indicate to the in silico perturber that you are loading the pretrained model.

From closed issue #63:
Also, I wanted to note that when using in silico perturbation to test for perturbations that shift cells between two very similar states, it will likely be more effective if you first fine-tune the model to distinguish between the states so that they are better separated within the embedding space before testing what perturbation shifts between them. Specifically, the end-stage failing heart states are similar between the dilated and hypertrophic cardiomyopathy samples in Chaffin et al. Nature 2022, so it will likely be more effective to first fine-tune the model to distinguish them before running the in silico perturbation/treatment analysis. Fine-tuning the model is less necessary when testing the shift between two states that are already very separable by the pretrained model (e.g. fibroblasts vs. cardiomyocytes). However, fine-tuning with relevant data may still be helpful to orient the model's weights towards the specific downstream objective.

ctheodoris changed discussion status to closed Jun 26, 2023

ctheodoris
/

Geneformer

about the values of forward_batch_size