How to run model (InSilicoPerturber) on different GPU than where PyTorch is allocated

#245
by junguyen - opened

Hello,

I am quite new to machine learning and would like some help on how to run the model on a different GPU than where PyTorch is allocated. I have 4GPUs, 24GB each. PyTorch is currently on GPU 0, reserving ~17GB of memory:

β”‚+---------------------------------------------------------------------------------------+
β”‚| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
β”‚|-----------------------------------------+----------------------+----------------------+
β”‚| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
β”‚| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
β”‚|                                         |                      |               MIG M. |
β”‚|=========================================+======================+======================|
β”‚|   0  NVIDIA A10G                    On  | 00000000:00:1B.0 Off |                    0 |
β”‚|  0%   34C    P0              60W / 300W |  17106MiB / 23028MiB |      0%      Default |
β”‚|                                         |                      |                  N/A |
β”‚+-----------------------------------------+----------------------+----------------------+
β”‚|   1  NVIDIA A10G                    On  | 00000000:00:1C.0 Off |                    0 |
β”‚|  0%   27C    P8              15W / 300W |      5MiB / 23028MiB |      0%      Default |
β”‚|                                         |                      |                  N/A |
β”‚+-----------------------------------------+----------------------+----------------------+
β”‚|   2  NVIDIA A10G                    On  | 00000000:00:1D.0 Off |                    0 |
β”‚|  0%   28C    P8              18W / 300W |      5MiB / 23028MiB |      0%      Default |
β”‚|                                         |                      |                  N/A |
β”‚+-----------------------------------------+----------------------+----------------------+
β”‚|   3  NVIDIA A10G                    On  | 00000000:00:1E.0 Off |                    0 |
β”‚|  0%   27C    P8              15W / 300W |      5MiB / 23028MiB |      0%      Default |
β”‚|                                         |                      |                  N/A |
β”‚+-----------------------------------------+----------------------+----------------------+
β”‚
β”‚+---------------------------------------------------------------------------------------+
β”‚| Processes:                                                                            |
β”‚|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
β”‚|        ID   ID                                                             Usage      |
β”‚|=======================================================================================|
β”‚|    0   N/A  N/A      2024      C   python3                                   17098MiB |
β”‚+---------------------------------------------------------------------------------------+

I would like to run the model on GPU 1, 2, or 3. I've already tried to edit in_silico_perturber.py and have replaced every instance of 'cuda' with 'cuda:1'. However, it seems the model is still trying to run on GPU 0 (error posted at bottom).

These are the parameters I am using for in silico perturbation:

isp = InSilicoPerturber(perturb_type="delete",
                        perturb_rank_shift=None,
                        # HNF4A: ENSG00000101076
                        genes_to_perturb=["ENSG00000101076"],
                        combos=0,
                        anchor_gene=None,
                        model_type="Pretrained",
                        num_classes=0,
                        emb_mode="cell",
                        cell_emb_style="mean_pool",
                        filter_data=None,
                        cell_states_to_model=None,
                        max_ncells=None,
                        emb_layer=-1,
                        forward_batch_size=200,
                        nproc=16,
                        token_dictionary_file = "/home/ubuntu/Geneformer/geneformer/token_dictionary.pkl")

# Perturb data
isp.perturb_data("/home/ubuntu/Geneformer/",
                 "/data/genecorpus_filtered_hep/",
                 "/data/genecorpus_filtered_hep/delete_cell/",
                 "delete_cell_HNF4A")

And changing forward_batch_size to smaller numbers still raises the same torch.cuda.OutOfMemoryError error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.49 GiB (GPU 0; 22.19 GiB total capacity; 3.22 Gi
B already allocated; 5.49 GiB free; 16.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory 
try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_
CONF 

Any advice on what I should try next would be very helpful. Thank you!

Thank you for your interest in Geneformer! The OOM error is referring to any entities related to PyTorch (e.g. the model), not PyTorch itself. PyTorch is required on all GPUs running the model. You should be able to run the code on a single GPU - you should reduce the forward_batch_size until it fits on your resources. Additionally, if you are using the 12L model, you can consider using the 6L model which will be less resource-intensive. If you'd like to distribute the job to multiple GPUs, there are multiple ways to do this, but I would recommend either running separate batches of cells on each GPU or using a method like Deepspeed if you'd like to distribute the model itself.

ctheodoris changed discussion status to closed

Sign up or log in to comment