For more details, please refer to https://huggingface.co/genbio-ai/AIDO.Tissue-60M.
Finetuning AIDO.Tissue for spatial single cell downstream tasks
We introduce how to finetune and evaluate our pre-trained AIDO.Tissue foundation models for downstream tasks. These tasks can be classified into the following categories:
- Cell-level classification tasks: niche label type prediction
- Cell-level regression tasks: cell density prediction
Note: All the following scripts should be run under ModelGenerator/
.
Download data
The related data is deposited at https://huggingface.co/datasets/genbio-ai/tissue-downstream-tasks. Please download the data and put under ModelGenerator/downloads
as cell_density
or niche_type_classification
. Under each sub-directory, there are three files denote different split (xx.train.h5ad, xx.val.h5ad, xx.test.h5ad).
For each .h5ad
, several obs attributes should be included to reprezent the spatial (coordinate) information (like x
, y
), the label information (like niche_label
). All the column fields will be specified in the following config.yaml
file.
Note: the file scRNA_genename_and_index.tsv
includes all the corresponding gene name and index in h5ad file.
Cell-level classification tasks
niche label type prediction
We fully finetune AIDO.Tissue for niche label type prediction.
Finetuning script
CUDA_VISIBLE_DEVICES=7 nohup mgen fit --config experiments/AIDO.Tissue/niche_type_classfification.yaml > logs/nohup/AIDO.Tissue.niche_type_classfification.yaml.log 2>&1 &
Note:
The filter_columns
includes label column and spatial coordinate column. rename_columns
keep unchanged and will be used for running.
Evaluation script
Once finished run, there will be several ckpt
file under the specified output directory default_root_dir
. Then we can use the ckpt
to evaluate on test dataset.
CUDA_VISIBLE_DEVICES=6 nohup mgen test --config experiments/AIDO.Tissue/niche_type_classfification.yaml \
--ckpt_path ckpt_path \
> ckpt_path.pred.log 2>&1 &
Note: ckpt_path
is the finetuned checkpoint path.
Cell-level regression tasks
cell density prediction
The config file is like experiments/AIDO.Tissue/cell_density_regression.yaml
, all the fintuning running and evaluation are similar as classification task.
Dump embedding
We can dump embedding for a .h5ad
file. The script is as:
CUDA_VISIBLE_DEVICES=3 nohup mgen predict --config experiments/AIDO.Tissue/emb.xenium.yaml > logs/nohup/AIDO.Tissue.emb.xenium.log 2>&1 &
The output file will be under specified output_dir
like ./logs/emb.xenium/lightning_logs/pred_output
. Each batch will be saved and a merged one will also be generated as predict_predictions.pt
. The predict_predictions.pt
file satcks all batches:
>>> import torch
>>> file_all = 'predict_predictions.pt'
>>> d_all = torch.load(file_all, map_location='cpu')
>>> d_all.keys()
dict_keys(['predictions', 'ids'])
>>> len(d_all['predictions']) # this equal to #sample
586
>>> len(d_all['ids']) # ids are numeric index corresponding to .h5ad file
586
>>> d_all['predictions'].shape # (B, L, D), L is max sequence length of all samples
torch.Size([586, 90, 128])
We can retrieve all the gene embedding and aggregate into cell embedding (like max pooling):
>>> d_all_maxpooling = [d_all['predictions'][i,:,:] for i in range(d_all['predictions'].shape[0])]
>>> d_all_maxpooling = [i[~torch.any(i.isnan(), dim=1)] for i in d_all_maxpooling]
>>> d_all_maxpooling = torch.cat([i.max(dim=0)[0].view(1,-1) for i in d_all_maxpooling])
>>> d_all_maxpooling.shape
torch.Size([586, 128])
- Downloads last month
- 161