---
title: Lisa On Gpu
emoji: π
colorFrom: yellow
colorTo: red
sdk: docker
pinned: false
---
# exec jupyter on the remote server with port forwarding on localhost
1. checkout repo, install venv with jupyter
2. port forwarding in localhost wiht private key: `ssh -i ~/.ssh/id_ecdsa_saturncloud trincuz@ssh.community.saturnenterprise.io -L 8889:localhost:8889 -N -f`
3. start the jupyter-lab server
4. connect to page in localhost
## Commands to work on saturncloud after clone and git lfs install
```bash
cd ~/workspace/lisa-on-gpu/
rm -rf lisa_venv
python3 -m venv lisa_venv
ln -s lisa_venv/ venv
source venv/bin/activate
pip --version
which python
python -m pip install pip wheel --upgrade
python -m pip install pytest pytest-cov jupyterlab
python -m pip install -r requirements.txt
nohup jupyter-lab &
tail -F nohup.out
```
# Jupyterlab Howto
To run the `test.ipynb` notebook you should already:
- cloned project https://huggingface.co/spaces/aletrn/lisa-on-gpu with active git lfs
- created and activated a virtualenv
- installed jupyterlab dependencies from requirements_jupyter.txt
- installed dependencies from requirements.txt
## Hardware requirements
- an nvidia gpu with 10 or 12GB of memory (a T4 should suffice)
- at least 16GB of system ram
[![Gradio](https://img.shields.io/badge/Gradio-Online%20Demo-blue)](http://103.170.5.190:7860/)
[![Open in OpenXLab](https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg)](https://openxlab.org.cn/apps/detail/openxlab-app/LISA)
# LISA: Reasoning Segmentation via Large Language Model
Input | Output | Input | Output |
"Who was the president of the US in this image? Please output segmentation mask and explain the reason." | "Sure, the segmentation result is [SEG]. The President of the United States in the image is President Obama.β | "Who was the president of the US in this image? Please output segmentation mask and explain why." | "Sure, [SEG]. In the image, the President of the United States is President Trump." |
"What can make the woman stand higher? Please output segmentation mask and explain why." | "Sure, [SEG]. The woman is standing higher by using a ladder..." | "Can you segment the camera lens that is more suitable for photographing nearby objects in this image?" | "Sure, [SEG]." |
"Can you segment the unusual part in this image and explain why." | "Sure, it is [SEG]. In the image, the unusual part is the dog wearing a reindeer antler headband..." | "Where to wash hands in this image? Please output segmentation mask." | "Sure, [SEG]." |
"Can you segment the founder of Alibaba in this image and explain why?" | "Sure, it is [SEG]. In the image, the man wearing a red tie and sitting on one of the chairs is Jack Ma, the co-founder of Alibaba Group..." | "Please segment Lisa in this figure." | "Sure, [SEG]." |
## News - [x] [2023.8.30] Release three new models [LISA-7B-v1](https://huggingface.co/xinlai/LISA-7B-v1), [LISA-7B-v1-explanatory](https://huggingface.co/xinlai/LISA-7B-v1-explanatory), and [LISA-13B-llama2-v1-explanatory](https://huggingface.co/xinlai/LISA-13B-llama2-v1-explanatory). Welcome to check them out! - [x] [2023.8.23] Refactor code, and release new model [LISA-13B-llama2-v1](https://huggingface.co/xinlai/LISA-13B-llama2-v1). Welcome to check it out! - [x] [2023.8.9] Training code is released! - [x] [2023.8.4] [Online Demo](http://103.170.5.190:7860/) is released! - [x] [2023.8.4] [*ReasonSeg* Dataset](https://drive.google.com/drive/folders/125mewyg5Ao6tZ3ZdJ-1-E3n04LGVELqy?usp=sharing) and the [LISA-13B-llama2-v0-explanatory](https://huggingface.co/xinlai/LISA-13B-llama2-v0-explanatory) model are released! - [x] [2023.8.3] Inference code and the [LISA-13B-llama2-v0](https://huggingface.co/xinlai/LISA-13B-llama2-v0) model are released. Welcome to check them out! - [x] [2023.8.2] [Paper](https://arxiv.org/pdf/2308.00692.pdf) is released and GitHub repo is created. **LISA: Reasoning Segmentation via Large Language Model [[Paper](https://arxiv.org/abs/2308.00692)]**
## Installation ``` pip install -r requirements.txt pip install flash-attn --no-build-isolation ``` ## Training ### Training Data Preparation The training data consists of 4 types of data: 1. Semantic segmentation datasets: [ADE20K](http://data.csail.mit.edu/places/ADEchallenge/ADEChallengeData2016.zip), [COCO-Stuff](http://calvin.inf.ed.ac.uk/wp-content/uploads/data/cocostuffdataset/stuffthingmaps_trainval2017.zip), [Mapillary](https://www.mapillary.com/dataset/vistas), [PACO-LVIS](https://github.com/facebookresearch/paco/tree/main#dataset-setup), [PASCAL-Part](https://github.com/facebookresearch/VLPart/tree/main/datasets#pascal-part), [COCO Images](http://images.cocodataset.org/zips/train2017.zip) Note: For COCO-Stuff, we use the annotation file stuffthingmaps_trainval2017.zip. We only use the PACO-LVIS part in PACO. COCO Images should be put into the `dataset/coco/` directory. 3. Referring segmentation datasets: [refCOCO](https://web.archive.org/web/20220413011718/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcoco.zip), [refCOCO+](https://web.archive.org/web/20220413011656/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcoco+.zip), [refCOCOg](https://web.archive.org/web/20220413012904/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcocog.zip), [refCLEF](https://web.archive.org/web/20220413011817/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refclef.zip) ([saiapr_tc-12](https://web.archive.org/web/20220515000000/http://bvisionweb1.cs.unc.edu/licheng/referit/data/images/saiapr_tc-12.zip)) Note: the original links of refCOCO series data are down, and we update them with new ones. If the download speed is super slow or unstable, we also provide a [OneDrive link](https://mycuhk-my.sharepoint.com/:f:/g/personal/1155154502_link_cuhk_edu_hk/Em5yELVBvfREodKC94nOFLoBLro_LPxsOxNV44PHRWgLcA?e=zQPjsc) to download. **You must also follow the rules that the original datasets require.** 4. Visual Question Answering dataset: [LLaVA-Instruct-150k](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_instruct_150k.json) 5. Reasoning segmentation dataset: [ReasonSeg](https://github.com/dvlab-research/LISA#dataset) Download them from the above links, and organize them as follows. ``` βββ dataset β βββ ade20k β β βββ annotations β β βββ images β βββ coco β β βββ train2017 β β βββ 000000000009.jpg β β βββ ... β βββ cocostuff β β βββ train2017 β β βββ 000000000009.png β β βββ ... β βββ llava_dataset β β βββ llava_instruct_150k.json β βββ mapillary β β βββ config_v2.0.json β β βββ testing β β βββ training β β βββ validation β βββ reason_seg β β βββ ReasonSeg β β βββ train β β βββ val β β βββ explanatory β βββ refer_seg β β βββ images β β | βββ saiapr_tc-12 β β | βββ mscoco β β | βββ images β β | βββ train2014 β β βββ refclef β β βββ refcoco β β βββ refcoco+ β β βββ refcocog β βββ vlpart β βββ paco β β βββ annotations β βββ pascal_part β βββ train.json β βββ VOCdevkit ``` ### Pre-trained weights #### LLaVA To train LISA-7B or 13B, you need to follow the [instruction](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md) to merge the LLaVA delta weights. Typically, we use the final weights `LLaVA-Lightning-7B-v1-1` and `LLaVA-13B-v1-1` merged from `liuhaotian/LLaVA-Lightning-7B-delta-v1-1` and `liuhaotian/LLaVA-13b-delta-v1-1`, respectively. For Llama2, we can directly use the LLaVA full weights `liuhaotian/llava-llama-2-13b-chat-lightning-preview`. #### SAM ViT-H weights Download SAM ViT-H pre-trained weights from the [link](https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth). ### Training ``` deepspeed --master_port=24999 train_ds.py \ --version="PATH_TO_LLaVA" \ --dataset_dir='./dataset' \ --vision_pretrained="PATH_TO_SAM" \ --dataset="sem_seg||refer_seg||vqa||reason_seg" \ --sample_rates="9,3,3,1" \ --exp_name="lisa-7b" ``` When training is finished, to get the full model weight: ``` cd ./runs/lisa-7b/ckpt_model && python zero_to_fp32.py . ../pytorch_model.bin ``` ### Merge LoRA Weight Merge the LoRA weights of `pytorch_model.bin`, save the resulting model into your desired path in the Hugging Face format: ``` CUDA_VISIBLE_DEVICES="" python merge_lora_weights_and_save_hf_model.py \ --version="PATH_TO_LLaVA" \ --weight="PATH_TO_pytorch_model.bin" \ --save_path="PATH_TO_SAVED_MODEL" ``` For example: ``` CUDA_VISIBLE_DEVICES="" python3 merge_lora_weights_and_save_hf_model.py \ --version="./LLaVA/LLaVA-Lightning-7B-v1-1" \ --weight="lisa-7b/pytorch_model.bin" \ --save_path="./LISA-7B" ``` ### Validation ``` deepspeed --master_port=24999 train_ds.py \ --version="PATH_TO_LISA_HF_Model_Directory" \ --dataset_dir='./dataset' \ --vision_pretrained="PATH_TO_SAM" \ --exp_name="lisa-7b" \ --eval_only ``` Note: the `v1` model is trained using both `train+val` sets, so please use the `v0` model to reproduce the validation results. (To use the `v0` models, please first checkout to the legacy version repo with `git checkout 0e26916`.) ## Inference To chat with [LISA-13B-llama2-v1](https://huggingface.co/xinlai/LISA-13B-llama2-v1) or [LISA-13B-llama2-v1-explanatory](https://huggingface.co/xinlai/LISA-13B-llama2-v1-explanatory): (Note that `chat.py` currently does not support `v0` models (i.e., `LISA-13B-llama2-v0` and `LISA-13B-llama2-v0-explanatory`), if you want to use the `v0` models, please first checkout to the legacy version repo `git checkout 0e26916`.) ``` CUDA_VISIBLE_DEVICES=0 python chat.py --version='xinlai/LISA-13B-llama2-v1' CUDA_VISIBLE_DEVICES=0 python chat.py --version='xinlai/LISA-13B-llama2-v1-explanatory' ``` To use `bf16` or `fp16` data type for inference: ``` CUDA_VISIBLE_DEVICES=0 python chat.py --version='xinlai/LISA-13B-llama2-v1' --precision='bf16' ``` To use `8bit` or `4bit` data type for inference (this enables running 13B model on a single 24G or 12G GPU at some cost of generation quality): ``` CUDA_VISIBLE_DEVICES=0 python chat.py --version='xinlai/LISA-13B-llama2-v1' --precision='fp16' --load_in_8bit CUDA_VISIBLE_DEVICES=0 python chat.py --version='xinlai/LISA-13B-llama2-v1' --precision='fp16' --load_in_4bit ``` Hint: for 13B model, 16-bit inference consumes 30G VRAM with a single GPU, 8-bit inference consumes 16G, and 4-bit inference consumes 9G. After that, input the text prompt and then the image path. For exampleοΌ ``` - Please input your prompt: Where can the driver see the car speed in this image? Please output segmentation mask. - Please input the image path: imgs/example1.jpg - Please input your prompt: Can you segment the food that tastes spicy and hot? - Please input the image path: imgs/example2.jpg ``` The results should be like:
## Deployment ``` CUDA_VISIBLE_DEVICES=0 python app.py --version='xinlai/LISA-13B-llama2-v1 --load_in_4bit' CUDA_VISIBLE_DEVICES=0 python app.py --version='xinlai/LISA-13B-llama2-v1-explanatory --load_in_4bit' ``` By default, we use 4-bit quantization. Feel free to delete the `--load_in_4bit` argument for 16-bit inference or replace it with `--load_in_8bit` argument for 8-bit inference. ## Dataset In ReasonSeg, we have collected 1218 images (239 train, 200 val, and 779 test). The training and validation sets can be download from **this link**. Each image is provided with an annotation JSON file: ``` image_1.jpg, image_1.json image_2.jpg, image_2.json ... image_n.jpg, image_n.json ``` Important keys contained in JSON files: ``` - "text": text instructions. - "is_sentence": whether the text instructions are long sentences. - "shapes": target polygons. ``` The elements of the "shapes" exhibit two categories, namely **"target"** and **"ignore"**. The former category is indispensable for evaluation, while the latter category denotes the ambiguous region and hence disregarded during the evaluation process. We provide a **script** that demonstrates how to process the annotations: ``` python3 utils/data_processing.py ``` Besides, we leveraged GPT-3.5 for rephrasing instructions, so images in the training set may have **more than one instructions (but fewer than six)** in the "text" field. During training, users may randomly select one as the text query to obtain a better model. ## Citation If you find this project useful in your research, please consider citing: ``` @article{lai2023lisa, title={LISA: Reasoning Segmentation via Large Language Model}, author={Lai, Xin and Tian, Zhuotao and Chen, Yukang and Li, Yanwei and Yuan, Yuhui and Liu, Shu and Jia, Jiaya}, journal={arXiv preprint arXiv:2308.00692}, year={2023} } @article{yang2023improved, title={An Improved Baseline for Reasoning Segmentation with Large Language Model}, author={Yang, Senqiao and Qu, Tianyuan and Lai, Xin and Tian, Zhuotao and Peng, Bohao and Liu, Shu and Jia, Jiaya}, journal={arXiv preprint arXiv:2312.17240}, year={2023} } ``` ## Acknowledgement - This work is built upon the [LLaVA](https://github.com/haotian-liu/LLaVA) and [SAM](https://github.com/facebookresearch/segment-anything). - placeholders images (error, 'no output segmentation') from Muhammad Khaleeq (https://www.vecteezy.com/members/iyikon)