Spaces:

kelvin-t-lu
/

chatbot

Paused

App Files Files Community

chatbot / docs /TRITON.md

kelvin-t-lu

init

dbd2ac6 almost 2 years ago

preview code

raw

history blame contribute delete

3.86 kB

	## Triton Inference Server

	To get optimal performance for inference for h2oGPT models, we will be using the [FastTransformer Backend for Triton](https://github.com/triton-inference-server/fastertransformer_backend/).

	Make sure to [Set Up GPU Docker](README_DOCKER.md#setup-docker-for-gpus) first.

	### Build Docker image for Triton with FasterTransformer backend:

	```bash
	git clone https://github.com/triton-inference-server/fastertransformer_backend.git
	cd fastertransformer_backend
	git clone https://github.com/NVIDIA/FasterTransformer.git
	export WORKSPACE=$(pwd)
	export CONTAINER_VERSION=22.12
	export TRITON_DOCKER_IMAGE=triton_with_ft:${CONTAINER_VERSION}
	docker build --rm \
	--build-arg TRITON_VERSION=${CONTAINER_VERSION} \
	-t ${TRITON_DOCKER_IMAGE} \
	-f docker/Dockerfile \
	.
	```

	### Create model definition files

	We convert the h2oGPT model from [HF to FT format](https://github.com/NVIDIA/FasterTransformer/pull/569):

	#### Fetch model from Hugging Face
	```bash
	export MODEL=h2ogpt-oig-oasst1-512-6_9b
	if [ ! -d ${MODEL} ]; then
	git lfs clone https://huggingface.co/h2oai/${MODEL}
	fi
	```
	If `git lfs` fails, make sure to install it first. For Ubuntu:
	```bash
	sudo apt-get install git-lfs
	```

	#### Convert to FasterTransformer format

	```bash
	export WORKSPACE=$(pwd)
	export TRITON_DOCKER_IMAGE=triton_with_ft:${CONTAINER_VERSION}
	# Go into Docker
	docker run -it --rm --runtime=nvidia --shm-size=1g \
	--ulimit memlock=-1 -v ${WORKSPACE}:${WORKSPACE} \
	-e CUDA_VISIBLE_DEVICES=0 \
	-e MODEL=${MODEL} \
	-e WORKSPACE=${WORKSPACE} \
	-w ${WORKSPACE} ${TRITON_DOCKER_IMAGE} bash
	export PYTHONPATH=${WORKSPACE}/FasterTransformer/:$PYTHONPATH
	python3 ${WORKSPACE}/FasterTransformer/examples/pytorch/gptneox/utils/huggingface_gptneox_convert.py \
	-i_g 1 \
	-m_n gptneox \
	-i ${WORKSPACE}/${MODEL} \
	-o ${WORKSPACE}/FT-${MODEL}
	```

	#### Test the FasterTransformer model

	FIXME
	```bash
	echo "Hi, who are you?" > gptneox_input
	echo "And you are?" >> gptneox_input
	python3 ${WORKSPACE}/FasterTransformer/examples/pytorch/gptneox/gptneox_example.py \
	--ckpt_path ${WORKSPACE}/FT-${MODEL}/1-gpu \
	--tokenizer_path ${WORKSPACE}/${MODEL} \
	--sample_input_file gptneox_input
	```

	#### Update Triton configuration files

	Fix a typo in the example:
	```bash
	sed -i -e 's@postprocessing@preprocessing@' all_models/gptneox/preprocessing/config.pbtxt
	```

	Update the path to the PyTorch model, and set to use 1 GPU:
	```bash
	sed -i -e "s@/workspace/ft/models/ft/gptneox/@${WORKSPACE}/FT-${MODEL}/1-gpu@" all_models/gptneox/fastertransformer/config.pbtxt
	sed -i -e 's@string_value: "2"@string_value: "1"@' all_models/gptneox/fastertransformer/config.pbtxt
	```

	#### Launch Triton

	```bash
	CUDA_VISIBLE_DEVICES=0 mpirun -n 1 \
	--allow-run-as-root /opt/tritonserver/bin/tritonserver \
	--model-repository=${WORKSPACE}/all_models/gptneox/ &
	```

	Now, you should see something like this:
	```bash
	+-------------------+---------+--------+
	\| Model \| Version \| Status \|
	+-------------------+---------+--------+
	\| ensemble \| 1 \| READY \|
	\| fastertransformer \| 1 \| READY \|
	\| postprocessing \| 1 \| READY \|
	\| preprocessing \| 1 \| READY \|
	+-------------------+---------+--------+
	```
	which means the pipeline is ready to make predictions!

	### Run client test

	Let's test the endpoint:
	```bash
	python3 ${WORKSPACE}/tools/gpt/identity_test.py
	```

	And now the end-to-end test:

	We first have to fix a bug in the inputs for postprocessing:
	```bash
	sed -i -e 's@prepare_tensor("RESPONSE_INPUT_LENGTHS", output2, FLAGS.protocol)@prepare_tensor("sequence_length", output1, FLAGS.protocol)@' ${WORKSPACE}/tools/gpt/end_to_end_test.py
	```

	```bash
	python3 ${WORKSPACE}/tools/gpt/end_to_end_test.py
	```