--- license: apache-2.0 datasets: - Flmc/DISC-Med-SFT language: - zh pipeline_tag: text-generation tags: - baichuan - medical - ggml --- This repository contains the quantized DISC-MedLLM, version of Baichuan-13b-base as the base model. The weights are converted to GGML format using [baichuan13b.cpp](https://github.com/ouwei2013/baichuan13b.cpp) (based on [llama.cpp](https://github.com/ggerganov/llama.cpp)) |Model |GGML quantize method| HDD size | |--------------------|--------------------|----------| |ggml-model-q4_0.bin | q4_0 | 7.55 GB | |ggml-model-q4_1.bin | q4_1 | 8.36 GB | |ggml-model-q5_0.bin | q5_0 | 9.17 GB | |ggml-model-q5_1.bin | q5_1 | 9.97 GB | |ggml-model-q8_0.bin | q8_0 | 14 GB | ## How to inference 1. [Compile baichuan13b](https://github.com/ouwei2013/baichuan13b.cpp#build), a main executable `baichuan13b/build/bin/main` and a server `baichuan13b/build/bin/server` will be generated. 2. Download the weight in this repository to `baichuan13b/build/bin/` 3. For command line interface, the following command is useful. You can also read [the doc including other command line parameters](https://github.com/ouwei2013/baichuan13b.cpp/tree/master/examples/main#quick-start) > ```bash > cd baichuan13b/build/bin/ > ./main -m ggml-model-q4_0.bin --prompt "I feel sick. Nausea and Vomiting." > ``` 4. For API interface, the following command is usefule. You can also read [the doc about server command line options](https://github.com/ouwei2013/baichuan13b.cpp/tree/master/examples/server#llamacppexampleserver) > ```bash > cd baichuan13b/build/bin/ > ./server -m ggml-model-q4_0.bin -c 2048 > ``` 5. To test API interface, you can use `curl`: > ```bash > curl --request POST \ > --url http://localhost:8080/completion \ > --data '{"prompt": "I feel sick. Nausea and Vomiting.", "n_predict": 512}' > ``` ### Use it in Python To use it in Python script like [cli_demo.py](https://github.com/FudanDISC/DISC-MedLLM/blob/main/cli_demo.py) all you need to do is replacing the `model.chat()` using `import requests`, POST to `localhost:8080` in JSON and decode HTTP return. ```python import requests llm_output = requests.post( "http://localhost:8080/completion" ).json({ "prompt": "I feel sick. Nausea and Vomiting.", "n_predict": 512 }).json() print(llm_output) ```