TheBloke
/

Falcon-180B-Chat-GPTQ

@@ -43,36 +43,18 @@ This repo contains GPTQ model files for [Technology Innovation Institute's Falco
 Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
-## EXPERIMENTAL
-These are experimental first GPTQs for Falcon 180B. They have not yet been tested.
 Transformers version 4.33.0 is required.
-In order to make them, a small change was needed to AutoGPTQ to add support for the new model_type name `falcon`.  You will need to merge this PR before you can attempt to load them in AutoGPTQ: https://github.com/PanQiWei/AutoGPTQ/pull/326
-Once this change has made, they should be usable just like any other GPTQ model. You can try the example Transformers Python code later in this README, or try loading them directly from AutoGPTQ.
-I believe you will need 2 x 80GB GPUs (or 4 x 48GB) to load the 4-bit models, and probably the 3-bit ones as well.
-Assuming the quants finish OK (and if you're reading this message, they did!) I will test them during the day on 7th September and update this notice with my findings.
-## SPLIT FILES
-Due to the HF 50GB file limit, and the fact that GPTQ does not currently support sharding, I have had to split the `model.safetensors` file.
-To join it:
-Linux and macOS:
-```
-cat model.safetensors-split-* > model.safetensors && rm model.safetensors-split-*
-```
-Windows command line:
-```
-COPY /B model.safetensors.split-a + model.safetensors.split-b model.safetensors
-del model.safetensors.split-a model.safetensors.split-b
-```
 <!-- description end -->
 <!-- repositories-available start -->
@@ -159,40 +141,22 @@ It is strongly recommended to use the text-generation-webui one-click-installers
 ### Install the necessary packages
-Requires: Transformers 4.33.0 or later, Optimum 1.12.0 or later, and AutoGPTQ compiled from source with a patch.
 ```shell
 pip3 install transformers>=4.33.0 optimum>=1.12.0
-pip3 uninstall -y auto-gptq
-git clone -b TB_Latest_Falcon https://github.com/TheBloke/AutoGPTQ
-cd AutoGPTQ
-pip3 install .
 ```
-### You then need to manually download the repo so it can be merged
-I recommend using my fast download script
-```shell
-git clone https://github.com/TheBlokeAI/AIScripts
-python3 AIScripts/hub_download.py TheBloke/Falcon-180B-Chat-GPTQ Falcon-180B-Chat-GPTQ --branch main  # change branch if you want to use the 3-bit model instead
-```
-### Now join the files
-```shell
-cd Falcon-180B-Chat-GPTQ
-# Windows users: see the command to use in the Description at the top of this README
-cat model.safetensors-split-* > model.safetensors && rm model.safetensors-split-*
-```
-### And then finally you can run the following code
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
-model_name_or_path = "/path/to/Falcon-180B-Chat-GPTQ"  # change this to the path you downloaded the model to
 model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                              device_map="auto",
                                              revision="main")
@@ -206,7 +170,7 @@ Assistant: '''
 print("\n\n*** Generate:")
 input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
-output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
 print(tokenizer.decode(output[0]))
 # Inference can also be done using transformers' pipeline
@@ -218,6 +182,7 @@ pipe = pipeline(
     tokenizer=tokenizer,
     max_new_tokens=512,
     temperature=0.7,
     top_p=0.95,
     repetition_penalty=1.15
 )

 Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
+## Requirements
 Transformers version 4.33.0 is required.
+Due to the huge size of the model, the GPTQ has been sharded. This will break compatibility with AutoGPTQ, and therefore any clients/libraries that use AutoGPTQ directly.
+But they work great direct from Transformers!
+Currently these GPTQs are tested to work with:
+- Transformers 4.33.0
+- Text Generation Inference (TGI) 1.0.2
 <!-- description end -->
 <!-- repositories-available start -->
 ### Install the necessary packages
+Requires: Transformers 4.33.0 or later, Optimum 1.12.0 or later, and AutoGPTQ.
 ```shell
 pip3 install transformers>=4.33.0 optimum>=1.12.0
+pip3 install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/  # Use cu117 if on CUDA 11.7
 ```
+### Transformers sample code
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
+model_name_or_path = "TheBloke/Falcon-180B-Chat-GPTQ"
+# To use a different branch, change revision
+# For example: revision="gptq-3bit--1g-actorder_True"
 model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                              device_map="auto",
                                              revision="main")
 print("\n\n*** Generate:")
 input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
+output = model.generate(inputs=input_ids, do_sample=True, temperature=0.7, max_new_tokens=512)
 print(tokenizer.decode(output[0]))
 # Inference can also be done using transformers' pipeline
     tokenizer=tokenizer,
     max_new_tokens=512,
     temperature=0.7,
+    do_sample=True,
     top_p=0.95,
     repetition_penalty=1.15
 )