--- license: apache-2.0 language: - zh - en library_name: transformers tags: - qihoo360 - 奇虎360 - zhinao - 360Zhinao - pretrain ---

中文 |   English 


360Zhinao2 (360智脑)

🤗 HuggingFace   |    🤖 ModelScope   |    💬 WeChat (微信)  

Feel free to visit 360Zhinao's official website https://ai.360.com for more experience.


# Introduction 🎉🎉🎉 We released the 360Zhinao2 model series: - **360Zhinao2-7B-Base** - **360Zhinao2-7B-Chat-4K** - **360Zhinao2-7B-Chat-32K** - **360Zhinao2-7B-Chat-360K** Notable features of our 360Zhinao models are: - **Base Model:** Using popular two-stage training method, In the first stage we totally train 10T tokens with a cosine learning rate schedule. In the second stage we increase the proportion of high-quality data and totally train 100B tokens, with the learning rate decaying directly to 0. The total training data for 360Zhinao2-7B amounts to 10.1T tokens. - **Chat Models:** Powerful chat capabilities and three context lengths of 4K, 32K and 360K.
# News and Updates - [2024.11.18] 🔥🔥🔥We release 360Zhinao2-7B, providing access to both the Base model and Chat models with text lengths of 4K, 32K, and 360K. - [2024.05.23] We released two models, 360Zhinao-search and 360Zhinao-1.8B-Reranking, which ranked first respectively in the Retrieval and Reranking tasks of [C-MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) . - [2024.05.20] We extended llama3 and released **llama3-8B-360Zhinao-360k-Instruct**🤗 - [2024.04.12] We released **360Zhinao-7B** v1.0, including the base model and three chat models with context lengths 4K, 32K and 360K. Technical report is on [arXiv](https://arxiv.org/abs/2405.13386).
# Table of contents - [Download URL](#Download-URL) - [Model Evaluation](#Model-Evaluation) - [Quickstart](#Quickstart) - [Model Inference](#Model-Inference) - [Model Finetune](#Model-Finetune) - [License](#License)
# Download URL | Size | Model | BF16 | Int4| |-|-|-|-| | 7B | 360Zhinao2-7B-Base | 🤖 🤗 | | | 7B | 360Zhinao2-7B-Chat-4K | 🤖 🤗 | 🤖 🤗 | | 7B | 360Zhinao2-7B-Chat-32K | 🤖 🤗 | 🤖 🤗 | | 7B | 360Zhinao2-7B-Chat-360K | 🤖 🤗 | 🤖 🤗 |
# Model Evaluation ## Base Model We used the open-source tool OpenCompass to evaluate the model and compared it with open-source models under 10B from the past six months. The 360Zhinao2-7B model is competive. The 360Zhinao2-7B model performs well on Chinese benchmarks such as CEval, C3 and LCSTS. The average socres of Chinese benchmarks is No 1. It also ranks No 1 on Math which is a challenging competition math dataset. **The 360Zhinao2-7B model has advantages in Chinese benchmark and challenging competition math.**
TypeDatasetslanguageglm4-9bQwen2.5-7Binternlm2.5-7bYi1.5-9Bgemma2-9bLlama3.1-8B360Zhinao2-7B
Examcevalzh75.8381.4177.7173.5156.3651.6783.04
mmluen75.575.571.5571.4372.2266.7567.84
cmmluzh74.2481.7978.7774.258.8952.4973.8
ARC-cen94.928085.0887.4677.6380.6887.12
ARC-een98.4184.8395.2494.5378.8489.7792.77
LanguageWiCen51.5752.8250.7850.6350.475049.84
WSCen68.2768.2769.2366.3568.2767.3165.38
Knowledge BoolQen81.883.8889.5184.4685.682.288.29
commonsense_qaen71.1773.2268.5571.5868.4771.2569.78
Understanding C3zh91.519293.0485.8681.6483.5193.26
race-middleen91.9991.0292.0691.1688.0981.6990.46
race-highen90.7187.9190.0888.3482.0878.7386.74
lcstszh18.2915.8215.9616.4910.6217.2918.61
eprstmt-devzh91.8886.8891.2591.8848.1283.1290
lambadaen71.6771.1469.9870.6475.4374.2372.56
Reasoning hellaswagen70.2572.7670.3871.5566.8374.6571.49
siqaen81.7372.5278.9776.258.9664.1877.12
bbhen73.6854.6359.4367.8668.4559.946.54
Code humanevalen69.517560.3726.225.4927.4460.98
mbppen606043.656.851.242.654
Math mathen26.863827.1427.0628.5215.3238.34
gsm8ken78.5479.7652.5471.1173.0956.2575.51
Overall avg_zh70.3571.5871.3568.3951.1357.6271.74
avg_all73.1171.7869.6068.8861.6062.3270.61

# Quickstart We provide simple examples illustrating the use of 360Zhinao2-7B-Base and 360Zhinao2-7B-Chat on 🤖ModelScope and 🤗Transformers. ## Dependency Installation - python >= 3.8 - pytorch >= 2.0 - transformers >= 4.37.2 - CUDA >= 11.4 ```shell pip install -r requirements.txt ``` Optionally, we recommend installing Flash-Attention 2 to improve performance and reduce memory footprint. >flash-attn >= 2.3.6 ```shell FLASH_ATTENTION_FORCE_BUILD=TRUE pip install flash-attn==2.3.6 ``` ## 🤗 Transformers ### Demonstration of Base Model Inference ```python from transformers import AutoTokenizer, AutoModelForCausalLM from transformers.generation import GenerationConfig MODEL_NAME_OR_PATH = "qihoo360/360Zhinao2-7B-Base" tokenizer = AutoTokenizer.from_pretrained( MODEL_NAME_OR_PATH, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( MODEL_NAME_OR_PATH, device_map="auto", trust_remote_code=True) generation_config = GenerationConfig.from_pretrained( MODEL_NAME_OR_PATH, trust_remote_code=True) inputs = tokenizer('中国二十四节气\n1. 立春\n2. 雨水\n3. 惊蛰\n4. 春分\n5. 清明\n', return_tensors='pt') inputs = inputs.to(model.device) pred = model.generate(input_ids=inputs["input_ids"], generation_config=generation_config) print("outputs:\n", tokenizer.decode(pred.cpu()[0], skip_special_tokens=True)) ``` ### Demonstration of Chat Model Inference ```python from transformers import AutoTokenizer, AutoModelForCausalLM from transformers.generation import GenerationConfig MODEL_NAME_OR_PATH = "qihoo360/360Zhinao2-7B-Chat-4K" tokenizer = AutoTokenizer.from_pretrained( MODEL_NAME_OR_PATH, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( MODEL_NAME_OR_PATH, device_map="auto", trust_remote_code=True) generation_config = GenerationConfig.from_pretrained( MODEL_NAME_OR_PATH, trust_remote_code=True) messages = [] #round-1 messages.append({"role": "user", "content": "介绍一下刘德华"}) response = model.chat(tokenizer=tokenizer, messages=messages, generation_config=generation_config) messages.append({"role": "assistant", "content": response}) print(messages) #round-2 messages.append({"role": "user", "content": "他有什么代表作?"}) response = model.chat(tokenizer=tokenizer, messages=messages, generation_config=generation_config) messages.append({"role": "assistant", "content": response}) print(messages) ``` ## 🤖 ModelScope ### Demonstration of Base Model Inference ```python from modelscope import AutoModelForCausalLM, AutoTokenizer from modelscope import GenerationConfig MODEL_NAME_OR_PATH = "qihoo360/360Zhinao2-7B-Base" tokenizer = AutoTokenizer.from_pretrained( MODEL_NAME_OR_PATH, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( MODEL_NAME_OR_PATH, device_map="auto", trust_remote_code=True) generation_config = GenerationConfig.from_pretrained( MODEL_NAME_OR_PATH, trust_remote_code=True) inputs = tokenizer('中国二十四节气\n1. 立春\n2. 雨水\n3. 惊蛰\n4. 春分\n5. 清明\n', return_tensors='pt') inputs = inputs.to(model.device) pred = model.generate(input_ids=inputs["input_ids"], generation_config=generation_config) print("outputs:\n", tokenizer.decode(pred.cpu()[0], skip_special_tokens=True)) ``` ### Demonstration of Chat Model Inference ```python from modelscope import AutoModelForCausalLM, AutoTokenizer from modelscope import GenerationConfig MODEL_NAME_OR_PATH = "qihoo360/360Zhinao2-7B-Chat-4K" tokenizer = AutoTokenizer.from_pretrained( MODEL_NAME_OR_PATH, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( MODEL_NAME_OR_PATH, device_map="auto", trust_remote_code=True) generation_config = GenerationConfig.from_pretrained( MODEL_NAME_OR_PATH, trust_remote_code=True) messages = [] #round-1 messages.append({"role": "user", "content": "介绍一下刘德华"}) response = model.chat(tokenizer=tokenizer, messages=messages, generation_config=generation_config) messages.append({"role": "assistant", "content": response}) print(messages) #round-2 messages.append({"role": "user", "content": "他有什么代表作?"}) response = model.chat(tokenizer=tokenizer, messages=messages, generation_config=generation_config) messages.append({"role": "assistant", "content": response}) print(messages) ``` ## CLI Demo Use terminal for command-line interface: ```shell python cli_demo.py ```

Note: for Mac users, `device = 'mps'` is not supported yet. ## Web Demo ```shell streamlit run web_demo.py ```

## API Demo Launch api: ```shell python openai_api.py ``` Then request with parameters: ```shell curl 'http://localhost:8360/v1/chat/completions' \ -H 'Content-Type: application/json' \ -d '{ "max_new_tokens": 200, "do_sample": true, "top_k": 0, "top_p": 0.8, "temperature": 1.0, "repetition_penalty": 1.0, "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "你好"} ] }' ```
# Model Inference ## Quantization We provide quantization schemes based on AutoGPTQ and release the Int4 quantization models. ## Deployment ### vLLM Installation We recommend using `vLLM==0.3.3`. If you are using **CUDA 12.1 and PyTorch 2.1**, you can install vLLM directly with: ```shell pip install vllm==0.3.3 ``` Otherwise, please refer to the official vLLM [Installation Instructions](https://docs.vllm.ai/en/latest/getting_started/installation.html). After installation, perform the following steps: 1. Copy `vllm/zhinao.py` into `vllm/model_executor/models` in your vllm installation directory (in python/conda env). 2. Copy `vllm/serving_chat.py` into `vllm/entrypoints/openai` in your vllm installation directory. 3. Then add a line in `vllm/model_executor/models/__init__.py` ```shell "ZhinaoForCausalLM": ("zhinao", "ZhinaoForCausalLM"), ``` ### vLLM Service Start Start the service: ```shell python -m vllm.entrypoints.openai.api_server \ --served-model-name 360Zhinao2-7B-Chat-4K \ --model qihoo360/360Zhinao2-7B-Chat-4K \ --trust-remote-code \ --tensor-parallel-size 1 \ --max-model-len 4096 \ --host 0.0.0.0 \ --port 8360 ``` Use curl to request the service: ```shell curl http://localhost:8360/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "360Zhinao2-7B-Chat-4K", "max_tokens": 200, "top_k": -1, "top_p": 0.8, "temperature": 1.0, "presence_penalty": 0.0, "frequency_penalty": 0.0, "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "你好"} ], "stop": [ "", "<|im_end|>", "<|im_start|>" ] }' ``` Use python to request the service: ```python from openai import OpenAI openai_api_key = "EMPTY" openai_api_base = "http://localhost:8360/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) chat_response = client.chat.completions.create( model="360Zhinao2-7B-Chat-4K", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "你好"}, ], stop=[ "", "<|im_end|>", "<|im_start|>" ], presence_penalty=0.0, frequency_penalty=0.0 ) print("Chat response:", chat_response) ``` > If you need to enable repetition penalty, we recommend setting `presence_penalty` and `frequency_penalty` instead of `repetition_penalty`.
# Model Finetune ## Training data Training Data: `data/training_data_sample.json`. This example data has 10,000 rows sampled from [multiturn_chat_0.8M](https://huggingface.co/datasets/BelleGroup/multiturn_chat_0.8M) with converted format. Data Format: ```json [ { "id": 1, "conversations": [ { "from": "system", "value": "You are a helpful assistant." }, { "from": "user", "value": "您好啊" }, { "from": "assistant", "value": "你好!我今天能为您做些什么?有什么问题或需要帮助吗? 我在这里为您提供服务。" } ] } ] ``` ## Finetuning scripts ```shell set -x HOSTFILE=hostfile DS_CONFIG=./finetune/ds_config_zero2.json # PARAMS LR=5e-6 EPOCHS=3 MAX_LEN=4096 BATCH_SIZE=4 NUM_NODES=1 NUM_GPUS=8 MASTER_PORT=29500 IS_CONCAT=False # Whether to concatenate to maximum length (MAX_LEN) DATA_PATH="./data/training_data_sample.json" MODEL_PATH="qihoo360/360Zhinao2-7B-Base" OUTPUT_DIR="./outputs/" deepspeed --hostfile ${HOSTFILE} \ --master_port ${MASTER_PORT} \ --num_nodes ${NUM_NODES} \ --num_gpus ${NUM_GPUS} \ finetune.py \ --report_to "tensorboard" \ --data_path ${DATA_PATH} \ --model_name_or_path ${MODEL_PATH} \ --output_dir ${OUTPUT_DIR} \ --model_max_length ${MAX_LEN} \ --num_train_epochs ${EPOCHS} \ --per_device_train_batch_size ${BATCH_SIZE} \ --gradient_accumulation_steps 1 \ --save_strategy steps \ --save_steps 200 \ --learning_rate ${LR} \ --lr_scheduler_type cosine \ --adam_beta1 0.9 \ --adam_beta2 0.95 \ --adam_epsilon 1e-8 \ --max_grad_norm 1.0 \ --weight_decay 0.1 \ --warmup_ratio 0.01 \ --gradient_checkpointing True \ --bf16 True \ --tf32 True \ --deepspeed ${DS_CONFIG} \ --is_concat ${IS_CONCAT} \ --logging_steps 1 \ --log_on_each_node False ``` ```shell bash finetune/ds_finetune.sh ``` - Configuring `HOSTFILE` switches between single-machine and multi-machine training. - configuring `ds_config` switches between zero1, zero2 and zero3. - `fp16, bf16` could configure mixed precision training. bf16 is recommended to be consistent with the pretrained model. - `is_concat` configures whether the training data is concatenated or not.
# License The source code of this repository follows the open-source license Apache 2.0. 360​Zhinao open-source models support free commercial use. It is not necessary for you to submit a request for commercial usage.