Text Generation
Transformers
Safetensors
Japanese
English
mistral
conversational
text-generation-inference
4-bit precision
gptq
TheBlokeAI

TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z)


Shisa 7B v1 - GPTQ

Description

This repo contains GPTQ model files for AUGMXNT's Shisa 7B v1.

Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.

These files were quantised using hardware kindly provided by Massed Compute.

Repositories available

Prompt template: Llama-2-Chat

[INST] <<SYS>>
{system_message}
<</SYS>>
{prompt} [/INST]

Known compatible clients / servers

GPTQ models are currently supported on Linux (NVidia/AMD) and Windows (NVidia only). macOS users: please use GGUF models.

These GPTQ models are known to work in the following inference servers/webuis.

This may not be a complete list; if you know of others, please let me know!

Provided files, and GPTQ parameters

Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements.

Each separate quant is in a different branch. See below for instructions on fetching from different branches.

Most GPTQ files are made with AutoGPTQ. Mistral models are currently made with Transformers.

Explanation of GPTQ parameters
  • Bits: The bit size of the quantised model.
  • GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value.
  • Act Order: True or False. Also known as desc_act. True results in better quantisation accuracy. Some GPTQ clients have had issues with models that use Act Order plus Group Size, but this is generally resolved now.
  • Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy.
  • GPTQ dataset: The calibration dataset used during quantisation. Using a dataset more appropriate to the model's training can improve quantisation accuracy. Note that the GPTQ calibration dataset is not the same as the dataset used to train the model - please refer to the original model repo for details of the training dataset(s).
  • Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only impacts the quantisation accuracy on longer inference sequences.
  • ExLlama Compatibility: Whether this file can be loaded with ExLlama, which currently only supports Llama and Mistral models in 4-bit.
Branch Bits GS Act Order Damp % GPTQ Dataset Seq Len Size ExLlama Desc
main 4 128 Yes 0.1 Shisa English Japanese DPO 4096 5.60 GB Yes 4-bit, with Act Order and group size 128g. Uses even less VRAM than 64g, but with slightly lower accuracy.
gptq-4bit-32g-actorder_True 4 32 Yes 0.1 Shisa English Japanese DPO 4096 6.01 GB Yes 4-bit, with Act Order and group size 32g. Gives highest possible inference quality, with maximum VRAM usage.
gptq-8bit--1g-actorder_True 8 None Yes 0.1 Shisa English Japanese DPO 4096 8.96 GB No 8-bit, with Act Order. No group size, to lower VRAM requirements.
gptq-8bit-128g-actorder_True 8 128 Yes 0.1 Shisa English Japanese DPO 4096 9.12 GB No 8-bit, with group size 128g for higher inference quality and with Act Order for even higher accuracy.
gptq-8bit-32g-actorder_True 8 32 Yes 0.1 Shisa English Japanese DPO 4096 9.61 GB No 8-bit, with group size 32g and Act Order for maximum inference quality.
gptq-4bit-64g-actorder_True 4 64 Yes 0.1 Shisa English Japanese DPO 4096 5.74 GB Yes 4-bit, with Act Order and group size 64g. Uses less VRAM than 32g, but with slightly lower accuracy.

How to download, including from branches

In text-generation-webui

To download from the main branch, enter TheBloke/shisa-7B-v1-GPTQ in the "Download model" box.

To download from another branch, add :branchname to the end of the download name, eg TheBloke/shisa-7B-v1-GPTQ:gptq-4bit-32g-actorder_True

From the command line

I recommend using the huggingface-hub Python library:

pip3 install huggingface-hub

To download the main branch to a folder called shisa-7B-v1-GPTQ:

mkdir shisa-7B-v1-GPTQ
huggingface-cli download TheBloke/shisa-7B-v1-GPTQ --local-dir shisa-7B-v1-GPTQ --local-dir-use-symlinks False

To download from a different branch, add the --revision parameter:

mkdir shisa-7B-v1-GPTQ
huggingface-cli download TheBloke/shisa-7B-v1-GPTQ --revision gptq-4bit-32g-actorder_True --local-dir shisa-7B-v1-GPTQ --local-dir-use-symlinks False
More advanced huggingface-cli download usage

If you remove the --local-dir-use-symlinks False parameter, the files will instead be stored in the central Hugging Face cache directory (default location on Linux is: ~/.cache/huggingface), and symlinks will be added to the specified --local-dir, pointing to their real location in the cache. This allows for interrupted downloads to be resumed, and allows you to quickly clone the repo to multiple places on disk without triggering a download again. The downside, and the reason why I don't list that as the default option, is that the files are then hidden away in a cache folder and it's harder to know where your disk space is being used, and to clear it up if/when you want to remove a download model.

The cache location can be changed with the HF_HOME environment variable, and/or the --cache-dir parameter to huggingface-cli.

For more documentation on downloading with huggingface-cli, please see: HF -> Hub Python Library -> Download files -> Download from the CLI.

To accelerate downloads on fast connections (1Gbit/s or higher), install hf_transfer:

pip3 install hf_transfer

And set environment variable HF_HUB_ENABLE_HF_TRANSFER to 1:

mkdir shisa-7B-v1-GPTQ
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download TheBloke/shisa-7B-v1-GPTQ --local-dir shisa-7B-v1-GPTQ --local-dir-use-symlinks False

Windows Command Line users: You can set the environment variable by running set HF_HUB_ENABLE_HF_TRANSFER=1 before the download command.

With git (not recommended)

To clone a specific branch with git, use a command like this:

git clone --single-branch --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/shisa-7B-v1-GPTQ

Note that using Git with HF repos is strongly discouraged. It will be much slower than using huggingface-hub, and will use twice as much disk space as it has to store the model files twice (it stores every byte both in the intended target folder, and again in the .git folder as a blob.)

How to easily download and use this model in text-generation-webui

Please make sure you're using the latest version of text-generation-webui.

It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install.

  1. Click the Model tab.

  2. Under Download custom model or LoRA, enter TheBloke/shisa-7B-v1-GPTQ.

    • To download from a specific branch, enter for example TheBloke/shisa-7B-v1-GPTQ:gptq-4bit-32g-actorder_True
    • see Provided Files above for the list of branches for each option.
  3. Click Download.

  4. The model will start downloading. Once it's finished it will say "Done".

  5. In the top left, click the refresh icon next to Model.

  6. In the Model dropdown, choose the model you just downloaded: shisa-7B-v1-GPTQ

  7. The model will automatically load, and is now ready for use!

  8. If you want any custom settings, set them and then click Save settings for this model followed by Reload the Model in the top right.

    • Note that you do not need to and should not set manual GPTQ parameters any more. These are set automatically from the file quantize_config.json.
  9. Once you're ready, click the Text Generation tab and enter a prompt to get started!

Serving this model from Text Generation Inference (TGI)

It's recommended to use TGI version 1.1.0 or later. The official Docker container is: ghcr.io/huggingface/text-generation-inference:1.1.0

Example Docker parameters:

--model-id TheBloke/shisa-7B-v1-GPTQ --port 3000 --quantize gptq --max-input-length 3696 --max-total-tokens 4096 --max-batch-prefill-tokens 4096

Example Python code for interfacing with TGI (requires huggingface-hub 0.17.0 or later):

pip3 install huggingface-hub
from huggingface_hub import InferenceClient

endpoint_url = "https://your-endpoint-url-here"

prompt = "Tell me about AI"
prompt_template=f'''[INST] <<SYS>>
{system_message}
<</SYS>>
{prompt} [/INST]
'''

client = InferenceClient(endpoint_url)
response = client.text_generation(prompt,
                                  max_new_tokens=128,
                                  do_sample=True,
                                  temperature=0.7,
                                  top_p=0.95,
                                  top_k=40,
                                  repetition_penalty=1.1)

print(f"Model output: {response}")

Python code example: inference from this GPTQ model

Install the necessary packages

Requires: Transformers 4.33.0 or later, Optimum 1.12.0 or later, and AutoGPTQ 0.4.2 or later.

pip3 install --upgrade transformers optimum
# If using PyTorch 2.1 + CUDA 12.x:
pip3 install --upgrade auto-gptq
# or, if using PyTorch 2.1 + CUDA 11.x:
pip3 install --upgrade auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/

If you are using PyTorch 2.0, you will need to install AutoGPTQ from source. Likewise if you have problems with the pre-built wheels, you should try building from source:

pip3 uninstall -y auto-gptq
git clone https://github.com/PanQiWei/AutoGPTQ
cd AutoGPTQ
git checkout v0.5.1
pip3 install .

Example Python code

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_name_or_path = "TheBloke/shisa-7B-v1-GPTQ"
# To use a different branch, change revision
# For example: revision="gptq-4bit-32g-actorder_True"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                             device_map="auto",
                                             trust_remote_code=False,
                                             revision="main")

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

prompt = "Tell me about AI"
prompt_template=f'''[INST] <<SYS>>
{system_message}
<</SYS>>
{prompt} [/INST]
'''

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512)
print(tokenizer.decode(output[0]))

# Inference can also be done using transformers' pipeline

print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    repetition_penalty=1.1
)

print(pipe(prompt_template)[0]['generated_text'])

Compatibility

The files provided are tested to work with Transformers. For non-Mistral models, AutoGPTQ can also be used directly.

ExLlama is compatible with Llama and Mistral models in 4-bit. Please see the Provided Files table above for per-file compatibility.

For a list of clients/servers, please see "Known compatible clients / servers", above.

Discord

For further support, and discussions on these models and AI in general, join us at:

TheBloke AI's Discord server

Thanks, and how to contribute

Thanks to the chirper.ai team!

Thanks to Clay from gpus.llm-utils.org!

I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training.

If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects.

Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits.

Special thanks to: Aemon Algiz.

Patreon special mentions: Michael Levine, 阿明, Trailburnt, Nikolai Manek, John Detwiler, Randy H, Will Dee, Sebastain Graf, NimbleBox.ai, Eugene Pentland, Emad Mostaque, Ai Maven, Jim Angel, Jeff Scroggin, Michael Davis, Manuel Alberto Morcote, Stephen Murray, Robert, Justin Joy, Luke @flexchar, Brandon Frisco, Elijah Stavena, S_X, Dan Guido, Undi ., Komninos Chatzipapas, Shadi, theTransient, Lone Striker, Raven Klaugh, jjj, Cap'n Zoog, Michel-Marie MAUDET (LINAGORA), Matthew Berman, David, Fen Risland, Omer Bin Jawed, Luke Pendergrass, Kalila, OG, Erik Bjäreholt, Rooh Singh, Joseph William Delisle, Dan Lewis, TL, John Villwock, AzureBlack, Brad, Pedro Madruga, Caitlyn Gatomon, K, jinyuan sun, Mano Prime, Alex, Jeffrey Morgan, Alicia Loh, Illia Dulskyi, Chadd, transmissions 11, fincy, Rainer Wilmers, ReadyPlayerEmma, knownsqashed, Mandus, biorpg, Deo Leter, Brandon Phillips, SuperWojo, Sean Connelly, Iucharbius, Jack West, Harry Royden McLaughlin, Nicholas, terasurfer, Vitor Caleffi, Duane Dunston, Johann-Peter Hartmann, David Ziegler, Olakabola, Ken Nordquist, Trenton Dambrowitz, Tom X Nguyen, Vadim, Ajan Kanaga, Leonard Tan, Clay Pascal, Alexandros Triantafyllidis, JM33133, Xule, vamX, ya boyyy, subjectnull, Talal Aujan, Alps Aficionado, wassieverse, Ari Malik, James Bentley, Woland, Spencer Kim, Michael Dempsey, Fred von Graf, Elle, zynix, William Richards, Stanislav Ovsiannikov, Edmond Seymore, Jonathan Leane, Martin Kemka, usrbinkat, Enrico Ros

Thank you to all my generous patrons and donaters!

And thank you again to a16z for their generous grant.

Original model card: AUGMXNT's Shisa 7B v1

Shisa 7B

Shi-chan and Sa-chan/シーちゃんとサーちゃん

Shisa 7B (shisa-7b-v1) is a bilingual Japanese and English (JA/EN) general-purpose chat model that aims to achieve strong Japanese language performance while retaining robust English capabilities, using a synthetic-data driven approach.

This model is based on Mistral 7B with a custom JA-optimized extended tokenizer that is >2X more efficient in Japanese than Mistral's original tokenizer. The base model was pre-trained for an additional 8B primarily Japanese tokens. It was then subsequently fine-tuned with an expanded, machine-translated version of airoboros-3.1, a set of the highest-scoring items from ultrafeedback_binarized, and additional freshly generated airoboros data directly to the target languages.

We also release our base model, datasets, and pipeline code under a permissive Apache 2.0 license which can be used for any purpose, commercial or otherwise:

  • shisa-base-7b-v1 - our base model w/ an extended tokenizer and additional JA pre-training
  • shisa-pretrain-en-ja-v1 - our pre-training data set
  • ultra-orca-boros-en-ja - a synthetically generated, machine-translated, programmatically validated JA/EN fine-tuning dataset
  • shisa-en-ja-dpo-v1 - Small subset of DPO pairs from ultrafeedback, along with JA DPO pairs using GPT-4 generated items as the chosen value, and outputs from our preliminary 7b model as the rejected values
  • Shisa repository - this includes our translation, dataset generation, training, and evaluation code

Moreover, we are in the process of publishing extended writeups and more details of our process, including ablation results, testing methodology, and key findings on our project wiki that may be of interest to fellow researchers.

Fine-Tuning

Our original intuition was to see if we could create a stronger Japanese model using the best existing public JA training sets and incorporating them. After initial review and testing, however, we decided that focusing solely on translation/generation of our own synthetic datasets could yield superior results with less training.

We compared multiple translation tools and, via manual review, judged that while gpt-4 almost always delivered the highest quality translations, Google's text-bison-32k was a good balance of quality, cost and throughput. Over various iterations, we refined our translation approach to include some additional algorithms for flagging and filtering invalid translations, re-translating and backfilling as necessary.

We also took this project as an opportunity to apply some newer techniques such as incorporating NEFTune and DPO training.

For our v1 release, we picked from our release candidates based on a significant amount of human preference testing (thousands of generations and multiple rounds of pairwise comparisons). We analyzed our results with both win/loss/draw and BTL modeling (iLSR) using choix).

The best candidate model was fine-tuned in a 3-step process:

  1. First, the model was fine-tuned on ultra-orca-boros-en-ja and SlimOrca (WandB Log)
  2. Next, we add one additional epoch at performed using only a subset of Japanese ultra-orca-boros-en-ja items to enhance JA performance (as SlimOrca from the first step is mostly EN) (WandB Log)
  3. Finally, the model was tuned using a DPOTrainer on a small subset of ultrafeedback (EN) and our own JA DPO dataset which uses gpt-4 outputs as the chosen values and outputs from stage 1's prelim model as rejected values. (WandDB Log )

During our training process, we also gained some key insights on why some existing Japanese models seem to underperform even versus models that have no additional JA training, and we hope that sharing this analysis will be useful to other teams developing Japanese language models.

While we need to explore this further, as an experimental validation, we applied a version of our fine-tuning set onto an existing base model ("Gamma 7B") and the initial JA MT-Bench results suggests that we can drastically increase functional performance with our tuning approach:

Model Score
shisa-gamma-7b-allsources-v0.4 5.65
ja-stablelm-instruct-gamma-7b* 4.01

Performance

Throughout our training, we did extensive human evaluation for each model to cross-validate our model performance, and we are currently conducting ongoing larger scale manual head-to-head testing between models. Our intention is open up and scale this data collection as we further develop our tools. For more information and updates, please see our project wiki.

While we believe llm-jp-eval is a useful metric for our base model, and it was extremely useful during our tuning process for initial validations, as our fine-tune training includes a percentage of the benchmark train splits, we provide these llm-jp-eval results primarily as a point of interest:

AVR MC NLI QA RC
0.7480 0.8900 0.8040 0.4153 0.8825

(We run a slightly modified llm-jp-eval to support testing of Qwen and to emit a bos_token if available)

For our final model, since it's customary to include benchmarks, we've used Stability AI Japan's Japanese MT-Bench as a more representative test of our model's capabilities. For our JA MT-Bench testing we use a Japanese prompt ("あなたは役立つアシスタントです。") as well as --num-choices 4 in an effort to reduce sampling variability, however we've still observed regular 0.5+ point (and sometimes even greater swings) between generations, as well as issues with default prompts and parameters when testing, so again, we'd urge caution in over-interpreting these scores and treating them as more of a probabilistic directional indicator, rather than a definitive score or ranking:

Benchmark Score
JA MT-Bench 5.02
MT-Bench 5.71

There is an MT-Bench Leaderboard, but as JA MT-Bench is still under development, for convenience, here is a comparison of the JA MT-Bench scores of some other models (our scores were rated by gpt-4-0613):

Model Score
gpt-4-0613 9.40
gpt-4-1106-preview 9.17
gpt-3.5-turbo* 8.41
Qwen-14B-Chat 7.47
shisa-7b-v1 5.02
ELYZA-japanese-Llama-2-7b-fast-instruct* 4.86
ja-stablelm-instruct-gamma-7b* 4.01
japanese-stablelm-instruct-alpha-7b* 2.74
Mistral-7B-OpenOrca-ja* 2.23
youri-7b-chat* 2.00
Mistral-7B-Instruct-v0.1* 1.78
llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0* 1.31

(Marked JA MT-Bench results in this section are sourced from shi3z)

Limitations

Although our model demonstrates a reasonably high level of Japanese fluency, as a 7B parameter model, it is prone to higher hallucination rates and less effective instruction following and reasoning than larger-class models. Also, it still does not have complete mastery of the Japanese language and a native speaker will spot occasional mistakes like some non-idiomatic/awkward phrasing, improper tenses/speech levels, etc.

We've also noticed a small amount of language leakage, likely largely attributable to our tokenizer expansion. These may be fixable with sampler settings like Min P) or additional targeted training, and we plan on doing additional work on automated detection/sampler sweeps in the future. One interesting observation is, based on our data collection, we found that as we iterated, the DPO process significantly exacerbated this issue, but also that our DPO models still had significantly higher human preference rates, so there was a bit of a trade-off in our choice of final tune.

While we believe that training larger models can improve performance using our existing approach and dataset, there are also many improvements we'd like to make for future models. We believe there is quite a bit of low hanging fruit for improving performance with even more training efficiency largely through improving the quality and construction of datasets.

Usage

Sample code:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer

model_name = "augmxnt/shisa-7b-v1"

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    torch_dtype=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16,
    device_map="auto"
)
streamer = TextStreamer(tokenizer, skip_prompt=True)

# The prompt template is included in the  model's tokenizer_config.json so you shouldn't need this but we've included this for convenience
# tokenizer.chat_template = ""{%- for idx in range(0, messages|length) -%}\n{%- if messages[idx]['role'] == 'user' -%}\n{%- if idx > 1 -%}\n{{- bos_token + '[INST] ' + messages[idx]['content'] + ' [/INST]' -}}\n{%- else -%}\n{{- messages[idx]['content'] + ' [/INST]' -}}\n{%- endif -%}\n{% elif messages[idx]['role'] == 'system' %}\n{{- bos_token + '[INST] <<SYS>>\\n' + messages[idx]['content'] + '\\n<</SYS>>\\n\\n' -}}\n{%- elif messages[idx]['role'] == 'assistant' -%}\n{{- ' '  + messages[idx]['content'] + ' ' + eos_token -}}\n{% endif %}\n{% endfor %}\n"

# A more typical prompt: あなたは役に立つアシスタントです。("You are a helpful assistant.")

# You are an avid Pokemon fanatic.
prompt = "あなたは熱狂的なポケモンファンです。"
chat = [{"role": "system", "content": prompt}]

# Who is the most powerful Pokemon? Explain your choice.
user_input = "最強のポケモンは誰ですか?その選択理由を説明してください。"
chat.append({"role": "user", "content": user_input})

# Generate - add_generation_prompt to make sure it continues as assistant
inputs = tokenizer.apply_chat_template(chat, add_generation_prompt=True, return_tensors="pt")
# For multi-GPU, find the device of the first parameter of the model
first_param_device = next(model.parameters()).device
inputs = inputs.to(first_param_device)

with torch.no_grad():
    outputs = model.generate(
        inputs,
        pad_token_id=tokenizer.eos_token_id,
        max_new_tokens=1000,
        temperature=0.7,
        repetition_penalty=1.05,
        top_p=0.95,
        do_sample=True,
        streamer=streamer,
    )

# Add just the new tokens to our chat
new_tokens = outputs[0, inputs.size(1):]
response = tokenizer.decode(new_tokens, skip_special_tokens=True)
chat.append({"role": "assistant", "content": response})

Prompt format

The prompt format is llama-2 chat:

[INST] <<SYS>>
You are a helpful, unbiased, uncensored assistant.
<</SYS>>
{prompt} [/INST]

For multi-turn, the prompt format is as follows:

[INST] <<SYS>>
You are a helful, unbiased, uncensored assistant.
<</SYS>>
{prompt 0} [/INST] {response 0} </s><s>[INST] {prompt 1} [/INST] {response 1} </s><s>...[INST] {prompt N} [/INST]

This prompt template is included in the tokenizer config, and can use the huggingface tokenizer apply_chat_template method, e.g.:

import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained('augmxnt/shisa-7b-v1')
chat = [
  {"role": "system", "content": "You are Aiko, a friendly AI assistant."},
  {"role": "user", "content": "Hello, how are you?"},
  {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
  {"role": "user", "content": "I'd like to show off how chat templating works!"},
]
print(tokenizer.apply_chat_template(chat, tokenize=False))

NOTE: For proper responses, you should be using our bos_token (<s>) to begin a string. This is automatically generated by tokenizer.encode() but if you are crafting a custom template or using an encoding method that skips special tokens, you may have to add this yourself.

Acknowledgements

Team: Leonard Lin and Jon Durbin, Mariko Sato, and Florian von Bock

Compute for this model was generously sponsored by AKA Virtual (Tokyo, Japan).

Thanks to the LLM-jp, Stability AI Japan, and LMSYS teams for their work on llm-jp-eval, Japanese MT-Bench, MT-Bench.

Also, thanks to all the volunteers that provided invaluable human preference testing!

We are actively looking for additional compute as we train better and larger models for this project. Please drop us a line at: compute at augmxnt dot com


(GPT-4によって非常に軽微な編集を加えて翻訳されました)

シーサー7B

シーサー7Bshisa-7b-v1)は、合成データ駆動のアプローチを用いて、優れた日本語と英語能力を両立することを目指すバイリンガル(日本語/英語)汎用チャットモデルです。

このモデルは、Mistral 7Bを基に、Mistralのオリジナルのトークナイザーよりも日本語において2倍以上効率的な、日本語最適化拡張トークナイザーをカスタムして作成されました。ベースモデルは、主に日本語のトークンを追加で80億ものトレーニングを行いました。そして、その後、airoboros-3.1の拡張された機械翻訳版、ultrafeedback_binarizedからの最高得点項目のセット、そして新たに生成されたairoborosのデータを直接目標言語で微調整しています。

商用を含むあらゆる目的で使用可能な寛容なApache 2.0ライセンスの下で、ベースモデル、データセット、およびパイプラインコードも公開しています:

  • shisa-base-7b-v1 - 拡張トークナイザーと追加の日本語プレトレーニングを備えた当方のベースモデル
  • shisa-pretrain-en-ja-v1 - 当方のプレトレーニングデータセット
  • ultra-orca-boros-en-ja - 合成生成、機械翻訳、プログラムによる検証によるJA/EN微調整データセット
  • shisa-en-ja-dpo-v1 - ultrafeedbackからのDPOペアの小さなサブセットと、選択された値としてGPT-4生成項目を使用した日本語のDPOペア、そして初期の7ビリオンモデルの出力を却下した値
  • シーサーリポジトリ - 翻訳、データセットの生成、トレーニング、評価コードなどが含まれています

さらに、アブレーション結果、テスト方法論、主要な調査結果など、プロセスの詳細や拡張ライトアップを公開する過程にあります。これは当プロジェクトwikiで研究者に興味深い情報として提供されています。

微調整

最初の直感は、最良の既存の公開日本語トレーニングセットを使用して、それらを組み入れることでより強力な日本語モデルを作成できるかどうかを見ることでした。しかし、初期の検討とテストの後、自らの合成データセットの翻訳/生成にだけ焦点を当てることで、短期間のトレーニングで優れた結果を得ることができると結論付けました。

私たちは複数の翻訳ツールを比較し、手動でレビューを行った結果、gpt-4がほぼ常に最高品質の翻訳を提供しながら、Googleの text-bison-32kは品質、コスト、スループットのバランスが良いと判断しました。複数の繰り返しを経て、無効な翻訳のフラグ付けとフィルタリング、必要に応じた再翻訳とバックフィルのための追加のアルゴリズムを含むように、翻訳アプローチを洗練させました。

また、このプロジェクトをNEFTuneDPOトレーニングを取り入れるなど、新しい技術を適用する機会ともなりました。

v1リリースのために、私たちは大量の人間の嗜好テスト(数千の生成と複数ラウンドのペアワイズ比較)に基づいてリリース候補から選択しました。私たちは、勝ち/負け/引き分けと、BTLモデル(iLSR)を使用してchoixで結果を分析しました。

最良の候補モデルは、3ステップのプロセスで微調整されました:

  1. 最初に、モデルはultra-orca-boros-en-jaとSlimOrca (WandB Log)で微調整されました。
  2. 次に、日本語のパフォーマンスを向上させるためにultra-orca-boros-en-jaの一部を使用して1回追加のエポックを追加しました(最初の段階のSlimOrcaは主に英語)(WandB Log)。
  3. 最後に、モデルは小規模のultrafeedback(英語)と自身のJA DPOデータセットに対してDPOTrainerを使用して調整されました。ここで使用したJA DPOデータセットはgpt-4の出力を選出された値とし、ステージ1の予備モデルの出力を却下した値とします。(WandDB Log )

私たちのトレーニングプロセス中に、何故一部の既存の日本語モデルが、追加の日本語トレーニングがないモデルに対してもパフォーマンスが低いのか、といういくつかの重要な洞察を得ることができました。この分析結果を共有すれば、他のチームが日本語モデルを開発する際の参考になると思います。

さらに探求する必要はありますが、実験的な検証として、微調整セットのバージョンを既存のベースモデル("Gamma 7B")に適用し、初期のJA MT-Bench結果が示すように、私たちのチューニングアプローチで機能性のパフォーマンスを劇的に向上させることができました:

モデル スコア
shisa-gamma-7b-allsources-v0.4 5.65
ja-stablelm-instruct-gamma-7b* 4.01

パフォーマンス

トレーニング全体を通じて、各モデルについて人間による評価を行い、モデルのパフォーマンスを相互に検証しました。現在、モデル間の手動での比較テストを大規模に行っています。私たちの目指すところは、ツールをさらに発展させることでこのデータ収集を公開して拡張することです。詳細と更新情報については、プロジェクトwiki をご覧ください。

我々は、llm-jp-evalは、私たちの基本モデルの有用な指標であり、初期の検証のための微調整プロセス中に非常に役立つと考えていますが、微調整トレーニングにはベンチマークのトレイン分割の一部が含まれているため、私たちが提供するllm-jp-evalの結果は主に興味深いポイントとして提供しています:

AVR MC NLI QA RC
0.7480 0.8900 0.8040 0.4153 0.8825

(Qwenのテストをサポートし、可能であればbos_tokenを発行するために、わずかに修正したllm-jp-eval を実行しています)

最終モデルについては、ベンチマークを含めるのが一般的なため、私たちのモデルの能力をより代表的にテストするために、Stability AI JapanのJapanese MT-Benchを使用しました。私たちのJA MT-Bench テストでは、サンプリング変動を減らすために、日本語のプロンプト("あなたは役立つアシスタントです。")と --num-choices 4を使用していますが、生成間で0.5+点(時にはそれ以上の変動)を頻繁に観察し、テスト時のデフォルトのプロンプトとパラメータに問題があったという経験から、これらのスコアを過度に解釈することには注意が必要で、これらを確定的なスコアやランキングではなく、より確率的な方向指標として扱うことをお勧めします:

ベンチマーク スコア
JA MT-Bench 5.02
MT-Bench 5.71

MT-Bench Leaderboardがありますが、JA MT-Benchはまだ開発中であるため、便宜上、他のモデルのJA MT-Benchスコアとの比較を示します(私たちのスコアはgpt-4-0613によって評価されました):

モデル スコア
gpt-4-0613 9.40
gpt-4-1106-preview 9.17
gpt-3.5-turbo* 8.41
Qwen-14B-Chat 7.47
shisa-7b-v1 5.02
ELYZA-japanese-Llama-2-7b-fast-instruct* 4.86
ja-stablelm-instruct-gamma-7b* 4.01
japanese-stablelm-instruct-alpha-7b* 2.74
Mistral-7B-OpenOrca-ja* 2.23
youri-7b-chat* 2.00
Mistral-7B-Instruct-v0.1* 1.78
llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0* 1.31

(このセクションでマークされたJA MT-Benchの結果はshi3zから引用しました)

制限事項

当モデルは十分な日本語の流暢さを示していますが、7Bパラメータのモデルとしては、より大きなクラスのモデルに比べて幻覚率が高く、指示の追跡や推論が効果的でない傾向があります。また、日本語の完全な習得はまだ達しておらず、ネイティブスピーカーはたまに非慣用的/違和感のある表現や不適切な時制/話し言葉のレベルなどの間違いを見つけることがあります。

また、私たちのトークナイザーの拡張に大いに起因する可能性が高いが、わずかな言語リークを確認しています。これらはMin Pなどのサンプラー設定や追加のターゲット指向型トレーニングで修正可能な可能性があり、今後、自動検出/サンプラーのスウィープについて追加の作業を行う予定です。興味深い観察としては、私たちのデータ収集に基づいて、DPOプロセスがこの問題を大幅に悪化させることがわかりましたが、それでもDPOモデルは人間の好み率が大幅に高かったため、最終的な微調整の選択には一定のトレードオフがありました。

現存するアプローチとデータセットを使用して、大規模なモデルのトレーニングがパフォーマンスを向上させると信じていますが、今後のモデル向けに行いたい改良も多くあります。私たちは、データセットの品質と構築を改善することで、さらなるトレーニング効率を通じたパフォーマンス向上にはまだ相当に取り組む余地があると考えています。

使用法

サンプルコード:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer

model_name = "augmxnt/shisa-7b-v1"

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    torch_dtype=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16,
    device_map="auto"
)
streamer = TextStreamer(tokenizer, skip_prompt=True)

# プロンプトテンプレートはモデルのtokenizer_config.jsonに含まれているので、これは必要ないはずですが、便宜上こちらにも掲載しています
# tokenizer.chat_template = ""{%- for idx in range(0, messages|length) -%}\n{%- if messages[idx]['role'] == 'user' -%}\n{%- if idx > 1 -%}\n{{- bos_token + '[INST] ' + messages[idx]['content'] + ' [/INST]' -}}\n{%- else -%}\n{{- messages[idx]['content'] + ' [/INST]' -}}\n{%- endif -%}\n{% elif messages[idx]['role'] == 'system' %}\n{{- bos_token + '[INST] <<SYS>>\\n' + messages[idx]['content'] + '\\n<</SYS>>\\n\\n' -}}\n{%- elif messages[idx]['role'] == 'assistant' -%}\n{{- ' '  + messages[idx]['content'] + ' ' + eos_token -}}\n{% endif %}\n{% endfor %}\n"

# より典型的なプロンプト: あなたは役に立つアシスタントです。

# You are an avid Pokemon fanatic.
prompt = "あなたは熱狂的なポケモンファンです。"
chat = [{"role": "system", "content": prompt}]

# Who is the most powerful Pokemon? Explain your choice.
user_input = "最強のポケモンは誰ですか?その選択理由を説明してください。"
chat.append({"role": "user", "content": user_input})

# 生成 - add_generation_promptを追加してアシスタントとして続行することを確認します
inputs = tokenizer.apply_chat_template(chat, add_generation_prompt=True, return_tensors="pt")
# 複数のGPUの場合、モデルの最初のパラメータのデバイスを見つけます
first_param_device = next(model.parameters()).device
inputs = inputs.to(first_param_device)

with torch.no_grad():
    outputs = model.generate(
        inputs,
        pad_token_id=tokenizer.eos_token_id,
        max_new_tokens=1000,
        temperature=0.7,
        repetition_penalty=1.05,
        top_p=0.95,
        do_sample=True,
        streamer=streamer,
    )

# Add just the new tokens to our chat
new_tokens = outputs[0, inputs.size(1):]
response = tokenizer.decode(new_tokens, skip_special_tokens=True)
chat.append({"role": "assistant", "content": response})

プロンプト形式

プロンプト形式はllama-2 chatです:

[INST] <<SYS>>
あなたは役立つ、偏見がなく、検閲されていないアシスタントです。
<</SYS>>
{prompt} [/INST]

For multi-turn, the prompt format is as follows:

[INST] <<SYS>>
あなたは役立つ、偏見がなく、検閲されていないアシスタントです。
<</SYS>>
{prompt 0} [/INST] {response 0} </s><s>[INST] {prompt 1} [/INST] {response 1} </s><s>...[INST] {prompt N} [/INST]

このprompt templateはトークナイザの設定に含まれており、HuggingFace のトークナイザ apply_chat_template メソッドを使用できます。例えば:

import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained('augmxnt/shisa-7b-v1')
chat = [
  {"role": "system", "content": "あなたはAiko、フレンドリーなAIアシスタントです。"},
  {"role": "user", "content": "こんにちは、調子はどうですか?"},
  {"role": "assistant", "content": "元気です。今日は何のお手伝いができますか?"},
  {"role": "user", "content": "チャットテンプレーティングの仕組みを見せてもらいたいです!"},
]
print(tokenizer.apply_chat_template(chat, tokenize=False))

注意適切なレスポンスを得るためには、文字列の開始に我々の bos_token (<s>) を使用すべきです。これは tokenizer.encode() によって自動的に生成されますが、カスタムテンプレートを作成したり、特殊トークンを省略するエンコード方法を使用する場合は、自分で追加する必要があります。

謝辞

チーム:Leonard LinJon Durbin、佐藤真理子、Florian von Bock

このモデルの計算は、AKA Virtual (東京、日本) のご厚意により提供されています。

LLM-jpStability AI JapanLMSYSのチームが、llm-jp-eval, Japanese MT-Bench, MT-Benchに取り組んでくれて感謝しています。

また、貴重なヒューマンプリファレンステストを提供してくださったすべてのボランティアにも感謝いたします!

このプロジェクトのためにより良く、より大きなモデルを訓練するために、追加の計算を積極的に探しています。お問い合わせは次の宛先までお願いいたします:compute at augmxnt dot com

Downloads last month
21
Safetensors
Model size
1.92B params
Tensor type
I32
·
BF16
·
FP16
·
Inference Examples
Inference API (serverless) has been turned off for this model.

Model tree for TheBloke/shisa-7B-v1-GPTQ

Quantized
(3)
this model

Datasets used to train TheBloke/shisa-7B-v1-GPTQ