metadata

datasets:
  - korean-jindo-dataset
language:
  - en
  - ko
tags:
  - dsdanielpark
  - llama2
  - instruct
  - instruction
  - jindo
  - korean
  - translation
  - 7b
pipeline_tag: text-generation

Sample repository

Development Status :: 2 - Pre-Alpha
Developed by MinWoo Park, 2023, Seoul, South Korea. Contact: [email protected].

danielpark/llama2-jindo-7b-instruct model card

`Jindo` is sLLM for construct datasets for LLM `KOLANI`.

Warning The training is still in progress.

This model is an LLM in various language domains, including Korean translation and correction. Its main purpose is to create a dataset for training the Korean LLM "KOLANI" (which is still undergoing training). Furthermore, since this model has been developed solely by one individual without any external support, the release and improvement process might be relatively slow.

Using: QLoRA

Model Details

The weights you are currently viewing are preliminary checkpoints, and the official weights have not been released yet.

Developed by: Minwoo Park
Backbone Model: LLaMA2 [Paper]
Model Jindo Variations: jindo-instruct, jindo-chat
jindo-instruct Variations: 2b / 7b / 13b
- danielpark/ko-llama-2-jindo-2b-instruct (from LLaMA1)
- danielpark/ko-llama-2-jindo-7b-instruct (from LLaMA2)
- danielpark/ko-llama-2-jindo-13b-instruct (from LLaMA2)
- This model targets specific domains, so the 70b model will not be released.
Quantinized Weight: 7b-gptq(4bit-128g) / 7b-ggml
- danielpark/ko-llama-2-jindo-7b-instruct-4bit-128g-gptq
- danielpark/ko-llama-2-jindo-7b-instruct-ggml
Library: HuggingFace Transformers
License: This model is licensed under the Meta's LLaMA2 license. We plan to check the dataset's license along with the official release, but our primary goal is to aim for a commercial-use release by default.
Where to send comments: Instructions on how to provide feedback or comments on a model can be found by opening an issue in the Hugging Face community's model repository
Contact: For questions and comments about the model, please email to me [email protected]

Web Demo

I implement the web demo using several popular tools that allow us to rapidly create web UIs.

model	web ui	quantinized
danielpark/ko-llama-2-jindo-7b-instruct.	using gradio on colab	-
danielpark/ko-llama-2-jindo-7b-instruct-4bit-128g-gptq	using text-generation-webui on colab	gptq
danielpark/ko-llama-2-jindo-7b-instruct-ggml	koboldcpp-v1.38	ggml

Model Revision

See this REVISION.md

Dataset Details

Orca style datasets

Used Datasets

korean-jindo-dataset
The dataset has not been released yet

No other data was used except for the dataset mentioned above

Prompt Template

We slightly modify the prompt templates for orca, guanaco, and llama. This is done to accommodate considerations such as CoT(Chain of Thought).
We modify only the basic system messages of Lama 2 while using the default system message. This is a precautionary measure to prevent conflicts with the pre-tuned wake of LFM.

default_system_message = "You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense or is not factually coherent, explain why instead of answering something that is not correct. If you don't know the answer to a question, please don't share false information.\n\n"

### System:
{System}

### User:
{User}

### Assistant:
{Assistant}

### System:
{System}

### User:
{New_User_Input}

### Input:
{New User Input}

### Assistant:
{New_Assistant_Answer}

### System:
{System}

### Input:
User: {History_User_Input}
Assistant: {History_Assistant_Answer}

### User:
{New_User_Input}

### Assistant:
{New_Assistant_Answer}

Hardware and Software

Hardware
- Under 10b model: Trained using the free T4 GPU resource.
- Over 10b model: Utilized a Single A100 on Google Colab.
Training Factors: HuggingFace trainer

Evaluation Results

Please refer to the following procedure for the evaluation of the backbone model. Other benchmarking and qualitative evaluations for Korean datasets are still pending.

Overview

We conducted a performance evaluation based on the tasks being evaluated on the Open LLM Leaderboard. We evaluated our model on four benchmark datasets, which include ARC-Challenge, HellaSwag, MMLU, and TruthfulQA. We used the lm-evaluation-harness repository, specifically commit b281b0921b636bc36ad05c0b0b0763bd6dd43463.

Usage

Please refer to the following information and install the appropriate versions compatible with your enviroments.

$ pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7

from transformers import AutoTokenizer
import transformers
import torch

model = "danielpark/ko-llama-2-jindo-7b-instruct"
# model = "meta-llama/Llama-2-7b-hf"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

sequences = pipeline(
    'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n',
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=200,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

To use the model with the transformers library on a machine with GPUs, first make sure you have the transformers and accelerate libraries installed.

$ pip install "accelerate>=0.16.0,<1" "transformers[torch]>=4.28.1,<5" "torch>=1.13.1,<2"

The instruction following pipeline can be loaded using the pipeline function as shown below. This loads a custom InstructionTextGenerationPipeline found in the model repo here, which is why trust_remote_code=True is required. Including torch_dtype=torch.bfloat16 is generally recommended if this type is supported in order to reduce memory usage. It does not appear to impact output quality. It is also fine to remove it if there is sufficient memory.

import torch
from transformers import pipeline

generate_text = pipeline(model="danielpark/ko-llama-2-jindo-7b-instruct", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")

You can then use the pipeline to answer instructions:

res = generate_text("Explain to me the difference between nuclear fission and fusion.")
print(res[0]["generated_text"])

Alternatively, if you prefer to not use trust_remote_code=True you can download instruct_pipeline.py, store it alongside your notebook, and construct the pipeline yourself from the loaded model and tokenizer:

import torch
from instruct_pipeline import InstructionTextGenerationPipeline
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("danielpark/ko-llama-2-jindo-7b-instruct", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("danielpark/ko-llama-2-jindo-7b-instruct", device_map="auto", torch_dtype=torch.bfloat16)

generate_text = InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer)

LangChain Usage

To use the pipeline with LangChain, you must set return_full_text=True, as LangChain expects the full text to be returned and the default for the pipeline is to only return the new text.

import torch
from transformers import pipeline

generate_text = pipeline(model="danielpark/ko-llama-2-jindo-7b-instruct", torch_dtype=torch.bfloat16,
                         trust_remote_code=True, device_map="auto", return_full_text=True)

You can create a prompt that either has only an instruction or has an instruction with context:

from langchain import PromptTemplate, LLMChain
from langchain.llms import HuggingFacePipeline

# template for an instrution with no input
prompt = PromptTemplate(
    input_variables=["instruction"],
    template="{instruction}")

# template for an instruction with input
prompt_with_context = PromptTemplate(
    input_variables=["instruction", "context"],
    template="{instruction}\n\nInput:\n{context}")

hf_pipeline = HuggingFacePipeline(pipeline=generate_text)

llm_chain = LLMChain(llm=hf_pipeline, prompt=prompt)
llm_context_chain = LLMChain(llm=hf_pipeline, prompt=prompt_with_context)

Example predicting using a simple instruction:

print(llm_chain.predict(instruction="Explain to me the difference between nuclear fission and fusion.").lstrip())

Example predicting using an instruction with context:

context = """George Washington (February 22, 1732[b] - December 14, 1799) was an American military officer, statesman,
and Founding Father who served as the first president of the United States from 1789 to 1797."""

print(llm_context_chain.predict(instruction="When was George Washington president?", context=context).lstrip())

Scripts

Prepare evaluation environments:

# clone the repository
git clone https://github.com/EleutherAI/lm-evaluation-harness.git

# check out the specific commit
git checkout b281b0921b636bc36ad05c0b0b0763bd6dd43463

# change to the repository directory
cd lm-evaluation-harness

Ethical Issues

The Jindo model has not been filtered for harmful, biased, or explicit content. As a result, outputs that do not adhere to ethical norms may be generated during use. Please exercise caution when using the model in research or practical applications.

Ethical Considerations

There were no ethical issues involved, as we did not include the benchmark test set or the training set in the model's training process. As always, we encourage responsible and ethical use of this model. Please note that while Jindo strives to provide accurate and helpful responses, it is still crucial to cross-verify the information from reliable sources for knowledge-based queries.

Contact Me

To contact me, you can mail to me. [email protected].

Model Architecture

Using default llama2 architecture as is.

See details

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=0)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(
            in_features=4096, out_features=4096, bias=False
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.1, inplace=False)
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=4096, out_features=64, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=64, out_features=4096, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
          )
          (k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear4bit(
            in_features=4096, out_features=4096, bias=False
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.1, inplace=False)
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=4096, out_features=64, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=64, out_features=4096, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
          )
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear4bit(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)

Training procedure

The following bitsandbytes quantization config was used during training:

load_in_8bit: False
load_in_4bit: True
llm_int8_threshold: 6.0
llm_int8_skip_modules: None
llm_int8_enable_fp32_cpu_offload: False
llm_int8_has_fp16_weight: False
bnb_4bit_quant_type: nf4
bnb_4bit_use_double_quant: False
bnb_4bit_compute_dtype: float16

Framework versions

PEFT 0.4.0

License:

The licenses of the pretrained models, llama1, and llama2, along with the datasets used, are applicable. For other datasets related to this work, guidance will be provided in the official release. The responsibility for verifying all licenses lies with the user, and the developer assumes no liability, explicit or implied, including legal responsibilities.

Remark:

The "instruct" in the model name can be omitted, but it is used to differentiate between the backbones of llama2 for chat and general purposes. Additionally, this model is created for a specific purpose, so we plan to fine-tune it with a dataset focused on instructions.

Reference model cards

This model card references the model cards of the following communities and also draws from model cards from Meta AI and Upstage, The Bloke, Stability AI,