Experimental Repository :)

Contents will updated without any notice at all. If you plan to use this repository, please use with revision with git hash.

This experiment is aimed to:

  • Maintain NLU capability of Mistral-Instruct model(mistralai/Mistral-7B-Instruct-v0.1)
  • Adapt new Korean vocab seamlessly
  • Use minimal dataset (used Korean wikipedia only)
  • Computationally efficient method
  • Let model answer using English knowledge and NLU capability even the question/answer is Korean only.

Here's some test:

from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    'beomi/Mistral-Ko-Inst-dev',
    torch_dtype='auto',
    device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained('beomi/Mistral-Ko-Inst-dev')

pipe = pipeline(
    'text-generation', 
    model=model, 
    tokenizer=tokenizer, 
    do_sample=True,
    max_new_tokens=350, 
    return_full_text=False,
    no_repeat_ngram_size=6,
    eos_token_id=1, # not yet tuned to gen </s>, use <s> instead.
)


def gen(x):
    chat = tokenizer.apply_chat_template([
        {"role": "user", "content": x},
        # {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
        # {"role": "user", "content": "Do you have mayonnaise recipes? please say in Korean."}
    ], tokenize=False)
    print(pipe(chat)[0]['generated_text'].strip())

gen("μŠ€νƒ€λ²…μŠ€μ™€ μŠ€νƒ€λ²…μŠ€ μ½”λ¦¬μ•„μ˜ μ°¨μ΄λŠ”?")

# (생성 μ˜ˆμ‹œ)
# μŠ€νƒ€λ²…μŠ€λŠ” μ „ μ„Έκ³„μ μœΌλ‘œ μš΄μ˜ν•˜κ³  μžˆλŠ” 컀피 전문사이닀. ν•œκ΅­μ—λŠ” μŠ€νƒ€λ²…μŠ€ μ½”λ¦¬μ•„λΌλŠ” μ΄λ¦„μœΌλ‘œ 운영되고 μžˆλ‹€.
# μŠ€νƒ€λ²…μŠ€ μ½”λ¦¬μ•„λŠ” λŒ€ν•œλ―Όκ΅­μ— μž…μ ν•œ 이후 2009λ…„κ³Ό 2010년에 두 μ°¨λ‘€μ˜ λΈŒλžœλ“œκ³Όμ˜ μž¬κ²€ν†  및 μƒˆλ‘œμš΄ λ””μžμΈμ„ 톡해 μƒˆλ‘œμš΄ λΈŒλžœλ“œλ‹€. 컀피 μ „λ¬Έμ˜ 프리미엄 이미지λ₯Ό μœ μ§€ν•˜κ³  있고, μŠ€νƒ€λ²…μŠ€ μ½”λ¦¬μ•„λŠ” ν•œκ΅­μ„ λŒ€ν‘œν•˜λŠ” 프리미엄 컀피 μ „λ¬Έ λΈŒλžœλ“œμ„ λ§Œλ“€κ³  μžˆλ‹€.
Downloads last month
12
Safetensors
Model size
7.36B params
Tensor type
BF16
Β·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train beomi/Mistral-Ko-Inst-dev