File size: 5,036 Bytes
1f92781
85cd7bf
 
1f92781
 
 
 
 
 
 
 
85cd7bf
 
 
1f92781
85cd7bf
 
1f92781
 
 
85cd7bf
 
1f92781
85cd7bf
 
 
 
 
 
 
 
1f92781
85cd7bf
1f92781
85cd7bf
 
1f92781
 
85cd7bf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1f92781
 
85cd7bf
1f92781
85cd7bf
 
 
 
 
1f92781
85cd7bf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1f92781
85cd7bf
 
 
 
 
 
 
 
 
 
 
 
1f92781
85cd7bf
 
 
 
 
 
 
14e4b46
85cd7bf
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
---
base_model:
- meta-llama/Meta-Llama-3.2-3B
language:
- en
- ko
library_name: transformers
license: llama3.2
---


<a href="https://github.com/MLP-Lab/Bllossom">
  <img src="https://github.com/teddysum/bllossom/blob/main//bllossom_icon.png?raw=true" width="30%" height="30%">
</a>

# Update!
* [2024.10.08] Bllossom-3B λͺ¨λΈμ΄ 졜초 μ—…λ°μ΄νŠΈ λ˜μ—ˆμŠ΅λ‹ˆλ‹€.



# Bllossom | [Demo]() | [Homepage](https://www.bllossom.ai/) | [Github](https://github.com/MLP-Lab/Bllossom) |

```bash
저희 Bllossom νŒ€μ—μ„œ Bllossom-3B λͺ¨λΈμ„ κ³΅κ°œν•©λ‹ˆλ‹€.
llama3.2-3Bκ°€ λ‚˜μ™”λŠ”λ° ν•œκ΅­μ–΄κ°€ 포함 μ•ˆλ˜μ—ˆλ‹€κ΅¬?? 이번 Bllossom-3BλŠ” ν•œκ΅­μ–΄κ°€ μ§€μ›λ˜μ§€ μ•ŠλŠ” κΈ°λ³Έ λͺ¨λΈμ„ ν•œκ΅­μ–΄-μ˜μ–΄λ‘œ κ°•ν™”λͺ¨λΈμž…λ‹ˆλ‹€.
 - 100% full-tuning으둜 150GB의 μ •μ œλœ ν•œκ΅­μ–΄λ‘œ μΆ”κ°€ μ‚¬μ „ν•™μŠ΅ λ˜μ—ˆμŠ΅λ‹ˆλ‹€. (GPU많이 νƒœμ› μŠ΅λ‹ˆλ‹€)
 - ꡉμž₯히 μ •μ œλœ Instruction Tuning을 μ§„ν–‰ν–ˆμŠ΅λ‹ˆλ‹€.
 - μ˜μ–΄ μ„±λŠ₯을 μ „ν˜€ μ†μƒμ‹œν‚€μ§€ μ•Šμ€ μ™„μ „ν•œ Bilingual λͺ¨λΈμž…λ‹ˆλ‹€.
 - LogicKor κΈ°μ€€ 5Bμ΄ν•˜ 졜고점수λ₯Ό κΈ°λ‘ν–ˆκ³  6점 μ΄ˆλ°˜λŒ€ 점수λ₯Ό λ³΄μž…λ‹ˆλ‹€.
 - Instruction tuning만 μ§„ν–‰ν–ˆμŠ΅λ‹ˆλ‹€. DPO λ“± μ„±λŠ₯ 올릴 λ°©λ²•μœΌλ‘œ νŠœλ‹ν•΄λ³΄μ„Έμš”.
 - MT-Bench, LogicKor λ“± 벀치마크 점수λ₯Ό μž˜λ°›κΈ° μœ„ν•΄ 정닡데이터λ₯Ό ν™œμš©ν•˜κ±°λ‚˜ ν˜Ήμ€ 벀치마크λ₯Ό νƒ€κ²ŸνŒ… ν•΄μ„œ ν•™μŠ΅ν•˜μ§€ μ•Šμ•˜μŠ΅λ‹ˆλ‹€. (ν•΄λ‹Ή 벀치마크 νƒ€κ²ŒνŒ…ν•΄μ„œ ν•™μŠ΅ν•˜λ©΄ 8점도 λ‚˜μ˜΅λ‹ˆλ‹€...)

μ–Έμ œλ‚˜ κ·Έλž¬λ“― ν•΄λ‹Ή λͺ¨λΈμ€ 상업적 이용이 κ°€λŠ₯ν•©λ‹ˆλ‹€.

1. Bllossom은 AAAI2024, NAACL2024, LREC-COLING2024 (ꡬ두) λ°œν‘œλ˜μ—ˆμŠ΅λ‹ˆλ‹€.
2. 쒋은 μ–Έμ–΄λͺ¨λΈ 계속 μ—…λ°μ΄νŠΈ ν•˜κ² μŠ΅λ‹ˆλ‹€!! ν•œκ΅­μ–΄ κ°•ν™”λ₯Όμœ„ν•΄ 곡동 μ—°κ΅¬ν•˜μ‹€λΆ„(νŠΉνžˆλ…Όλ¬Έ) μ–Έμ œλ“  ν™˜μ˜ν•©λ‹ˆλ‹€!! 
```



```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = 'Bllossom/llama-3.2-Korean-Bllossom-3B'

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
instruction = "μ² μˆ˜κ°€ 20개의 연필을 가지고 μžˆμ—ˆλŠ”λ° μ˜ν¬κ°€ μ ˆλ°˜μ„ κ°€μ Έκ°€κ³  λ―Όμˆ˜κ°€ 남은 5개λ₯Ό κ°€μ Έκ°”μœΌλ©΄ μ² μˆ˜μ—κ²Œ 남은 μ—°ν•„μ˜ κ°―μˆ˜λŠ” λͺ‡κ°œμΈκ°€μš”?"

messages = [
    {"role": "user", "content": f"{instruction}"}
    ]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

terminators = [
    tokenizer.convert_tokens_to_ids("<|end_of_text|>"),
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = model.generate(
    input_ids,
    max_new_tokens=1024,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9
)

print(tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True))
```
```
μ² μˆ˜κ°€ 20개의 연필을 가지고 μžˆμ—ˆκ³  μ˜ν¬κ°€ μ ˆλ°˜μ„ κ°€μ Έκ°€λ©΄, μ˜ν¬κ°€ κ°€μ Έκ°„ μ—°ν•„μ˜ κ°―μˆ˜λŠ” 20 / 2 = 10κ°œμž…λ‹ˆλ‹€.

이제 μ² μˆ˜κ°€ 남은 μ—°ν•„μ˜ 갯수λ₯Ό κ³„μ‚°ν•΄λ³΄κ² μŠ΅λ‹ˆλ‹€. μ˜ν¬κ°€ 10개λ₯Ό κ°€μ Έκ°„ ν›„ μ² μˆ˜κ°€ 남은 μ—°ν•„μ˜ κ°―μˆ˜λŠ” 20 - 10 = 10κ°œμž…λ‹ˆλ‹€.

λ―Όμˆ˜κ°€ 남은 5개λ₯Ό κ°€μ Έκ°”μœΌλ―€λ‘œ, μ² μˆ˜κ°€ 남은 μ—°ν•„μ˜ κ°―μˆ˜λŠ” 10 - 5 = 5κ°œμž…λ‹ˆλ‹€. 

λ”°λΌμ„œ μ² μˆ˜κ°€ 남은 μ—°ν•„μ˜ κ°―μˆ˜λŠ” 5κ°œμž…λ‹ˆλ‹€.
```

## Supported by

 - AICA  <img src="https://aica-gj.kr/images/logo.png" width="20%" height="20%">

## Citation
**Language Model**
```text
@misc{bllossom,
  author = {ChangSu Choi, Yongbin Jeong, Seoyoon Park, InHo Won, HyeonSeok Lim, SangMin Kim, Yejee Kang, Chanhyuk Yoon, Jaewan Park, Yiseul Lee, HyeJin Lee, Younggyun Hahm, Hansaem Kim, KyungTae Lim},
  title = {Optimizing Language Augmentation for Multilingual Large Language Models: A Case Study on Korean},
  year = {2024},
  journal = {LREC-COLING 2024},
  paperLink = {\url{https://arxiv.org/pdf/2403.10882}},
 },
}
```

**Vision-Language Model**
```text
@misc{bllossom-V,
  author = {Dongjae Shin, Hyunseok Lim, Inho Won, Changsu Choi, Minjun Kim, Seungwoo Song, Hangyeol Yoo, Sangmin Kim, Kyungtae Lim},
  title = {X-LLaVA: Optimizing Bilingual Large Vision-Language Alignment},
  year = {2024},
  publisher = {GitHub},
  journal = {NAACL 2024 findings},
  paperLink = {\url{https://arxiv.org/pdf/2403.11399}},
 },
}
```

## Contact
 - μž„κ²½νƒœ(KyungTae Lim), Professor at Seoultech. `[email protected]`
 - ν•¨μ˜κ· (Younggyun Hahm), CEO of Teddysum. `[email protected]`
 - κΉ€ν•œμƒ˜(Hansaem Kim), Professor at Yonsei. `[email protected]`

## Contributor
 - **μœ ν•œκ²°(Hangyeol Yoo)**, [email protected]
 - μ‹ λ™μž¬(Dongjae Shin), [email protected]
 - μž„ν˜„μ„(Hyeonseok Lim), [email protected]
 - μ›μΈν˜Έ(Inho Won), [email protected]
 - κΉ€λ―Όμ€€(Minjun Kim), [email protected]
 - μ†‘μŠΉμš°(Seungwoo Song), [email protected]
 - μœ‘μ •ν›ˆ(Jeonghun Yuk), [email protected]
 - 졜창수(Chansu Choi), [email protected]
 - μ†‘μ„œν˜„(Seohyun Song), [email protected]