File size: 2,932 Bytes
369258a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
---
license: mit
language:
- ko
- en
base_model: MLP-KTLim/llama-3-Korean-Bllossom-8B
---
# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->

This model is an AWS Neuron compiled version, neuron-cc 2.14, of the Korean fine-tuned model MLP-KTLim/llama-3-Korean-Bllossom-8B, available at https://huggingface.co/MLP-KTLim/llama-3-Korean-Bllossom-8B. It is intended for deployment on Amazon EC2 Inferentia2 and Amazon SageMaker. For detailed information about the model and its license, please refer to the original MLP-KTLim/llama-3-Korean-Bllossom-8B model page

## Model Details

This model is compiled with neuronx-cc version, 2.14
It can be deployed with [v1.0-hf-tgi-0.0.24-pt-2.1.2-inf-neuronx-py310](https://github.com/aws/deep-learning-containers/releases?q=tgi+AND+neuronx&expanded=true)


## How to Get Started with the Model

After logging in to Amazon ECR with permission, You can pull the docker image 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.2-optimum0.0.24-neuronx-py310-ubuntu22.04-v1.0 , downlaod this model and run the command like this example:
```
docker run \
-p 8080:80 \
-v $(pwd)/data:/data \
--privileged \
763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.2-optimum0.0.24-neuronx-py310-ubuntu22.04-v1.0  \
--model-id /data/AWS-NeuronCC-2-14-llama-3-Korean-Bllossom-8B
```
After deployment, you can inference like this
```
curl 127.0.0.1:8080/generate \
-X POST \
-d '{"inputs":"๋”ฅ๋Ÿฌ๋‹์ด ๋ญ์•ผ?","parameters":{"max_new_tokens":512}}' \
-H 'Content-Type: application/json'
```
or
```
curl localhost:8080/v1/chat/completions \
    -X POST \
    -d '{
"model": "tgi",
"messages": [
    {
    "role": "system",
    "content": "๋‹น์‹ ์€ ์ธ๊ณต์ง€๋Šฅ ์ „๋ฌธ๊ฐ€ ์ž…๋‹ˆ๋‹ค."
    },
    {
    "role": "user",
    "content": "๋”ฅ๋Ÿฌ๋‹์ด ๋ฌด์—‡์ž…๋‹ˆ๊นŒ?"
    }
],
"stream": false,
"max_tokens": 512
}' \
    -H 'Content-Type: application/json'  
```

This model can be deployed to Amazon SageMaker Endtpoint with this guide, [S3 ์— ์ €์žฅ๋œ ๋ชจ๋ธ์„ SageMaker INF2 ์— ๋ฐฐํฌํ•˜๊ธฐ](https://github.com/aws-samples/aws-ai-ml-workshop-kr/blob/master/neuron/hf-optimum/04-Deploy-Llama3-8B-HF-TGI-Docker-On-INF2/notebook/03-deploy-llama-3-neuron-moel-inferentia2-from-S3.ipynb)

In order to do neuron-compilation and depoly in detail , you can refer to [Amazon ECR ์˜ ๋„์ปค ์ด๋ฏธ์ง€ ๊ธฐ๋ฐ˜ํ•˜์— Amazon EC2 Inferentia2 ์„œ๋น™ํ•˜๊ธฐ](https://github.com/aws-samples/aws-ai-ml-workshop-kr/blob/master/neuron/hf-optimum/04-Deploy-Llama3-8B-HF-TGI-Docker-On-INF2/README-NeuronCC-2-14.md)


## Hardware

At a minimum hardware, you can use Amazon EC2 inf2.xlarge and more powerful family such as inf2.8xlarge, inf2.24xlarge and inf2.48xlarge.
The detailed information is [Amazon EC2 Inf2 Instances](https://aws.amazon.com/ec2/instance-types/inf2/)


## Model Card Contact

Gonsoo Moon, [email protected]