|
--- |
|
library_name: transformers |
|
license: mit |
|
base_model: |
|
- meta-llama/Llama-3.2-11B-Vision-Instruct |
|
--- |
|
|
|
# Model Card for Model ID |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
|
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
This is the model card of a π€ transformers model that has been pushed on the Hub. This model card has been automatically generated. |
|
|
|
- **Developed by:** ASUS, NTHU, NTU |
|
- **Model type:** Based on Llama-3.2-11B-Vision-Instruct, with added support for voice input. |
|
- **Language(s) (NLP):** Supports multiple languages, but optimized for Traditional Chinese. |
|
- **License:** MIT |
|
- **Finetuned from model [optional]:** meta-llama/Llama-3.2-11B-Vision-Instruct |
|
|
|
## Uses |
|
|
|
The purpose of this multimodal model is to enrich knowledge about tourist attractions in Taiwan and engage travelers through interactive voice responses. You can provide a picture of a Taiwan's landscape to initiate a conversation. |
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
```python |
|
import torch |
|
from transformers import pipeline |
|
import librosa |
|
from PIL import Image |
|
|
|
model_path = "taipei-1-mllama-project-2024/multi-modal-llama-tp1" |
|
pipe = pipeline(model=model_path, trust_remote_code=True, device_map='auto') |
|
audio, sr = librosa.load("/path/to/θ«εεηδΈηζ―ι»ζ―εͺ裑.wav", sr=16000) |
|
image = Image.open("/path/to/ε°εεε».jpg") |
|
turns = [ |
|
dict( |
|
role='system', |
|
content = "You are a travel expert who can accurately analyze the attractions in the pictures. All conversations should be conducted in Traditional Chinese.", |
|
), |
|
dict( |
|
role='user', |
|
content='<|image|><|begin_of_audio|><|audio|><|end_of_audio|>' |
|
) |
|
] |
|
y_pred = pipe({'audio': [audio], 'images': [image], 'turns': turns, 'sampling_rate': sr}, max_new_tokens=300) |
|
print(y_pred) # ιεΌ΅η
§ηδΈηζ―ι»ζ―ε°η£ηγε°εεε»γγ... |
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Procedure |
|
|
|
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. --> |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision --> |
|
|
|
## Evaluation |
|
|
|
<!-- This section describes the evaluation protocols and provides the results. --> |
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
#### Testing Data |
|
|
|
<!-- This should link to a Dataset Card if possible. --> |
|
|
|
[More Information Needed] |
|
|
|
#### Factors |
|
|
|
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. --> |
|
|
|
[More Information Needed] |
|
|
|
#### Metrics |
|
|
|
<!-- These are the evaluation metrics being used, ideally with a description of why. --> |
|
|
|
[More Information Needed] |
|
|
|
### Results |
|
|
|
[More Information Needed] |
|
|
|
#### Summary |
|
|
|
## Technical Specifications [optional] |
|
|
|
### Model Architecture and Objective |
|
|
|
[More Information Needed] |
|
|
|
### Compute Infrastructure |
|
|
|
[Taiwan-1 Super Computer](https://en.wikipedia.org/wiki/Taipei-1_(supercomputer)) |
|
|
|
#### Hardware |
|
|
|
H100 x 8 GPUs per node x 16 nodes |
|
|