taipei-1-mllama-project-2024
/

multi-modal-llama-tp1

Feature Extraction

Model card Files Files and versions Community

multi-modal-llama-tp1 / README.md

AlexHung29629's picture

Update README.md

2412e47 verified about 1 month ago

|

history blame contribute delete

3.19 kB

	---
	library_name: transformers
	license: mit
	base_model:
	- meta-llama/Llama-3.2-11B-Vision-Instruct
	---

	# Model Card for Model ID

	<!-- Provide a quick summary of what the model is/does. -->



	## Model Details

	### Model Description

	<!-- Provide a longer summary of what this model is. -->

	This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

	- Developed by: ASUS, NTHU, NTU
	- Model type: Based on Llama-3.2-11B-Vision-Instruct, with added support for voice input.
	- Language(s) (NLP): Supports multiple languages, but optimized for Traditional Chinese.
	- License: MIT
	- Finetuned from model [optional]: meta-llama/Llama-3.2-11B-Vision-Instruct

	## Uses

	The purpose of this multimodal model is to enrich knowledge about tourist attractions in Taiwan and engage travelers through interactive voice responses. You can provide a picture of a Taiwan's landscape to initiate a conversation.

	## How to Get Started with the Model

	Use the code below to get started with the model.

	```python
	import torch
	from transformers import pipeline
	import librosa
	from PIL import Image

	model_path = "taipei-1-mllama-project-2024/multi-modal-llama-tp1"
	pipe = pipeline(model=model_path, trust_remote_code=True, device_map='auto')
	audio, sr = librosa.load("/path/to/請問圖片中的景點是哪裡.wav", sr=16000)
	image = Image.open("/path/to/台南孔廟.jpg")
	turns = [
	dict(
	role='system',
	content = "You are a travel expert who can accurately analyze the attractions in the pictures. All conversations should be conducted in Traditional Chinese.",
	),
	dict(
	role='user',
	content='<\|image\|><\|begin_of_audio\|><\|audio\|><\|end_of_audio\|>'
	)
	]
	y_pred = pipe({'audio': [audio], 'images': [image], 'turns': turns, 'sampling_rate': sr}, max_new_tokens=300)
	print(y_pred) # 這張照片中的景點是台灣的「台南孔廟」。...
	```

	## Training Details

	### Training Procedure

	<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->


	#### Training Hyperparameters

	- Training regime: [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->

	## Evaluation

	<!-- This section describes the evaluation protocols and provides the results. -->

	### Testing Data, Factors & Metrics

	#### Testing Data

	<!-- This should link to a Dataset Card if possible. -->

	[More Information Needed]

	#### Factors

	<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->

	[More Information Needed]

	#### Metrics

	<!-- These are the evaluation metrics being used, ideally with a description of why. -->

	[More Information Needed]

	### Results

	[More Information Needed]

	#### Summary

	## Technical Specifications [optional]

	### Model Architecture and Objective

	[More Information Needed]

	### Compute Infrastructure

	[Taiwan-1 Super Computer](https://en.wikipedia.org/wiki/Taipei-1_(supercomputer))

	#### Hardware

	H100 x 8 GPUs per node x 16 nodes