XuyaoWang commited on
Commit
a49a84a
·
verified ·
1 Parent(s): 5c1b049

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +69 -3
README.md CHANGED
@@ -1,3 +1,69 @@
1
- ---
2
- license: llama3.1
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ base_model:
6
+ - meta-llama/Meta-Llama-3.1-8B
7
+ ---
8
+ # 🦙 Llama3.1-8b-vision-audio Model Card
9
+
10
+ ## Model Details
11
+
12
+ This repository contains a version of the [LLaVA](https://github.com/haotian-liu/LLaVA) model that supports image and audio input from the [Llama 3.1-8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B) foundation model using the [PKU-Alignment/align-anything](https://github.com/PKU-Alignment/align-anything) library.
13
+
14
+ - **Developed by:** the [PKU-Alignment](https://github.com/PKU-Alignment) Team.
15
+ - **Model Type:** An auto-regressive language model based on the transformer architecture.
16
+ - **License:** Non-commercial license.
17
+ - **Fine-tuned from model:** [meta-llama/Llama 3.1-8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B).
18
+
19
+ ## Model Sources
20
+
21
+ - **Repository:** <https://github.com/PKU-Alignment/align-anything>
22
+ - **Dataset:**
23
+ - <https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K>
24
+ - <https://huggingface.co/datasets/PKU-Alignment/Align-Anything-Instruction-100K>
25
+ - <https://huggingface.co/datasets/cvssp/WavCaps>
26
+
27
+ ## How to use model (reprod.)
28
+
29
+ - Using align-anything
30
+
31
+ ```python
32
+ from align_anything.models.llama_vision_audio_model import (
33
+ LlamaVisionAudioForConditionalGeneration,
34
+ LlamaVisionAudioProcessor,
35
+ )
36
+ import torch
37
+ import torchaudio
38
+ from PIL import Image
39
+
40
+ path = <path_to_model_dir>
41
+ processor = LlamaVisionAudioProcessor.from_pretrained(path)
42
+ model = LlamaVisionAudioForConditionalGeneration.from_pretrained(path)
43
+
44
+ prompt = "<|start_header_id|>user<|end_header_id|>: Where is the capital of China?\n<|start_header_id|>assistant<|end_header_id|>: "
45
+
46
+ inputs = processor(text=prompt, return_tensors="pt")
47
+ outputs = model.generate(**inputs, max_new_tokens=1024)
48
+ print(processor.decode(outputs[0], skip_special_tokens=True))
49
+
50
+ prompt = "<|start_header_id|>user<|end_header_id|>: Summarize the audio's contents.<audio>\n<|start_header_id|>assistant<|end_header_id|>: "
51
+
52
+ audio_path = "align-anything/assets/test_audio.wav"
53
+ audio, _ = torchaudio.load(audio_path)
54
+ if audio.shape[0] == 2:
55
+ audio = audio.mean(dim=0, keepdim=True)
56
+ audio = audio.squeeze().tolist()
57
+
58
+ inputs = processor(text=prompt, raw_speech=audio, return_tensors="pt")
59
+ outputs = model.generate(**inputs, max_new_tokens=1024)
60
+ print(processor.decode(outputs[0], skip_special_tokens=False))
61
+
62
+ prompt = "<|start_header_id|>user<|end_header_id|>: <image> Give an overview of what's in the image.\n<|start_header_id|>assistant<|end_header_id|>: "
63
+ image_path = "align-anything/assets/test_image.webp"
64
+ image = Image.open(image_path)
65
+
66
+ inputs = processor(text=prompt, images=image, return_tensors="pt")
67
+ outputs = model.generate(**inputs, max_new_tokens=1024)
68
+ print(processor.decode(outputs[0], skip_special_tokens=True))
69
+ ```