Update README.md (#3)

f5588fe verified about 1 month ago

6.19 kB

	---
	datasets:
	- homebrewltd/instruction-speech-whispervq-v2
	language:
	- en
	license: apache-2.0
	pipeline_tag: audio-text-to-text
	tags:
	- sound language model
	---

	## Model Details

	We have developed and released the family [Ichigo-llama3s](https://huggingface.co/collections/homebrew-research/llama3-s-669df2139f0576abc6eb7405). This family is natively understanding audio and text input.

	This model focused on fine-tuning the model to improve user interaction from [homebrewltd/Ichigo-llama3.1-s-instruct-v0.3-phase-2](https://huggingface.co/homebrewltd/Ichigo-llama3.1-s-instruct-v0.3-phase-2), particularly in handling inaudible inputs and multi-turn conversations.

	Model developers Homebrew Research.

	Input Text and sound.

	Output Text.

	Model Architecture Llama-3.

	Language(s): English.

	## Intended Use

	Intended Use Cases This family is primarily intended for research applications. This version aims to further improve the LLM on sound understanding capabilities.

	Out-of-scope The use of llama3-s in any manner that violates applicable laws or regulations is strictly prohibited.

	## How to Get Started with the Model

	Try this model using [Google Colab Notebook](https://colab.research.google.com/drive/18IiwN0AzBZaox5o0iidXqWD1xKq11XbZ?usp=sharing).

	First, we need to convert the audio file to sound tokens

	```python
	device = "cuda" if torch.cuda.is_available() else "cpu"
	if not os.path.exists("whisper-vq-stoks-medium-en+pl-fixed.model"):
	hf_hub_download(
	repo_id="jan-hq/WhisperVQ",
	filename="whisper-vq-stoks-medium-en+pl-fixed.model",
	local_dir=".",
	)
	vq_model = RQBottleneckTransformer.load_model(
	"whisper-vq-stoks-medium-en+pl-fixed.model"
	).to(device)
	vq_model.ensure_whisper(device)
	def audio_to_sound_tokens(audio_path, target_bandwidth=1.5, device=device):

	wav, sr = torchaudio.load(audio_path)
	if sr != 16000:
	wav = torchaudio.functional.resample(wav, sr, 16000)
	with torch.no_grad():
	codes = vq_model.encode_audio(wav.to(device))
	codes = codes[0].cpu().tolist()

	result = ''.join(f'<\|sound_{num:04d}\|>' for num in codes)
	return f'<\|sound_start\|>{result}<\|sound_end\|>'
	```

	Then, we can inference the model the same as any other LLM.

	```python
	def setup_pipeline(model_path, use_4bit=False, use_8bit=False):
	tokenizer = AutoTokenizer.from_pretrained(model_path)

	model_kwargs = {"device_map": "auto"}

	if use_4bit:
	model_kwargs["quantization_config"] = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_compute_dtype=torch.bfloat16,
	bnb_4bit_use_double_quant=True,
	bnb_4bit_quant_type="nf4",
	)
	elif use_8bit:
	model_kwargs["quantization_config"] = BitsAndBytesConfig(
	load_in_8bit=True,
	bnb_8bit_compute_dtype=torch.bfloat16,
	bnb_8bit_use_double_quant=True,
	)
	else:
	model_kwargs["torch_dtype"] = torch.bfloat16

	model = AutoModelForCausalLM.from_pretrained(model_path, **model_kwargs)

	return pipeline("text-generation", model=model, tokenizer=tokenizer)

	def generate_text(pipe, messages, max_new_tokens=64, temperature=0.0, do_sample=False):
	generation_args = {
	"max_new_tokens": max_new_tokens,
	"return_full_text": False,
	"temperature": temperature,
	"do_sample": do_sample,
	}

	output = pipe(messages, **generation_args)
	return output[0]['generated_text']

	# Usage
	llm_path = "homebrewltd/llama3.1-s-instruct-v0.2"
	pipe = setup_pipeline(llm_path, use_8bit=True)
	```

	## Training process
	Training Metrics Image: Below is a snapshot of the training loss curve visualized.

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/65713d70f56f9538679e5a56/7TWPqLdDLDlfzeRXP9m36.png)

	[MMLU](https://huggingface.co/datasets/cais/mmlu):

	\| Model \| MMLU Score \|
	\| --- \| --- \|
	\| llama3.5-instruct-8b \| 69.40 \|
	\| ichigo-llama3.1-s-v0.3: phase 3 \| 63.79 \|
	\| ichigo-llama3.1-s-v0.3: phase 2 \| 63.08 \|
	\| ichigo-llama3.1-s-base-v0.3 \| 42.11 \|
	\| llama3.5-instruct-v0.2 \| 50.27 \|

	[AudioBench](https://arxiv.org/abs/2406.16020) Eval:

	\| Model Bench \| [Open-hermes Instruction Audio](https://huggingface.co/datasets/AudioLLMs/openhermes_instruction_test) (GPT-4-O judge 0:5) \| [Alpaca Instruction Audio](https://huggingface.co/datasets/AudioLLMs/alpaca_audio_test) (GPT-4-O judge 0:5) \|
	\| --- \| --- \| --- \|
	\| [Llama3.1-s-v2](https://huggingface.co/homebrewltd/llama3-s-instruct-v0.2) \| 3.45 \| 3.53 \|
	\| [Ichigo-llama3.1-s v0.3-phase2 -cp7000](https://huggingface.co/homebrewltd/Ichigo-llama3.1-s-instruct-v0.3-phase-2) \| 3.42 \| 3.62 \|
	\| [Ichigo-llama3.1-s v0.3-phase2-cplast](https://huggingface.co/jan-hq/llama3-s-instruct-v0.3-checkpoint-last) \| 3.31 \| 3.6 \|
	\| [Ichigo-llama3.1-s v0.3-phase3](https://huggingface.co/homebrewltd/Ichigo-llama3.1-s-instruct-v0.3-phase-3) \| 3.64 \| 3.68 \|
	\| [Qwen2-audio-7B](https://huggingface.co/Qwen/Qwen2-Audio-7B) \| 2.63 \| 2.24 \|

	### Hardware

	GPU Configuration: Cluster of 8x NVIDIA H100-SXM-80GB.

	GPU Usage:
	- Continual Training: 3 hours.

	### Training Arguments

	We utilize [torchtune](https://github.com/pytorch/torchtune) library for the latest FSDP2 training code implementation.

	\| Parameter \| Continual Training \|
	\| --- \| --- \|
	\| Epoch \| 1 \|
	\| Global batch size \| 256 \|
	\| Learning Rate \| 1.5e-5 \|
	\| Learning Scheduler \| LambdaLR with warmup \|
	\| Optimizer \| [AdamW Fused](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html) \|
	\| Warmup Steps \| 8 \|
	\| Weight Decay \| 0.005 \|
	\| Max length \| 4096 \|
	\| Precision \| bf16 \|


	## More detail

	Paper: http://arxiv.org/abs/2410.15316


	## Citation Information

	BibTeX:

	```
	@article{Llama3-S: Sound Instruction Language Model 2024,
	title={Llama3-S},
	author={Homebrew Research},
	year=2024,
	month=August},
	url={https://huggingface.co/homebrewltd/llama3.1-s-2024-08-20}
	```

	## Acknowledgement

	- [WhisperSpeech](https://github.com/collabora/WhisperSpeech)

	- [Meta-Llama-3.1-8B-Instruct ](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct)