Update README.md

e91ec4d verified 18 days ago

6.61 kB

	---
	license: other
	pipeline_tag: text-to-audio
	library_name: transformers
	---
	# Spirit LM Inference Gradio Demo

	Copy the github repo, build the [spiritlm](https://github.com/facebookresearch/spiritlm) python package and put models in `checkpoints` folder before running the script. I would suggest to use conda environment for this.

	You need around 15.5GB of VRAM to run the model with 200 tokens output length and around 19GB to output 800 tokens.

	If you're concerned about pickles from unknown uploader - grab them from a repo maintained by HF staffer - [https://huggingface.co/spirit-lm/Meta-spirit-lm](https://huggingface.co/spirit-lm/Meta-spirit-lm)

	Audio to audio inference doesn't seem good at all. Potentially I am tokenizing the audio wrong. Could be also that model doesn't work well with audio IN audio OUT.

	Script here works with just single speaker - if you know how to get other speakers let me know and I'll update it.

	to install requirements for the sample Gradio demo provided, please run:

	```
	pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu121
	pip install gradio tempfile transformers numpy
	```

	Remember that you also need to install the [spiritlm](https://github.com/facebookresearch/spiritlm) module.

	```python
	import gradio as gr
	from spiritlm.model.spiritlm_model import Spiritlm, OutputModality, GenerationInput, ContentType
	from transformers import GenerationConfig
	import torchaudio
	import torch
	import tempfile
	import os
	import numpy as np

	# Initialize the Spirit LM model with the modified class
	spirit_lm = Spiritlm("spirit-lm-base-7b")

	def generate_output(input_type, input_content_text, input_content_audio, output_modality, temperature, top_p, max_new_tokens, do_sample, speaker_id):
	generation_config = GenerationConfig(
	temperature=temperature,
	top_p=top_p,
	max_new_tokens=max_new_tokens,
	do_sample=do_sample,
	)

	if input_type == "text":
	interleaved_inputs = [GenerationInput(content=input_content_text, content_type=ContentType.TEXT)]
	elif input_type == "audio":
	# Load audio file
	waveform, sample_rate = torchaudio.load(input_content_audio)
	interleaved_inputs = [GenerationInput(content=waveform.squeeze(0), content_type=ContentType.SPEECH)]
	else:
	raise ValueError("Invalid input type")

	outputs = spirit_lm.generate(
	interleaved_inputs=interleaved_inputs,
	output_modality=OutputModality[output_modality.upper()],
	generation_config=generation_config,
	speaker_id=speaker_id, # Pass the selected speaker ID
	)

	text_output = ""
	audio_output = None

	for output in outputs:
	if output.content_type == ContentType.TEXT:
	text_output = output.content
	elif output.content_type == ContentType.SPEECH:
	# Ensure output.content is a NumPy array
	if isinstance(output.content, np.ndarray):
	# Debugging: Print shape and dtype of the audio data
	print("Audio data shape:", output.content.shape)
	print("Audio data dtype:", output.content.dtype)

	# Ensure the audio data is in the correct format
	if len(output.content.shape) == 1:
	# Mono audio data
	audio_data = torch.from_numpy(output.content).unsqueeze(0)
	else:
	# Stereo audio data
	audio_data = torch.from_numpy(output.content)

	# Save the audio content to a temporary file
	with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as temp_audio_file:
	torchaudio.save(temp_audio_file.name, audio_data, 16000)
	audio_output = temp_audio_file.name
	else:
	raise TypeError("Expected output.content to be a NumPy array, but got {}".format(type(output.content)))

	return text_output, audio_output

	# Define the Gradio interface
	iface = gr.Interface(
	fn=generate_output,
	inputs=[
	gr.Radio(["text", "audio"], label="Input Type", value="text"),
	gr.Textbox(label="Input Content (Text)"),
	gr.Audio(label="Input Content (Audio)", type="filepath"),
	gr.Radio(["TEXT", "SPEECH", "ARBITRARY"], label="Output Modality", value="SPEECH"),
	gr.Slider(0, 1, step=0.1, value=0.9, label="Temperature"),
	gr.Slider(0, 1, step=0.05, value=0.95, label="Top P"),
	gr.Slider(1, 800, step=1, value=500, label="Max New Tokens"),
	gr.Checkbox(value=True, label="Do Sample"),
	gr.Dropdown(choices=[0, 1, 2, 3], value=0, label="Speaker ID"),
	],
	outputs=[gr.Textbox(label="Generated Text"), gr.Audio(label="Generated Audio")],
	title="Spirit LM WebUI Demo",
	description="Demo for generating text or audio using the Spirit LM model.",
	flagging_mode="never",
	)

	# Launch the interface
	iface.launch()


	```


	# Spirit LM Checkpoints

	## Download Checkpoints
	Checkpoints are in this repo

	Please note that Spirit LM is made available under the FAIR Noncommercial Research License

	License is here: https://github.com/facebookresearch/spiritlm/blob/main/LICENSE

	## Structure
	The checkpoints directory should look like this:
	```
	checkpoints/
	├── README.md
	├── speech_tokenizer
	│ ├── hifigan_spiritlm_base
	│ │ ├── config.json
	│ │ ├── generator.pt
	│ │ ├── speakers.txt
	│ │ └── styles.txt
	│ ├── hifigan_spiritlm_expressive_w2v2
	│ │ ├── config.json
	│ │ ├── generator.pt
	│ │ └── speakers.txt
	│ ├── hubert_25hz
	│ │ ├── L11_quantizer_500.pt
	│ │ └── mhubert_base_25hz.pt
	│ ├── style_encoder_w2v2
	│ │ ├── config.json
	│ │ └── pytorch_model.bin
	│ └── vqvae_f0_quantizer
	│ ├── config.yaml
	│ └── model.pt
	└── spiritlm_model
	├── spirit-lm-base-7b
	│ ├── config.json
	│ ├── generation_config.json
	│ ├── pytorch_model.bin
	│ ├── special_tokens_map.json
	│ ├── tokenizer_config.json
	│ └── tokenizer.model
	└── spirit-lm-expressive-7b
	├── config.json
	├── generation_config.json
	├── pytorch_model.bin
	├── special_tokens_map.json
	├── tokenizer_config.json
	└── tokenizer.model
	```