Edit model card

"Working on the future!"

— # Leroy Dyer (1972-Present)

"To grow as a professional, set goals just beyond your current abilities. Achieving these milestones will not only overcome obstacles but also strengthen your skillset. If your tasks are too easy, you’ll never challenge yourself or improve, and life will pass you by!"

SpydazWeb_AI_Text_AudioVision_Project : A multi- Purposed model !

In the creation of a model a lot of time and consideration of which data to focus training and which methodolgys to deploy is a task in its own right : but these training and fine tunings can be stacked on topof each other : in some cases poiling the otput and other enhancing the output . but in truth the aim is not to be agentic at all ! w need a single model which can perform ANY task ! such that any type of task even unseen and untrained task can be performed at will : we have attempted to hand;le this with agentic workflows ,, such as graphs , agents , tools , functions : but in truth these are just short cuts as well as removing the capablitys of the model in nplace of programming or external data ie essentially rewriting the output of the modle or carving the model to perform a psecific censored answrr: in fact censoring is not only about rude ness or lwed talk, its about rstricting the usage of the model as some arnas are not rady to be guzumpt by AI : Especially those in well paid positions :

We have found that agent gen and crew ai , even aider . these framworks allow for some organization and role play which enables for models to act like humans ( satisfing a part of the original AI goals ) the idea of agentic workflows and agentic conenctivity as intligent agents is old news and those old powers are also tring to push thier agendas ( much like the ancient racists wh have attempted to dominate history and other world content )

today we are in a enwe genratio of thought and action and the goals of the past no longer apply : but we are also in a python world ! an untrained and uneduacted areana : so it seems as backwards is the ew forwards :

There have been a flurry of multi modal s ow they have slowly caught us up but again they have basically only copied other models and not produced unique models ! they used the lava , as well as clip models etc to make these multimodal vision models !

Inn fact we should examine these models carefullys as they are all replicaka ! are they flooding us with bad unusble models ! ( some of the sizes are way over sized they love to share what cannot be used !)

But inn truth we need to realise that the 7b model has not been fully explored and we stil do not have a metric for increasing parameters and performance as it is only assumed as well as some FIXING ! of results by traininng youor model onn the multiple choice datsets ! But no innvovation': despite the unstable diffusions: SO we need to create a enw method ! hence text vision !

Text Vision:

text vision is a methodolgy developed for entering an image into your chat with a model :

In this method : we choose to firstly convert the image to a text representation first : This is base64 a common method in python ad easy to replicate :

import gradio as gr
import base64
from PIL import Image
import io

def _encode_image_to_base64(image_path):
    """Encodes an image to a Base64 string."""
    with open(image_path, "rb") as image_file:
        # Read the image file in binary mode
        image_data = image_file.read()
        # Encode the image data to Base64
        base64_encoded = base64.b64encode(image_data).decode('utf-8')
    return base64_encoded

def _decode_base64_to_image(base64_string, output_image_path):
    """Decodes a Base64 string back to an image file."""
    # Decode the Base64 string
    image_data = base64.b64decode(base64_string)
    with open(output_image_path, "wb") as image_file:
        # Write the binary data to an image file

def encode_image_to_base64(image):
    """Encodes an image to a Base64 string."""
    buffered = io.BytesIO()
    image.save(buffered, format="PNG")
    img_str = base64.b64encode(buffered.getvalue()).decode()
    return img_str

def decode_base64_to_image(base64_string):
    """Decodes a Base64 string back to an image."""
    image_data = base64.b64decode(base64_string)
    image = Image.open(io.BytesIO(image_data))
    return image

# Gradio interface for encoding
def encode_interface(input_image):
    base64_string = encode_image_to_base64(input_image)
    return base64_string

# Gradio interface for decoding
def decode_interface(base64_string):
        decoded_image = decode_base64_to_image(base64_string)
        return decoded_image
        return None

# Create Gradio Blocks
with gr.Blocks() as demo:
    gr.Markdown("# Image Encoder-Decoder")
    with gr.Tab("Encode Image to Base64"):
        with gr.Row():
            input_image = gr.Image(type="pil", label="Input Image")
            output_text = gr.Textbox(label="Base64 Output", lines=5)
        encode_button = gr.Button("Encode")
        encode_button.click(encode_interface, inputs=input_image, outputs=output_text)
    with gr.Tab("Decode Base64 to Image"):
        with gr.Row():
            input_text = gr.Textbox(label="Base64 Input", lines=5)
            output_image = gr.Image(type="pil", label="Decoded Image")
        decode_button = gr.Button("Decode")
        decode_button.click(decode_interface, inputs=input_text, outputs=output_image)

# Launch the app
if __name__ == "__main__":

# Example usage
if __name__ == "__main__":
    # Encode image to Base64
    base64_string = _encode_image_to_base64("input_image.jpg")
    print("Encoded Base64 String:")

    # Decode Base64 back to image
    _decode_base64_to_image(base64_string, "output_image.jpg")
    print("Image decoded and saved as output_image.jpg")

To process the records of the datasets into image64 i selected a few basic datasets from the hub which are just simple description and image : ( in this case png )

# Define a function to process each example in the dataset
def process_images_func(examples):

    texts = examples["text"]
    images = examples["image"]  # Assuming the images are in PIL format

    # Convert each image to base64
    base64_images = [image_to_base64(image) for image in images]

    # Return the updated examples with base64-encoded images
    return {
        "text": texts,
        "image_base64": base64_images  # Adding the Base64 encoded image strings

# Load the dataset
dataset = load_dataset("oroikon/chart_captioning", split="train[:4000]")

# Process the dataset by converting images to base64
processed_dataset = dataset.map(process_images_func, batched=True)

After pushing them all to hub ! They can now be loaded into the training script as usual :

A basic prompt :

- Generate an image based on this description 

- describe this image : ( base64 )

- generate a spectrographic image based on this description

- describe this sound in this spectrographic image : ( base64 )

So perhaps my input formatting will be :

<Image> : (base64) </Image>
<Sound> : (base64) </Sound> 
<Text> : Prompt </Text> 

Text Audio

so here we can also define a sound as an image , by converting it too into a text representation ,

here we convert the audio first inpt a spectrographic image ,

ie : a wave form image , so then we can continue as an image but , in fact because its a sound we should also lt the model know we are giving it a sound perhaps as we would like ot generate these types of sounds later

hence : these prompts :

- generate a spectrographic image based on this description

- describe this sound in this spectrographic image : ( base64 )


this method gives us a method to create images form spectograms as wll as encoding and decoding the audio Perhaps the whisper as it creates a mel which is the inage which is used for thier transformation : hence we can do the same : using those images as input and potential outputs :

import numpy as np
import torch
import torchaudio
import librosa
import librosa.display
import matplotlib.pyplot as plt
import soundfile as sf
from PIL import Image

# Step 1: Encode Audio to Mel-Spectrogram
def encode_audio_to_mel_spectrogram(audio_file, n_mels=128):
    Encode an audio file to a mel-spectrogram.
    - audio_file: Path to the audio file.
    - n_mels: Number of mel bands (default: 128).
    - mel_spectrogram_db: Mel-spectrogram in dB scale.
    - sample_rate: Sample rate of the audio file.
    y, sample_rate = librosa.load(audio_file, sr=None)  # Load audio
    mel_spectrogram = librosa.feature.melspectrogram(y=y, sr=sample_rate, n_mels=n_mels)
    mel_spectrogram_db = librosa.power_to_db(mel_spectrogram, ref=np.max)  # Convert to dB
    return mel_spectrogram_db, sample_rate

# Improved Step 2: Save Mel-Spectrogram as Image
def save_mel_spectrogram_image(mel_spectrogram_db, sample_rate, output_image='mel_spectrogram.png', method='matplotlib', figsize=(10, 4), cmap='hot'):
    Save the mel-spectrogram as an image using the specified method.
    - mel_spectrogram_db: Mel-spectrogram in dB scale.
    - sample_rate: Sample rate of the audio file.
    - output_image: Path to save the image.
    - method: Method for saving ('matplotlib' or 'custom').
    - figsize: Size of the figure for matplotlib (default: (10, 4)).
    - cmap: Colormap for the spectrogram (default: 'hot').
    if method == 'matplotlib':
        librosa.display.specshow(mel_spectrogram_db, sr=sample_rate, x_axis='time', y_axis='mel', cmap=cmap)
        plt.colorbar(format='%+2.0f dB')
        print(f"Mel-spectrogram image saved using matplotlib as '{output_image}'")
    elif method == 'custom':
        # Convert dB scale to linear scale for image generation
        mel_spectrogram_linear = librosa.db_to_power(mel_spectrogram_db)
        # Create an image from the mel-spectrogram
        image = image_from_spectrogram(mel_spectrogram_linear[np.newaxis, ...])  # Add channel dimension
        # Save the image
        print(f"Mel-spectrogram image saved using custom method as '{output_image}'")
        raise ValueError("Invalid method. Choose 'matplotlib' or 'custom'.")

# Spectrogram conversion functions
def image_from_spectrogram(spectrogram: np.ndarray, power: float = 0.25) -> Image.Image:
    Compute a spectrogram image from a spectrogram magnitude array.

        spectrogram: (channels, frequency, time)
        power: A power curve to apply to the spectrogram to preserve contrast

        image: (frequency, time, channels)
    # Rescale to 0-1
    max_value = np.max(spectrogram)
    data = spectrogram / max_value

    # Apply the power curve
    data = np.power(data, power)

    # Rescale to 0-255 and invert
    data = 255 - (data * 255).astype(np.uint8)

    # Convert to a PIL image
    if data.shape[0] == 1:
        image = Image.fromarray(data[0], mode="L").convert("RGB")
    elif data.shape[0] == 2:
        data = np.array([np.zeros_like(data[0]), data[0], data[1]]).transpose(1, 2, 0)
        image = Image.fromarray(data, mode="RGB")
        raise NotImplementedError(f"Unsupported number of channels: {data.shape[0]}")

    # Flip Y
    image = image.transpose(Image.FLIP_TOP_BOTTOM)
    return image

# Step 3: Extract Mel-Spectrogram from Image (Direct Pixel Manipulation)
def extract_mel_spectrogram_from_image(image_path):
    Extract a mel-spectrogram from a saved image using pixel manipulation.
    - image_path: Path to the spectrogram image file.
    - mel_spectrogram_db: The extracted mel-spectrogram in dB scale.
    img = Image.open(image_path).convert('L')  # Open image and convert to grayscale
    img_array = np.array(img)  # Convert to NumPy array
    mel_spectrogram_db = img_array / 255.0 * -80  # Scale to dB range
    return mel_spectrogram_db

# Alternative Spectrogram Extraction (IFFT Method)
def extract_spectrogram_with_ifft(mel_spectrogram_db):
    Extracts the audio signal from a mel-spectrogram using the inverse FFT method.
    - mel_spectrogram_db: The mel-spectrogram in dB scale.
    - audio: The reconstructed audio signal.
    # Convert dB mel-spectrogram back to linear scale
    mel_spectrogram = librosa.db_to_power(mel_spectrogram_db)

    # Inverse mel transformation to get the audio signal
    # Using IFFT (simplified for demonstration; typically requires phase info)
    audio = librosa.feature.inverse.mel_to_audio(mel_spectrogram)
    return audio

# Step 4: Decode Mel-Spectrogram with Griffin-Lim
def decode_mel_spectrogram_to_audio(mel_spectrogram_db, sample_rate, output_audio='griffin_reconstructed_audio.wav'):
    Decode a mel-spectrogram into audio using Griffin-Lim algorithm.
    - mel_spectrogram_db: The mel-spectrogram in dB scale.
    - sample_rate: The sample rate for the audio file.
    - output_audio: Path to save the reconstructed audio file.
    # Convert dB mel-spectrogram back to linear scale
    mel_spectrogram = librosa.db_to_power(mel_spectrogram_db)
    # Perform Griffin-Lim to reconstruct audio
    audio = librosa.griffinlim(mel_spectrogram)
    # Save the generated audio
    sf.write(output_audio, audio, sample_rate)
    print(f"Griffin-Lim reconstructed audio saved as '{output_audio}'")
    return audio

# Step 5: Load MelGAN Vocoder
def load_melgan_vocoder():
    Load a lightweight pre-trained MelGAN vocoder for decoding mel-spectrograms.
    Returns a torch MelGAN vocoder model.
    model = torchaudio.models.MelGAN()  # Load MelGAN model
    model.eval()  # Ensure the model is in evaluation mode
    return model

# Step 6: Decode Mel-Spectrogram with MelGAN
def decode_mel_spectrogram_with_melgan(mel_spectrogram_db, sample_rate, output_audio='melgan_reconstructed_audio.wav'):
    Decode a mel-spectrogram into audio using MelGAN vocoder.
    - mel_spectrogram_db: The mel-spectrogram in dB scale.
    - sample_rate: The sample rate for the audio file.
    - output_audio: Path to save the reconstructed audio file.
    - audio: The reconstructed audio signal.
    # Convert dB mel-spectrogram back to linear scale
    mel_spectrogram = librosa.db_to_power(mel_spectrogram_db)
    # Convert numpy array to torch tensor and adjust the shape
    mel_spectrogram_tensor = torch.tensor(mel_spectrogram).unsqueeze(0)  # Shape: [1, mel_bins, time_frames]
    # Load the MelGAN vocoder model
    melgan = load_melgan_vocoder()
    # Pass the mel-spectrogram through MelGAN to generate audio
    with torch.no_grad():
        audio = melgan(mel_spectrogram_tensor).squeeze().numpy()  # Squeeze to remove batch dimension
    # Save the generated audio
    sf.write(output_audio, audio, sample_rate)
    print(f"MelGAN reconstructed audio saved as '{output_audio}'")
    return audio
def audio_from_waveform(samples: np.ndarray, sample_rate: int, normalize: bool = False) -> pydub.AudioSegment:
    Convert a numpy array of samples of a waveform to an audio segment.

        samples: (channels, samples) array
        sample_rate: Sample rate of the audio.
        normalize: Flag to normalize volume.

    # Normalize volume to fit in int16
    if normalize:
        samples *= np.iinfo(np.int16).max / np.max(np.abs(samples))

    # Transpose and convert to int16
    samples = samples.transpose(1, 0).astype(np.int16)

    # Write to the bytes of a WAV file
    wav_bytes = io.BytesIO()
    wavfile.write(wav_bytes, sample_rate, samples)

    # Read into pydub
    return pydub.AudioSegment.from_wav(wav_bytes)

def apply_filters(segment: pydub.AudioSegment, compression: bool = False) -> pydub.AudioSegment:
    Apply post-processing filters to the audio segment to compress it and keep at a -10 dBFS level.

        segment: The audio segment to filter.
        compression: Flag to apply dynamic range compression.

    if compression:
        segment = pydub.effects.normalize(segment, headroom=0.1)
        segment = segment.apply_gain(-10 - segment.dBFS)
        segment = pydub.effects.compress_dynamic_range(

    # Apply gain to desired dB level and normalize again
    desired_db = -12
    segment = segment.apply_gain(desired_db - segment.dBFS)
    return pydub.effects.normalize(segment, headroom=0.1)

def stitch_segments(segments: Sequence[pydub.AudioSegment], crossfade_s: float) -> pydub.AudioSegment:
    Stitch together a sequence of audio segments with a crossfade between each segment.

        segments: Sequence of audio segments to stitch.
        crossfade_s: Duration of crossfade in seconds.

    crossfade_ms = int(crossfade_s * 1000)
    combined_segment = segments[0]
    for segment in segments[1:]:
        combined_segment = combined_segment.append(segment, crossfade=crossfade_ms)
    return combined_segment

def overlay_segments(segments: Sequence[pydub.AudioSegment]) -> pydub.AudioSegment:
    Overlay a sequence of audio segments on top of each other.

        segments: Sequence of audio segments to overlay.

    assert len(segments) > 0
    output: pydub.AudioSegment = segments[0]
    for segment in segments[1:]:
        output = output.overlay(segment)
    return output

# Step 7: Full Pipeline for Audio Processing with Customization
def mel_spectrogram_pipeline(audio_file, output_image='mel_spectrogram.png', 
                             extraction_method='pixel',  # 'pixel' or 'ifft'
                             decoding_method='griffin'):  # 'griffin' or 'melgan'
    Full pipeline to encode audio to mel-spectrogram, save it as an image, extract the spectrogram from the image,
    and decode it back to audio using the selected methods.
    - audio_file: Path to the audio file to be processed.
    - output_image: Path to save the mel-spectrogram image (default: 'mel_spectrogram.png').
    - output_audio_griffin: Path to save the Griffin-Lim reconstructed audio.
    - output_audio_melgan: Path to save the MelGAN reconstructed audio.
    - extraction_method: Method for extraction ('pixel' or 'ifft').
    - decoding_method: Method for decoding ('griffin' or 'melgan').
    # Step 1: Encode (Audio -> Mel-Spectrogram)
    mel_spectrogram_db, sample_rate = encode_audio_to_mel_spectrogram(audio_file)
    # Step 2: Convert Mel-Spectrogram to Image and save it
    save_mel_spectrogram_image(mel_spectrogram_db, sample_rate, output_image)
    # Step 3: Extract Mel-Spectrogram from the image based on chosen method
    if extraction_method == 'pixel':
        extracted_mel_spectrogram_db = extract_mel_spectrogram_from_image(output_image)
    elif extraction_method == 'ifft':
        extracted_mel_spectrogram_db = extract_spectrogram_with_ifft(mel_spectrogram_db)
        raise ValueError("Invalid extraction method. Choose 'pixel' or 'ifft'.")
    # Step 4: Decode based on the chosen decoding method
    if decoding_method == 'griffin':
        decode_mel_spectrogram_to_audio(extracted_mel_spectrogram_db, sample_rate, output_audio_griffin)
    elif decoding_method == 'melgan':
        decode_mel_spectrogram_with_melgan(extracted_mel_spectrogram_db, sample_rate, output_audio_melgan)
        raise ValueError("Invalid decoding method. Choose 'griffin' or 'melgan'.")

# Example usage
if __name__ == "__main__":
    audio_file_path = 'your_audio_file.wav'  # Specify the path to your audio file here
        extraction_method='pixel',  # Choose 'pixel' or 'ifft'
        decoding_method='griffin'  # Choose 'griffin' or 'melgan'

Currenty i did not convert any sounds to mel or spectrographic images :

i did locate some datsets which have them so i used them !

They seemed to work the same as the image files and currently its improving !


Here added functionality was trained into the model !

- Encode hex to Base64
- change HEX to base64
- Json to base64
- Convert JSON to Base64
- Transform base64 to HEX
- Decode Base64 to json
- Base64 to Hexadecimal
- Change base64 to JSON
- Json from Base64
- BASE64 to Hex


{"instruction": "Encode hex to Base64", "input": "ecfc2db9ba6049165b", "output": "7PwtubpgSRZb"}
{"instruction": "change HEX to base64", "input": "60926e782008", "output": "YJJueCAI"}
{"instruction": "Json to base64", "input": "[77,62,160,64,248,233,105,133,5,248,89,239]", "output": "TT6gQPjpaYUF+Fnv"}
{"instruction": "Change Json to BASE64", "input": "[10,59,42,251,112,1]", "output": "Cjsq+3AB"}
{"instruction": "Convert JSON to Base64", "input": "[236,201,129,100,238]", "output": "7MmBZO4="}
### Decode
{"instruction": "Transform base64 to HEX", "input": "464pNBlIObA=", "output": "e3ae2934194839b0"}
{"instruction": "Decode Base64 to json", "input": "NQ==", "output": "[53]"}
{"instruction": "Base64 to Hexadecimal", "input": "ax0WaQ==", "output": "6b1d1669"}
{"instruction": "convert base64 to Hexadecimal", "input": "8X43", "output": "f17e37"}
{"instruction": "Change base64 to JSON", "input": "7MmBZO4=", "output": "[236,201,129,100,238]"}
{"instruction": "Json from Base64", "input": "ytBBCmPRA6De+Ow=", "output": "[202,208,65,10,99,209,3,160,222,248,236]"}
{"instruction": "BASE64 to Hex", "input": "m/A=", "output": "9bf0"}

These sub tasks allow for the model to have the base64 as a object in the model :

the embedding were also trained to embed these tasks ,

enabling for the model to learn how to manipulate the base64 data for other comparitive tasks !

there are many datasets which could be used to increase the usage of the base64 code :

by trainig the embeddings for these subtasks as well as pushing a large stack of the model parameters :

the model is forced to create custom tokes as these tokens ahve not been seen before i training :

hence they need to find space in the embeddings model :

so the caption and generations can be trained at a lesser level as welll as focus on attention innstead of deeper learning :

Downloads last month
Model size
7.24B params
Tensor type
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train LeroyDyer/SpydazWeb_AI_Text_AudioVision_Project