Algea-VE: A Tiny Multimodal Language Model with Only 0.8B Parameters

Algea-ve is trained on the LAION-CC-SBU dataset using algea-550M-base as the base model and fine-tuned on llava_v1_5_mix665k. It uses CLIP ViT-L/14-336 as the visual encoder. The model is very small, requiring only 32GB of VRAM for fine-tuning and 3GB for inference.

Due to insufficient training of the base model, the current model has some issues with hallucinations and repetition. To address this, I am training a new model that will maintain the same size but offer better performance.

This model is built based on the llavaphi project. To use the model, please click here.

Downloads last month
53
Safetensors
Model size
862M params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.