royokong
/

e5-v

Image-Text-to-Text

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

royokong commited on Jul 20

Commit

0c1f226

•

1 Parent(s): 4475d43

Update README.md

Files changed (1) hide show

README.md +4 -0

README.md CHANGED Viewed

@@ -5,11 +5,15 @@ tags: []
 # [E5-V: Universal Embeddings with Multimodal Large Language Models](https://arxiv.org/abs/2407.12580)
 ## Overview
 We propose a framework, called E5-V, to adpat MLLMs for achieving multimodal embeddings. E5-V effectively bridges the modality gap between different types of inputs, demonstrating strong performance in multimodal embeddings even without fine-tuning. We also propose a single modality training approach for E5-V, where the model is trained exclusively on text pairs, demonstrating better performance than multimodal training.
 More details can be found in https://github.com/kongds/E5-V
 ## Example
 ``` python
 import torch

 # [E5-V: Universal Embeddings with Multimodal Large Language Models](https://arxiv.org/abs/2407.12580)
+E5-V is fine-tuned based on lmms-lab/llama3-llava-next-8b.
 ## Overview
 We propose a framework, called E5-V, to adpat MLLMs for achieving multimodal embeddings. E5-V effectively bridges the modality gap between different types of inputs, demonstrating strong performance in multimodal embeddings even without fine-tuning. We also propose a single modality training approach for E5-V, where the model is trained exclusively on text pairs, demonstrating better performance than multimodal training.
 More details can be found in https://github.com/kongds/E5-V
 ## Example
 ``` python
 import torch