weizhiwang
/

llava-v1.5-llama-3-8b-pretrain-clip-large-336px

Text Generation

Model card Files Files and versions Community

llava-v1.5-llama-3-8b-pretrain-clip-large-336px / README.md

weizhiwang's picture

Create README.md

35089c0 verified 9 months ago

|

647 Bytes

	---
	inference: false
	datasets:
	- liuhaotian/LLaVA-CC3M-Pretrain-595K
	---

	# llava-v1.5-llama-3-8b-pretrain Model Card

	This is a pretrained checkpoint with the MLP connector after LLaVA stage 1, you can use it to instruct tune your multimodal models.
	Please follow my reproduced implementation [LLaVA-Llama-3](https://github.com/Victorwz/LLaVA-Llama-3/) for more details on fine-tuning LLaVA model with Llama-3 as the foundatiaon LLM.


	## Training dataset
	- 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.

	## Architecture
	- LLM: llama-3-8b (Frozen)
	- Vision-Language Adapter: MLP
	- Vision Encoder: CLIP-ViT-L-336px (Frozen)