Pretrain stage only, 4630 epochs

Introduction

We use the powerful TinyLLaVA Factory to create a super small image-text-to-text model.

The goal is to make it possible to run LLaVA models on edge devices (with few gigabytes of memory).

For LLM and vision tower, we choose OpenELM-270M-Instruct and facebook/dinov2-small, respectively.

Category	# Samples	TP	FP	TN	FN	Accuracy	Precision	Recall	F1 Score	Yes Ratio
Adversarial	3000	1312	1250	250	188	0.521	0.512	0.875	0.646	0.854
Popular	3000	1312	1236	264	188	0.525	0.515	0.875	0.648	0.849
Random	2910	1312	1185	225	188	0.528	0.525	0.875	0.656	0.858

Samples 5000, Accuracy 0% (:-|)

Samples 4241, Correct: -, Accuracy: -%, IMG-Accuracy: -%

Category	# Samples	Accuracy
Overall	900	0.280
Overall-Art and Design	120	0.208
Art	30	0.167
Art Theory	30	0.200
Design	30	0.367
Music	30	0.100
Overall-Business	150	0.213
Accounting	30	0.100
Economics	30	0.367
Finance	30	0.200
Management	30	0.233
Marketing	30	0.167
Overall-Science	150	0.300
Biology	30	0.300
Chemistry	30	0.133
Geography	30	0.300
Math	30	0.333
Physics	30	0.433
Overall-Health and Medicine	150	0.340
Basic Medical Science	30	0.300
Clinical Medicine	30	0.133
Diagnostics and Laboratory Med.	30	0.333
Pharmacy	30	0.400
Public Health	30	0.533
Overall-Humanities and Soc. Sci.	120	0.342
History	30	0.300
Literature	30	0.567
Sociology	30	0.233
Psychology	30	0.267
Overall-Tech and Engineering	210	0.276
Agriculture	30	0.300
Architecture and Engineering	30	0.200
Computer Science	30	0.367
Electronics	30	0.200
Energy and Power	30	0.367
Materials	30	0.233
Mechanical Engineering	30	0.267

sbrzz
/

TinyLLaVA-Qwen2.5-0.5B-Instruct-dinov2-small