ytaek-oh
's Collections
VLM Papers
updated
CompCap: Improving Multimodal Large Language Models with Composite
Captions
Paper
•
2412.05243
•
Published
•
18
GraPE: A Generate-Plan-Edit Framework for Compositional T2I Synthesis
Paper
•
2412.06089
•
Published
•
4
SILMM: Self-Improving Large Multimodal Models for Compositional
Text-to-Image Generation
Paper
•
2412.05818
•
Published
FLAIR: VLM with Fine-grained Language-informed Image Representations
Paper
•
2412.03561
•
Published
•
1
Active Data Curation Effectively Distills Large-Scale Multimodal Models
Paper
•
2411.18674
•
Published
•
1
COSMOS: Cross-Modality Self-Distillation for Vision Language
Pre-training
Paper
•
2412.01814
•
Published
•
1
CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions
Paper
•
2411.16828
•
Published
•
1
FLAME: Frozen Large Language Models Enable Data-Efficient Language-Image
Pre-training
Paper
•
2411.11927
•
Published
•
1
LAION-SG: An Enhanced Large-Scale Dataset for Training Complex
Image-Text Models with Structural Annotations
Paper
•
2412.08580
•
Published
•
45
V2PE: Improving Multimodal Long-Context Capability of Vision-Language
Models with Variable Visual Position Encoding
Paper
•
2412.09616
•
Published
•
1
InstanceCap: Improving Text-to-Video Generation via Instance-aware
Structured Caption
Paper
•
2412.09283
•
Published
•
19
jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images
Paper
•
2412.08802
•
Published
•
4
ColPali: Efficient Document Retrieval with Vision Language Models
Paper
•
2407.01449
•
Published
•
42
HALLUCINOGEN: A Benchmark for Evaluating Object Hallucination in Large
Visual-Language Models
Paper
•
2412.20622
•
Published