Umitcan Sahin's picture

Umitcan Sahin PRO

ucsahin

·

AI & ML interests

Visual Language Models, Large Language Models, Vision Transformers

Recent Activity

reacted to singhsidhukuldeep's post with 🔥 1 day ago

Exciting News in AI: JinaAI Releases JINA-CLIP-v2! The team at Jina AI has just released a groundbreaking multilingual multimodal embedding model that's pushing the boundaries of text-image understanding. Here's why this is a big deal: 🚀 Technical Highlights: - Dual encoder architecture combining a 561M parameter Jina XLM-RoBERTa text encoder and a 304M parameter EVA02-L14 vision encoder - Supports 89 languages with 8,192 token context length - Processes images up to 512×512 pixels with 14×14 patch size - Implements FlashAttention2 for text and xFormers for vision processing - Uses Matryoshka Representation Learning for efficient vector storage ⚡️ Under The Hood: - Multi-stage training process with progressive resolution scaling (224→384→512) - Contrastive learning using InfoNCE loss in both directions - Trained on massive multilingual dataset including 400M English and 400M multilingual image-caption pairs - Incorporates specialized datasets for document understanding, scientific graphs, and infographics - Uses hard negative mining with 7 negatives per positive sample 📊 Performance: - Outperforms previous models on visual document retrieval (52.65% nDCG@5) - Achieves 89.73% image-to-text and 79.09% text-to-image retrieval on CLIP benchmark - Strong multilingual performance across 30 languages - Maintains performance even with 75% dimension reduction (256D vs 1024D) 🎯 Key Innovation: The model solves the long-standing challenge of unifying text-only and multi-modal retrieval systems while adding robust multilingual support. Perfect for building cross-lingual visual search systems! Kudos to the research team at Jina AI for this impressive advancement in multimodal AI!

upvoted a collection 16 days ago

DataGemma Release

new activity 24 days ago

ucsahin/TR-VLM-DPO-Dataset:[bot] Conversion to Parquet

View all activity

Organizations

None yet

ucsahin's activity

upvoted a collection 16 days ago

DataGemma Release

A series of pioneering open models that help ground LLMs in real-world data through Data Commons. • 2 items • Updated 12 days ago • 81

upvoted a collection 24 days ago

Turkish Instruction Datasets

Collection of instruction datasets for Turkish. • 37 items • Updated 24 days ago • 2

upvoted 2 collections about 1 month ago

SigLIP

Contrastive (sigmoid) image-text models from https://arxiv.org/abs/2303.15343 • 10 items • Updated 12 days ago • 47

Nov 15 Releases 🍂

15 items • Updated Nov 15 • 6

upvoted a collection 3 months ago

Turkish Vision-Language Datasets

Collection of Turkish vision-language datasets. • 20 items • Updated 24 days ago • 4

upvoted 3 papers 4 months ago

LLaVA-OneVision: Easy Visual Task Transfer

Paper • 2408.03326 • Published Aug 6 • 59

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

Paper • 2408.02718 • Published Aug 5 • 60

VITA: Towards Open-Source Interactive Omni Multimodal LLM

Paper • 2408.05211 • Published Aug 9 • 47

upvoted 2 papers 5 months ago

SAM 2: Segment Anything in Images and Videos

Paper • 2408.00714 • Published Aug 1 • 109

Gemma 2: Improving Open Language Models at a Practical Size

Paper • 2408.00118 • Published Jul 31 • 75

upvoted a collection 5 months ago

Vision Language Leaderboards

This collection has all the vision language leaderboards. • 7 items • Updated Aug 24 • 13

upvoted 2 articles 5 months ago

Article

Google releases Gemma 2 2B, ShieldGemma and Gemma Scope

Jul 31

• 59

Article

The Rise of Agentic Data Generation

By

•

Jul 15

• 78

upvoted 2 papers 5 months ago

EVLM: An Efficient Vision-Language Model for Visual Understanding

Paper • 2407.14177 • Published Jul 19 • 42

Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model

Paper • 2407.07053 • Published Jul 9 • 42

upvoted a collection 5 months ago

🪐 SmolLM

A series of smol LLMs: 135M, 360M and 1.7B. We release base and Instruct models as well as the training corpus and some WebGPU demos • 12 items • Updated 3 days ago • 204

upvoted 2 articles 5 months ago

Article

TGI Multi-LoRA: Deploy Once, Serve 30 Models

Jul 18

• 53

Article

Docmatix - a huge dataset for Document Visual Question Answering

Jul 18

• 71

upvoted a paper 5 months ago

Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing

Paper • 2407.08770 • Published Jul 11 • 19

upvoted a paper 6 months ago

AgentInstruct: Toward Generative Teaching with Agentic Flows

Paper • 2407.03502 • Published Jul 3 • 49