Jay Broughton

Kerassy
·

AI & ML interests

NLP, dataset creation, small language and multi-modal models.

Recent Activity

updated a dataset about 16 hours ago
Kerassy/misc-art-and-descriptons
updated a model 1 day ago
Kerassy/Qwen2.5-3b-instruct-FineTome
published a model 1 day ago
Kerassy/Qwen2.5-3b-instruct-FineTome
View all activity

Organizations

None yet

Kerassy's activity

reacted to hexgrad's post with ❤️ 2 days ago
view post
Post
2261
IMHO, being able & willing to defeat CAPTCHA, hCaptcha, or any other reasoning puzzle is a must-have for any Web-Browsing / Computer-Using Agent (WB/CUA).

I realize it subverts the purpose of CAPTCHA, but I do not think you can claim to be building AGI/agents without smoothly passing humanity checks. It would be like getting in a self-driving car that requires human intervention over speed bumps. Claiming AGI or even "somewhat powerful AI" seems hollow if you are halted by a mere CAPTCHA.

I imagine OpenAI's Operator is *able* but *not willing* to defeat CAPTCHA. Like their non-profit status, I expect that policy to evolve over time—and if not, rival agent-builders will attack that opening to offer a better product.
  • 1 reply
·
reacted to andito's post with 🚀 3 days ago
view post
Post
1386
𝗜𝗻𝘁𝗿𝗼𝗱𝘂𝗰𝗶𝗻𝗴 𝘁𝗵𝗲 𝘄𝗼𝗿𝗹𝗱'𝘀 𝘀𝗺𝗮𝗹𝗹𝗲𝘀𝘁 𝘃𝗶𝘀𝗶𝗼𝗻 𝗹𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗺𝗼𝗱𝗲𝗹!

We’re thrilled to share 𝗦𝗺𝗼𝗹𝗩𝗟𝗠 (256M & 500M)—the smallest Visual Language Models ever built. Think: running on <1GB of GPU memory—you can fine-tune it on your laptop and run it on your toaster!

Why It’s Game-Changing:
- 𝗢𝘂𝘁𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝘀 𝗟𝗮𝗿𝗴𝗲𝗿 𝗠𝗼𝗱𝗲𝗹𝘀: Even the 256M model surpasses our SOTA 80B-parameter model from just 17 months ago. Over 300x reduction!
𝗠𝗶𝗴𝗵𝘁𝘆 𝗘𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝗰𝘆: The 256M version delivers 80% of our 2.2B model’s performance, and the 500M version hits 90%
𝗟𝗶𝗴𝗵𝘁𝗻𝗶𝗻𝗴-𝗙𝗮𝘀𝘁 𝗦𝗲𝗮𝗿𝗰𝗵: SmolVLM integrates with ColiPali for state-of-the-art retrieval speeds—on par with models 10x bigger. That means cheaper, faster indexing and real-world impact.

What’s New Under the Hood:
- 𝗡𝗲𝘄 𝗩𝗶𝘀𝗶𝗼𝗻 𝗘𝗻𝗰𝗼𝗱𝗲𝗿: Smaller overall size (400M -> 93M), but with higher resolution.
- 𝗛𝗶𝗴𝗵𝗲𝗿 𝗣𝗶𝘅𝗲𝗹𝘀/𝗧𝗼𝗸𝗲𝗻: 4096 vs. 1820—more efficient image processing.
- 𝗦𝗺𝗮𝗿𝘁 𝗧𝗼𝗸𝗲𝗻𝗶𝘇𝗮𝘁𝗶𝗼𝗻: Faster training and a performance boost.

Check our blog: https://huggingface.co/blog/smolervlm
The models: HuggingFaceTB/smolvlm-256m-and-500m-6791fafc5bb0ab8acc960fb0
The demo: HuggingFaceTB/SmolVLM-256M-Demo
  • 1 reply
·