Prithiv Sakthi's picture

Prithiv Sakthi

prithivMLmods

AI & ML interests

computer vision, multimodality, adapters @starngerzonehf @strangerguardhf

Recent Activity

Organizations

Stanford AI's profile picture DataScienceEngineering's profile picture AI FILMS's profile picture Samsung Electronics's profile picture MISATO-dataset's profile picture GEM benchmark's profile picture OpenGVLab's profile picture MusicAI's profile picture BigScience Biomedical Datasets's profile picture OpenVINO Toolkit's profile picture LLMs's profile picture ONNXConfig for all's profile picture Gradio-Themes-Party's profile picture scikit-learn's profile picture Open-Source AI Meetup's profile picture lora concepts library's profile picture Platzi Community's profile picture Kornia AI's profile picture Tune a video concepts library's profile picture Université Dauphine-PSL's profile picture Keras Dreambooth Event's profile picture Stable Diffusion Dreambooth Concepts Library's profile picture The Waifu Research Department's profile picture Musika's profile picture Blog-explorers's profile picture OpenSky's profile picture AI Tamil Nadu's profile picture OpenLLM France's profile picture huggingPartyParis's profile picture Team Tonic's profile picture That Time I got Reincarnated as a Hugging Face Organization's profile picture LocalLLaMA's profile picture Major TOM's profile picture MLX Community's profile picture C4AI Community's profile picture M4-ai's profile picture Chinese LLMs on Hugging Face's profile picture ONNX Community's profile picture Dataset Tools's profile picture Nerdy Face's profile picture Stranger Zone's profile picture open/ acc's profile picture Data Is Better Together Contributor's profile picture None yet's profile picture Doge Face's profile picture Stranger Guard's profile picture

prithivMLmods's activity

posted an update 1 day ago
view post
Post
1786
Gemma-3-4B : Image and Video Inference 🖼️🎥

🧤Space: prithivMLmods/Gemma-3-Multimodal

@gemma3-4b : {Tag + Space_+ 'prompt'}
@video-infer : {Tag + Space_+ 'prompt'}

+ Gemma3-4B : google/gemma-3-4b-it
+ By default, it runs : prithivMLmods/Qwen2-VL-OCR-2B-Instruct

Gemma 3 Technical Report : https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf

Additionally, I have also tested Aya-Vision 8B vs Custom Qwen2-VL-OCR for OCR with test case samples on messy handwriting for experimental purposes to optimize edge device VLMs for Optical Character Recognition.

📜Read the blog here: https://huggingface.co/blog/prithivMLmods/aya-vision-vs-qwen2vl-ocr-2b
  • 1 reply
·
posted an update 3 days ago
reacted to Smooke's post with 🧠 3 days ago
view post
Post
1794
Hallucinations Blog Research Reading List:

Hallucinations Are A Feature of AI, Humans Are The Bug https://hackernoon.com/hallucinations-are-a-feature-of-ai-humans-are-the-bug

Overcome LLM Hallucinations Using Knowledge Bases https://hackernoon.com/overcome-llm-hallucinations-using-knowledge-bases

How to Detect and Minimise Hallucinations in AI Models https://hackernoon.com/how-to-detect-and-minimise-hallucinations-in-ai-models

Predictive Coding, AI: Modeling Placebos in RCTs for Psychedelics and Antidepressants https://hackernoon.com/predictive-coding-ai-modeling-placebos-in-rcts-for-psychedelics-and-antidepressants

A Simple Method to Improving the Accuracy of Your RAG System https://hackernoon.com/say-goodbye-to-ai-hallucinations-a-simple-method-to-improving-the-accuracy-of-your-rag-system

Gen AI Hallucinations: The Good, the Bad, and the Costly https://hackernoon.com/gen-ai-hallucinations-the-good-the-bad-and-the-costly

Why Do LLMs Hallucinate? https://hackernoon.com/why-do-llms-hallucinate

Truth Serum For The AI Age: Factiverse To Fight Fake News And Hallucinations https://hackernoon.com/truth-serum-for-the-ai-age-factiverse-to-fight-fake-news-and-hallucinations

A Secret Technique To Sidestepping LLM Hallucinations https://hackernoon.com/a-secret-technique-to-sidestepping-llm-hallucinations

The Importance of Explainability in AI (XAI) https://hackernoon.com/tackling-ai-hallucinations-the-importance-of-explainability-in-ai-xai

What You Need to Know About Amazon Bedrock’s RAG Evaluation and LLM-as-a-Judge for Advancing AI https://hackernoon.com/what-you-need-to-know-about-amazon-bedrocks-rag-evaluation-and-llm-as-a-judge-for-advancing-ai

I Over Relied on AI and Those Shortcuts Cost Me https://hackernoon.com/i-over-relied-on-ai-and-those-shortcuts-cost-me

AI’s Non-Determinism, Hallucinations, And... Cats? https://hackernoon.com/ais-non-determinism-hallucinations-and-cats

More to read --> https://hackernoon.com/search?query=hallucinations

reacted to Kseniase's post with 🧠 4 days ago
view post
Post
3744
5 New implementations of Diffusion Models

Diffusion models are widely used for image and video generation but remain underexplored in text generation, where autoregressive models (ARMs) dominate. Unlike ARMs, which produce tokens sequentially, diffusion models iteratively refine noise through denoising steps, offering greater flexibility and speed.
Recent advancements show a shift toward using diffusion models in place of, or alongside, ARMs. Researchers also combine strengths from both methods and integrate autoregressive concepts into diffusion.

Here are 5 new implementations of diffusion models:

1. Mercury family of diffusion LLMs (dLLMs) by Inception Labs -> https://www.inceptionlabs.ai/news
It applies diffusion to text and code data, enabling sequence generation 10x faster than today's top LLMs. Now available Mercury Coder can run at over 1,000 tokens/sec on NVIDIA H100s.

2. Diffusion of Thoughts (DoT) -> Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models (2402.07754)
Integrates diffusion models with Chain-of-Thought. DoT allows reasoning steps to diffuse gradually over time. This flexibility enables balancing between reasoning quality and computational cost.

3. LLaDA -> Large Language Diffusion Models (2502.09992)
Shows diffusion models' potential in replacing ARMs. Trained with pre-training and SFT, LLaDA masks tokens, predicts them via a Transformer, and optimizes a likelihood bound. LLaDA matches key LLM skills, and surpasses GPT-4o in reversal poetry.

4. LanDiff -> The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation (2503.04606)
This hybrid text-to-video model combines autoregressive and diffusion paradigms, introducing a semantic tokenizer, an LM for token generation, and a streaming diffusion model. LanDiff outperforms models like Sora.

5. General Interpolating Discrete Diffusion (GIDD) -> Generalized Interpolating Discrete Diffusion (2503.04482)
A flexible noising process with a novel diffusion ELBO enables combining masking and uniform noise, allowing diffusion models to correct mistakes, where ARMs struggle.
  • 3 replies
·
reacted to clem's post with 🔥 5 days ago
view post
Post
6990
I was chatting with @peakji , one of the cofounders of Manu AI, who told me he was on Hugging Face (very cool!).

He shared an interesting insight which is that agentic capabilities might be more of an alignment problem rather than a foundational capability issue. Similar to the difference between GPT-3 and InstructGPT, some open-source foundation models are simply trained to 'answer everything in one response regardless of the complexity of the question' - after all, that's the user preference in chatbot use cases. Just a bit of post-training on agentic trajectories can make an immediate and dramatic difference.

As a thank you to the community, he shared 100 invite code first-come first serve, just use “HUGGINGFACE” to get access!
·
reacted to davidberenstein1957's post with ❤️ 8 days ago
reacted to albertvillanova's post with 🔥 8 days ago
view post
Post
3771
🚀 Big news for AI agents! With the latest release of smolagents, you can now securely execute Python code in sandboxed Docker or E2B environments. 🦾🔒

Here's why this is a game-changer for agent-based systems: 🧵👇

1️⃣ Security First 🔐
Running AI agents in unrestricted Python environments is risky! With sandboxing, your agents are isolated, preventing unintended file access, network abuse, or system modifications.

2️⃣ Deterministic & Reproducible Runs 📦
By running agents in containerized environments, you ensure that every execution happens in a controlled and predictable setting—no more environment mismatches or dependency issues!

3️⃣ Resource Control & Limits 🚦
Docker and E2B allow you to enforce CPU, memory, and execution time limits, so rogue or inefficient agents don’t spiral out of control.

4️⃣ Safer Code Execution in Production 🏭
Deploy AI agents confidently, knowing that any generated code runs in an ephemeral, isolated environment, protecting your host machine and infrastructure.

5️⃣ Easy to Integrate 🛠️
With smolagents, you can simply configure your agent to use Docker or E2B as its execution backend—no need for complex security setups!

6️⃣ Perfect for Autonomous AI Agents 🤖
If your AI agents generate and execute code dynamically, this is a must-have to avoid security pitfalls while enabling advanced automation.

⚡ Get started now: https://github.com/huggingface/smolagents

What will you build with smolagents? Let us know! 🚀💡
reacted to burtenshaw's post with ❤️ 9 days ago
view post
Post
3544
I’m super excited to work with @mlabonne to build the first practical example in the reasoning course.

🔗 https://huggingface.co/reasoning-course

Here's a quick walk through of the first drop of material that works toward the use case:

- a fundamental introduction to reinforcement learning. Answering questions like, ‘what is a reward?’ and ‘how do we create an environment for a language model?’

- Then it focuses on Deepseek R1 by walking through the paper and highlighting key aspects. This is an old school way to learn ML topics, but it always works.

- Next, it takes to you Transformers Reinforcement Learning and demonstrates potential reward functions you could use. This is cool because it uses Marimo notebooks to visualise the reward.

- Finally, Maxime walks us through a real training notebook that uses GRPO to reduce generation length. I’m really into this because it works and Maxime took the time to validate it share assets and logging from his own runs for you to compare with.

Maxime’s work and notebooks have been a major part of the open source community over the last few years. I, like everyone, have learnt so much from them.
posted an update 9 days ago
reacted to AdinaY's post with 😎 11 days ago
view post
Post
3996
Exciting releases from the Chinese community this February🔥
👉 zh-ai-community/2025-february-67a35aaa68e97812def5b6ef

MLLM:
✨ Ovis2 by Alibaba
AIDC-AI/ovis2-67ab36c7e497429034874464
✨ Step Audio Chat by StepFun AI
stepfun-ai/step-audio-67b33accf45735bb21131b0b

Audio:
✨ Step Audio TTS by StepFunAI
stepfun-ai/Step-Audio-TTS-3B
✨ InspireMusic by Alibaba
https://huggingface.co/FunAudioLLM
✨ Baichuan Audio by BaichuanAI
baichuan-inc/Baichuan-Audio-Instruct

Video:
✨ Wan2.1 by Alibaba_Wan
Wan-AI/Wan2.1-T2V-14B
✨ Stepvideo-T2V by StepFun AI
stepfun-ai/stepvideo-t2v
✨ SkyReels-V1 by Skywork
Skywork/skyreels-v1-67b34676ff65b4ec02d16307
✨ LLaDA-8B by RenminUniversity
GSAI-ML/LLaDA-8B-Instruct

MoE:
✨ Moonlight-16B by MoonshotAI (Kimi)
moonshotai/Moonlight-16B-A3B-Instruct

Reasoning:
✨ TinyR1-32B by Qihoo360
qihoo360/TinyR1-32B-Preview

Dataset:
✨ Chinese DeepSeek R1-Distill data -110k
Congliu/Chinese-DeepSeek-R1-Distill-data-110k
reacted to mkurman's post with ❤️ 13 days ago
view post
Post
3673
Introducing a new architecture, MedIT One – a single-token transformer with LSTM-like recurrence.

It is extremely fast in training and inference, but we lack funding for large-scale training. Enjoy 🍓

https://github.com/MedITSolutionsKurman/medit-one

reacted to davanstrien's post with 🧠 15 days ago
view post
Post
3615
Quick POC: Turn a Hugging Face dataset card into a short podcast introducing the dataset using all open models.

I think I'm the only weirdo who would enjoy listening to something like this though 😅

Here is an example for eth-nlped/stepverify
  • 2 replies
·
reacted to burtenshaw's post with 🔥 16 days ago
view post
Post
6202
Now the Hugging Face agent course is getting real! With frameworks like smolagents, LlamaIndex, and LangChain.

🔗 Follow the org for updates https://huggingface.co/agents-course

This week we are releasing the first framework unit in the course and it’s on smolagents. This is what the unit covers:

- why should you use smolagents vs another library?
- how to build agents that use code
- build multiagents systems
- use vision language models for browser use

The team has been working flat out on this for a few weeks. Led by @sergiopaniego and supported by smolagents author @m-ric .
reacted to freddyaboulton's post with 🚀 16 days ago
view post
Post
3155
Getting WebRTC and Websockets right in python is very tricky. If you've tried to wrap an LLM in a real-time audio layer then you know what I'm talking about.

That's where FastRTC comes in! It makes WebRTC and Websocket streams super easy with minimal code and overhead.

Check out our org: hf.co/fastrtc
posted an update 17 days ago
view post
Post
5849
Dropping some of the custom fine-tunes based on SigLIP2,
with a single/multi label classification problem type! 🌀🧤

- AI vs Deepfake vs Real : prithivMLmods/AI-vs-Deepfake-vs-Real-Siglip2
- Deepfake Detect : prithivMLmods/Deepfake-Detect-Siglip2
- Fire Detection : prithivMLmods/Fire-Detection-Siglip2
- Deepfake Quality Assess : prithivMLmods/Deepfake-Quality-Assess-Siglip2
- Guard Against Unsafe Content : prithivMLmods/Guard-Against-Unsafe-Content-Siglip2

🌠Collection : prithivMLmods/siglip2-custom-67bcdb2de8fe96b99fb4e19e
reacted to alvarobartt's post with 🔥 17 days ago
view post
Post
2842
🔥 Agents can do anything! @microsoft Research just announced the release of Magma 8B!

Magma is a new Visual Language Model (VLM) with 8B parameters for multi-modal agents designed to handle complex interactions across virtual and real environments; and it's MIT licensed!

Magma comes with exciting new features such as:
- Introduces the Set-of-Mark and Trace-of-Mark techniques for fine-tuning
- Leverages a large amount of unlabeled video data to learn the spatial-temporal grounding and planning
- A strong generalization and ability to be fine-tuned for other agentic tasks
- SOTA in different multi-modal benchmarks spanning across UI navigation, robotics manipulation, image / video understanding and spatial understanding and reasoning
- Generates goal-driven visual plans and actions for agentic use cases

Model: microsoft/Magma-8B
Technical Report: Magma: A Foundation Model for Multimodal AI Agents (2502.13130)
reacted to AdinaY's post with 🚀 17 days ago
view post
Post
3161
Try QwQ-Max-Preview, Qwen's reasoning model here👉 https://chat.qwen.ai
Can't wait for the model weights to drop on the Hugging Face Hub 🔥
·
replied to their post 19 days ago
view reply

@lunarflu
I read about it somewhere in a Microsoft article that, unlike the e− based, it doesn’t exist naturally but in specific equilibrium conditions with superconductors, and also with a backable magnetic field, it can proceed to obtain the particle to existence.

posted an update 20 days ago
view post
Post
5824
It's really interesting about the deployment of a new state of matter in Majorana 1: the world’s first quantum processor powered by topological qubits. If you missed this news this week, here are some links for you:

🅱️Topological qubit arrays: https://arxiv.org/pdf/2502.12252

⚛️ Quantum Blog: https://azure.microsoft.com/en-us/blog/quantum/2025/02/19/microsoft-unveils-majorana-1-the-worlds-first-quantum-processor-powered-by-topological-qubits/

📖 Read the story: https://news.microsoft.com/source/features/innovation/microsofts-majorana-1-chip-carves-new-path-for-quantum-computing/

📝 Majorana 1 Intro: https://youtu.be/Q4xCR20Dh1E?si=Z51DbEYnZFp_88Xp

🌀The Path to a Million Qubits: https://youtu.be/wSHmygPQukQ?si=TS80EhI62oWiMSHK
·
reacted to nicolay-r's post with 🚀 20 days ago
view post
Post
3740
📢 If you're looking for translating massive dataset of JSON-lines / CSV data with various set of source fields, then the following update would be relevant. So far and experimenting with adapting language specific Sentiment Analysis model, got a change to reforge and relaese bulk-translate 0.25.2.
⭐️ https://github.com/nicolay-r/bulk-translate/releases/tag/0.25.2

The update has the following major features
- Supporting schemas: all the columns to be translated are now could be declared within the same prompt-style format. using json this automatically allows to map them onto output fields
- The related updates for shell execution mode: schema parameter is now available alongside with just a prompt usage before.

Benefit is that your output is invariant. You can extend and stack various translators with separated shell laucnhes.

Screenshot below is the application of the google-translate engine in manual batching mode.
🚀 Performance: 2.5 it / sec (in the case of a single field translation)

🌟 about bulk-translate: https://github.com/nicolay-r/bulk-translate
🌌 nlp-thirdgate: https://github.com/nicolay-r/nlp-thirdgate?tab=readme-ov-file
  • 1 reply
·