Dataset Tools

community

AI & ML interests

Tools for creating and exploring datasets

Recent Activity

Dataset-Tools's activity

fdaudensΒ 
posted an update about 13 hours ago
prithivMLmodsΒ 
posted an update 4 days ago
view post
Post
657
Hey Guys! One Small Announcement πŸ€—
Stranger Zone now accepts LoRA requests!

✍️Request : strangerzonehf/Request-LoRA [ or ] strangerzonehf/Request-LoRA#1

Page : https://huggingface.co/strangerzonehf

Describe the artistic properties by posting sample images or links to similar images in the request discussion. If the adapters you're asking for are truly creative and safe for work, I'll train and upload the LoRA to the Stranger Zone repo!

Thank you!
fdaudensΒ 
posted an update 4 days ago
view post
Post
604
🀯 Gemma 3's image analysis blew me away!

Tested 2 ways to extract airplane registration numbers from photos with 12B model:

1️⃣ Gradio app w/API link (underrated feature IMO) + ZeroGPU infra on Hugging Face in Google Colab. Fast & free.

2️⃣ LMStudio + local processing (100% private). Running this powerhouse on a MacBook w/16GB RAM is wild! πŸš€

Colab: https://colab.research.google.com/drive/1YmmaP0IDEu98CLDppAAK9kbQZ7lFnLZ1?usp=sharing
prithivMLmodsΒ 
posted an update 6 days ago
view post
Post
2427
Gemma-3-4B : Image and Video Inference πŸ–ΌοΈπŸŽ₯

🧀Space: prithivMLmods/Gemma-3-Multimodal
πŸ₯ Git : https://github.com/PRITHIVSAKTHIUR/Gemma-3-Multimodal

@gemma3 : {Tag + Space_+ 'prompt'}
@video-infer : {Tag + Space_+ 'prompt'}

+ Gemma3-4B : google/gemma-3-4b-it
+ By default, it runs : prithivMLmods/Qwen2-VL-OCR-2B-Instruct

Gemma 3 Technical Report : https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf
  • 1 reply
Β·
fdaudensΒ 
posted an update 6 days ago
view post
Post
1196
Ever wanted 45 min with one of AI’s most fascinating minds? Was with @thomwolf at HumanX Vegas. Sharing my notes of his Q&A with the pressβ€”completely changed how I think about AI’s future:

1️⃣ The next wave of successful AI companies won’t be defined by who has the best model but by who builds the most useful real-world solutions. "We all have engines in our cars, but that’s rarely the only reason we buy one. We expect it to work well, and that’s enough. LLMs will be the same."

2️⃣ Big players are pivoting: "Closed-source companiesβ€”OpenAI being the firstβ€”have largely shifted from LLM announcements to product announcements."

3️⃣ Open source is changing everything: "DeepSeek was open source AI’s ChatGPT moment. Basically, everyone outside the bubble realized you can get a model for freeβ€”and it’s just as good as the paid ones."

4️⃣ Product innovation is being democratized: Take Manus, for exampleβ€”they built a product on top of Anthropic’s models that’s "actually better than Anthropic’s own product for now, in terms of agents." This proves that anyone can build great products with existing models.

We’re entering a "multi-LLM world," where models are becoming commoditized, and all the tools to build are readily availableβ€”just look at the flurry of daily new releases on Hugging Face.

Thom's comparison to the internet era is spot-on: "In the beginning you made a lot of money by making websites... but nowadays the huge internet companies are not the companies that built websites. Like Airbnb, Uber, Facebook, they just use the internet as a medium to make something for real life use cases."

Love to hear your thoughts on this shift!
  • 1 reply
Β·
fdaudensΒ 
posted an update 6 days ago
view post
Post
1727
πŸ”₯The Open R1 team just dropped OlympicCoder and it's wild:

- 7B model outperforms Claude 3.7 Sonnet on IOI benchmark (yes, 7B!!)
- 32B crushes all open-weight models tested, even those 100x larger 🀯

Open-sourcing the future of code reasoning! πŸš€

Check it out https://huggingface.co/blog/open-r1/update-3
prithivMLmodsΒ 
posted an update 7 days ago
fdaudensΒ 
posted an update 9 days ago
view post
Post
5689
Honored to be named among their 12 pioneers and power players in the news industry in the 2025 Tech Trends Report from Future Today Strategy Group.

Incredible group to be part of - each person is doing groundbreaking work at the intersection of AI and journalism. Worth following them all: they're consistently sharing practical insights on building the future of news.

Take the time to read this report, it's packed with insights as always. The news & information section's #1 insight hits hard: "The most substantive economic impact of AI to date has been licensing payouts for a handful of big publishers. The competition will start shifting in the year ahead to separate AI 'haves' that have positioned themselves to grow from the 'have-nots.'"

This AI-driven divide is something I've been really concerned about. Now is the time to build more than ever!

πŸ‘‰ Full report here: https://ftsg.com/wp-content/uploads/2025/03/FTSG_2025_TR_FINAL_LINKED.pdf
  • 2 replies
Β·
TonicΒ 
posted an update 11 days ago
view post
Post
1070
πŸ™‹πŸ»β€β™‚οΈHey there folks,

Did you know that you can use ModernBERT to detect model hallucinations ?

Check out the Demo : Tonic/hallucination-test

See here for Medical Context Demo : MultiTransformer/tonic-discharge-guard

check out the model from KRLabs : KRLabsOrg/lettucedect-large-modernbert-en-v1

and the library they kindly open sourced for it : https://github.com/KRLabsOrg/LettuceDetect

πŸ‘†πŸ»if you like this topic please contribute code upstream πŸš€

  • 2 replies
Β·
fdaudensΒ 
posted an update 12 days ago
view post
Post
4066
AI will bring us "a country of yes-men on servers" instead of one of "Einsteins sitting in a data center" if we continue on current trends.

Must-read by @thomwolf deflating overblown AI promises and explaining what real scientific breakthroughs require.

https://thomwolf.io/blog/scientific-ai.html
  • 2 replies
Β·
davidberenstein1957Β 
posted an update 12 days ago
TonicΒ 
posted an update 13 days ago
view post
Post
668
Powered by KRLabsOrg/lettucedect-large-modernbert-en-v1 from KRLabsOrg.

Detect hallucinations in answers based on context and questions using ModernBERT with 8192-token context support!

### Model Details
- **Model Name**: [lettucedect-large-modernbert-en-v1]( KRLabsOrg/lettucedect-large-modernbert-en-v1)
- **Organization**: [KRLabsOrg](https://huggingface.co/KRLabsOrg)
- **Github**: [https://github.com/KRLabsOrg/LettuceDetect](https://github.com/KRLabsOrg/LettuceDetect)
- **Architecture**: ModernBERT (Large) with extended context support up to 8192 tokens
- **Task**: Token Classification / Hallucination Detection
- **Training Dataset**: [RagTruth]( wandb/RAGTruth-processed)
- **Language**: English
- **Capabilities**: Detects hallucinated spans in answers, provides confidence scores, and calculates average confidence across detected spans.

LettuceDetect excels at processing long documents to determine if an answer aligns with the provided context, making it a powerful tool for ensuring factual accuracy.
prithivMLmodsΒ 
posted an update 13 days ago
davidberenstein1957Β 
posted an update 14 days ago
view post
Post
4188
πŸ₯Š Epic Agent Framework Showdown! Available today!

πŸ”΅ In the blue corner, the versatile challenger with a proven track record of knowledge retrieval: LlamaIndex!

πŸ›‘ In the red corner, the defender, weighing in with lightweight efficiency: Hugging Face smolagents!

πŸ”— URL: https://huggingface.co/agents-course

We just published the LlamaIndex unit for the agents course, and it is set to offer a great contrast between the smolagents unit by looking at

- What makes llama-index stand-out
- How the LlamaHub is used for integrations
- Creating QueryEngine components
- Using agents and tools
- Agentic and multi-agent workflows

The team has been working flat-out on this for a few weeks. Supported by Logan Markewich and Laurie Voss over at LlamaIndex.

Who won? You decide!
davidberenstein1957Β 
posted an update 14 days ago
view post
Post
2996
🫸 New release to push vector search to the Hub with vicinity and work with any serialisable objects.

πŸ§‘β€πŸ« KNN, HNSW, USEARCH, ANNOY, PYNNDESCENT, FAISS, and VOYAGER.

πŸ”— Example Repo: minishlab/my-vicinity-repo
zamalΒ 
posted an update 16 days ago
view post
Post
1932
πŸš€ ftBoost is LIVE – Stop Struggling with Fine-Tuning Data!

Alright folks, if you’re tired of manually crafting fine-tuning datasets, ftBoost is here to do the heavy lifting. One-click, LangChain-Groq-powered data augmentation that scales your training data in OpenAI, Gemini, Mistral, and LLaMA formatsβ€”automatically.

πŸ”₯ What’s inside?
βœ… Smart Augmentations – Paraphrasing, back translation, synonym swapping & synthetic noise.
βœ… No more JSONL headaches – Auto-formats everything for OpenAI, Gemini, Mistral & LLaMA.
βœ… Custom tuning – Adjust similarity, diversity, and fluency in real-time.
βœ… Upload, generate, download – That’s it.

⚑ If you’re fine-tuning LLMs, this will save you hours.

πŸš€ Try it now: πŸ‘‰ zamal/Finetune-Boost

🌟 Give us a star on GitHub!

Let me know what you think & how it boosts your workflow! πŸ”₯
fdaudensΒ 
posted an update 18 days ago
view post
Post
3439
What if AI becomes as ubiquitous as the internet, but runs locally and transparently on our devices?

Fascinating TED talk by @thomwolf on open source AI and its future impact.

Imagine this for AI: instead of black box models running in distant data centers, we get transparent AI that runs locally on our phones and laptops, often without needing internet access. If the original team moves on? No problem - resilience is one of the beauties of open source. Anyone (companies, collectives, or individuals) can adapt and fix these models.

This is a compelling vision of AI's future that solves many of today's concerns around AI transparency and centralized control.

Watch the full talk here: https://www.ted.com/talks/thomas_wolf_what_if_ai_just_works
  • 1 reply
Β·
davanstrienΒ 
posted an update 18 days ago
view post
Post
2714
πŸ“Š Introducing "Hugging Face Dataset Spotlight" πŸ“Š

I'm excited to share the first episode of our AI-generated podcast series focusing on nice datasets from the Hugging Face Hub!

This first episode explores mathematical reasoning datasets:

- SynthLabsAI/Big-Math-RL-Verified: Over 250,000 rigorously verified problems spanning multiple difficulty levels and mathematical domains
- open-r1/OpenR1-Math-220k: 220,000 math problems with multiple reasoning traces, verified for accuracy using Math Verify and Llama-3.3-70B models.
- facebook/natural_reasoning: 1.1 million general reasoning questions carefully deduplicated and decontaminated from existing benchmarks, showing superior scaling effects when training models like Llama3.1-8B-Instruct.

Plus a bonus segment on bespokelabs/bespoke-manim!

https://www.youtube.com/watch?v=-TgmRq45tW4
davanstrienΒ 
posted an update 19 days ago
view post
Post
3619
Quick POC: Turn a Hugging Face dataset card into a short podcast introducing the dataset using all open models.

I think I'm the only weirdo who would enjoy listening to something like this though πŸ˜…

Here is an example for eth-nlped/stepverify
  • 2 replies
Β·
fdaudensΒ 
posted an update 19 days ago
view post
Post
3088
Is this the best tool to extract clean info from PDFs, handwriting and complex documents yet?

Open source olmOCR just dropped and the results are impressive.

Tested the free demo with various documents, including a handwritten Claes Oldenburg letter. The speed is impressive: 3000 tokens/second on your own GPU - that's 1/32 the cost of GPT-4o ($190/million pages). Game-changer for content extraction and digital archives.

To achieve this, Ai2 trained a 7B vision language model on 260K pages from 100K PDFs using "document anchoring" - combining PDF metadata with page images.

Best part: it actually understands document structure (columns, tables, equations) instead of just jumbling everything together like most OCR tools. Their human eval results back this up.

πŸ‘‰ Try the demo: https://olmocr.allenai.org

Going right into the AI toolkit: JournalistsonHF/ai-toolkit
  • 3 replies
Β·