Did we just drop personalized AI evaluation?! This tool auto-generates custom benchmarks on your docs to test which models are the best.
Most benchmarks test general capabilities, but what matters is how models handle your data and tasks. YourBench helps answer critical questions like: - Do you really need a hundreds-of-billions-parameter model sledgehammer to crack a nut? - Could a smaller, fine-tuned model work better? - How well do different models understand your domain?
Some cool features: ๐ Generates custom benchmarks from your own documents (PDFs, Word, HTML) ๐ฏ Tests models on real tasks, not just general capabilities ๐ Supports multiple models for different pipeline stages ๐ง Generate both single-hop and multi-hop questions ๐ Evaluate top models and deploy leaderboards instantly ๐ฐ Full cost analysis to optimize for your budget ๐ ๏ธ Fully configurable via a single YAML file
26 SOTA models tested for question generation. Interesting finding: Qwen2.5 32B leads in question diversity, while smaller Qwen models and Gemini 2.0 Flash offer great value for cost.
You can also run it locally on any models you want.
Collection of approximately 197,718 aviation photographs featuring: - High-quality aircraft images across multiple sizes and formats - Comprehensive metadata including aircraft registrations, types, and photographer information - View counts, ratings, and submission timestamps for each photo - Rich classification data preserving original titles, descriptions, and photographer badges
This dataset offers a unique visual archive of aircraft spanning commercial, military, and private aviation captured by FlightAware's community of photographers under CC BY-NC-SA 3.0 license.
Before 2020, most of the AI field was open and collaborative. For me, that was the key factor that accelerated scientific progress and made the impossible possibleโjust look at the โTโ in ChatGPT, which comes from the Transformer architecture openly shared by Google.
Then came the myth that AI was too dangerous to share, and companies started optimizing for short-term revenue. That led many major AI labs and researchers to stop sharing and collaborating.
With OAI and sama now saying they're willing to share open weights again, we have a real chance to return to a golden age of AI progress and democratizationโpowered by openness and collaboration, in the US and around the world.
This is incredibly exciting. Letโs go, open science and open-source AI!
It's called ๐๐ข๐๐๐๐ง๐จ๐ฌ๐ข๐ฌ and is a lightweight framework that helps you ๐ฑ๐ถ๐ฎ๐ด๐ป๐ผ๐๐ฒ ๐๐ต๐ฒ ๐ฝ๐ฒ๐ฟ๐ณ๐ผ๐ฟ๐บ๐ฎ๐ป๐ฐ๐ฒ ๐ผ๐ณ ๐๐๐ ๐ ๐ฎ๐ป๐ฑ ๐ฟ๐ฒ๐๐ฟ๐ถ๐ฒ๐๐ฎ๐น ๐บ๐ผ๐ฑ๐ฒ๐น๐ ๐ถ๐ป ๐ฅ๐๐ ๐ฎ๐ฝ๐ฝ๐น๐ถ๐ฐ๐ฎ๐๐ถ๐ผ๐ป๐.
You can launch it as an application locally (it's Docker-ready!๐) or, if you want more flexibility, you can integrate it in your code as a python package๐ฆ
The workflow is simple: ๐ง You choose your favorite LLM provider and model (supported, for now, are Mistral AI, Groq, Anthropic, OpenAI and Cohere) ๐ง You pick the embedding models provider and the embedding model you prefer (supported, for now, are Mistral AI, Hugging Face, Cohere and OpenAI) ๐ You prepare and provide your documents โ๏ธ Documents are ingested into a Qdrant vector database and transformed into a synthetic question dataset with the help of LlamaIndex ๐ The LLM is evaluated for the faithfulness and relevancy of its retrieval-augmented answer to the questions ๐ The embedding model is evaluated for hit rate and mean reciprocal ranking (MRR) of the retrieved documents
And the cool thing is that all of this is ๐ถ๐ป๐๐๐ถ๐๐ถ๐๐ฒ ๐ฎ๐ป๐ฑ ๐ฐ๐ผ๐บ๐ฝ๐น๐ฒ๐๐ฒ๐น๐ ๐ฎ๐๐๐ผ๐บ๐ฎ๐๐ฒ๐ฑ: you plug it in, and it works!๐โก
Even cooler? This is all built on top of LlamaIndex and its integrations: no need for tons of dependencies or fancy workarounds๐ฆ And if you're a UI lover, Gradio and FastAPI are there to provide you a seamless backend-to-frontend experience๐ถ๏ธ