🔍 DeepGit 2.0 — ColBERT‑Powered, Hardware‑Aware & Ready to Dig
Community Article
Published
April 18, 2025

GitHub’s great… until you actually have to find something
Stars are a popularity contest, keywords are brittle, and half the repos you open can’t even run on your laptop. DeepGit 2.0 fixes that by treating GitHub like a **research corpus** instead of a social feed.
DeepGit is an advanced, Langgraph-based agentic workflow designed to perform deep research across GitHub repositories. It intelligently searches, analyzes, and ranks repositories based on user intent—even uncovering less-known but highly relevant tools. DeepGit infuses hybrid dense retrieval with advanced cross-encoder re-ranking and comprehensive activity analysis into a unified, open-source platform for intelligent repository discovery
đź§© What Makes DeepGit Different?
Pain on vanilla GitHub | DeepGit’s antidote |
---|---|
Infinite scrolling through star‑inflated, outdated projects | ColBERT v2 semantic retrieval – token‑level MaxSim pulls conceptually relevant repos, not just fuzzy keyword hits |
README looks good… until pip install dies |
Hardware‑aware dependency filter – the agent reads requirements.txt / pyproject.toml and drops repos that need a GPU when you’re “GPU‑poor” |
One metric (stars) ≠quality | Multi‑factor ranking – cross‑encoder similarity, code‑quality heuristics, commit cadence & community health blend into a single relevance score |
Time sink: Clicking, reading, guessing | Tabulated results with similarity %, hardware badge, and one‑line justification – decide in seconds |

🚀 What’s New in 2.0?
Upgrade | Why it matters |
---|---|
⚛ ColBERT‑v2 embeddings | Late‑interaction vectors capture phrase‑level context; surfaces hidden gems that single‑vector models miss |
🔩 Hardware‑aware filter | Add “cpu‑only”, “low‑memory” or “mobile” to your query – the agent prunes heavyweight repos automatically |
⚡ Faster cross‑encoder | MiniLM‑L‑6‑v2 keeps passage‑level accuracy while chopping latency |
đź› Inside the Agentic Pipeline
Query: “Fast Rust JSON parser that runs on cpu‑only”
Stage | Behind the curtain |
---|---|
1. Query Expansion | LLM rewrites to json-parser:rust:target-cpu |
2. Hardware Detection | “cpu‑only” recorded as a constraint |
3. ColBERT Retrieval | 280 repos scored via MaxSim over README & docs |
4. Cross‑Encoder Re‑rank | Top‑K rescored → 60 remain |
5. Dependency Filter | Model reads Cargo .toml & drops crates requiring CUDA |
6. Insight Merge | Adds stars, forks, issue velocity, code smells |
7. Output | Table with similarity %, CE‑score, and ✅ Runs on cpu‑only badge |

🔬 Technical Highlights
- LangGraph orchestration – each tool is a node; loops until convergence
- ColBERT‑v2 – pulled from
colbert-ir/colbertv2.0
, runs CPU or GPU - Cross‑Encoder –
cross-encoder/ms-marco-MiniLM-L-6-v2
for re‑ranking - Dependency reasoning – the agent asks “Can this dependency list run on ?” and acts on the answer
🚀 Goals
Uncover Hidden Gems:
Surface powerful but under-the-radar open-source tools. Now comes with hardware spec filter too.Empower Research:
Build an intelligent discovery layer over GitHub tailored for research-focused developers.Promote Open Innovation:
Open-source the entire workflow to foster transparency and collaboration in research.
đź§Ş Try It Yourself
Zero‑GPU demo
👉 Hugging Face Space – https://huggingface.co/spaces/zamal/DeepGit-lite
Full local run
git clone https://github.com/zamalali/DeepGit.git
cd DeepGit
python -m venv venv && source venv/bin/activate # Win → venv\Scripts\activate
pip install -r requirements.txt
export GITHUB_API_KEY=<your_token>
python app.py