🔍 DeepGit 2.0 — ColBERT‑Powered, Hardware‑Aware & Ready to Dig

Community Article Published April 18, 2025

GitHub’s great… until you actually have to find something

Stars are a popularity contest, keywords are brittle, and half the repos you open can’t even run on your laptop. DeepGit 2.0 fixes that by treating GitHub like a **research corpus** instead of a social feed.
DeepGit is an advanced, Langgraph-based agentic workflow designed to perform deep research across GitHub repositories. It intelligently searches, analyzes, and ranks repositories based on user intent—even uncovering less-known but highly relevant tools. DeepGit infuses hybrid dense retrieval with advanced cross-encoder re-ranking and comprehensive activity analysis into a unified, open-source platform for intelligent repository discovery

đź§© What Makes DeepGit Different?

Pain on vanilla GitHub DeepGit’s antidote
Infinite scrolling through star‑inflated, outdated projects ColBERT v2 semantic retrieval – token‑level MaxSim pulls conceptually relevant repos, not just fuzzy keyword hits
README looks good… until pip install dies Hardware‑aware dependency filter – the agent reads requirements.txt / pyproject.toml and drops repos that need a GPU when you’re “GPU‑poor”
One metric (stars) ≠ quality Multi‑factor ranking – cross‑encoder similarity, code‑quality heuristics, commit cadence & community health blend into a single relevance score
Time sink: Clicking, reading, guessing Tabulated results with similarity %, hardware badge, and one‑line justification – decide in seconds

🚀 What’s New in 2.0?

Upgrade Why it matters
âš› ColBERT‑v2 embeddings Late‑interaction vectors capture phrase‑level context; surfaces hidden gems that single‑vector models miss
🔩 Hardware‑aware filter Add “cpu‑only”, “low‑memory” or “mobile” to your query – the agent prunes heavyweight repos automatically
⚡ Faster cross‑encoder MiniLM‑L‑6‑v2 keeps passage‑level accuracy while chopping latency

đź›  Inside the Agentic Pipeline

Query: “Fast Rust JSON parser that runs on cpu‑only”

Stage Behind the curtain
1. Query Expansion LLM rewrites to json-parser:rust:target-cpu
2. Hardware Detection “cpu‑only” recorded as a constraint
3. ColBERT Retrieval 280 repos scored via MaxSim over README & docs
4. Cross‑Encoder Re‑rank Top‑K rescored → 60 remain
5. Dependency Filter Model reads Cargo .toml & drops crates requiring CUDA
6. Insight Merge Adds stars, forks, issue velocity, code smells
7. Output Table with similarity %, CE‑score, and âś… Runs on cpu‑only badge

🔬 Technical Highlights

  • LangGraph orchestration – each tool is a node; loops until convergence
  • ColBERT‑v2 – pulled from colbert-ir/colbertv2.0, runs CPU or GPU
  • Cross‑Encoder – cross-encoder/ms-marco-MiniLM-L-6-v2 for re‑ranking
  • Dependency reasoning – the agent asks “Can this dependency list run on ?” and acts on the answer

🚀 Goals

  • Uncover Hidden Gems:
    Surface powerful but under-the-radar open-source tools. Now comes with hardware spec filter too.

  • Empower Research:
    Build an intelligent discovery layer over GitHub tailored for research-focused developers.

  • Promote Open Innovation:
    Open-source the entire workflow to foster transparency and collaboration in research.

đź§Ş Try It Yourself

Zero‑GPU demo

👉 Hugging Face Space – https://huggingface.co/spaces/zamal/DeepGit-lite

Full local run

git clone https://github.com/zamalali/DeepGit.git
cd DeepGit
python -m venv venv && source venv/bin/activate   # Win → venv\Scripts\activate
pip install -r requirements.txt
export GITHUB_API_KEY=<your_token>
python app.py

Community

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment