Spaces:
Running
Running
license: mit | |
title: CRAWLGPT | |
sdk: docker | |
emoji: π» | |
colorFrom: pink | |
colorTo: blue | |
pinned: true | |
short_description: A powerful web content crawler with LLM-powered RAG. | |
# CrawlGPT π€ | |
A powerful web content crawler with LLM-powered RAG (Retrieval Augmented Generation) capabilities. CrawlGPT extracts content from URLs, processes it through intelligent summarization, and enables natural language interactions using modern LLM technology. | |
## π Key Features | |
### Core Features | |
- **Intelligent Web Crawling** | |
- Async web content extraction using Playwright | |
- Smart rate limiting and validation | |
- Configurable crawling strategies | |
- **Advanced Content Processing** | |
- Automatic text chunking and summarization | |
- Vector embeddings via FAISS | |
- Context-aware response generation | |
- **Streamlit Chat Interface** | |
- Clean, responsive UI | |
- Real-time content processing | |
- Conversation history | |
- User authentication | |
### Technical Features | |
- **Vector Database** | |
- FAISS-powered similarity search | |
- Efficient content retrieval | |
- Persistent storage | |
- **User Management** | |
- SQLite database backend | |
- Secure password hashing | |
- Chat history tracking | |
- **Monitoring & Utils** | |
- Request metrics collection | |
- Progress tracking | |
- Data import/export | |
- Content validation | |
## π₯ Demo | |
### [Deployed APP ππ€](https://huggingface.co/spaces/jatinmehra/CRAWL-GPT-CHAT) | |
[streamlit-chat_app video.webm](https://github.com/user-attachments/assets/ae1ddca0-9e3e-4b00-bf21-e73bb8e6cfdf) | |
_Example of CRAWLGPT in action!_ | |
## π§ Requirements | |
- Python >= 3.8 | |
- Operating System: OS Independent | |
- Required packages are handled by the setup script. | |
## π Quick Start | |
1. Clone the Repository: | |
```git clone https://github.com/Jatin-Mehra119/CRAWLGPT.git | |
cd CRAWLGPT | |
``` | |
2. Run the Setup Script: | |
``` | |
python -m setup_env | |
``` | |
_This script installs dependencies, creates a virtual environment, and prepares the project._ | |
3. Update Your Environment Variables: | |
- Create or modify the `.env` file. | |
- Add your Groq API key and Ollama API key. Learn how to get API keys. | |
``` | |
GROQ_API_KEY=your_groq_api_key_here | |
OLLAMA_API_TOKEN=your_ollama_api_key_here | |
``` | |
4. Activate the Virtual Environment: | |
``` | |
source .venv/bin/activate # On Unix/macOS | |
.venv\Scripts\activate # On Windows | |
``` | |
5. Run the Application: | |
``` | |
python -m streamlit run src/crawlgpt/ui/chat_app.py | |
``` | |
## π¦ Dependencies | |
### Core Dependencies | |
- `streamlit==1.41.1` | |
- `groq==0.15.0` | |
- `sentence-transformers==3.3.1` | |
- `faiss-cpu==1.9.0.post1` | |
- `crawl4ai==0.4.247` | |
- `python-dotenv==1.0.1` | |
- `pydantic==2.10.5` | |
- `aiohttp==3.11.11` | |
- `beautifulsoup4==4.12.3` | |
- `numpy==2.2.0` | |
- `tqdm==4.67.1` | |
- `playwright>=1.41.0` | |
- `asyncio>=3.4.3` | |
### Development Dependencies | |
- `pytest==8.3.4` | |
- `pytest-mockito==0.0.4` | |
- `black==24.2.0` | |
- `isort==5.13.0` | |
- `flake8==7.0.0` | |
## ποΈ Project Structure | |
``` | |
crawlgpt/ | |
βββ src/ | |
β βββ crawlgpt/ | |
β βββ core/ # Core functionality | |
β β βββ database.py # SQL database handling | |
β β βββ LLMBasedCrawler.py # Main crawler implementation | |
β β βββ DatabaseHandler.py # Vector database (FAISS) | |
β β βββ SummaryGenerator.py # Text summarization | |
β βββ ui/ # User Interface | |
β β βββ chat_app.py # Main Streamlit app | |
β β βββ chat_ui.py # Development UI | |
β β βββ login.py # Authentication UI | |
β βββ utils/ # Utilities | |
β βββ content_validator.py # URL/content validation | |
β βββ data_manager.py # Import/export handling | |
β βββ helper_functions.py # General helpers | |
β βββ monitoring.py # Metrics collection | |
β βββ progress.py # Progress tracking | |
βββ tests/ # Test suite | |
β βββ test_core/ | |
β βββ test_database_handler.py # Vector DB tests | |
β βββ test_integration.py # Integration tests | |
β βββ test_llm_based_crawler.py # Crawler tests | |
β βββ test_summary_generator.py # Summarizer tests | |
βββ .github/ # CI/CD | |
β βββ workflows/ | |
β βββ Push_to_hf.yaml # HuggingFace sync | |
βββ Docs/ | |
β βββ MiniDoc.md # Documentation | |
βββ .dockerignore # Docker exclusions | |
βββ .gitignore # Git exclusions | |
βββ Dockerfile # Container config | |
βββ LICENSE # MIT License | |
βββ README.md # Project documentation | |
βββ README_hf.md # HuggingFace README | |
βββ pyproject.toml # Project metadata | |
βββ pytest.ini # Test configuration | |
βββ setup_env.py # Environment setup | |
``` | |
## π§ͺ Testing | |
Run all tests | |
``` | |
python -m pytest | |
``` | |
_The tests include unit tests for core functionality and integration tests for end-to-end workflows._ | |
## π License | |
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. | |
## π Links | |
- [Bug Tracker](https://github.com/Jatin-Mehra119/crawlgpt/issues) | |
- [Documentation](https://github.com/Jatin-Mehra119/crawlgpt/wiki) | |
- [Source Code](https://github.com/Jatin-Mehra119/crawlgpt) | |
## π§‘ Acknowledgments | |
- Inspired by the potential of GPT models for intelligent content processing. | |
- Special thanks to the creators of Crawl4ai, Groq, FAISS, and Playwright for their powerful tools. | |
## π¨βπ» Author | |
- Jatin Mehra ([email protected]) | |
## π€ Contributing | |
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, open an issue first to discuss your proposal. | |
1. Fork the Project. | |
2. Create your Feature Branch: | |
``` | |
git checkout -b feature/AmazingFeature` | |
``` | |
3. Commit your Changes: | |
``` | |
git commit -m 'Add some AmazingFeature | |
``` | |
4. Push to the Branch: | |
``` | |
git push origin feature/AmazingFeature | |
``` | |
5. Open a Pull Request. |