Spaces:
Running
Running
File size: 6,630 Bytes
765d69e d32bdc1 5143d98 d32bdc1 5143d98 d32bdc1 5143d98 d32bdc1 5143d98 3878b6f 5143d98 3878b6f 5143d98 765d69e d32bdc1 765d69e d32bdc1 765d69e d32bdc1 765d69e d32bdc1 765d69e d32bdc1 5143d98 d32bdc1 765d69e d32bdc1 3878b6f d32bdc1 765d69e d32bdc1 5143d98 765d69e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 |
---
license: mit
title: CRAWLGPT
sdk: docker
emoji: π»
colorFrom: pink
colorTo: blue
pinned: true
short_description: A powerful web content crawler with LLM-powered RAG.
---
# CrawlGPT π€
A powerful web content crawler with LLM-powered RAG (Retrieval Augmented Generation) capabilities. CrawlGPT extracts content from URLs, processes it through intelligent summarization, and enables natural language interactions using modern LLM technology.
## π Key Features
### Core Features
- **Intelligent Web Crawling**
- Async web content extraction using Playwright
- Smart rate limiting and validation
- Configurable crawling strategies
- **Advanced Content Processing**
- Automatic text chunking and summarization
- Vector embeddings via FAISS
- Context-aware response generation
- **Streamlit Chat Interface**
- Clean, responsive UI
- Real-time content processing
- Conversation history
- User authentication
### Technical Features
- **Vector Database**
- FAISS-powered similarity search
- Efficient content retrieval
- Persistent storage
- **User Management**
- SQLite database backend
- Secure password hashing
- Chat history tracking
- **Monitoring & Utils**
- Request metrics collection
- Progress tracking
- Data import/export
- Content validation
## π₯ Demo
### [Deployed APP ππ€](https://huggingface.co/spaces/jatinmehra/CRAWL-GPT-CHAT)
[streamlit-chat_app video.webm](https://github.com/user-attachments/assets/ae1ddca0-9e3e-4b00-bf21-e73bb8e6cfdf)
_Example of CRAWLGPT in action!_
## π§ Requirements
- Python >= 3.8
- Operating System: OS Independent
- Required packages are handled by the setup script.
## π Quick Start
1. Clone the Repository:
```git clone https://github.com/Jatin-Mehra119/CRAWLGPT.git
cd CRAWLGPT
```
2. Run the Setup Script:
```
python -m setup_env
```
_This script installs dependencies, creates a virtual environment, and prepares the project._
3. Update Your Environment Variables:
- Create or modify the `.env` file.
- Add your Groq API key and Ollama API key. Learn how to get API keys.
```
GROQ_API_KEY=your_groq_api_key_here
OLLAMA_API_TOKEN=your_ollama_api_key_here
```
4. Activate the Virtual Environment:
```
source .venv/bin/activate # On Unix/macOS
.venv\Scripts\activate # On Windows
```
5. Run the Application:
```
python -m streamlit run src/crawlgpt/ui/chat_app.py
```
## π¦ Dependencies
### Core Dependencies
- `streamlit==1.41.1`
- `groq==0.15.0`
- `sentence-transformers==3.3.1`
- `faiss-cpu==1.9.0.post1`
- `crawl4ai==0.4.247`
- `python-dotenv==1.0.1`
- `pydantic==2.10.5`
- `aiohttp==3.11.11`
- `beautifulsoup4==4.12.3`
- `numpy==2.2.0`
- `tqdm==4.67.1`
- `playwright>=1.41.0`
- `asyncio>=3.4.3`
### Development Dependencies
- `pytest==8.3.4`
- `pytest-mockito==0.0.4`
- `black==24.2.0`
- `isort==5.13.0`
- `flake8==7.0.0`
## ποΈ Project Structure
```
crawlgpt/
βββ src/
β βββ crawlgpt/
β βββ core/ # Core functionality
β β βββ database.py # SQL database handling
β β βββ LLMBasedCrawler.py # Main crawler implementation
β β βββ DatabaseHandler.py # Vector database (FAISS)
β β βββ SummaryGenerator.py # Text summarization
β βββ ui/ # User Interface
β β βββ chat_app.py # Main Streamlit app
β β βββ chat_ui.py # Development UI
β β βββ login.py # Authentication UI
β βββ utils/ # Utilities
β βββ content_validator.py # URL/content validation
β βββ data_manager.py # Import/export handling
β βββ helper_functions.py # General helpers
β βββ monitoring.py # Metrics collection
β βββ progress.py # Progress tracking
βββ tests/ # Test suite
β βββ test_core/
β βββ test_database_handler.py # Vector DB tests
β βββ test_integration.py # Integration tests
β βββ test_llm_based_crawler.py # Crawler tests
β βββ test_summary_generator.py # Summarizer tests
βββ .github/ # CI/CD
β βββ workflows/
β βββ Push_to_hf.yaml # HuggingFace sync
βββ Docs/
β βββ MiniDoc.md # Documentation
βββ .dockerignore # Docker exclusions
βββ .gitignore # Git exclusions
βββ Dockerfile # Container config
βββ LICENSE # MIT License
βββ README.md # Project documentation
βββ README_hf.md # HuggingFace README
βββ pyproject.toml # Project metadata
βββ pytest.ini # Test configuration
βββ setup_env.py # Environment setup
```
## π§ͺ Testing
Run all tests
```
python -m pytest
```
_The tests include unit tests for core functionality and integration tests for end-to-end workflows._
## π License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## π Links
- [Bug Tracker](https://github.com/Jatin-Mehra119/crawlgpt/issues)
- [Documentation](https://github.com/Jatin-Mehra119/crawlgpt/wiki)
- [Source Code](https://github.com/Jatin-Mehra119/crawlgpt)
## π§‘ Acknowledgments
- Inspired by the potential of GPT models for intelligent content processing.
- Special thanks to the creators of Crawl4ai, Groq, FAISS, and Playwright for their powerful tools.
## π¨βπ» Author
- Jatin Mehra ([email protected])
## π€ Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, open an issue first to discuss your proposal.
1. Fork the Project.
2. Create your Feature Branch:
```
git checkout -b feature/AmazingFeature`
```
3. Commit your Changes:
```
git commit -m 'Add some AmazingFeature
```
4. Push to the Branch:
```
git push origin feature/AmazingFeature
```
5. Open a Pull Request. |