CRAWL-GPT-CHAT / README.md
github-actions[bot]
Update Hugging Face README
765d69e
---
license: mit
title: CRAWLGPT
sdk: docker
emoji: πŸ’»
colorFrom: pink
colorTo: blue
pinned: true
short_description: A powerful web content crawler with LLM-powered RAG.
---
# CrawlGPT πŸ€–
A powerful web content crawler with LLM-powered RAG (Retrieval Augmented Generation) capabilities. CrawlGPT extracts content from URLs, processes it through intelligent summarization, and enables natural language interactions using modern LLM technology.
## 🌟 Key Features
### Core Features
- **Intelligent Web Crawling**
- Async web content extraction using Playwright
- Smart rate limiting and validation
- Configurable crawling strategies
- **Advanced Content Processing**
- Automatic text chunking and summarization
- Vector embeddings via FAISS
- Context-aware response generation
- **Streamlit Chat Interface**
- Clean, responsive UI
- Real-time content processing
- Conversation history
- User authentication
### Technical Features
- **Vector Database**
- FAISS-powered similarity search
- Efficient content retrieval
- Persistent storage
- **User Management**
- SQLite database backend
- Secure password hashing
- Chat history tracking
- **Monitoring & Utils**
- Request metrics collection
- Progress tracking
- Data import/export
- Content validation
## πŸŽ₯ Demo
### [Deployed APP πŸš€πŸ€–](https://huggingface.co/spaces/jatinmehra/CRAWL-GPT-CHAT)
[streamlit-chat_app video.webm](https://github.com/user-attachments/assets/ae1ddca0-9e3e-4b00-bf21-e73bb8e6cfdf)
_Example of CRAWLGPT in action!_
## πŸ”§ Requirements
- Python >= 3.8
- Operating System: OS Independent
- Required packages are handled by the setup script.
## πŸš€ Quick Start
1. Clone the Repository:
```git clone https://github.com/Jatin-Mehra119/CRAWLGPT.git
cd CRAWLGPT
```
2. Run the Setup Script:
```
python -m setup_env
```
_This script installs dependencies, creates a virtual environment, and prepares the project._
3. Update Your Environment Variables:
- Create or modify the `.env` file.
- Add your Groq API key and Ollama API key. Learn how to get API keys.
```
GROQ_API_KEY=your_groq_api_key_here
OLLAMA_API_TOKEN=your_ollama_api_key_here
```
4. Activate the Virtual Environment:
```
source .venv/bin/activate # On Unix/macOS
.venv\Scripts\activate # On Windows
```
5. Run the Application:
```
python -m streamlit run src/crawlgpt/ui/chat_app.py
```
## πŸ“¦ Dependencies
### Core Dependencies
- `streamlit==1.41.1`
- `groq==0.15.0`
- `sentence-transformers==3.3.1`
- `faiss-cpu==1.9.0.post1`
- `crawl4ai==0.4.247`
- `python-dotenv==1.0.1`
- `pydantic==2.10.5`
- `aiohttp==3.11.11`
- `beautifulsoup4==4.12.3`
- `numpy==2.2.0`
- `tqdm==4.67.1`
- `playwright>=1.41.0`
- `asyncio>=3.4.3`
### Development Dependencies
- `pytest==8.3.4`
- `pytest-mockito==0.0.4`
- `black==24.2.0`
- `isort==5.13.0`
- `flake8==7.0.0`
## πŸ—οΈ Project Structure
```
crawlgpt/
β”œβ”€β”€ src/
β”‚ └── crawlgpt/
β”‚ β”œβ”€β”€ core/ # Core functionality
β”‚ β”‚ β”œβ”€β”€ database.py # SQL database handling
β”‚ β”‚ β”œβ”€β”€ LLMBasedCrawler.py # Main crawler implementation
β”‚ β”‚ β”œβ”€β”€ DatabaseHandler.py # Vector database (FAISS)
β”‚ β”‚ └── SummaryGenerator.py # Text summarization
β”‚ β”œβ”€β”€ ui/ # User Interface
β”‚ β”‚ β”œβ”€β”€ chat_app.py # Main Streamlit app
β”‚ β”‚ β”œβ”€β”€ chat_ui.py # Development UI
β”‚ β”‚ └── login.py # Authentication UI
β”‚ └── utils/ # Utilities
β”‚ β”œβ”€β”€ content_validator.py # URL/content validation
β”‚ β”œβ”€β”€ data_manager.py # Import/export handling
β”‚ β”œβ”€β”€ helper_functions.py # General helpers
β”‚ β”œβ”€β”€ monitoring.py # Metrics collection
β”‚ └── progress.py # Progress tracking
β”œβ”€β”€ tests/ # Test suite
β”‚ └── test_core/
β”‚ β”œβ”€β”€ test_database_handler.py # Vector DB tests
β”‚ β”œβ”€β”€ test_integration.py # Integration tests
β”‚ β”œβ”€β”€ test_llm_based_crawler.py # Crawler tests
β”‚ └── test_summary_generator.py # Summarizer tests
β”œβ”€β”€ .github/ # CI/CD
β”‚ └── workflows/
β”‚ └── Push_to_hf.yaml # HuggingFace sync
β”œβ”€β”€ Docs/
β”‚ └── MiniDoc.md # Documentation
β”œβ”€β”€ .dockerignore # Docker exclusions
β”œβ”€β”€ .gitignore # Git exclusions
β”œβ”€β”€ Dockerfile # Container config
β”œβ”€β”€ LICENSE # MIT License
β”œβ”€β”€ README.md # Project documentation
β”œβ”€β”€ README_hf.md # HuggingFace README
β”œβ”€β”€ pyproject.toml # Project metadata
β”œβ”€β”€ pytest.ini # Test configuration
└── setup_env.py # Environment setup
```
## πŸ§ͺ Testing
Run all tests
```
python -m pytest
```
_The tests include unit tests for core functionality and integration tests for end-to-end workflows._
## πŸ“ License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## πŸ”— Links
- [Bug Tracker](https://github.com/Jatin-Mehra119/crawlgpt/issues)
- [Documentation](https://github.com/Jatin-Mehra119/crawlgpt/wiki)
- [Source Code](https://github.com/Jatin-Mehra119/crawlgpt)
## 🧑 Acknowledgments
- Inspired by the potential of GPT models for intelligent content processing.
- Special thanks to the creators of Crawl4ai, Groq, FAISS, and Playwright for their powerful tools.
## πŸ‘¨β€πŸ’» Author
- Jatin Mehra ([email protected])
## 🀝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, open an issue first to discuss your proposal.
1. Fork the Project.
2. Create your Feature Branch:
```
git checkout -b feature/AmazingFeature`
```
3. Commit your Changes:
```
git commit -m 'Add some AmazingFeature
```
4. Push to the Branch:
```
git push origin feature/AmazingFeature
```
5. Open a Pull Request.