Spaces:
Running
Running
# CrawlGPT Documentation | |
## Overview | |
CrawlGPT is a web content crawler with GPT-powered summarization and chat capabilities. It extracts content from URLs, stores it in a vector database, and enables natural language querying of the stored content. | |
## Project Structure | |
``` | |
crawlgpt/ | |
βββ src/ | |
β βββ crawlgpt/ | |
β βββ core/ # Core functionality | |
β β βββ database.py # SQL database handling | |
β β βββ LLMBasedCrawler.py # Main crawler implementation | |
β β βββ DatabaseHandler.py # Vector database (FAISS) | |
β β βββ SummaryGenerator.py # Text summarization | |
β βββ ui/ # User Interface | |
β β βββ chat_app.py # Main Streamlit app | |
β β βββ chat_ui.py # Development UI | |
β β βββ login.py # Authentication UI | |
β βββ utils/ # Utilities | |
β βββ content_validator.py # URL/content validation | |
β βββ data_manager.py # Import/export handling | |
β βββ helper_functions.py # General helpers | |
β βββ monitoring.py # Metrics collection | |
β βββ progress.py # Progress tracking | |
βββ tests/ # Test suite | |
β βββ test_core/ | |
β βββ test_database_handler.py # Vector DB tests | |
β βββ test_integration.py # Integration tests | |
β βββ test_llm_based_crawler.py # Crawler tests | |
β βββ test_summary_generator.py # Summarizer tests | |
βββ .github/ # CI/CD | |
β βββ workflows/ | |
β βββ Push_to_hf.yaml # HuggingFace sync | |
βββ Docs/ | |
β βββ MiniDoc.md # Documentation | |
βββ .dockerignore # Docker exclusions | |
βββ .gitignore # Git exclusions | |
βββ Dockerfile # Container config | |
βββ LICENSE # MIT License | |
βββ README.md # Project documentation | |
βββ README_hf.md # HuggingFace README | |
βββ pyproject.toml # Project metadata | |
βββ pytest.ini # Test configuration | |
βββ setup_env.py # Environment setup | |
``` | |
## Core Components | |
### [LLMBasedCrawler](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/core/LLMBasedCrawler.py) (src/crawlgpt/core/LLMBasedCrawler.py) | |
- Main crawler class handling web content extraction and processing | |
- Integrates with Groq API for language model operations | |
- Manages content chunking, summarization and response generation | |
- Includes rate limiting and metrics collection | |
### [DatabaseHandler](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/core/DatabaseHandler.py) (src/crawlgpt/core/DatabaseHandler.py) | |
- Vector database implementation using FAISS | |
- Stores and retrieves text embeddings for efficient similarity search | |
- Handles data persistence and state management | |
### [SummaryGenerator](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/core/SummaryGenerator.py) (src/crawlgpt/core/SummaryGenerator.py) | |
- Generates concise summaries of text chunks using Groq API | |
- Configurable model selection and parameters | |
- Handles empty input validation | |
### [Database](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/core/database.py) (src/crawl/core/database.py) | |
- SQLAlchemy-based database handling for user management and chat history | |
- Provides secure user authentication with BCrypt password hashing | |
- Manages persistent storage of chat conversations and context | |
- Configuration | |
- Uses SQLite by default (`sqlite:///crawlgpt.db`) | |
- Configurable via DATABASE_URL environment variable | |
- Automatic schema creation on startup | |
- Session management with SQLAlchemy sessionmaker | |
- Security Features | |
- BCrypt password hashing with PassLib | |
- Unique username enforcement | |
- Secure session handling | |
- Role-based message tracking | |
## UI Components | |
### [chat_app.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/ui/chat_app.py) (src/crawlgpt/ui/chat_app.py) | |
- Main Streamlit application interface | |
- URL processing and content extraction | |
- Chat interface with message history | |
- System metrics and debug information | |
- Import/export functionality | |
### [chat_ui.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/ui/chat_ui.py) (src/crawlgpt/ui/chat_ui.py) | |
- Development/testing UI with additional debug features | |
- Extended metrics visualization | |
- Raw data inspection capabilities | |
## Utilities | |
### [content_validator.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/utils/content_validator.py) (src/crawlgpt/utils/content_validator.py) | |
- URL and content validation | |
- MIME type checking | |
- Size limit enforcement | |
- Security checks for malicious content | |
### [data_manager.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/utils/data_manager.py) (src/crawlgpt/utils/data_manager.py) | |
- Data import/export operations | |
- File serialization (JSON/pickle) | |
- Timestamped backups | |
- State management | |
### [monitoring.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/utils/monitoring.py) (src/crawlgpt/utils/monitoring.py) | |
- Request metrics collection | |
- Rate limiting implementation | |
- Performance monitoring | |
- Usage statistics | |
### [progress.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/utils/progress.py) (src/crawlgpt/utils/progress.py) | |
- Operation progress tracking | |
- Status updates | |
- Step counting | |
- Time tracking | |
## Testing | |
### [test_database_handler.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/tests/test_core/test_database_handler.py) (tests/test_core/test_database_handler.py) | |
- Tests for vector database operations | |
- Integration tests for data storage/retrieval | |
- End-to-end flow validation | |
### [test_integration.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/tests/test_core/test_integration.py) (tests/test_core/test_integration.py) | |
- Full system integration tests | |
- URL extraction to response generation flow | |
- State management validation | |
### [test_llm_based_crawler.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/tests/test_core/test_llm_based_crawler.py) (tests/test_core/test_llm_based_crawler.py) | |
- Crawler functionality tests | |
- Content extraction validation | |
- Response generation testing | |
### [test_summary_generator.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/tests/test_core/test_summary_generator.py) (tests/test_core/test_summary_generator.py) | |
- Summary generation tests | |
- Empty input handling | |
- Model output validation | |
## Configuration | |
### [pyproject.toml](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/pyproject.toml) | |
- Project metadata | |
- Dependencies | |
- Optional dev dependencies | |
- Entry points | |
### [pytest.ini](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/pytest.ini) | |
- Test configuration | |
- Path settings | |
- Test discovery patterns | |
- Reporting options | |
### [setup_env.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/setup_env.py) | |
- Environment setup script | |
- Virtual environment creation | |
- Dependency installation | |
- Playwright setup | |
## Features | |
1. **Web Crawling** | |
- Async web content extraction | |
- Playwright-based rendering | |
- Content validation | |
- Rate limiting | |
2. **Content Processing** | |
- Text chunking | |
- Vector embeddings | |
- Summarization | |
- Similarity search | |
3. **Chat Interface** | |
- Message history | |
- Context management | |
- Model parameter control | |
- Debug information | |
4. **Data Management** | |
- State import/export | |
- Progress tracking | |
- Metrics collection | |
- Error handling | |
5. **Testing** | |
- Unit tests | |
- Integration tests | |
- Mock implementations | |
- Async test support | |
## Dependencies | |
Core: | |
- streamlit | |
- groq | |
- sentence-transformers | |
- faiss-cpu | |
- crawl4ai | |
- pydantic | |
- aiohttp | |
- beautifulsoup4 | |
- playwright | |
Development: | |
- pytest | |
- pytest-mockito | |
- black | |
- isort | |
- flake8 | |
## License | |
MIT License | |