CRAWL-GPT-CHAT / Docs /MiniDoc.md
jatinmehra's picture
docs: update MiniDoc and README.md for improved clarity and feature descriptions
d32bdc1
# CrawlGPT Documentation
## Overview
CrawlGPT is a web content crawler with GPT-powered summarization and chat capabilities. It extracts content from URLs, stores it in a vector database, and enables natural language querying of the stored content.
## Project Structure
```
crawlgpt/
β”œβ”€β”€ src/
β”‚ └── crawlgpt/
β”‚ β”œβ”€β”€ core/ # Core functionality
β”‚ β”‚ β”œβ”€β”€ database.py # SQL database handling
β”‚ β”‚ β”œβ”€β”€ LLMBasedCrawler.py # Main crawler implementation
β”‚ β”‚ β”œβ”€β”€ DatabaseHandler.py # Vector database (FAISS)
β”‚ β”‚ └── SummaryGenerator.py # Text summarization
β”‚ β”œβ”€β”€ ui/ # User Interface
β”‚ β”‚ β”œβ”€β”€ chat_app.py # Main Streamlit app
β”‚ β”‚ β”œβ”€β”€ chat_ui.py # Development UI
β”‚ β”‚ └── login.py # Authentication UI
β”‚ └── utils/ # Utilities
β”‚ β”œβ”€β”€ content_validator.py # URL/content validation
β”‚ β”œβ”€β”€ data_manager.py # Import/export handling
β”‚ β”œβ”€β”€ helper_functions.py # General helpers
β”‚ β”œβ”€β”€ monitoring.py # Metrics collection
β”‚ └── progress.py # Progress tracking
β”œβ”€β”€ tests/ # Test suite
β”‚ └── test_core/
β”‚ β”œβ”€β”€ test_database_handler.py # Vector DB tests
β”‚ β”œβ”€β”€ test_integration.py # Integration tests
β”‚ β”œβ”€β”€ test_llm_based_crawler.py # Crawler tests
β”‚ └── test_summary_generator.py # Summarizer tests
β”œβ”€β”€ .github/ # CI/CD
β”‚ └── workflows/
β”‚ └── Push_to_hf.yaml # HuggingFace sync
β”œβ”€β”€ Docs/
β”‚ └── MiniDoc.md # Documentation
β”œβ”€β”€ .dockerignore # Docker exclusions
β”œβ”€β”€ .gitignore # Git exclusions
β”œβ”€β”€ Dockerfile # Container config
β”œβ”€β”€ LICENSE # MIT License
β”œβ”€β”€ README.md # Project documentation
β”œβ”€β”€ README_hf.md # HuggingFace README
β”œβ”€β”€ pyproject.toml # Project metadata
β”œβ”€β”€ pytest.ini # Test configuration
└── setup_env.py # Environment setup
```
## Core Components
### [LLMBasedCrawler](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/core/LLMBasedCrawler.py) (src/crawlgpt/core/LLMBasedCrawler.py)
- Main crawler class handling web content extraction and processing
- Integrates with Groq API for language model operations
- Manages content chunking, summarization and response generation
- Includes rate limiting and metrics collection
### [DatabaseHandler](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/core/DatabaseHandler.py) (src/crawlgpt/core/DatabaseHandler.py)
- Vector database implementation using FAISS
- Stores and retrieves text embeddings for efficient similarity search
- Handles data persistence and state management
### [SummaryGenerator](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/core/SummaryGenerator.py) (src/crawlgpt/core/SummaryGenerator.py)
- Generates concise summaries of text chunks using Groq API
- Configurable model selection and parameters
- Handles empty input validation
### [Database](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/core/database.py) (src/crawl/core/database.py)
- SQLAlchemy-based database handling for user management and chat history
- Provides secure user authentication with BCrypt password hashing
- Manages persistent storage of chat conversations and context
- Configuration
- Uses SQLite by default (`sqlite:///crawlgpt.db`)
- Configurable via DATABASE_URL environment variable
- Automatic schema creation on startup
- Session management with SQLAlchemy sessionmaker
- Security Features
- BCrypt password hashing with PassLib
- Unique username enforcement
- Secure session handling
- Role-based message tracking
## UI Components
### [chat_app.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/ui/chat_app.py) (src/crawlgpt/ui/chat_app.py)
- Main Streamlit application interface
- URL processing and content extraction
- Chat interface with message history
- System metrics and debug information
- Import/export functionality
### [chat_ui.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/ui/chat_ui.py) (src/crawlgpt/ui/chat_ui.py)
- Development/testing UI with additional debug features
- Extended metrics visualization
- Raw data inspection capabilities
## Utilities
### [content_validator.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/utils/content_validator.py) (src/crawlgpt/utils/content_validator.py)
- URL and content validation
- MIME type checking
- Size limit enforcement
- Security checks for malicious content
### [data_manager.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/utils/data_manager.py) (src/crawlgpt/utils/data_manager.py)
- Data import/export operations
- File serialization (JSON/pickle)
- Timestamped backups
- State management
### [monitoring.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/utils/monitoring.py) (src/crawlgpt/utils/monitoring.py)
- Request metrics collection
- Rate limiting implementation
- Performance monitoring
- Usage statistics
### [progress.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/utils/progress.py) (src/crawlgpt/utils/progress.py)
- Operation progress tracking
- Status updates
- Step counting
- Time tracking
## Testing
### [test_database_handler.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/tests/test_core/test_database_handler.py) (tests/test_core/test_database_handler.py)
- Tests for vector database operations
- Integration tests for data storage/retrieval
- End-to-end flow validation
### [test_integration.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/tests/test_core/test_integration.py) (tests/test_core/test_integration.py)
- Full system integration tests
- URL extraction to response generation flow
- State management validation
### [test_llm_based_crawler.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/tests/test_core/test_llm_based_crawler.py) (tests/test_core/test_llm_based_crawler.py)
- Crawler functionality tests
- Content extraction validation
- Response generation testing
### [test_summary_generator.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/tests/test_core/test_summary_generator.py) (tests/test_core/test_summary_generator.py)
- Summary generation tests
- Empty input handling
- Model output validation
## Configuration
### [pyproject.toml](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/pyproject.toml)
- Project metadata
- Dependencies
- Optional dev dependencies
- Entry points
### [pytest.ini](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/pytest.ini)
- Test configuration
- Path settings
- Test discovery patterns
- Reporting options
### [setup_env.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/setup_env.py)
- Environment setup script
- Virtual environment creation
- Dependency installation
- Playwright setup
## Features
1. **Web Crawling**
- Async web content extraction
- Playwright-based rendering
- Content validation
- Rate limiting
2. **Content Processing**
- Text chunking
- Vector embeddings
- Summarization
- Similarity search
3. **Chat Interface**
- Message history
- Context management
- Model parameter control
- Debug information
4. **Data Management**
- State import/export
- Progress tracking
- Metrics collection
- Error handling
5. **Testing**
- Unit tests
- Integration tests
- Mock implementations
- Async test support
## Dependencies
Core:
- streamlit
- groq
- sentence-transformers
- faiss-cpu
- crawl4ai
- pydantic
- aiohttp
- beautifulsoup4
- playwright
Development:
- pytest
- pytest-mockito
- black
- isort
- flake8
## License
MIT License