Spaces:
Running
Running
File size: 8,658 Bytes
65faf21 d32bdc1 65faf21 d32bdc1 65faf21 482c230 65faf21 482c230 65faf21 482c230 65faf21 d32bdc1 65faf21 482c230 65faf21 482c230 65faf21 482c230 65faf21 482c230 65faf21 482c230 65faf21 482c230 65faf21 482c230 65faf21 482c230 65faf21 482c230 65faf21 482c230 65faf21 482c230 65faf21 482c230 65faf21 482c230 65faf21 482c230 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 |
# CrawlGPT Documentation
## Overview
CrawlGPT is a web content crawler with GPT-powered summarization and chat capabilities. It extracts content from URLs, stores it in a vector database, and enables natural language querying of the stored content.
## Project Structure
```
crawlgpt/
βββ src/
β βββ crawlgpt/
β βββ core/ # Core functionality
β β βββ database.py # SQL database handling
β β βββ LLMBasedCrawler.py # Main crawler implementation
β β βββ DatabaseHandler.py # Vector database (FAISS)
β β βββ SummaryGenerator.py # Text summarization
β βββ ui/ # User Interface
β β βββ chat_app.py # Main Streamlit app
β β βββ chat_ui.py # Development UI
β β βββ login.py # Authentication UI
β βββ utils/ # Utilities
β βββ content_validator.py # URL/content validation
β βββ data_manager.py # Import/export handling
β βββ helper_functions.py # General helpers
β βββ monitoring.py # Metrics collection
β βββ progress.py # Progress tracking
βββ tests/ # Test suite
β βββ test_core/
β βββ test_database_handler.py # Vector DB tests
β βββ test_integration.py # Integration tests
β βββ test_llm_based_crawler.py # Crawler tests
β βββ test_summary_generator.py # Summarizer tests
βββ .github/ # CI/CD
β βββ workflows/
β βββ Push_to_hf.yaml # HuggingFace sync
βββ Docs/
β βββ MiniDoc.md # Documentation
βββ .dockerignore # Docker exclusions
βββ .gitignore # Git exclusions
βββ Dockerfile # Container config
βββ LICENSE # MIT License
βββ README.md # Project documentation
βββ README_hf.md # HuggingFace README
βββ pyproject.toml # Project metadata
βββ pytest.ini # Test configuration
βββ setup_env.py # Environment setup
```
## Core Components
### [LLMBasedCrawler](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/core/LLMBasedCrawler.py) (src/crawlgpt/core/LLMBasedCrawler.py)
- Main crawler class handling web content extraction and processing
- Integrates with Groq API for language model operations
- Manages content chunking, summarization and response generation
- Includes rate limiting and metrics collection
### [DatabaseHandler](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/core/DatabaseHandler.py) (src/crawlgpt/core/DatabaseHandler.py)
- Vector database implementation using FAISS
- Stores and retrieves text embeddings for efficient similarity search
- Handles data persistence and state management
### [SummaryGenerator](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/core/SummaryGenerator.py) (src/crawlgpt/core/SummaryGenerator.py)
- Generates concise summaries of text chunks using Groq API
- Configurable model selection and parameters
- Handles empty input validation
### [Database](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/core/database.py) (src/crawl/core/database.py)
- SQLAlchemy-based database handling for user management and chat history
- Provides secure user authentication with BCrypt password hashing
- Manages persistent storage of chat conversations and context
- Configuration
- Uses SQLite by default (`sqlite:///crawlgpt.db`)
- Configurable via DATABASE_URL environment variable
- Automatic schema creation on startup
- Session management with SQLAlchemy sessionmaker
- Security Features
- BCrypt password hashing with PassLib
- Unique username enforcement
- Secure session handling
- Role-based message tracking
## UI Components
### [chat_app.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/ui/chat_app.py) (src/crawlgpt/ui/chat_app.py)
- Main Streamlit application interface
- URL processing and content extraction
- Chat interface with message history
- System metrics and debug information
- Import/export functionality
### [chat_ui.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/ui/chat_ui.py) (src/crawlgpt/ui/chat_ui.py)
- Development/testing UI with additional debug features
- Extended metrics visualization
- Raw data inspection capabilities
## Utilities
### [content_validator.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/utils/content_validator.py) (src/crawlgpt/utils/content_validator.py)
- URL and content validation
- MIME type checking
- Size limit enforcement
- Security checks for malicious content
### [data_manager.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/utils/data_manager.py) (src/crawlgpt/utils/data_manager.py)
- Data import/export operations
- File serialization (JSON/pickle)
- Timestamped backups
- State management
### [monitoring.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/utils/monitoring.py) (src/crawlgpt/utils/monitoring.py)
- Request metrics collection
- Rate limiting implementation
- Performance monitoring
- Usage statistics
### [progress.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/utils/progress.py) (src/crawlgpt/utils/progress.py)
- Operation progress tracking
- Status updates
- Step counting
- Time tracking
## Testing
### [test_database_handler.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/tests/test_core/test_database_handler.py) (tests/test_core/test_database_handler.py)
- Tests for vector database operations
- Integration tests for data storage/retrieval
- End-to-end flow validation
### [test_integration.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/tests/test_core/test_integration.py) (tests/test_core/test_integration.py)
- Full system integration tests
- URL extraction to response generation flow
- State management validation
### [test_llm_based_crawler.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/tests/test_core/test_llm_based_crawler.py) (tests/test_core/test_llm_based_crawler.py)
- Crawler functionality tests
- Content extraction validation
- Response generation testing
### [test_summary_generator.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/tests/test_core/test_summary_generator.py) (tests/test_core/test_summary_generator.py)
- Summary generation tests
- Empty input handling
- Model output validation
## Configuration
### [pyproject.toml](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/pyproject.toml)
- Project metadata
- Dependencies
- Optional dev dependencies
- Entry points
### [pytest.ini](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/pytest.ini)
- Test configuration
- Path settings
- Test discovery patterns
- Reporting options
### [setup_env.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/setup_env.py)
- Environment setup script
- Virtual environment creation
- Dependency installation
- Playwright setup
## Features
1. **Web Crawling**
- Async web content extraction
- Playwright-based rendering
- Content validation
- Rate limiting
2. **Content Processing**
- Text chunking
- Vector embeddings
- Summarization
- Similarity search
3. **Chat Interface**
- Message history
- Context management
- Model parameter control
- Debug information
4. **Data Management**
- State import/export
- Progress tracking
- Metrics collection
- Error handling
5. **Testing**
- Unit tests
- Integration tests
- Mock implementations
- Async test support
## Dependencies
Core:
- streamlit
- groq
- sentence-transformers
- faiss-cpu
- crawl4ai
- pydantic
- aiohttp
- beautifulsoup4
- playwright
Development:
- pytest
- pytest-mockito
- black
- isort
- flake8
## License
MIT License
|