File size: 8,658 Bytes
65faf21
 
 
 
 
 
 
 
 
 
 
 
d32bdc1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65faf21
d32bdc1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65faf21
 
 
 
482c230
65faf21
 
 
 
 
 
482c230
65faf21
 
 
 
 
482c230
65faf21
 
 
 
 
d32bdc1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65faf21
 
482c230
65faf21
 
 
 
 
 
 
482c230
65faf21
 
 
 
 
 
 
482c230
65faf21
 
 
 
 
 
482c230
65faf21
 
 
 
 
 
482c230
65faf21
 
 
 
 
 
482c230
65faf21
 
 
 
 
 
 
 
482c230
65faf21
 
 
 
 
482c230
65faf21
 
 
 
 
482c230
65faf21
 
 
 
 
482c230
65faf21
 
 
 
 
 
 
482c230
65faf21
 
 
 
 
 
482c230
65faf21
 
 
 
 
 
482c230
65faf21
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
482c230
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
# CrawlGPT Documentation

## Overview

CrawlGPT is a web content crawler with GPT-powered summarization and chat capabilities. It extracts content from URLs, stores it in a vector database, and enables natural language querying of the stored content.

## Project Structure

```
crawlgpt/
β”œβ”€β”€ src/
β”‚   └── crawlgpt/
β”‚       β”œβ”€β”€ core/                           # Core functionality
β”‚       β”‚   β”œβ”€β”€ database.py                 # SQL database handling
β”‚       β”‚   β”œβ”€β”€ LLMBasedCrawler.py         # Main crawler implementation
β”‚       β”‚   β”œβ”€β”€ DatabaseHandler.py          # Vector database (FAISS)
β”‚       β”‚   └── SummaryGenerator.py         # Text summarization
β”‚       β”œβ”€β”€ ui/                            # User Interface
β”‚       β”‚   β”œβ”€β”€ chat_app.py                # Main Streamlit app
β”‚       β”‚   β”œβ”€β”€ chat_ui.py                 # Development UI
β”‚       β”‚   └── login.py                   # Authentication UI
β”‚       └── utils/                         # Utilities
β”‚           β”œβ”€β”€ content_validator.py        # URL/content validation
β”‚           β”œβ”€β”€ data_manager.py            # Import/export handling
β”‚           β”œβ”€β”€ helper_functions.py         # General helpers
β”‚           β”œβ”€β”€ monitoring.py              # Metrics collection
β”‚           └── progress.py                # Progress tracking
β”œβ”€β”€ tests/                                # Test suite
β”‚   └── test_core/
β”‚       β”œβ”€β”€ test_database_handler.py       # Vector DB tests
β”‚       β”œβ”€β”€ test_integration.py           # Integration tests
β”‚       β”œβ”€β”€ test_llm_based_crawler.py     # Crawler tests
β”‚       └── test_summary_generator.py     # Summarizer tests
β”œβ”€β”€ .github/                             # CI/CD
β”‚   └── workflows/
β”‚       └── Push_to_hf.yaml              # HuggingFace sync
β”œβ”€β”€ Docs/
β”‚   └── MiniDoc.md                       # Documentation
β”œβ”€β”€ .dockerignore                        # Docker exclusions
β”œβ”€β”€ .gitignore                          # Git exclusions
β”œβ”€β”€ Dockerfile                          # Container config
β”œβ”€β”€ LICENSE                             # MIT License
β”œβ”€β”€ README.md                          # Project documentation
β”œβ”€β”€ README_hf.md                       # HuggingFace README
β”œβ”€β”€ pyproject.toml                     # Project metadata
β”œβ”€β”€ pytest.ini                         # Test configuration
└── setup_env.py                       # Environment setup
```

## Core Components

### [LLMBasedCrawler](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/core/LLMBasedCrawler.py) (src/crawlgpt/core/LLMBasedCrawler.py)

-   Main crawler class handling web content extraction and processing
-   Integrates with Groq API for language model operations
-   Manages content chunking, summarization and response generation
-   Includes rate limiting and metrics collection

### [DatabaseHandler](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/core/DatabaseHandler.py) (src/crawlgpt/core/DatabaseHandler.py)

-   Vector database implementation using FAISS
-   Stores and retrieves text embeddings for efficient similarity search
-   Handles data persistence and state management

### [SummaryGenerator](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/core/SummaryGenerator.py) (src/crawlgpt/core/SummaryGenerator.py)

-   Generates concise summaries of text chunks using Groq API
-   Configurable model selection and parameters
-   Handles empty input validation

### [Database](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/core/database.py) (src/crawl/core/database.py)

-   SQLAlchemy-based database handling for user management and chat history
-   Provides secure user authentication with BCrypt password hashing
-   Manages persistent storage of chat conversations and context  

- Configuration
    - Uses SQLite by default (`sqlite:///crawlgpt.db`)
    - Configurable via DATABASE_URL environment variable
    - Automatic schema creation on startup
    - Session management with SQLAlchemy sessionmaker
- Security Features
    - BCrypt password hashing with PassLib
    - Unique username enforcement
    - Secure session handling
    - Role-based message tracking


## UI Components

### [chat_app.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/ui/chat_app.py) (src/crawlgpt/ui/chat_app.py)

-   Main Streamlit application interface
-   URL processing and content extraction
-   Chat interface with message history
-   System metrics and debug information
-   Import/export functionality

### [chat_ui.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/ui/chat_ui.py)  (src/crawlgpt/ui/chat_ui.py)

-   Development/testing UI with additional debug features
-   Extended metrics visualization
-   Raw data inspection capabilities

## Utilities

### [content_validator.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/utils/content_validator.py) (src/crawlgpt/utils/content_validator.py)

-   URL and content validation
-   MIME type checking
-   Size limit enforcement
-   Security checks for malicious content

### [data_manager.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/utils/data_manager.py) (src/crawlgpt/utils/data_manager.py)

-   Data import/export operations
-   File serialization (JSON/pickle)
-   Timestamped backups
-   State management

### [monitoring.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/utils/monitoring.py) (src/crawlgpt/utils/monitoring.py)

-   Request metrics collection
-   Rate limiting implementation
-   Performance monitoring
-   Usage statistics

### [progress.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/src/crawlgpt/utils/progress.py) (src/crawlgpt/utils/progress.py)

-   Operation progress tracking
-   Status updates
-   Step counting
-   Time tracking

## Testing

### [test_database_handler.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/tests/test_core/test_database_handler.py) (tests/test_core/test_database_handler.py)

-   Tests for vector database operations
-   Integration tests for data storage/retrieval
-   End-to-end flow validation

### [test_integration.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/tests/test_core/test_integration.py) (tests/test_core/test_integration.py)

-   Full system integration tests
-   URL extraction to response generation flow
-   State management validation

### [test_llm_based_crawler.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/tests/test_core/test_llm_based_crawler.py) (tests/test_core/test_llm_based_crawler.py)

-   Crawler functionality tests
-   Content extraction validation
-   Response generation testing

### [test_summary_generator.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/tests/test_core/test_summary_generator.py) (tests/test_core/test_summary_generator.py)

-   Summary generation tests
-   Empty input handling
-   Model output validation

## Configuration

### [pyproject.toml](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/pyproject.toml)

-   Project metadata
-   Dependencies
-   Optional dev dependencies
-   Entry points

### [pytest.ini](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/pytest.ini)

-   Test configuration
-   Path settings
-   Test discovery patterns
-   Reporting options

### [setup_env.py](https://github.com/Jatin-Mehra119/CRAWLGPT/blob/main/setup_env.py)

-   Environment setup script
-   Virtual environment creation
-   Dependency installation
-   Playwright setup

## Features

1.  **Web Crawling**
    
    -   Async web content extraction
    -   Playwright-based rendering
    -   Content validation
    -   Rate limiting
2.  **Content Processing**
    
    -   Text chunking
    -   Vector embeddings
    -   Summarization
    -   Similarity search
3.  **Chat Interface**
    
    -   Message history
    -   Context management
    -   Model parameter control
    -   Debug information
4.  **Data Management**
    
    -   State import/export
    -   Progress tracking
    -   Metrics collection
    -   Error handling
5.  **Testing**
    
    -   Unit tests
    -   Integration tests
    -   Mock implementations
    -   Async test support

## Dependencies

Core:

-   streamlit
-   groq
-   sentence-transformers
-   faiss-cpu
-   crawl4ai
-   pydantic
-   aiohttp
-   beautifulsoup4
-   playwright

Development:

-   pytest
-   pytest-mockito
-   black
-   isort
-   flake8

## License

MIT License