Jatin Mehra commited on
Commit
65faf21
Β·
1 Parent(s): 5b02b7b

Add initial documentation for CrawlGPT, outlining project structure, core components, UI features, utilities, testing, configuration, and dependencies

Browse files
Files changed (1) hide show
  1. Docs/MiniDoc.md +214 -0
Docs/MiniDoc.md ADDED
@@ -0,0 +1,214 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CrawlGPT Documentation
2
+
3
+ ## Overview
4
+
5
+ CrawlGPT is a web content crawler with GPT-powered summarization and chat capabilities. It extracts content from URLs, stores it in a vector database, and enables natural language querying of the stored content.
6
+
7
+ ## Project Structure
8
+
9
+ ```
10
+ crawlgpt/
11
+ β”œβ”€β”€ src/
12
+ β”‚ └── crawlgpt/
13
+ β”‚ β”œβ”€β”€ core/
14
+ β”‚ β”‚ β”œβ”€β”€ DatabaseHandler.py
15
+ β”‚ β”‚ β”œβ”€β”€ LLMBasedCrawler.py
16
+ β”‚ β”‚ └── SummaryGenerator.py
17
+ β”‚ β”œβ”€β”€ ui/
18
+ β”‚ β”‚ β”œβ”€β”€ chat_app.py
19
+ β”‚ β”‚ └── chat_ui.py
20
+ β”‚ └── utils/
21
+ β”‚ β”œβ”€β”€ content_validator.py
22
+ β”‚ β”œβ”€β”€ data_manager.py
23
+ β”‚ β”œβ”€β”€ helper_functions.py
24
+ β”‚ β”œβ”€β”€ monitoring.py
25
+ β”‚ └── progress.py
26
+ β”œβ”€β”€ tests/
27
+ β”‚ └── test_core/
28
+ β”‚ β”œβ”€β”€ test_database_handler.py
29
+ β”‚ β”œβ”€β”€ test_integration.py
30
+ β”‚ β”œβ”€β”€ test_llm_based_crawler.py
31
+ β”‚ └── test_summary_generator.py
32
+ β”œβ”€β”€ .gitignore
33
+ β”œβ”€β”€ LICENSE
34
+ β”œβ”€β”€ README.md
35
+ β”œβ”€β”€ Docs
36
+ β”œβ”€β”€ pyproject.toml
37
+ β”œβ”€β”€ pytest.ini
38
+ └── setup_env.py
39
+ ```
40
+
41
+ ## Core Components
42
+
43
+ ### LLMBasedCrawler (src/crawlgpt/core/LLMBasedCrawler.py)
44
+
45
+ - Main crawler class handling web content extraction and processing
46
+ - Integrates with Groq API for language model operations
47
+ - Manages content chunking, summarization and response generation
48
+ - Includes rate limiting and metrics collection
49
+
50
+ ### DatabaseHandler (src/crawlgpt/core/DatabaseHandler.py)
51
+
52
+ - Vector database implementation using FAISS
53
+ - Stores and retrieves text embeddings for efficient similarity search
54
+ - Handles data persistence and state management
55
+
56
+ ### SummaryGenerator (src/crawlgpt/core/SummaryGenerator.py)
57
+
58
+ - Generates concise summaries of text chunks using Groq API
59
+ - Configurable model selection and parameters
60
+ - Handles empty input validation
61
+
62
+ ## UI Components
63
+
64
+ ### [chat_app.py](https://orange-memory-g4xp5wqvqvr4hrvx.github.dev/?folder=%2Fworkspaces%2FCRAWLGPT) (src/crawlgpt/ui/chat_app.py)
65
+
66
+ - Main Streamlit application interface
67
+ - URL processing and content extraction
68
+ - Chat interface with message history
69
+ - System metrics and debug information
70
+ - Import/export functionality
71
+
72
+ ### [chat_ui.py](https://orange-memory-g4xp5wqvqvr4hrvx.github.dev/?folder=%2Fworkspaces%2FCRAWLGPT) (src/crawlgpt/ui/chat_ui.py)
73
+
74
+ - Development/testing UI with additional debug features
75
+ - Extended metrics visualization
76
+ - Raw data inspection capabilities
77
+
78
+ ## Utilities
79
+
80
+ ### [content_validator.py](https://orange-memory-g4xp5wqvqvr4hrvx.github.dev/?folder=%2Fworkspaces%2FCRAWLGPT)
81
+
82
+ - URL and content validation
83
+ - MIME type checking
84
+ - Size limit enforcement
85
+ - Security checks for malicious content
86
+
87
+ ### [data_manager.py](https://orange-memory-g4xp5wqvqvr4hrvx.github.dev/?folder=%2Fworkspaces%2FCRAWLGPT)
88
+
89
+ - Data import/export operations
90
+ - File serialization (JSON/pickle)
91
+ - Timestamped backups
92
+ - State management
93
+
94
+ ### [monitoring.py](https://orange-memory-g4xp5wqvqvr4hrvx.github.dev/?folder=%2Fworkspaces%2FCRAWLGPT)
95
+
96
+ - Request metrics collection
97
+ - Rate limiting implementation
98
+ - Performance monitoring
99
+ - Usage statistics
100
+
101
+ ### [progress.py](https://orange-memory-g4xp5wqvqvr4hrvx.github.dev/?folder=%2Fworkspaces%2FCRAWLGPT)
102
+
103
+ - Operation progress tracking
104
+ - Status updates
105
+ - Step counting
106
+ - Time tracking
107
+
108
+ ## Testing
109
+
110
+ ### [test_database_handler.py](https://orange-memory-g4xp5wqvqvr4hrvx.github.dev/?folder=%2Fworkspaces%2FCRAWLGPT)
111
+
112
+ - Tests for vector database operations
113
+ - Integration tests for data storage/retrieval
114
+ - End-to-end flow validation
115
+
116
+ ### [test_integration.py](https://orange-memory-g4xp5wqvqvr4hrvx.github.dev/?folder=%2Fworkspaces%2FCRAWLGPT)
117
+
118
+ - Full system integration tests
119
+ - URL extraction to response generation flow
120
+ - State management validation
121
+
122
+ ### [test_llm_based_crawler.py](https://orange-memory-g4xp5wqvqvr4hrvx.github.dev/?folder=%2Fworkspaces%2FCRAWLGPT)
123
+
124
+ - Crawler functionality tests
125
+ - Content extraction validation
126
+ - Response generation testing
127
+
128
+ ### [test_summary_generator.py](https://orange-memory-g4xp5wqvqvr4hrvx.github.dev/?folder=%2Fworkspaces%2FCRAWLGPT)
129
+
130
+ - Summary generation tests
131
+ - Empty input handling
132
+ - Model output validation
133
+
134
+ ## Configuration
135
+
136
+ ### [pyproject.toml](https://orange-memory-g4xp5wqvqvr4hrvx.github.dev/?folder=%2Fworkspaces%2FCRAWLGPT)
137
+
138
+ - Project metadata
139
+ - Dependencies
140
+ - Optional dev dependencies
141
+ - Entry points
142
+
143
+ ### [pytest.ini](https://orange-memory-g4xp5wqvqvr4hrvx.github.dev/?folder=%2Fworkspaces%2FCRAWLGPT)
144
+
145
+ - Test configuration
146
+ - Path settings
147
+ - Test discovery patterns
148
+ - Reporting options
149
+
150
+ ### [setup_env.py](https://orange-memory-g4xp5wqvqvr4hrvx.github.dev/?folder=%2Fworkspaces%2FCRAWLGPT)
151
+
152
+ - Environment setup script
153
+ - Virtual environment creation
154
+ - Dependency installation
155
+ - Playwright setup
156
+
157
+ ## Features
158
+
159
+ 1. **Web Crawling**
160
+
161
+ - Async web content extraction
162
+ - Playwright-based rendering
163
+ - Content validation
164
+ - Rate limiting
165
+ 2. **Content Processing**
166
+
167
+ - Text chunking
168
+ - Vector embeddings
169
+ - Summarization
170
+ - Similarity search
171
+ 3. **Chat Interface**
172
+
173
+ - Message history
174
+ - Context management
175
+ - Model parameter control
176
+ - Debug information
177
+ 4. **Data Management**
178
+
179
+ - State import/export
180
+ - Progress tracking
181
+ - Metrics collection
182
+ - Error handling
183
+ 5. **Testing**
184
+
185
+ - Unit tests
186
+ - Integration tests
187
+ - Mock implementations
188
+ - Async test support
189
+
190
+ ## Dependencies
191
+
192
+ Core:
193
+
194
+ - streamlit
195
+ - groq
196
+ - sentence-transformers
197
+ - faiss-cpu
198
+ - crawl4ai
199
+ - pydantic
200
+ - aiohttp
201
+ - beautifulsoup4
202
+ - playwright
203
+
204
+ Development:
205
+
206
+ - pytest
207
+ - pytest-mockito
208
+ - black
209
+ - isort
210
+ - flake8
211
+
212
+ ## License
213
+
214
+ MIT License