CRAWL-GPT-CHAT / Docs /MiniDoc.md
jatinmehra's picture
docs: update MiniDoc and README.md for improved clarity and feature descriptions
d32bdc1

CrawlGPT Documentation

Overview

CrawlGPT is a web content crawler with GPT-powered summarization and chat capabilities. It extracts content from URLs, stores it in a vector database, and enables natural language querying of the stored content.

Project Structure

crawlgpt/
β”œβ”€β”€ src/
β”‚   └── crawlgpt/
β”‚       β”œβ”€β”€ core/                           # Core functionality
β”‚       β”‚   β”œβ”€β”€ database.py                 # SQL database handling
β”‚       β”‚   β”œβ”€β”€ LLMBasedCrawler.py         # Main crawler implementation
β”‚       β”‚   β”œβ”€β”€ DatabaseHandler.py          # Vector database (FAISS)
β”‚       β”‚   └── SummaryGenerator.py         # Text summarization
β”‚       β”œβ”€β”€ ui/                            # User Interface
β”‚       β”‚   β”œβ”€β”€ chat_app.py                # Main Streamlit app
β”‚       β”‚   β”œβ”€β”€ chat_ui.py                 # Development UI
β”‚       β”‚   └── login.py                   # Authentication UI
β”‚       └── utils/                         # Utilities
β”‚           β”œβ”€β”€ content_validator.py        # URL/content validation
β”‚           β”œβ”€β”€ data_manager.py            # Import/export handling
β”‚           β”œβ”€β”€ helper_functions.py         # General helpers
β”‚           β”œβ”€β”€ monitoring.py              # Metrics collection
β”‚           └── progress.py                # Progress tracking
β”œβ”€β”€ tests/                                # Test suite
β”‚   └── test_core/
β”‚       β”œβ”€β”€ test_database_handler.py       # Vector DB tests
β”‚       β”œβ”€β”€ test_integration.py           # Integration tests
β”‚       β”œβ”€β”€ test_llm_based_crawler.py     # Crawler tests
β”‚       └── test_summary_generator.py     # Summarizer tests
β”œβ”€β”€ .github/                             # CI/CD
β”‚   └── workflows/
β”‚       └── Push_to_hf.yaml              # HuggingFace sync
β”œβ”€β”€ Docs/
β”‚   └── MiniDoc.md                       # Documentation
β”œβ”€β”€ .dockerignore                        # Docker exclusions
β”œβ”€β”€ .gitignore                          # Git exclusions
β”œβ”€β”€ Dockerfile                          # Container config
β”œβ”€β”€ LICENSE                             # MIT License
β”œβ”€β”€ README.md                          # Project documentation
β”œβ”€β”€ README_hf.md                       # HuggingFace README
β”œβ”€β”€ pyproject.toml                     # Project metadata
β”œβ”€β”€ pytest.ini                         # Test configuration
└── setup_env.py                       # Environment setup

Core Components

LLMBasedCrawler (src/crawlgpt/core/LLMBasedCrawler.py)

  • Main crawler class handling web content extraction and processing
  • Integrates with Groq API for language model operations
  • Manages content chunking, summarization and response generation
  • Includes rate limiting and metrics collection

DatabaseHandler (src/crawlgpt/core/DatabaseHandler.py)

  • Vector database implementation using FAISS
  • Stores and retrieves text embeddings for efficient similarity search
  • Handles data persistence and state management

SummaryGenerator (src/crawlgpt/core/SummaryGenerator.py)

  • Generates concise summaries of text chunks using Groq API
  • Configurable model selection and parameters
  • Handles empty input validation

Database (src/crawl/core/database.py)

  • SQLAlchemy-based database handling for user management and chat history

  • Provides secure user authentication with BCrypt password hashing

  • Manages persistent storage of chat conversations and context

  • Configuration

    • Uses SQLite by default (sqlite:///crawlgpt.db)
    • Configurable via DATABASE_URL environment variable
    • Automatic schema creation on startup
    • Session management with SQLAlchemy sessionmaker
  • Security Features

    • BCrypt password hashing with PassLib
    • Unique username enforcement
    • Secure session handling
    • Role-based message tracking

UI Components

chat_app.py (src/crawlgpt/ui/chat_app.py)

  • Main Streamlit application interface
  • URL processing and content extraction
  • Chat interface with message history
  • System metrics and debug information
  • Import/export functionality

chat_ui.py (src/crawlgpt/ui/chat_ui.py)

  • Development/testing UI with additional debug features
  • Extended metrics visualization
  • Raw data inspection capabilities

Utilities

content_validator.py (src/crawlgpt/utils/content_validator.py)

  • URL and content validation
  • MIME type checking
  • Size limit enforcement
  • Security checks for malicious content

data_manager.py (src/crawlgpt/utils/data_manager.py)

  • Data import/export operations
  • File serialization (JSON/pickle)
  • Timestamped backups
  • State management

monitoring.py (src/crawlgpt/utils/monitoring.py)

  • Request metrics collection
  • Rate limiting implementation
  • Performance monitoring
  • Usage statistics

progress.py (src/crawlgpt/utils/progress.py)

  • Operation progress tracking
  • Status updates
  • Step counting
  • Time tracking

Testing

test_database_handler.py (tests/test_core/test_database_handler.py)

  • Tests for vector database operations
  • Integration tests for data storage/retrieval
  • End-to-end flow validation

test_integration.py (tests/test_core/test_integration.py)

  • Full system integration tests
  • URL extraction to response generation flow
  • State management validation

test_llm_based_crawler.py (tests/test_core/test_llm_based_crawler.py)

  • Crawler functionality tests
  • Content extraction validation
  • Response generation testing

test_summary_generator.py (tests/test_core/test_summary_generator.py)

  • Summary generation tests
  • Empty input handling
  • Model output validation

Configuration

pyproject.toml

  • Project metadata
  • Dependencies
  • Optional dev dependencies
  • Entry points

pytest.ini

  • Test configuration
  • Path settings
  • Test discovery patterns
  • Reporting options

setup_env.py

  • Environment setup script
  • Virtual environment creation
  • Dependency installation
  • Playwright setup

Features

  1. Web Crawling

    • Async web content extraction
    • Playwright-based rendering
    • Content validation
    • Rate limiting
  2. Content Processing

    • Text chunking
    • Vector embeddings
    • Summarization
    • Similarity search
  3. Chat Interface

    • Message history
    • Context management
    • Model parameter control
    • Debug information
  4. Data Management

    • State import/export
    • Progress tracking
    • Metrics collection
    • Error handling
  5. Testing

    • Unit tests
    • Integration tests
    • Mock implementations
    • Async test support

Dependencies

Core:

  • streamlit
  • groq
  • sentence-transformers
  • faiss-cpu
  • crawl4ai
  • pydantic
  • aiohttp
  • beautifulsoup4
  • playwright

Development:

  • pytest
  • pytest-mockito
  • black
  • isort
  • flake8

License

MIT License