Spaces:

AmmarFahmy
/

innodata_poc

Sleeping

App Files Files Community

AmmarFahmy commited on Feb 6

Commit

3f71554

1 Parent(s): b4faff1

Update README with comprehensive project description and system architecture

Browse files

Files changed (10) hide show

.gitignore +28 -0
027-SLLR-SLLR-1990-V-1-DONA-CECILIANA-AND-OTHERS-v.-KAMALA-PIYASEELI-AND-ANOTHER.pdf +0 -0
README.md +215 -13
TISSA-BANDARA-RANDENIYA-and-THE-BOARD-OF-DIRECTORS-OF-THE-CO-OPERATIVE-W.pdf.pdf +0 -0
Y.-B.-PUSSADENIYA-ASSISTANT-COMMISSIONER-OF-LOCAL-GOVERNMENT-Petitioner-and-O.pdf.pdf +0 -0
app.py +466 -0
requirements.txt +26 -0
source/flowchartTD.png +0 -0
templates/prompts.j2 +25 -0
templates/taxonomy.j2 +50 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,28 @@

+# Streamlit config directory and files
+.streamlit/
+.streamlit/*.toml
+# Python related
+__pycache__/
+*.py[cod]
+*$py.class
+# Flashrank cache
+.flashrank_cache
+# Environment
+.env
+.venv
+env/
+venv/
+ENV/
+# IDE
+.vscode/
+.idea/
+# Distribution / packaging
+dist/
+build/
+*.egg-info/

027-SLLR-SLLR-1990-V-1-DONA-CECILIANA-AND-OTHERS-v.-KAMALA-PIYASEELI-AND-ANOTHER.pdf ADDED Viewed

Binary file (212 kB). View file

README.md CHANGED Viewed

@@ -1,13 +1,215 @@
----
-title: Innodata Poc
-emoji: 🐨
-colorFrom: gray
-colorTo: purple
-sdk: streamlit
-sdk_version: 1.42.0
-app_file: app.py
-pinned: false
-short_description: This is a POC app for an interview purpose.
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Legal Document RAG with Taxonomy-Aware Hybrid Search
+A powerful Q&A application for legal documents that leverages Hybrid Search and Retrieval-Augmented Generation (RAG) with built-in legal taxonomy awareness. Built with RAGLite for robust document processing and retrieval and Streamlit for an intuitive chat interface, this system provides intelligent answers to legal queries while maintaining awareness of key legal domain concepts.
+## Features
+- **Dual-Mode Taxonomy Extraction**:
+    - **Automatic Mode**:
+        - Fast regex-based keyword matching
+        - Identifies exact matches from robust predefined legal taxonomy
+        - Efficient for quick document processing
+        - No API calls required
+    - **Intelligent Mode**:
+        - LLM-powered taxonomy analysis using GPT-4o-mini
+        - Identifies both exact matches and semantically related concepts
+        - Provides additional context through related keyword suggestions
+        - More nuanced understanding of legal concepts
+- **Advanced Search and Retrieval**:
+    - **Hybrid Search System**:
+        - Combines semantic search with traditional keyword matching
+        - Uses OpenAI's text-embedding-3-large for semantic understanding
+        - Supports up to 10 initial search results (retrieves the top 10 most relevant document chunks before reranking)
+        - Optimized chunk size of 8000 tokens with 2-sentence overlapping windows (each document chunk contains 8000 tokens and overlaps with adjacent chunks by 2 sentences to maintain context)
+    - **Intelligent Reranking**:
+        - Powered by Cohere's reranking technology
+        - Re-orders search results based on relevance to query
+        - Improves context selection for more accurate answers
+        - Language-aware reranking with English optimization
+    - **Fallback Mechanism**:
+        - Graceful degradation to general knowledge when no relevant documents found
+        - Uses GPT-4o-mini for general legal knowledge
+        - Maintains conversation context
+- **Document Processing and UI**:
+    - Page-level document chunking with metadata enrichment
+    - Visual PDF page display for source verification
+    - Progress tracking during document processing
+    - Interactive chat interface with conversation history
+- **Template-Based Configuration**:
+    - The application uses Jinja2 templates for managing prompts and taxonomies, following software engineering best practices:
+    - **Separation of Concerns**:
+        - Prompts and taxonomies are maintained in separate template files
+        - `templates/prompts.j2`: Contains all system prompts (RAG, extraction, fallback)
+        - `templates/taxonomy.j2`: Contains the comprehensive legal taxonomy keywords
+    - **Benefits**:
+        - **Maintainability**: Edit prompts and taxonomies without touching application code
+        - **Version Control**: Track changes to prompts and taxonomies separately
+        - **Environment Flexibility**: Support different prompts/taxonomies per environment
+        - **Reusability**: Templates can be shared across multiple applications
+        - **Readability**: Clean separation between logic and content
+## System Architecture
+The following flowchart illustrates the complete system pipeline from initial configuration to final answer generation:
+```mermaid
+flowchart TD
+    subgraph Configuration["Initial Configuration"]
+        Config["Configure API Keys & DB"]
+        Mode["Select Taxonomy Mode:<br>Automatic or Intelligent"]
+    end
+    subgraph Upload_Duplicate_Check["Upload & Duplicate Check"]
+        A["User Uploads PDF"]
+        D["Store PDF Bytes in Session"]
+        Hash["Generate MD5 Hash"]
+        DupCheck{"Is Duplicate?"}
+    end
+    subgraph Document_Processing["Document Processing <br> Chunking & Taxonomy Extraction"]
+        E["Document Processing"]
+        F["Read PDF using PyPDF2"]
+        G["Split PDF into Pages"]
+        H["For Each Page: Extract Text"]
+        TaxMode{"Extraction Mode?"}
+        I1["Automatic: Extract Keywords<br>using Regex"]
+        I2["Intelligent: Use GPT-4o-mini<br>for Keyword Extraction"]
+        J["Generate Temporary File with Header:<br>Document, DocHash, Page, Taxonomy"]
+        K["Call insert_document<br>Chunk Ingestion"]
+        L["Store Chunk in Database"]
+        M["Update Processing Progress & Complete"]
+    end
+    subgraph Query_Search_Flow["Query & Search Flow"]
+        N["User Enters Query in Chat"]
+        O["Perform Hybrid Search<br>hybrid_search"]
+        P["Retrieve Chunks<br>retrieve_chunks"]
+        Q["Re-rank Chunks<br>rerank_chunks"]
+        R{"Relevant Chunk Found?"}
+        S["Select Top Matched Chunk"]
+        T["Fallback: General Knowledge<br>GPT-4o-mini"]
+    end
+    subgraph Answer_Generation_UI["Answer Generation & UI"]
+        U["Call rag for Answer Generation"]
+        V["Stream Generated Answer to User"]
+        W["Expander: Top Matched Source"]
+        X["Parse Chunk Header for:<br>Document, Page, Taxonomy"]
+        Y["Convert PDF Page to Image"]
+        Z["Display PDF Page Image &<br>Taxonomy Information"]
+    end
+    Config --> Mode
+    Mode --> A
+    A --> Hash
+    Hash --> DupCheck
+    DupCheck -- Yes --> A
+    DupCheck -- No --> D
+    D --> E
+    E --> F
+    F --> G
+    G --> H
+    H --> TaxMode
+    TaxMode -- Automatic --> I1
+    TaxMode -- Intelligent --> I2
+    I1 --> J
+    I2 --> J
+    J --> K
+    K --> L
+    L --> M
+    M --> N
+    N --> O
+    O --> P
+    P --> Q
+    Q --> R
+    R -- Yes --> S
+    R -- No --> T
+    T --> V
+    S --> U
+    U --> V
+    V --> W
+    W --> X
+    X --> Y
+    Y --> Z
+```
+## Prerequisites
+You'll need the following API keys:
+1. **API Keys**:
+   - [OpenAI API key](https://platform.openai.com/api-keys) for:
+     - GPT-4o model (chat completions)
+     - text-embedding-3-large (embeddings)
+     - GPT-4o-mini (intelligent taxonomy extraction)
+   - [Cohere API key](https://dashboard.cohere.com/api-keys) for reranking
+2. **Database Setup** (Optional):
+   - Default: SQLite (no setup required)
+   - Alternatively: Use any SQLAlchemy-compatible database
+## Installation
+1. **Install Dependencies**:
+   ```bash
+   pip install -r requirements.txt
+   ```
+2. **Required System Dependencies**:
+   - install both pypandoc and Pandoc via conda
+   ```bash
+   conda install -c conda-forge pypandoc pandoc
+   ```
+## Usage
+1. **Start the Application**:
+   ```bash
+   streamlit run app.py
+   ```
+2. **Configure the Application**:
+   - Enter your OpenAI API key
+   - Enter your Cohere API key
+   - Configure database URL (optional, defaults to SQLite)
+   - Select taxonomy extraction mode (Automatic or Intelligent)
+   - Click "Save Configuration"
+3. **Upload Documents**:
+   - Upload PDF legal documents
+   - The system will automatically:
+     - Process documents page by page
+     - Extract legal taxonomy keywords based on selected mode
+     - Create searchable chunks with metadata
+     - Display processing progress
+4. **Ask Questions**:
+   - Ask questions about your legal documents
+   - View source information including:
+     - Original document and page number
+     - Extracted taxonomy keywords (exact matches and related concepts in Intelligent mode)
+     - PDF page preview
+   - System automatically falls back to general knowledge for non-document questions
+## Legal Taxonomy
+The system includes built-in recognition for over 100 legal concepts across various categories:
+- Core Legal Areas (e.g., contract law, tort law, criminal law)
+- Legal Processes & Procedures (e.g., civil procedure, arbitration)
+- Legal Concepts & Principles (e.g., due process, liability)
+- Rights & Protections (e.g., civil rights, privacy rights)
+- Business & Commercial (e.g., securities regulation, intellectual property)
+- Property & Real Estate (e.g., zoning, land use)
+- Criminal Justice (e.g., felony, probable cause)
+- Specialized Areas (e.g., healthcare law, cyber law)
+- Government & Public Law (e.g., administrative law, regulatory compliance)
+- Alternative Dispute Resolution (e.g., mediation, arbitration)
+See the code for the complete list of supported taxonomy keywords.

TISSA-BANDARA-RANDENIYA-and-THE-BOARD-OF-DIRECTORS-OF-THE-CO-OPERATIVE-W.pdf.pdf ADDED Viewed

Binary file (112 kB). View file

Y.-B.-PUSSADENIYA-ASSISTANT-COMMISSIONER-OF-LOCAL-GOVERNMENT-Petitioner-and-O.pdf.pdf ADDED Viewed

Binary file (100 kB). View file

app.py ADDED Viewed

	@@ -0,0 +1,466 @@

+import os
+import re
+import json
+import hashlib
+import logging
+import streamlit as st
+import PyPDF2
+from raglite import RAGLiteConfig, insert_document, hybrid_search, retrieve_chunks, rerank_chunks, rag
+from rerankers import Reranker
+from typing import List
+from pathlib import Path
+import openai
+import time
+import warnings
+from jinja2 import Environment, FileSystemLoader
+from pdf2image import convert_from_bytes
+# Setup logging and ignore specific warnings.
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+warnings.filterwarnings("ignore", message=".*torch.classes.*")
+# Initialize Jinja2 environment
+jinja_env = Environment(loader=FileSystemLoader('templates'))
+# Load templates
+prompts_template = jinja_env.get_template('prompts.j2')
+taxonomy_template = jinja_env.get_template('taxonomy.j2')
+# Render templates to get variables
+template_vars = {}
+exec(prompts_template.render(), template_vars)
+exec(taxonomy_template.render(), template_vars)
+# Extract variables from templates
+RAG_SYSTEM_PROMPT = template_vars['RAG_SYSTEM_PROMPT'].strip()
+INTELLIGENT_EXTRACTION_PROMPT = template_vars['INTELLIGENT_EXTRACTION_PROMPT'].strip(
+)
+FALLBACK_SYSTEM_PROMPT = template_vars['FALLBACK_SYSTEM_PROMPT'].strip()
+LEGAL_TAXONOMY_KEYWORDS = template_vars['LEGAL_TAXONOMY_KEYWORDS']
+# ------------------------------------------
+# 1. Predefined Legal Taxonomy
+# ------------------------------------------
+# ------------------------------------------
+# 2. Automatic Taxonomy Extraction (Regex-based)
+# ------------------------------------------
+def extract_taxonomy_keywords_automatic(text: str, taxonomy: list) -> list:
+    """
+    Return a list of taxonomy keywords that appear in the text using regex matching.
+    """
+    found_keywords = []
+    for keyword in taxonomy:
+        pattern = r'\b' + re.escape(keyword) + r'\b'
+        if re.search(pattern, text, flags=re.IGNORECASE):
+            found_keywords.append(keyword)
+    return found_keywords
+# ------------------------------------------
+# 3. Intelligent Taxonomy Extraction (LLM-based)
+# ------------------------------------------
+def extract_taxonomy_keywords_intelligent(text: str, taxonomy: list) -> tuple:
+    """
+    Uses GPT-4o-mini to extract taxonomy keywords from the page content.
+    The assistant is provided with both the page content and the list of legal taxonomy keywords.
+    It returns a tuple (exact_matches, related_keywords) where:
+      - exact_matches: a list of keywords that exactly appear in the content (if any)
+      - related_keywords: a list of 5 highly relevant taxonomy keywords.
+    If no exact matches are found, only related_keywords are provided.
+    """
+    try:
+        client = openai.OpenAI(
+            api_key=st.session_state.user_env["OPENAI_API_KEY"])
+        system_prompt = INTELLIGENT_EXTRACTION_PROMPT
+        user_prompt = f"Taxonomy keywords: {', '.join(taxonomy)}\n\nPage content:\n{text}"
+        response = client.chat.completions.create(
+            model="gpt-4o-mini",
+            messages=[
+                {"role": "system", "content": system_prompt},
+                {"role": "user", "content": user_prompt}
+            ],
+            max_tokens=1024,
+            temperature=0.7
+        )
+        result_text = response.choices[0].message.content.strip()
+        logger.info("LLM extraction result: " + result_text)
+        try:
+            data = json.loads(result_text)
+        except Exception as parse_error:
+            logger.error(
+                "JSON parsing error in intelligent extraction: " + str(parse_error))
+            logger.error("LLM result was: " + result_text)
+            return ([], [])
+        exact_matches = data.get("exact_matches", [])
+        related_keywords = data.get("related_keywords", [])
+        logger.info(f"Exact matches: {exact_matches}")
+        logger.info(f"Related keywords: {related_keywords}")
+        return (exact_matches, related_keywords)
+    except Exception as e:
+        logger.error("LLM extraction error: " + str(e))
+        return ([], [])
+# ------------------------------------------
+# 4. Helper Function: Parse Chunk Text for Metadata
+# ------------------------------------------
+def parse_chunk_text(chunk_text: str):
+    """
+    Expects the chunk text to be formatted as:
+    ===PAGE_INFO===
+    Document: <doc_name>
+    DocHash: <doc_hash>
+    Page: <page_number>
+    Taxonomy: <header_line>
+    ===CONTENT===
+    <actual page content>
+    For Intelligent mode, header_line may be formatted as:
+    <exact_matches> | Related: <related_keywords>
+    Returns a tuple: (doc_name, doc_hash, page_number, taxonomy_info, actual_content)
+    taxonomy_info is returned as a string.
+    """
+    doc_name = "Unknown"
+    doc_hash = "Unknown"
+    page_num = "Unknown"
+    taxonomy_info = ""
+    content = chunk_text
+    if chunk_text.startswith("===PAGE_INFO==="):
+        parts = chunk_text.split("===CONTENT===")
+        if len(parts) >= 2:
+            header = parts[0]
+            content = "===CONTENT===".join(parts[1:]).strip()
+            for line in header.splitlines():
+                if line.startswith("Document:"):
+                    doc_name = line.split("Document:")[1].strip()
+                elif line.startswith("DocHash:"):
+                    doc_hash = line.split("DocHash:")[1].strip()
+                elif line.startswith("Page:"):
+                    page_num = line.split("Page:")[1].strip()
+                elif line.startswith("Taxonomy:"):
+                    taxonomy_info = line.split("Taxonomy:")[1].strip()
+    return doc_name, doc_hash, page_num, taxonomy_info, content
+# ------------------------------------------
+# 5. Configuration Initialization
+# ------------------------------------------
+def initialize_config(openai_key: str, cohere_key: str, db_url: str) -> RAGLiteConfig:
+    try:
+        os.environ["OPENAI_API_KEY"] = openai_key
+        os.environ["COHERE_API_KEY"] = cohere_key
+        return RAGLiteConfig(
+            db_url=db_url,
+            llm="gpt-4o",
+            embedder="text-embedding-3-large",
+            embedder_normalize=True,
+            chunk_max_size=8000,
+            embedder_sentence_window_size=2,
+            reranker=Reranker("cohere", api_key=cohere_key, lang="en")
+        )
+    except Exception as e:
+        raise ValueError(f"Configuration error: {e}")
+# ------------------------------------------
+# 6. Document Processing: Page-Wise Chunking with Metadata Injection and Progress UI
+# ------------------------------------------
+def process_document(file_path: str, doc_hash: str, doc_name: str) -> bool:
+    try:
+        if not st.session_state.get('my_config'):
+            raise ValueError("Configuration not initialized")
+        # Sanitize document name to avoid encoding issues
+        doc_name = doc_name.encode('ascii', 'replace').decode('ascii')
+        with open(file_path, "rb") as f:
+            pdf_reader = PyPDF2.PdfReader(f)
+            num_pages = len(pdf_reader.pages)
+            logger.info(f"Processing PDF '{doc_name}' with {num_pages} pages.")
+            progress_bar = st.progress(0)
+            status_text = st.empty()
+            for page_index in range(num_pages):
+                status_text.text(
+                    f"Processing page {page_index+1} of {num_pages}...")
+                with st.spinner(f"Processing page {page_index+1}..."):
+                    try:
+                        page = pdf_reader.pages[page_index]
+                        # Extract text and handle encoding more robustly
+                        raw_text = page.extract_text() or ""
+                        # Convert text to plain ASCII, replacing non-ASCII characters
+                        text = raw_text.encode(
+                            'ascii', 'replace').decode('ascii')
+                        # Remove any remaining problematic characters
+                        text = ''.join(
+                            char for char in text if ord(char) < 128)
+                        extraction_mode = st.session_state.get(
+                            "extraction_mode", "Automatic")
+                        if extraction_mode == "Intelligent":
+                            exact_matches, related_keywords = extract_taxonomy_keywords_intelligent(
+                                text, LEGAL_TAXONOMY_KEYWORDS)
+                            logger.info(f"Exact matches: {exact_matches}")
+                            logger.info(
+                                f"Related keywords: {related_keywords}")
+                            if exact_matches:
+                                header_line = f"{', '.join(exact_matches)} | Related: {', '.join(related_keywords)}"
+                            else:
+                                header_line = f"{', '.join(related_keywords)}"
+                        else:
+                            tax_keywords = extract_taxonomy_keywords_automatic(
+                                text, LEGAL_TAXONOMY_KEYWORDS)
+                            header_line = f"{', '.join(tax_keywords) if tax_keywords else 'None'}"
+                        # Create safe filename for temporary file
+                        safe_doc_name = ''.join(
+                            c for c in doc_name if c.isalnum() or c in ('-', '_'))
+                        temp_page_file = f"temp_page_{safe_doc_name}_{page_index+1}.txt"
+                        # Write the temporary file using ASCII encoding
+                        with open(temp_page_file, "w", encoding='ascii', errors='replace') as tmp:
+                            header = (
+                                "===PAGE_INFO===\n"
+                                f"Document: {doc_name}\n"
+                                f"DocHash: {doc_hash}\n"
+                                f"Page: {page_index+1}\n"
+                                f"Taxonomy: {header_line}\n"
+                                "===CONTENT===\n"
+                            )
+                            tmp.write(header)
+                            tmp.write(text)
+                        insert_document(Path(temp_page_file),
+                                        config=st.session_state.my_config)
+                        os.remove(temp_page_file)
+                        progress_bar.progress((page_index + 1) / num_pages)
+                    except Exception as page_error:
+                        logger.error(
+                            f"Error processing page {page_index+1}: {str(page_error)}")
+                        continue
+            status_text.text("Processing complete!")
+        return True
+    except Exception as e:
+        logger.error(f"Error processing document: {str(e)}")
+        return False
+# ------------------------------------------
+# 7. Search and Fallback Functions
+# ------------------------------------------
+def perform_search(query: str) -> List:
+    try:
+        chunk_ids, scores = hybrid_search(
+            query, num_results=10, config=st.session_state.my_config)
+        if not chunk_ids:
+            return []
+        chunks = retrieve_chunks(chunk_ids, config=st.session_state.my_config)
+        return rerank_chunks(query, chunks, config=st.session_state.my_config)
+    except Exception as e:
+        logger.error(f"Search error: {str(e)}")
+        return []
+def handle_fallback(query: str) -> str:
+    try:
+        client = openai.OpenAI(
+            api_key=st.session_state.user_env["OPENAI_API_KEY"])
+        system_prompt = FALLBACK_SYSTEM_PROMPT
+        response = client.chat.completions.create(
+            model="gpt-4o-mini",
+            messages=[
+                {"role": "system", "content": system_prompt},
+                {"role": "user", "content": query}
+            ],
+            max_tokens=1024,
+            temperature=0.7
+        )
+        return response.choices[0].message.content
+    except Exception as e:
+        logger.error(f"Fallback error: {str(e)}")
+        st.error(f"Fallback error: {str(e)}")
+        return "I apologize, but I encountered an error while processing your request. Please try again."
+# ------------------------------------------
+# 8. Main Streamlit App
+# ------------------------------------------
+def main():
+    st.set_page_config(page_title="Innodata - Taxonomy RAG POC", layout="wide")
+    for state_var in ['chat_history', 'documents_loaded', 'my_config', 'user_env', 'processed_pdf_hashes', 'pdf_files']:
+        if state_var not in st.session_state:
+            if state_var == 'chat_history':
+                st.session_state[state_var] = []
+            elif state_var == 'documents_loaded':
+                st.session_state[state_var] = False
+            elif state_var == 'my_config':
+                st.session_state[state_var] = None
+            elif state_var == 'user_env':
+                st.session_state[state_var] = {}
+            elif state_var == 'processed_pdf_hashes':
+                st.session_state[state_var] = set()
+            elif state_var == 'pdf_files':
+                st.session_state[state_var] = {}
+    with st.sidebar:
+        st.title("Configuration")
+        openai_key = st.text_input("OpenAI API Key", value=st.session_state.get(
+            'openai_key', ''), type="password", placeholder="sk-...")
+        cohere_key = st.text_input("Cohere API Key", value=st.session_state.get(
+            'cohere_key', ''), type="password", placeholder="Enter Cohere key")
+        db_url = st.text_input("Database URL", value=st.session_state.get(
+            'db_url', 'sqlite:///raglite.sqlite'), placeholder="sqlite:///raglite.sqlite")
+        if not st.session_state.documents_loaded:
+            extraction_mode = st.radio("Select Taxonomy Extraction Mode", options=[
+                                       "Automatic", "Intelligent"], index=0)
+            st.session_state["extraction_mode"] = extraction_mode
+        else:
+            st.write("Taxonomy Extraction Mode: " +
+                     st.session_state.get("extraction_mode", "Automatic"))
+        if st.button("Save Configuration"):
+            try:
+                if not all([openai_key, cohere_key, db_url]):
+                    st.error("All fields are required!")
+                    return
+                st.session_state['openai_key'] = openai_key
+                st.session_state['cohere_key'] = cohere_key
+                st.session_state['db_url'] = db_url
+                st.session_state.my_config = initialize_config(
+                    openai_key=openai_key, cohere_key=cohere_key, db_url=db_url)
+                st.session_state.user_env = {"OPENAI_API_KEY": openai_key}
+                st.success("Configuration saved successfully!")
+            except Exception as e:
+                st.error(f"Configuration error: {str(e)}")
+    st.title("Innodata - Taxonomy POC - RAG with Hybrid Search")
+    if not st.session_state.documents_loaded:
+        uploaded_files = st.file_uploader("Upload PDF legal documents", type=[
+                                          "pdf"], accept_multiple_files=True, key="pdf_uploader")
+        if uploaded_files:
+            for uploaded_file in uploaded_files:
+                file_bytes = uploaded_file.getvalue()
+                file_hash = hashlib.md5(file_bytes).hexdigest()
+                if file_hash in st.session_state.processed_pdf_hashes:
+                    st.warning(
+                        f"'{uploaded_file.name}' has already been uploaded. Skipping duplicate.")
+                    continue
+                else:
+                    st.session_state.processed_pdf_hashes.add(file_hash)
+                    st.session_state.pdf_files[file_hash] = file_bytes
+                    temp_path = f"temp_{uploaded_file.name}"
+                    with open(temp_path, "wb") as f:
+                        f.write(file_bytes)
+                    with st.spinner(f"Processing {uploaded_file.name}..."):
+                        if process_document(temp_path, file_hash, uploaded_file.name):
+                            st.success(
+                                f"Successfully processed: {uploaded_file.name}")
+                        else:
+                            st.error(
+                                f"Failed to process: {uploaded_file.name}")
+                    os.remove(temp_path)
+            st.session_state.documents_loaded = True
+            st.success(
+                "All documents are ready! You can now ask questions about them.")
+    else:
+        st.info("Documents already processed. You can ask your questions below.")
+    if st.session_state.documents_loaded:
+        for msg in st.session_state.chat_history:
+            with st.chat_message("user"):
+                st.write(msg[0])
+            with st.chat_message("assistant"):
+                st.write(msg[1])
+        user_input = st.chat_input("Ask a question about the documents...")
+        if user_input:
+            with st.chat_message("user"):
+                st.write(user_input)
+            with st.chat_message("assistant"):
+                message_placeholder = st.empty()
+                try:
+                    reranked_chunks = perform_search(query=user_input)
+                    if not reranked_chunks or len(reranked_chunks) == 0:
+                        logger.info(
+                            "No relevant documents found. Falling back to general LLM.")
+                        st.info(
+                            "No relevant documents found. Using general knowledge to answer.")
+                        full_response = handle_fallback(user_input)
+                        message_placeholder.markdown(full_response)
+                    else:
+                        best_chunk = reranked_chunks[0]
+                        raw_text = best_chunk.body
+                        doc_name, doc_hash, page_number, taxonomy_info, content_without_header = parse_chunk_text(
+                            raw_text)
+                        formatted_messages = [
+                            {"role": "user" if i %
+                                2 == 0 else "assistant", "content": msg}
+                            for i, msg in enumerate([m for pair in st.session_state.chat_history for m in pair])
+                            if msg
+                        ]
+                        response_stream = rag(
+                            prompt=user_input,
+                            system_prompt=RAG_SYSTEM_PROMPT,
+                            search=hybrid_search,
+                            messages=formatted_messages,
+                            max_contexts=5,
+                            config=st.session_state.my_config
+                        )
+                        full_response = ""
+                        for chunk in response_stream:
+                            full_response += chunk
+                            message_placeholder.markdown(full_response + "▌")
+                        message_placeholder.markdown(full_response)
+                        with st.expander("Top Matched Source Information:", expanded=False):
+                            st.write(f"**Document:** {doc_name}")
+                            st.write(f"**Page:** {page_number}")
+                            if st.session_state.get("extraction_mode") == "Intelligent" and "|" in taxonomy_info:
+                                parts = taxonomy_info.split("|")
+                                exact_matches = parts[0].strip()
+                                related_keywords = parts[1].replace(
+                                    "Related:", "").strip()
+                                st.write(f"**Exact Matches:** {exact_matches}")
+                                st.write(
+                                    f"**Related Keywords:** {related_keywords}")
+                            else:
+                                st.write(
+                                    f"**Taxonomy Keywords:** {taxonomy_info if taxonomy_info else 'None'}")
+                            if doc_hash in st.session_state.pdf_files:
+                                pdf_bytes = st.session_state.pdf_files[doc_hash]
+                                try:
+                                    page_num_int = int(page_number)
+                                    pages = convert_from_bytes(
+                                        pdf_bytes, first_page=page_num_int, last_page=page_num_int)
+                                    if pages:
+                                        st.image(
+                                            pages[0], caption=f"{doc_name} - Page {page_number}")
+                                except Exception as e:
+                                    st.error(
+                                        "Could not convert PDF page to image: " + str(e))
+                    st.session_state.chat_history.append(
+                        (user_input, full_response))
+                except Exception as e:
+                    st.error(f"Error: {str(e)}")
+    else:
+        if not st.session_state.my_config:
+            st.info("Please configure your API keys to get started.")
+        else:
+            st.info("Please upload some documents to get started.")
+if __name__ == "__main__":
+    main()

requirements.txt ADDED Viewed

	@@ -0,0 +1,26 @@

+# Core dependencies
+streamlit>=1.31.0
+raglite==0.2.1
+pydantic==2.10.1
+sqlalchemy>=2.0.0
+openai>=1.0.0
+cohere>=4.37
+rerankers==0.6.0
+# PDF processing
+PyPDF2>=3.0.0
+pdf2image>=1.16.3
+poppler-utils
+# Template engine
+jinja2>=3.1.0
+# Database
+psycopg2-binary>=2.9.9  # Optional: for PostgreSQL support
+# NLP and text processing
+spacy>=3.7.0
+python-dotenv>=1.0.0
+# Download spacy model during deployment
+en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.0/en_core_web_sm-3.7.0-py3-none-any.whl

source/flowchartTD.png ADDED Viewed

templates/prompts.j2 ADDED Viewed

	@@ -0,0 +1,25 @@

+{# RAG System Prompt #}
+{% set RAG_SYSTEM_PROMPT %}
+You are a friendly and knowledgeable legal assistant that provides complete and insightful answers.
+Answer the user's question using only the context provided.
+When responding, you MUST NOT reference the existence of the context, directly or indirectly.
+Instead, treat the context as if it were entirely part of your working memory.
+{% endset %}
+{# Intelligent Extraction System Prompt #}
+{% set INTELLIGENT_EXTRACTION_PROMPT %}
+You are a legal taxonomy extraction assistant.
+Given the following page content and a list of legal taxonomy keywords,
+identify all keywords from the list that exactly appear in the page content.
+Then, suggest 5 additional legal taxonomy keywords that are highly relevant to the content.
+If no exact matches are found, just provide 5 related keywords.
+Return your answer as a JSON object with two keys: exact_matches and related_keywords.
+Do not include any extra text.
+{% endset %}
+{# Fallback System Prompt #}
+{% set FALLBACK_SYSTEM_PROMPT %}
+You are a helpful AI assistant. When you don't know something,
+be honest about it. Provide clear, concise, and accurate responses.
+If the question is not related to any specific document, use your general knowledge to answer.
+{% endset %}

templates/taxonomy.j2 ADDED Viewed

	@@ -0,0 +1,50 @@

+{# Legal Taxonomy Keywords #}
+{% set LEGAL_TAXONOMY_KEYWORDS = [
+    # Core Legal Areas
+    "contract law", "tort law", "criminal law", "civil law", "constitutional law",
+    "property law", "family law", "intellectual property", "corporate law", "tax law",
+    "administrative law", "environmental law", "labor law", "immigration law",
+    "bankruptcy law", "securities law", "antitrust law", "international law",
+    # Legal Processes & Procedures
+    "civil procedure", "criminal procedure", "evidence", "jurisdiction", "arbitration",
+    "mediation", "litigation", "appeal", "discovery", "pleadings", "injunction",
+    "class action", "settlement", "trial", "hearing", "deposition",
+    # Legal Concepts & Principles
+    "due process", "precedent", "statute", "regulation", "liability", "negligence",
+    "damages", "remedy", "standing", "jurisdiction", "venue", "immunity",
+    "consideration", "breach", "fraud", "defamation", "estoppel",
+    # Rights & Protections
+    "civil rights", "human rights", "privacy rights", "discrimination",
+    "equal protection", "freedom of speech", "freedom of religion",
+    "right to counsel", "miranda rights", "fourth amendment", "fifth amendment",
+    # Business & Commercial
+    "mergers and acquisitions", "securities regulation", "commercial law",
+    "partnership law", "llc law", "agency law", "employment law", "trade law",
+    "consumer protection", "unfair competition", "trademark", "patent", "copyright",
+    # Property & Real Estate
+    "real property", "personal property", "easement", "zoning", "land use",
+    "landlord tenant", "mortgage", "title", "deed", "conveyance",
+    # Criminal Justice
+    "felony", "misdemeanor", "mens rea", "actus reus", "probable cause",
+    "search and seizure", "self defense", "double jeopardy", "plea bargain",
+    # Specialized Areas
+    "healthcare law", "education law", "elder law", "military law", "maritime law",
+    "aviation law", "sports law", "entertainment law", "cyber law", "blockchain law",
+    "data privacy", "artificial intelligence law", "environmental compliance",
+    # Government & Public Law
+    "municipal law", "state law", "federal law", "legislative process",
+    "executive power", "judicial review", "administrative procedure",
+    "public policy", "regulatory compliance", "government contracts",
+    # Alternative Dispute Resolution
+    "negotiation", "conciliation", "dispute resolution", "binding arbitration",
+    "non-binding arbitration", "mediation agreement", "settlement conference"
+] %}