AmmarFahmy commited on
Commit
3f71554
·
1 Parent(s): b4faff1

Update README with comprehensive project description and system architecture

Browse files
.gitignore ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Streamlit config directory and files
2
+ .streamlit/
3
+ .streamlit/*.toml
4
+
5
+ # Python related
6
+ __pycache__/
7
+ *.py[cod]
8
+ *$py.class
9
+
10
+ # Flashrank cache
11
+ .flashrank_cache
12
+
13
+ # Environment
14
+ .env
15
+ .venv
16
+ env/
17
+ venv/
18
+ ENV/
19
+
20
+
21
+ # IDE
22
+ .vscode/
23
+ .idea/
24
+
25
+ # Distribution / packaging
26
+ dist/
27
+ build/
28
+ *.egg-info/
027-SLLR-SLLR-1990-V-1-DONA-CECILIANA-AND-OTHERS-v.-KAMALA-PIYASEELI-AND-ANOTHER.pdf ADDED
Binary file (212 kB). View file
 
README.md CHANGED
@@ -1,13 +1,215 @@
1
- ---
2
- title: Innodata Poc
3
- emoji: 🐨
4
- colorFrom: gray
5
- colorTo: purple
6
- sdk: streamlit
7
- sdk_version: 1.42.0
8
- app_file: app.py
9
- pinned: false
10
- short_description: This is a POC app for an interview purpose.
11
- ---
12
-
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Legal Document RAG with Taxonomy-Aware Hybrid Search
2
+
3
+ A powerful Q&A application for legal documents that leverages Hybrid Search and Retrieval-Augmented Generation (RAG) with built-in legal taxonomy awareness. Built with RAGLite for robust document processing and retrieval and Streamlit for an intuitive chat interface, this system provides intelligent answers to legal queries while maintaining awareness of key legal domain concepts.
4
+
5
+ ## Features
6
+
7
+ - **Dual-Mode Taxonomy Extraction**:
8
+ - **Automatic Mode**:
9
+ - Fast regex-based keyword matching
10
+ - Identifies exact matches from robust predefined legal taxonomy
11
+ - Efficient for quick document processing
12
+ - No API calls required
13
+ - **Intelligent Mode**:
14
+ - LLM-powered taxonomy analysis using GPT-4o-mini
15
+ - Identifies both exact matches and semantically related concepts
16
+ - Provides additional context through related keyword suggestions
17
+ - More nuanced understanding of legal concepts
18
+
19
+ - **Advanced Search and Retrieval**:
20
+ - **Hybrid Search System**:
21
+ - Combines semantic search with traditional keyword matching
22
+ - Uses OpenAI's text-embedding-3-large for semantic understanding
23
+ - Supports up to 10 initial search results (retrieves the top 10 most relevant document chunks before reranking)
24
+ - Optimized chunk size of 8000 tokens with 2-sentence overlapping windows (each document chunk contains 8000 tokens and overlaps with adjacent chunks by 2 sentences to maintain context)
25
+
26
+ - **Intelligent Reranking**:
27
+ - Powered by Cohere's reranking technology
28
+ - Re-orders search results based on relevance to query
29
+ - Improves context selection for more accurate answers
30
+ - Language-aware reranking with English optimization
31
+
32
+ - **Fallback Mechanism**:
33
+ - Graceful degradation to general knowledge when no relevant documents found
34
+ - Uses GPT-4o-mini for general legal knowledge
35
+ - Maintains conversation context
36
+
37
+ - **Document Processing and UI**:
38
+ - Page-level document chunking with metadata enrichment
39
+ - Visual PDF page display for source verification
40
+ - Progress tracking during document processing
41
+ - Interactive chat interface with conversation history
42
+
43
+ - **Template-Based Configuration**:
44
+ - The application uses Jinja2 templates for managing prompts and taxonomies, following software engineering best practices:
45
+
46
+ - **Separation of Concerns**:
47
+ - Prompts and taxonomies are maintained in separate template files
48
+ - `templates/prompts.j2`: Contains all system prompts (RAG, extraction, fallback)
49
+ - `templates/taxonomy.j2`: Contains the comprehensive legal taxonomy keywords
50
+
51
+ - **Benefits**:
52
+ - **Maintainability**: Edit prompts and taxonomies without touching application code
53
+ - **Version Control**: Track changes to prompts and taxonomies separately
54
+ - **Environment Flexibility**: Support different prompts/taxonomies per environment
55
+ - **Reusability**: Templates can be shared across multiple applications
56
+ - **Readability**: Clean separation between logic and content
57
+
58
+ ## System Architecture
59
+
60
+ The following flowchart illustrates the complete system pipeline from initial configuration to final answer generation:
61
+
62
+ ```mermaid
63
+ flowchart TD
64
+ subgraph Configuration["Initial Configuration"]
65
+ Config["Configure API Keys & DB"]
66
+ Mode["Select Taxonomy Mode:<br>Automatic or Intelligent"]
67
+ end
68
+
69
+ subgraph Upload_Duplicate_Check["Upload & Duplicate Check"]
70
+ A["User Uploads PDF"]
71
+ D["Store PDF Bytes in Session"]
72
+ Hash["Generate MD5 Hash"]
73
+ DupCheck{"Is Duplicate?"}
74
+ end
75
+
76
+ subgraph Document_Processing["Document Processing <br> Chunking & Taxonomy Extraction"]
77
+ E["Document Processing"]
78
+ F["Read PDF using PyPDF2"]
79
+ G["Split PDF into Pages"]
80
+ H["For Each Page: Extract Text"]
81
+ TaxMode{"Extraction Mode?"}
82
+ I1["Automatic: Extract Keywords<br>using Regex"]
83
+ I2["Intelligent: Use GPT-4o-mini<br>for Keyword Extraction"]
84
+ J["Generate Temporary File with Header:<br>Document, DocHash, Page, Taxonomy"]
85
+ K["Call insert_document<br>Chunk Ingestion"]
86
+ L["Store Chunk in Database"]
87
+ M["Update Processing Progress & Complete"]
88
+ end
89
+
90
+ subgraph Query_Search_Flow["Query & Search Flow"]
91
+ N["User Enters Query in Chat"]
92
+ O["Perform Hybrid Search<br>hybrid_search"]
93
+ P["Retrieve Chunks<br>retrieve_chunks"]
94
+ Q["Re-rank Chunks<br>rerank_chunks"]
95
+ R{"Relevant Chunk Found?"}
96
+ S["Select Top Matched Chunk"]
97
+ T["Fallback: General Knowledge<br>GPT-4o-mini"]
98
+ end
99
+
100
+ subgraph Answer_Generation_UI["Answer Generation & UI"]
101
+ U["Call rag for Answer Generation"]
102
+ V["Stream Generated Answer to User"]
103
+ W["Expander: Top Matched Source"]
104
+ X["Parse Chunk Header for:<br>Document, Page, Taxonomy"]
105
+ Y["Convert PDF Page to Image"]
106
+ Z["Display PDF Page Image &<br>Taxonomy Information"]
107
+ end
108
+
109
+ Config --> Mode
110
+ Mode --> A
111
+ A --> Hash
112
+ Hash --> DupCheck
113
+ DupCheck -- Yes --> A
114
+ DupCheck -- No --> D
115
+ D --> E
116
+ E --> F
117
+ F --> G
118
+ G --> H
119
+ H --> TaxMode
120
+ TaxMode -- Automatic --> I1
121
+ TaxMode -- Intelligent --> I2
122
+ I1 --> J
123
+ I2 --> J
124
+ J --> K
125
+ K --> L
126
+ L --> M
127
+ M --> N
128
+ N --> O
129
+ O --> P
130
+ P --> Q
131
+ Q --> R
132
+ R -- Yes --> S
133
+ R -- No --> T
134
+ T --> V
135
+ S --> U
136
+ U --> V
137
+ V --> W
138
+ W --> X
139
+ X --> Y
140
+ Y --> Z
141
+ ```
142
+
143
+ ## Prerequisites
144
+
145
+ You'll need the following API keys:
146
+
147
+ 1. **API Keys**:
148
+ - [OpenAI API key](https://platform.openai.com/api-keys) for:
149
+ - GPT-4o model (chat completions)
150
+ - text-embedding-3-large (embeddings)
151
+ - GPT-4o-mini (intelligent taxonomy extraction)
152
+ - [Cohere API key](https://dashboard.cohere.com/api-keys) for reranking
153
+
154
+ 2. **Database Setup** (Optional):
155
+ - Default: SQLite (no setup required)
156
+ - Alternatively: Use any SQLAlchemy-compatible database
157
+
158
+ ## Installation
159
+
160
+ 1. **Install Dependencies**:
161
+ ```bash
162
+ pip install -r requirements.txt
163
+ ```
164
+
165
+ 2. **Required System Dependencies**:
166
+ - install both pypandoc and Pandoc via conda
167
+ ```bash
168
+ conda install -c conda-forge pypandoc pandoc
169
+ ```
170
+
171
+ ## Usage
172
+
173
+ 1. **Start the Application**:
174
+ ```bash
175
+ streamlit run app.py
176
+ ```
177
+
178
+ 2. **Configure the Application**:
179
+ - Enter your OpenAI API key
180
+ - Enter your Cohere API key
181
+ - Configure database URL (optional, defaults to SQLite)
182
+ - Select taxonomy extraction mode (Automatic or Intelligent)
183
+ - Click "Save Configuration"
184
+
185
+ 3. **Upload Documents**:
186
+ - Upload PDF legal documents
187
+ - The system will automatically:
188
+ - Process documents page by page
189
+ - Extract legal taxonomy keywords based on selected mode
190
+ - Create searchable chunks with metadata
191
+ - Display processing progress
192
+
193
+ 4. **Ask Questions**:
194
+ - Ask questions about your legal documents
195
+ - View source information including:
196
+ - Original document and page number
197
+ - Extracted taxonomy keywords (exact matches and related concepts in Intelligent mode)
198
+ - PDF page preview
199
+ - System automatically falls back to general knowledge for non-document questions
200
+
201
+ ## Legal Taxonomy
202
+
203
+ The system includes built-in recognition for over 100 legal concepts across various categories:
204
+ - Core Legal Areas (e.g., contract law, tort law, criminal law)
205
+ - Legal Processes & Procedures (e.g., civil procedure, arbitration)
206
+ - Legal Concepts & Principles (e.g., due process, liability)
207
+ - Rights & Protections (e.g., civil rights, privacy rights)
208
+ - Business & Commercial (e.g., securities regulation, intellectual property)
209
+ - Property & Real Estate (e.g., zoning, land use)
210
+ - Criminal Justice (e.g., felony, probable cause)
211
+ - Specialized Areas (e.g., healthcare law, cyber law)
212
+ - Government & Public Law (e.g., administrative law, regulatory compliance)
213
+ - Alternative Dispute Resolution (e.g., mediation, arbitration)
214
+
215
+ See the code for the complete list of supported taxonomy keywords.
TISSA-BANDARA-RANDENIYA-and-THE-BOARD-OF-DIRECTORS-OF-THE-CO-OPERATIVE-W.pdf.pdf ADDED
Binary file (112 kB). View file
 
Y.-B.-PUSSADENIYA-ASSISTANT-COMMISSIONER-OF-LOCAL-GOVERNMENT-Petitioner-and-O.pdf.pdf ADDED
Binary file (100 kB). View file
 
app.py ADDED
@@ -0,0 +1,466 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import re
3
+ import json
4
+ import hashlib
5
+ import logging
6
+ import streamlit as st
7
+ import PyPDF2
8
+ from raglite import RAGLiteConfig, insert_document, hybrid_search, retrieve_chunks, rerank_chunks, rag
9
+ from rerankers import Reranker
10
+ from typing import List
11
+ from pathlib import Path
12
+ import openai
13
+ import time
14
+ import warnings
15
+ from jinja2 import Environment, FileSystemLoader
16
+ from pdf2image import convert_from_bytes
17
+
18
+ # Setup logging and ignore specific warnings.
19
+ logging.basicConfig(level=logging.INFO)
20
+ logger = logging.getLogger(__name__)
21
+ warnings.filterwarnings("ignore", message=".*torch.classes.*")
22
+
23
+ # Initialize Jinja2 environment
24
+ jinja_env = Environment(loader=FileSystemLoader('templates'))
25
+
26
+ # Load templates
27
+ prompts_template = jinja_env.get_template('prompts.j2')
28
+ taxonomy_template = jinja_env.get_template('taxonomy.j2')
29
+
30
+ # Render templates to get variables
31
+ template_vars = {}
32
+ exec(prompts_template.render(), template_vars)
33
+ exec(taxonomy_template.render(), template_vars)
34
+
35
+ # Extract variables from templates
36
+ RAG_SYSTEM_PROMPT = template_vars['RAG_SYSTEM_PROMPT'].strip()
37
+ INTELLIGENT_EXTRACTION_PROMPT = template_vars['INTELLIGENT_EXTRACTION_PROMPT'].strip(
38
+ )
39
+ FALLBACK_SYSTEM_PROMPT = template_vars['FALLBACK_SYSTEM_PROMPT'].strip()
40
+ LEGAL_TAXONOMY_KEYWORDS = template_vars['LEGAL_TAXONOMY_KEYWORDS']
41
+
42
+ # ------------------------------------------
43
+ # 1. Predefined Legal Taxonomy
44
+ # ------------------------------------------
45
+
46
+ # ------------------------------------------
47
+ # 2. Automatic Taxonomy Extraction (Regex-based)
48
+ # ------------------------------------------
49
+
50
+
51
+ def extract_taxonomy_keywords_automatic(text: str, taxonomy: list) -> list:
52
+ """
53
+ Return a list of taxonomy keywords that appear in the text using regex matching.
54
+ """
55
+ found_keywords = []
56
+ for keyword in taxonomy:
57
+ pattern = r'\b' + re.escape(keyword) + r'\b'
58
+ if re.search(pattern, text, flags=re.IGNORECASE):
59
+ found_keywords.append(keyword)
60
+ return found_keywords
61
+
62
+ # ------------------------------------------
63
+ # 3. Intelligent Taxonomy Extraction (LLM-based)
64
+ # ------------------------------------------
65
+
66
+
67
+ def extract_taxonomy_keywords_intelligent(text: str, taxonomy: list) -> tuple:
68
+ """
69
+ Uses GPT-4o-mini to extract taxonomy keywords from the page content.
70
+ The assistant is provided with both the page content and the list of legal taxonomy keywords.
71
+ It returns a tuple (exact_matches, related_keywords) where:
72
+ - exact_matches: a list of keywords that exactly appear in the content (if any)
73
+ - related_keywords: a list of 5 highly relevant taxonomy keywords.
74
+ If no exact matches are found, only related_keywords are provided.
75
+ """
76
+ try:
77
+ client = openai.OpenAI(
78
+ api_key=st.session_state.user_env["OPENAI_API_KEY"])
79
+ system_prompt = INTELLIGENT_EXTRACTION_PROMPT
80
+ user_prompt = f"Taxonomy keywords: {', '.join(taxonomy)}\n\nPage content:\n{text}"
81
+ response = client.chat.completions.create(
82
+ model="gpt-4o-mini",
83
+ messages=[
84
+ {"role": "system", "content": system_prompt},
85
+ {"role": "user", "content": user_prompt}
86
+ ],
87
+ max_tokens=1024,
88
+ temperature=0.7
89
+ )
90
+ result_text = response.choices[0].message.content.strip()
91
+ logger.info("LLM extraction result: " + result_text)
92
+ try:
93
+ data = json.loads(result_text)
94
+ except Exception as parse_error:
95
+ logger.error(
96
+ "JSON parsing error in intelligent extraction: " + str(parse_error))
97
+ logger.error("LLM result was: " + result_text)
98
+ return ([], [])
99
+ exact_matches = data.get("exact_matches", [])
100
+ related_keywords = data.get("related_keywords", [])
101
+ logger.info(f"Exact matches: {exact_matches}")
102
+ logger.info(f"Related keywords: {related_keywords}")
103
+ return (exact_matches, related_keywords)
104
+ except Exception as e:
105
+ logger.error("LLM extraction error: " + str(e))
106
+ return ([], [])
107
+
108
+ # ------------------------------------------
109
+ # 4. Helper Function: Parse Chunk Text for Metadata
110
+ # ------------------------------------------
111
+
112
+
113
+ def parse_chunk_text(chunk_text: str):
114
+ """
115
+ Expects the chunk text to be formatted as:
116
+
117
+ ===PAGE_INFO===
118
+ Document: <doc_name>
119
+ DocHash: <doc_hash>
120
+ Page: <page_number>
121
+ Taxonomy: <header_line>
122
+ ===CONTENT===
123
+ <actual page content>
124
+
125
+ For Intelligent mode, header_line may be formatted as:
126
+ <exact_matches> | Related: <related_keywords>
127
+
128
+ Returns a tuple: (doc_name, doc_hash, page_number, taxonomy_info, actual_content)
129
+ taxonomy_info is returned as a string.
130
+ """
131
+ doc_name = "Unknown"
132
+ doc_hash = "Unknown"
133
+ page_num = "Unknown"
134
+ taxonomy_info = ""
135
+ content = chunk_text
136
+ if chunk_text.startswith("===PAGE_INFO==="):
137
+ parts = chunk_text.split("===CONTENT===")
138
+ if len(parts) >= 2:
139
+ header = parts[0]
140
+ content = "===CONTENT===".join(parts[1:]).strip()
141
+ for line in header.splitlines():
142
+ if line.startswith("Document:"):
143
+ doc_name = line.split("Document:")[1].strip()
144
+ elif line.startswith("DocHash:"):
145
+ doc_hash = line.split("DocHash:")[1].strip()
146
+ elif line.startswith("Page:"):
147
+ page_num = line.split("Page:")[1].strip()
148
+ elif line.startswith("Taxonomy:"):
149
+ taxonomy_info = line.split("Taxonomy:")[1].strip()
150
+ return doc_name, doc_hash, page_num, taxonomy_info, content
151
+
152
+ # ------------------------------------------
153
+ # 5. Configuration Initialization
154
+ # ------------------------------------------
155
+
156
+
157
+ def initialize_config(openai_key: str, cohere_key: str, db_url: str) -> RAGLiteConfig:
158
+ try:
159
+ os.environ["OPENAI_API_KEY"] = openai_key
160
+ os.environ["COHERE_API_KEY"] = cohere_key
161
+ return RAGLiteConfig(
162
+ db_url=db_url,
163
+ llm="gpt-4o",
164
+ embedder="text-embedding-3-large",
165
+ embedder_normalize=True,
166
+ chunk_max_size=8000,
167
+ embedder_sentence_window_size=2,
168
+ reranker=Reranker("cohere", api_key=cohere_key, lang="en")
169
+ )
170
+ except Exception as e:
171
+ raise ValueError(f"Configuration error: {e}")
172
+
173
+ # ------------------------------------------
174
+ # 6. Document Processing: Page-Wise Chunking with Metadata Injection and Progress UI
175
+ # ------------------------------------------
176
+
177
+
178
+ def process_document(file_path: str, doc_hash: str, doc_name: str) -> bool:
179
+ try:
180
+ if not st.session_state.get('my_config'):
181
+ raise ValueError("Configuration not initialized")
182
+
183
+ # Sanitize document name to avoid encoding issues
184
+ doc_name = doc_name.encode('ascii', 'replace').decode('ascii')
185
+
186
+ with open(file_path, "rb") as f:
187
+ pdf_reader = PyPDF2.PdfReader(f)
188
+ num_pages = len(pdf_reader.pages)
189
+ logger.info(f"Processing PDF '{doc_name}' with {num_pages} pages.")
190
+ progress_bar = st.progress(0)
191
+ status_text = st.empty()
192
+
193
+ for page_index in range(num_pages):
194
+ status_text.text(
195
+ f"Processing page {page_index+1} of {num_pages}...")
196
+ with st.spinner(f"Processing page {page_index+1}..."):
197
+ try:
198
+ page = pdf_reader.pages[page_index]
199
+
200
+ # Extract text and handle encoding more robustly
201
+ raw_text = page.extract_text() or ""
202
+
203
+ # Convert text to plain ASCII, replacing non-ASCII characters
204
+ text = raw_text.encode(
205
+ 'ascii', 'replace').decode('ascii')
206
+
207
+ # Remove any remaining problematic characters
208
+ text = ''.join(
209
+ char for char in text if ord(char) < 128)
210
+
211
+ extraction_mode = st.session_state.get(
212
+ "extraction_mode", "Automatic")
213
+
214
+ if extraction_mode == "Intelligent":
215
+ exact_matches, related_keywords = extract_taxonomy_keywords_intelligent(
216
+ text, LEGAL_TAXONOMY_KEYWORDS)
217
+ logger.info(f"Exact matches: {exact_matches}")
218
+ logger.info(
219
+ f"Related keywords: {related_keywords}")
220
+ if exact_matches:
221
+ header_line = f"{', '.join(exact_matches)} | Related: {', '.join(related_keywords)}"
222
+ else:
223
+ header_line = f"{', '.join(related_keywords)}"
224
+ else:
225
+ tax_keywords = extract_taxonomy_keywords_automatic(
226
+ text, LEGAL_TAXONOMY_KEYWORDS)
227
+ header_line = f"{', '.join(tax_keywords) if tax_keywords else 'None'}"
228
+
229
+ # Create safe filename for temporary file
230
+ safe_doc_name = ''.join(
231
+ c for c in doc_name if c.isalnum() or c in ('-', '_'))
232
+ temp_page_file = f"temp_page_{safe_doc_name}_{page_index+1}.txt"
233
+
234
+ # Write the temporary file using ASCII encoding
235
+ with open(temp_page_file, "w", encoding='ascii', errors='replace') as tmp:
236
+ header = (
237
+ "===PAGE_INFO===\n"
238
+ f"Document: {doc_name}\n"
239
+ f"DocHash: {doc_hash}\n"
240
+ f"Page: {page_index+1}\n"
241
+ f"Taxonomy: {header_line}\n"
242
+ "===CONTENT===\n"
243
+ )
244
+ tmp.write(header)
245
+ tmp.write(text)
246
+
247
+ insert_document(Path(temp_page_file),
248
+ config=st.session_state.my_config)
249
+ os.remove(temp_page_file)
250
+ progress_bar.progress((page_index + 1) / num_pages)
251
+
252
+ except Exception as page_error:
253
+ logger.error(
254
+ f"Error processing page {page_index+1}: {str(page_error)}")
255
+ continue
256
+
257
+ status_text.text("Processing complete!")
258
+ return True
259
+
260
+ except Exception as e:
261
+ logger.error(f"Error processing document: {str(e)}")
262
+ return False
263
+
264
+ # ------------------------------------------
265
+ # 7. Search and Fallback Functions
266
+ # ------------------------------------------
267
+
268
+
269
+ def perform_search(query: str) -> List:
270
+ try:
271
+ chunk_ids, scores = hybrid_search(
272
+ query, num_results=10, config=st.session_state.my_config)
273
+ if not chunk_ids:
274
+ return []
275
+ chunks = retrieve_chunks(chunk_ids, config=st.session_state.my_config)
276
+ return rerank_chunks(query, chunks, config=st.session_state.my_config)
277
+ except Exception as e:
278
+ logger.error(f"Search error: {str(e)}")
279
+ return []
280
+
281
+
282
+ def handle_fallback(query: str) -> str:
283
+ try:
284
+ client = openai.OpenAI(
285
+ api_key=st.session_state.user_env["OPENAI_API_KEY"])
286
+ system_prompt = FALLBACK_SYSTEM_PROMPT
287
+ response = client.chat.completions.create(
288
+ model="gpt-4o-mini",
289
+ messages=[
290
+ {"role": "system", "content": system_prompt},
291
+ {"role": "user", "content": query}
292
+ ],
293
+ max_tokens=1024,
294
+ temperature=0.7
295
+ )
296
+ return response.choices[0].message.content
297
+ except Exception as e:
298
+ logger.error(f"Fallback error: {str(e)}")
299
+ st.error(f"Fallback error: {str(e)}")
300
+ return "I apologize, but I encountered an error while processing your request. Please try again."
301
+
302
+ # ------------------------------------------
303
+ # 8. Main Streamlit App
304
+ # ------------------------------------------
305
+
306
+
307
+ def main():
308
+ st.set_page_config(page_title="Innodata - Taxonomy RAG POC", layout="wide")
309
+ for state_var in ['chat_history', 'documents_loaded', 'my_config', 'user_env', 'processed_pdf_hashes', 'pdf_files']:
310
+ if state_var not in st.session_state:
311
+ if state_var == 'chat_history':
312
+ st.session_state[state_var] = []
313
+ elif state_var == 'documents_loaded':
314
+ st.session_state[state_var] = False
315
+ elif state_var == 'my_config':
316
+ st.session_state[state_var] = None
317
+ elif state_var == 'user_env':
318
+ st.session_state[state_var] = {}
319
+ elif state_var == 'processed_pdf_hashes':
320
+ st.session_state[state_var] = set()
321
+ elif state_var == 'pdf_files':
322
+ st.session_state[state_var] = {}
323
+ with st.sidebar:
324
+ st.title("Configuration")
325
+ openai_key = st.text_input("OpenAI API Key", value=st.session_state.get(
326
+ 'openai_key', ''), type="password", placeholder="sk-...")
327
+ cohere_key = st.text_input("Cohere API Key", value=st.session_state.get(
328
+ 'cohere_key', ''), type="password", placeholder="Enter Cohere key")
329
+ db_url = st.text_input("Database URL", value=st.session_state.get(
330
+ 'db_url', 'sqlite:///raglite.sqlite'), placeholder="sqlite:///raglite.sqlite")
331
+ if not st.session_state.documents_loaded:
332
+ extraction_mode = st.radio("Select Taxonomy Extraction Mode", options=[
333
+ "Automatic", "Intelligent"], index=0)
334
+ st.session_state["extraction_mode"] = extraction_mode
335
+ else:
336
+ st.write("Taxonomy Extraction Mode: " +
337
+ st.session_state.get("extraction_mode", "Automatic"))
338
+ if st.button("Save Configuration"):
339
+ try:
340
+ if not all([openai_key, cohere_key, db_url]):
341
+ st.error("All fields are required!")
342
+ return
343
+ st.session_state['openai_key'] = openai_key
344
+ st.session_state['cohere_key'] = cohere_key
345
+ st.session_state['db_url'] = db_url
346
+ st.session_state.my_config = initialize_config(
347
+ openai_key=openai_key, cohere_key=cohere_key, db_url=db_url)
348
+ st.session_state.user_env = {"OPENAI_API_KEY": openai_key}
349
+ st.success("Configuration saved successfully!")
350
+ except Exception as e:
351
+ st.error(f"Configuration error: {str(e)}")
352
+ st.title("Innodata - Taxonomy POC - RAG with Hybrid Search")
353
+ if not st.session_state.documents_loaded:
354
+ uploaded_files = st.file_uploader("Upload PDF legal documents", type=[
355
+ "pdf"], accept_multiple_files=True, key="pdf_uploader")
356
+ if uploaded_files:
357
+ for uploaded_file in uploaded_files:
358
+ file_bytes = uploaded_file.getvalue()
359
+ file_hash = hashlib.md5(file_bytes).hexdigest()
360
+ if file_hash in st.session_state.processed_pdf_hashes:
361
+ st.warning(
362
+ f"'{uploaded_file.name}' has already been uploaded. Skipping duplicate.")
363
+ continue
364
+ else:
365
+ st.session_state.processed_pdf_hashes.add(file_hash)
366
+ st.session_state.pdf_files[file_hash] = file_bytes
367
+ temp_path = f"temp_{uploaded_file.name}"
368
+ with open(temp_path, "wb") as f:
369
+ f.write(file_bytes)
370
+ with st.spinner(f"Processing {uploaded_file.name}..."):
371
+ if process_document(temp_path, file_hash, uploaded_file.name):
372
+ st.success(
373
+ f"Successfully processed: {uploaded_file.name}")
374
+ else:
375
+ st.error(
376
+ f"Failed to process: {uploaded_file.name}")
377
+ os.remove(temp_path)
378
+ st.session_state.documents_loaded = True
379
+ st.success(
380
+ "All documents are ready! You can now ask questions about them.")
381
+ else:
382
+ st.info("Documents already processed. You can ask your questions below.")
383
+ if st.session_state.documents_loaded:
384
+ for msg in st.session_state.chat_history:
385
+ with st.chat_message("user"):
386
+ st.write(msg[0])
387
+ with st.chat_message("assistant"):
388
+ st.write(msg[1])
389
+ user_input = st.chat_input("Ask a question about the documents...")
390
+ if user_input:
391
+ with st.chat_message("user"):
392
+ st.write(user_input)
393
+ with st.chat_message("assistant"):
394
+ message_placeholder = st.empty()
395
+ try:
396
+ reranked_chunks = perform_search(query=user_input)
397
+ if not reranked_chunks or len(reranked_chunks) == 0:
398
+ logger.info(
399
+ "No relevant documents found. Falling back to general LLM.")
400
+ st.info(
401
+ "No relevant documents found. Using general knowledge to answer.")
402
+ full_response = handle_fallback(user_input)
403
+ message_placeholder.markdown(full_response)
404
+ else:
405
+ best_chunk = reranked_chunks[0]
406
+ raw_text = best_chunk.body
407
+ doc_name, doc_hash, page_number, taxonomy_info, content_without_header = parse_chunk_text(
408
+ raw_text)
409
+ formatted_messages = [
410
+ {"role": "user" if i %
411
+ 2 == 0 else "assistant", "content": msg}
412
+ for i, msg in enumerate([m for pair in st.session_state.chat_history for m in pair])
413
+ if msg
414
+ ]
415
+ response_stream = rag(
416
+ prompt=user_input,
417
+ system_prompt=RAG_SYSTEM_PROMPT,
418
+ search=hybrid_search,
419
+ messages=formatted_messages,
420
+ max_contexts=5,
421
+ config=st.session_state.my_config
422
+ )
423
+ full_response = ""
424
+ for chunk in response_stream:
425
+ full_response += chunk
426
+ message_placeholder.markdown(full_response + "▌")
427
+ message_placeholder.markdown(full_response)
428
+ with st.expander("Top Matched Source Information:", expanded=False):
429
+ st.write(f"**Document:** {doc_name}")
430
+ st.write(f"**Page:** {page_number}")
431
+ if st.session_state.get("extraction_mode") == "Intelligent" and "|" in taxonomy_info:
432
+ parts = taxonomy_info.split("|")
433
+ exact_matches = parts[0].strip()
434
+ related_keywords = parts[1].replace(
435
+ "Related:", "").strip()
436
+ st.write(f"**Exact Matches:** {exact_matches}")
437
+ st.write(
438
+ f"**Related Keywords:** {related_keywords}")
439
+ else:
440
+ st.write(
441
+ f"**Taxonomy Keywords:** {taxonomy_info if taxonomy_info else 'None'}")
442
+ if doc_hash in st.session_state.pdf_files:
443
+ pdf_bytes = st.session_state.pdf_files[doc_hash]
444
+ try:
445
+ page_num_int = int(page_number)
446
+ pages = convert_from_bytes(
447
+ pdf_bytes, first_page=page_num_int, last_page=page_num_int)
448
+ if pages:
449
+ st.image(
450
+ pages[0], caption=f"{doc_name} - Page {page_number}")
451
+ except Exception as e:
452
+ st.error(
453
+ "Could not convert PDF page to image: " + str(e))
454
+ st.session_state.chat_history.append(
455
+ (user_input, full_response))
456
+ except Exception as e:
457
+ st.error(f"Error: {str(e)}")
458
+ else:
459
+ if not st.session_state.my_config:
460
+ st.info("Please configure your API keys to get started.")
461
+ else:
462
+ st.info("Please upload some documents to get started.")
463
+
464
+
465
+ if __name__ == "__main__":
466
+ main()
requirements.txt ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Core dependencies
2
+ streamlit>=1.31.0
3
+ raglite==0.2.1
4
+ pydantic==2.10.1
5
+ sqlalchemy>=2.0.0
6
+ openai>=1.0.0
7
+ cohere>=4.37
8
+ rerankers==0.6.0
9
+
10
+ # PDF processing
11
+ PyPDF2>=3.0.0
12
+ pdf2image>=1.16.3
13
+ poppler-utils
14
+
15
+ # Template engine
16
+ jinja2>=3.1.0
17
+
18
+ # Database
19
+ psycopg2-binary>=2.9.9 # Optional: for PostgreSQL support
20
+
21
+ # NLP and text processing
22
+ spacy>=3.7.0
23
+ python-dotenv>=1.0.0
24
+
25
+ # Download spacy model during deployment
26
+ en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.0/en_core_web_sm-3.7.0-py3-none-any.whl
source/flowchartTD.png ADDED
templates/prompts.j2 ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {# RAG System Prompt #}
2
+ {% set RAG_SYSTEM_PROMPT %}
3
+ You are a friendly and knowledgeable legal assistant that provides complete and insightful answers.
4
+ Answer the user's question using only the context provided.
5
+ When responding, you MUST NOT reference the existence of the context, directly or indirectly.
6
+ Instead, treat the context as if it were entirely part of your working memory.
7
+ {% endset %}
8
+
9
+ {# Intelligent Extraction System Prompt #}
10
+ {% set INTELLIGENT_EXTRACTION_PROMPT %}
11
+ You are a legal taxonomy extraction assistant.
12
+ Given the following page content and a list of legal taxonomy keywords,
13
+ identify all keywords from the list that exactly appear in the page content.
14
+ Then, suggest 5 additional legal taxonomy keywords that are highly relevant to the content.
15
+ If no exact matches are found, just provide 5 related keywords.
16
+ Return your answer as a JSON object with two keys: exact_matches and related_keywords.
17
+ Do not include any extra text.
18
+ {% endset %}
19
+
20
+ {# Fallback System Prompt #}
21
+ {% set FALLBACK_SYSTEM_PROMPT %}
22
+ You are a helpful AI assistant. When you don't know something,
23
+ be honest about it. Provide clear, concise, and accurate responses.
24
+ If the question is not related to any specific document, use your general knowledge to answer.
25
+ {% endset %}
templates/taxonomy.j2 ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {# Legal Taxonomy Keywords #}
2
+ {% set LEGAL_TAXONOMY_KEYWORDS = [
3
+ # Core Legal Areas
4
+ "contract law", "tort law", "criminal law", "civil law", "constitutional law",
5
+ "property law", "family law", "intellectual property", "corporate law", "tax law",
6
+ "administrative law", "environmental law", "labor law", "immigration law",
7
+ "bankruptcy law", "securities law", "antitrust law", "international law",
8
+
9
+ # Legal Processes & Procedures
10
+ "civil procedure", "criminal procedure", "evidence", "jurisdiction", "arbitration",
11
+ "mediation", "litigation", "appeal", "discovery", "pleadings", "injunction",
12
+ "class action", "settlement", "trial", "hearing", "deposition",
13
+
14
+ # Legal Concepts & Principles
15
+ "due process", "precedent", "statute", "regulation", "liability", "negligence",
16
+ "damages", "remedy", "standing", "jurisdiction", "venue", "immunity",
17
+ "consideration", "breach", "fraud", "defamation", "estoppel",
18
+
19
+ # Rights & Protections
20
+ "civil rights", "human rights", "privacy rights", "discrimination",
21
+ "equal protection", "freedom of speech", "freedom of religion",
22
+ "right to counsel", "miranda rights", "fourth amendment", "fifth amendment",
23
+
24
+ # Business & Commercial
25
+ "mergers and acquisitions", "securities regulation", "commercial law",
26
+ "partnership law", "llc law", "agency law", "employment law", "trade law",
27
+ "consumer protection", "unfair competition", "trademark", "patent", "copyright",
28
+
29
+ # Property & Real Estate
30
+ "real property", "personal property", "easement", "zoning", "land use",
31
+ "landlord tenant", "mortgage", "title", "deed", "conveyance",
32
+
33
+ # Criminal Justice
34
+ "felony", "misdemeanor", "mens rea", "actus reus", "probable cause",
35
+ "search and seizure", "self defense", "double jeopardy", "plea bargain",
36
+
37
+ # Specialized Areas
38
+ "healthcare law", "education law", "elder law", "military law", "maritime law",
39
+ "aviation law", "sports law", "entertainment law", "cyber law", "blockchain law",
40
+ "data privacy", "artificial intelligence law", "environmental compliance",
41
+
42
+ # Government & Public Law
43
+ "municipal law", "state law", "federal law", "legislative process",
44
+ "executive power", "judicial review", "administrative procedure",
45
+ "public policy", "regulatory compliance", "government contracts",
46
+
47
+ # Alternative Dispute Resolution
48
+ "negotiation", "conciliation", "dispute resolution", "binding arbitration",
49
+ "non-binding arbitration", "mediation agreement", "settlement conference"
50
+ ] %}