wang.lingxiao commited on
Commit
4f8205f
Β·
1 Parent(s): e37fd3c
Files changed (3) hide show
  1. README.md +130 -175
  2. __pycache__/app.cpython-313.pyc +0 -0
  3. app.py +236 -1189
README.md CHANGED
@@ -1,6 +1,6 @@
1
  ---
2
- title: Advanced Document to Markdown Converter
3
- emoji: πŸš€
4
  colorFrom: blue
5
  colorTo: purple
6
  sdk: gradio
@@ -14,205 +14,162 @@ tags:
14
  - document-processing
15
  - markdown
16
  - pdf-converter
17
- - ai-analysis
18
- - mcp-server-track
19
- - mcp-server
20
- - nlp
21
- - ocr
22
- short_description: Convert any document to Markdown with AI-powered analysis
23
  ---
24
 
25
- # πŸš€ Advanced Document to Markdown Converter
26
 
27
- Convert documents to Markdown format with AI-powered analysis and advanced features.
28
 
29
  ## Features
30
 
31
  ### πŸ“„ Supported Formats
32
- - **PDF** - With OCR support for image-based PDFs
33
- - **Word Documents** (.docx) - Full formatting preservation
34
- - **PowerPoint** (.pptx) - Slide-by-slide conversion
35
- - **Excel** (.xlsx) - Table extraction and formatting
36
- - **Plain Text** (.txt, .md) - Smart formatting detection
37
- - **Rich Text** (.rtf) - Complete formatting support
38
- - **E-books** (.epub) - Chapter and content extraction
39
-
40
- ### 🧠 AI-Powered Features
41
- - **Structure Analysis** - Intelligent document organization
42
- - **Topic Extraction** - Automatic keyword and topic identification
43
- - **Entity Recognition** - Named entity detection and classification
44
- - **Content Summarization** - AI-generated document summaries
45
- - **Smart Heading Detection** - Context-aware heading hierarchy
46
-
47
- ### ⚑ Advanced Capabilities
48
- - **Batch Processing** - Process multiple documents simultaneously
49
- - **OCR Integration** - Extract text from images and scanned documents
50
- - **Custom Templates** - Pre-configured output formats
51
- - **Caching System** - Improved performance for repeated processing
52
- - **Progress Tracking** - Real-time processing status
53
- - **Export Options** - Multiple output formats (MD, HTML, PDF)
54
-
55
- ### πŸ”§ Technical Features
56
- - **MCP Server** - Model Context Protocol integration
57
- - **Concurrent Processing** - Multi-threaded document handling
58
- - **Memory Optimization** - Efficient large file processing
59
- - **Error Recovery** - Robust error handling and reporting
60
 
61
  ## Usage
62
 
63
- ### Single Document Processing
64
- 1. Upload your document
65
- 2. Configure processing options
66
- 3. Click "Process Document"
67
- 4. View results in multiple tabs
68
-
69
- ### Batch Processing
70
- 1. Upload multiple documents
71
- 2. Enable combination option if needed
72
- 3. Process all documents simultaneously
73
- 4. Export results as needed
74
-
75
- ### MCP Integration
76
- This application can be used as an MCP server with Claude AI:
77
-
78
- ```json
79
- {
80
- "mcpServers": {
81
- "document_converter": {
82
- "command": "npx",
83
- "args": [
84
- "mcp-remote",
85
- "https://YOUR-SPACE-URL/gradio_api/mcp/sse",
86
- "--transport",
87
- "sse-only"
88
- ]
89
- }
90
- }
91
- }
92
- ```
 
 
 
 
 
 
93
 
94
  ## Installation
95
 
96
  ### Local Development
97
  ```bash
98
- git clone https://huggingface.co/spaces/YOUR-USERNAME/advanced-document-converter
99
- cd advanced-document-converter
100
- pip install -r requirements.txt
101
- python app.py
102
- ```
103
-
104
- ### Docker Deployment
105
- ```dockerfile
106
- FROM python:3.11-slim
107
-
108
- WORKDIR /app
109
- COPY requirements.txt .
110
- RUN pip install -r requirements.txt
111
 
112
- # Install system dependencies for OCR
113
- RUN apt-get update && apt-get install -y \
114
- tesseract-ocr \
115
- tesseract-ocr-eng \
116
- && rm -rf /var/lib/apt/lists/*
117
-
118
- COPY . .
119
- EXPOSE 7860
120
 
121
- CMD ["python", "app.py"]
 
122
  ```
123
 
124
- ## API Documentation
125
-
126
- ### Core Functions
127
-
128
- #### `process_document(file_path, options)`
129
- Process a single document and convert to Markdown.
130
-
131
- **Parameters:**
132
- - `file_path` (str): Path to the document file
133
- - `options` (dict): Processing configuration
134
- - `enable_ai_analysis` (bool): Enable AI-powered analysis
135
- - `include_frontmatter` (bool): Add YAML frontmatter
136
- - `generate_toc` (bool): Generate table of contents
137
- - `use_cache` (bool): Enable result caching
138
-
139
- **Returns:**
140
- - Dictionary with markdown content, structure analysis, and metadata
141
-
142
- #### `process_multiple_documents(file_paths, options)`
143
- Process multiple documents concurrently.
144
-
145
- **Parameters:**
146
- - `file_paths` (list): List of file paths
147
- - `options` (dict): Processing configuration
148
- - `combine_documents` (bool): Merge into single document
149
- - Additional options from single document processing
150
-
151
- **Returns:**
152
- - Dictionary with results for each document and optional combined output
153
-
154
- ### MCP Functions
155
-
156
- #### `extract_document_to_md_process_document`
157
- MCP-compatible function for document processing.
158
-
159
- **Parameters:**
160
- - `file_path` (str): HTTP/HTTPS URL to document
161
- - `show_prev` (bool): Return preview only
162
- - `show_struct` (bool): Include structure analysis
163
-
164
- ## Configuration
165
-
166
- ### Environment Variables
167
- - `MAX_FILE_SIZE_MB` - Maximum file size limit (default: 50)
168
- - `CACHE_DIR` - Directory for cached results
169
- - `WORKERS` - Number of concurrent workers
170
- - `ENABLE_OCR` - Enable OCR processing by default
171
-
172
- ### Processing Options
173
- - **AI Analysis**: Uses spaCy NLP models for advanced text analysis
174
- - **OCR**: Tesseract-based optical character recognition
175
- - **Caching**: Redis-compatible caching for improved performance
176
-
177
- ## Dependencies
178
-
179
- ### Core Requirements
180
  - `gradio>=4.0.0` - Web interface framework
181
  - `python-docx>=1.1.0` - Word document processing
182
- - `PyMuPDF>=1.23.0` - PDF processing
183
- - `python-pptx>=0.6.21` - PowerPoint processing
184
- - `openpyxl>=3.1.0` - Excel file processing
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
185
 
186
- ### AI/ML Requirements
187
- - `spacy>=3.7.0` - Natural language processing
188
- - `pytesseract>=0.3.10` - OCR capabilities
189
- - `transformers>=4.30.0` - Advanced AI models
190
 
191
- ### Optional Features
192
- - `matplotlib>=3.7.0` - Visualization capabilities
193
- - `pandas>=2.0.0` - Data processing
194
- - `scikit-learn>=1.3.0` - Machine learning features
 
 
 
195
 
196
- ## Performance
 
 
 
 
 
 
197
 
198
- ### Benchmarks
199
- - **Small files** (<1MB): ~2-5 seconds
200
- - **Medium files** (1-10MB): ~10-30 seconds
201
- - **Large files** (10-50MB): ~30-120 seconds
202
- - **Batch processing**: Linear scaling with concurrent workers
203
 
204
- ### Memory Usage
205
- - **Base memory**: ~200MB
206
- - **Per document**: ~50-100MB additional
207
- - **OCR processing**: +200-500MB peak usage
208
 
209
  ## Contributing
210
 
211
  1. Fork the repository
212
- 2. Create feature branch: `git checkout -b feature-name`
213
- 3. Commit changes: `git commit -am 'Add feature'`
214
- 4. Push to branch: `git push origin feature-name`
215
- 5. Submit pull request
216
 
217
  ## License
218
 
@@ -220,10 +177,8 @@ MIT License - see LICENSE file for details.
220
 
221
  ## Support
222
 
223
- - **Issues**: Report bugs and feature requests on GitHub
224
- - **Documentation**: Full API documentation available
225
- - **Community**: Join discussions in the Community tab
226
 
227
  ---
228
 
229
- *Built with ❀️ using Gradio, spaCy, and various document processing libraries*
 
1
  ---
2
+ title: Document to Markdown Converter
3
+ emoji: πŸ“„
4
  colorFrom: blue
5
  colorTo: purple
6
  sdk: gradio
 
14
  - document-processing
15
  - markdown
16
  - pdf-converter
17
+ - text-extraction
18
+ short_description: Convert PDF and DOCX documents to Markdown format
 
 
 
 
19
  ---
20
 
21
+ # πŸ“„ Document to Markdown Converter
22
 
23
+ Convert PDF and DOCX documents to Markdown format with intelligent structure analysis.
24
 
25
  ## Features
26
 
27
  ### πŸ“„ Supported Formats
28
+ - **PDF** - Extract text with formatting preservation
29
+ - **Word Documents** (.docx) - Full formatting and structure conversion
30
+
31
+ ### 🧠 Smart Processing
32
+ - **Heading Detection** - Automatically detect headings based on styles and formatting
33
+ - **Table Extraction** - Convert tables to Markdown format
34
+ - **List Processing** - Preserve ordered and unordered lists
35
+ - **Inline Formatting** - Maintain bold, italic, and other text formatting
36
+ - **Structure Analysis** - Detailed document structure statistics
37
+
38
+ ### ⚑ Key Capabilities
39
+ - **Font-based Heading Detection** - Uses font size and styling to identify headings
40
+ - **Style Recognition** - Recognizes Word document styles (Title, Heading 1-6)
41
+ - **Table Conversion** - Converts complex tables to Markdown table format
42
+ - **List Recognition** - Identifies and converts various list formats
43
+ - **Text Formatting** - Preserves bold, italic formatting in Markdown syntax
 
 
 
 
 
 
 
 
 
 
 
 
44
 
45
  ## Usage
46
 
47
+ ### Basic Processing
48
+ 1. Upload a PDF or DOCX file
49
+ 2. Click "Convert to Markdown"
50
+ 3. View the converted Markdown in the output tab
51
+
52
+ ### Options
53
+ - **Structure Analysis**: Enable to see detailed document statistics
54
+ - **Preview Mode**: Show only the first 500 characters for quick preview
55
+
56
+ ### Output Tabs
57
+ - **Markdown Output**: The complete converted Markdown text
58
+ - **Structure Analysis**: Statistics about headings, lists, tables, etc.
59
+ - **File Information**: Basic file details (name, type, size)
60
+
61
+ ## Technical Details
62
+
63
+ ### PDF Processing
64
+ - Uses PyMuPDF (fitz) for text extraction
65
+ - Analyzes font sizes to determine heading hierarchy
66
+ - Preserves text formatting flags (bold, italic)
67
+ - Processes text blocks while maintaining structure
68
+
69
+ ### DOCX Processing
70
+ - Uses python-docx for document parsing
71
+ - Recognizes built-in Word styles
72
+ - Extracts tables with proper formatting
73
+ - Maintains paragraph-level formatting
74
+
75
+ ### Structure Analysis
76
+ The application analyzes:
77
+ - **Headings**: Count by level (H1-H6)
78
+ - **Lists**: Ordered vs unordered list items
79
+ - **Tables**: Number of tables detected
80
+ - **Paragraphs**: Regular text paragraphs
81
+ - **Formatting**: Bold and italic text occurrences
82
+ - **Statistics**: Word count, character count, total lines
83
 
84
  ## Installation
85
 
86
  ### Local Development
87
  ```bash
88
+ # Clone the repository
89
+ git clone https://huggingface.co/spaces/YOUR-USERNAME/document-to-markdown-converter
90
+ cd document-to-markdown-converter
 
 
 
 
 
 
 
 
 
 
91
 
92
+ # Install dependencies
93
+ pip install -r requirements.txt
 
 
 
 
 
 
94
 
95
+ # Run the application
96
+ python app.py
97
  ```
98
 
99
+ ### Dependencies
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
100
  - `gradio>=4.0.0` - Web interface framework
101
  - `python-docx>=1.1.0` - Word document processing
102
+ - `PyMuPDF>=1.23.0` - PDF processing library
103
+
104
+ ## API
105
+
106
+ ### Core Function
107
+ ```python
108
+ def extract_document_to_markdown(file_path: str) -> Dict[str, Any]:
109
+ """
110
+ Extract document content and convert to Markdown format
111
+
112
+ Args:
113
+ file_path: Path to PDF or DOCX file
114
+
115
+ Returns:
116
+ Dictionary containing:
117
+ - success: Boolean indicating success
118
+ - markdown: Converted Markdown content
119
+ - structure: Document structure analysis
120
+ - file_info: File metadata (name, type, size)
121
+ - preview: Short preview of content
122
+ - error: Error message if processing failed
123
+ """
124
+ ```
125
+
126
+ ### Structure Analysis Output
127
+ ```json
128
+ {
129
+ "headings": {"h1": 2, "h2": 5, "h3": 8, "h4": 0, "h5": 0, "h6": 0},
130
+ "lists": {"ordered": 3, "unordered": 7},
131
+ "tables": 2,
132
+ "paragraphs": 45,
133
+ "bold_text": 12,
134
+ "italic_text": 8,
135
+ "total_lines": 120,
136
+ "word_count": 2500,
137
+ "character_count": 15000
138
+ }
139
+ ```
140
 
141
+ ## Examples
 
 
 
142
 
143
+ ### Converting a PDF
144
+ 1. Upload a PDF file
145
+ 2. The application will:
146
+ - Extract text from each page
147
+ - Detect headings based on font size
148
+ - Preserve bold/italic formatting
149
+ - Convert to clean Markdown
150
 
151
+ ### Converting a DOCX
152
+ 1. Upload a Word document
153
+ 2. The application will:
154
+ - Parse document styles
155
+ - Convert headings based on style names
156
+ - Extract and format tables
157
+ - Maintain list structures
158
 
159
+ ## Limitations
 
 
 
 
160
 
161
+ - **OCR**: Does not perform OCR on image-based PDFs
162
+ - **Complex Layouts**: May not perfectly preserve complex document layouts
163
+ - **Images**: Does not extract or convert embedded images
164
+ - **Fonts**: Limited font analysis for PDFs
165
 
166
  ## Contributing
167
 
168
  1. Fork the repository
169
+ 2. Create a feature branch
170
+ 3. Make your changes
171
+ 4. Test thoroughly
172
+ 5. Submit a pull request
173
 
174
  ## License
175
 
 
177
 
178
  ## Support
179
 
180
+ For issues and feature requests, please use the Community tab or create an issue on GitHub.
 
 
181
 
182
  ---
183
 
184
+ *Built with ❀️ using Gradio, python-docx, and PyMuPDF*
__pycache__/app.cpython-313.pyc CHANGED
Binary files a/__pycache__/app.cpython-313.pyc and b/__pycache__/app.cpython-313.pyc differ
 
app.py CHANGED
@@ -1,452 +1,40 @@
1
  import gradio as gr
2
  import re
 
3
  import os
4
- import io
5
- import json
6
- import hashlib
7
- import zipfile
8
- import tempfile
9
- from datetime import datetime
10
- from typing import Dict, Any, Optional, List, Tuple
11
  from pathlib import Path
12
- from concurrent.futures import ThreadPoolExecutor, as_completed
13
- import threading
14
- import time
15
-
16
- # Import dependencies with fallbacks
17
- DEPENDENCIES = {
18
- "docx": {"available": False, "module": None},
19
- "pdf": {"available": False, "module": None},
20
- "pptx": {"available": False, "module": None},
21
- "xlsx": {"available": False, "module": None},
22
- "ocr": {"available": False, "module": None},
23
- "nlp": {"available": False, "module": None},
24
- "epub": {"available": False, "module": None},
25
- "rtf": {"available": False, "module": None},
26
- }
27
-
28
- # Try importing all dependencies
29
  try:
30
  import docx
31
 
32
- DEPENDENCIES["docx"] = {"available": True, "module": docx}
33
  except ImportError:
34
- pass
35
 
36
  try:
37
  import fitz # PyMuPDF
38
 
39
- DEPENDENCIES["pdf"] = {"available": True, "module": fitz}
40
- except ImportError:
41
- pass
42
-
43
- try:
44
- from pptx import Presentation
45
-
46
- DEPENDENCIES["pptx"] = {"available": True, "module": Presentation}
47
- except ImportError:
48
- pass
49
-
50
- try:
51
- import openpyxl
52
-
53
- DEPENDENCIES["xlsx"] = {"available": True, "module": openpyxl}
54
- except ImportError:
55
- pass
56
-
57
- try:
58
- import pytesseract
59
- from PIL import Image
60
-
61
- DEPENDENCIES["ocr"] = {"available": True, "module": (pytesseract, Image)}
62
- except ImportError:
63
- pass
64
-
65
- try:
66
- import spacy
67
-
68
- DEPENDENCIES["nlp"] = {"available": True, "module": spacy}
69
- except ImportError:
70
- pass
71
-
72
- try:
73
- import ebooklib
74
- from ebooklib import epub
75
-
76
- DEPENDENCIES["epub"] = {"available": True, "module": (ebooklib, epub)}
77
  except ImportError:
78
- pass
79
-
80
- try:
81
- from striprtf.striprtf import rtf_to_text
82
-
83
- DEPENDENCIES["rtf"] = {"available": True, "module": rtf_to_text}
84
- except ImportError:
85
- pass
86
-
87
-
88
- class ProgressTracker:
89
- """Thread-safe progress tracking"""
90
-
91
- def __init__(self):
92
- self.current = 0
93
- self.total = 100
94
- self.status = "Ready"
95
- self.lock = threading.Lock()
96
-
97
- def update(self, current: int, total: int, status: str):
98
- with self.lock:
99
- self.current = current
100
- self.total = total
101
- self.status = status
102
-
103
- def get_progress(self) -> Tuple[int, str]:
104
- with self.lock:
105
- progress = int((self.current / self.total) * 100) if self.total > 0 else 0
106
- return progress, self.status
107
-
108
-
109
- class DocumentCache:
110
- """Simple file-based cache for processed documents"""
111
-
112
- def __init__(self, cache_dir: str = "/tmp/doc_cache"):
113
- self.cache_dir = Path(cache_dir)
114
- self.cache_dir.mkdir(exist_ok=True)
115
-
116
- def _get_file_hash(self, file_path: str) -> str:
117
- """Generate hash for file content"""
118
- hasher = hashlib.md5()
119
- with open(file_path, "rb") as f:
120
- for chunk in iter(lambda: f.read(4096), b""):
121
- hasher.update(chunk)
122
- return hasher.hexdigest()
123
-
124
- def get(self, file_path: str) -> Optional[Dict]:
125
- """Get cached result if available"""
126
- try:
127
- file_hash = self._get_file_hash(file_path)
128
- cache_file = self.cache_dir / f"{file_hash}.json"
129
- if cache_file.exists():
130
- with open(cache_file, "r", encoding="utf-8") as f:
131
- return json.load(f)
132
- except Exception:
133
- pass
134
- return None
135
-
136
- def set(self, file_path: str, result: Dict):
137
- """Cache the result"""
138
- try:
139
- file_hash = self._get_file_hash(file_path)
140
- cache_file = self.cache_dir / f"{file_hash}.json"
141
- with open(cache_file, "w", encoding="utf-8") as f:
142
- json.dump(result, f, ensure_ascii=False, indent=2)
143
- except Exception:
144
- pass
145
-
146
-
147
- class AIContentAnalyzer:
148
- """AI-powered content analysis and structuring"""
149
-
150
- def __init__(self):
151
- self.nlp = None
152
- if DEPENDENCIES["nlp"]["available"]:
153
- try:
154
- self.nlp = spacy.load("en_core_web_sm")
155
- except OSError:
156
- pass
157
-
158
- def analyze_structure(self, text: str) -> Dict[str, Any]:
159
- """Analyze document structure using NLP"""
160
- if not self.nlp:
161
- return self._basic_structure_analysis(text)
162
-
163
- doc = self.nlp(text)
164
-
165
- # Extract entities, topics, and structure
166
- entities = [(ent.text, ent.label_) for ent in doc.ents]
167
- sentences = [sent.text.strip() for sent in doc.sents]
168
-
169
- # Identify potential headings based on sentence structure
170
- potential_headings = []
171
- for sent in sentences:
172
- if (
173
- len(sent.split()) <= 10
174
- and sent[0].isupper()
175
- and not sent.endswith(".")
176
- and len(sent) > 5
177
- ):
178
- potential_headings.append(sent)
179
-
180
- return {
181
- "entities": entities[:10], # Top 10 entities
182
- "potential_headings": potential_headings[:20],
183
- "sentence_count": len(sentences),
184
- "avg_sentence_length": sum(len(s.split()) for s in sentences)
185
- / len(sentences)
186
- if sentences
187
- else 0,
188
- "topics": self._extract_topics(doc),
189
- }
190
-
191
- def _basic_structure_analysis(self, text: str) -> Dict[str, Any]:
192
- """Basic structure analysis without NLP"""
193
- lines = text.split("\n")
194
- sentences = re.split(r"[.!?]+", text)
195
-
196
- return {
197
- "entities": [],
198
- "potential_headings": [
199
- line.strip()
200
- for line in lines
201
- if len(line.strip().split()) <= 10 and line.strip()
202
- ],
203
- "sentence_count": len([s for s in sentences if s.strip()]),
204
- "avg_sentence_length": sum(len(s.split()) for s in sentences if s.strip())
205
- / len(sentences)
206
- if sentences
207
- else 0,
208
- "topics": [],
209
- }
210
-
211
- def _extract_topics(self, doc) -> List[str]:
212
- """Extract main topics from document"""
213
- # Simple topic extraction based on noun phrases
214
- topics = []
215
- for chunk in doc.noun_chunks:
216
- if len(chunk.text.split()) <= 3 and chunk.text.lower() not in [
217
- "the",
218
- "a",
219
- "an",
220
- ]:
221
- topics.append(chunk.text)
222
- return list(set(topics))[:10]
223
-
224
- def generate_summary(self, text: str, max_length: int = 200) -> str:
225
- """Generate document summary"""
226
- sentences = re.split(r"[.!?]+", text)
227
- sentences = [s.strip() for s in sentences if s.strip() and len(s.split()) > 5]
228
-
229
- if not sentences:
230
- return "No content to summarize."
231
-
232
- # Simple extractive summarization - take first few and some middle sentences
233
- summary_sentences = []
234
- if len(sentences) <= 3:
235
- summary_sentences = sentences
236
- else:
237
- summary_sentences.append(sentences[0]) # First sentence
238
- if len(sentences) > 2:
239
- summary_sentences.append(
240
- sentences[len(sentences) // 2]
241
- ) # Middle sentence
242
- summary_sentences.append(sentences[-1]) # Last sentence
243
-
244
- summary = " ".join(summary_sentences)
245
- if len(summary) > max_length:
246
- summary = summary[:max_length] + "..."
247
-
248
- return summary
249
 
250
 
251
- class AdvancedDocumentConverter:
252
- """Advanced document converter with AI features"""
253
 
254
  def __init__(self):
255
- self.progress = ProgressTracker()
256
- self.cache = DocumentCache()
257
- self.ai_analyzer = AIContentAnalyzer()
258
- self.supported_formats = {
259
- ".pdf": self.extract_from_pdf,
260
- ".docx": self.extract_from_docx,
261
- ".pptx": self.extract_from_pptx,
262
- ".xlsx": self.extract_from_xlsx,
263
- ".txt": self.extract_from_txt,
264
- ".md": self.extract_from_txt,
265
- ".rtf": self.extract_from_rtf,
266
- ".epub": self.extract_from_epub,
267
- }
268
-
269
- def process_document(
270
- self, file_path: str, options: Dict[str, Any] = None
271
- ) -> Dict[str, Any]:
272
- """Main document processing function"""
273
- if not options:
274
- options = {}
275
-
276
- # Check cache first
277
- if options.get("use_cache", True):
278
- cached_result = self.cache.get(file_path)
279
- if cached_result:
280
- return cached_result
281
-
282
- self.progress.update(10, 100, "Starting processing...")
283
-
284
- if not os.path.exists(file_path):
285
- return {"error": "File not found", "markdown": "", "structure": {}}
286
-
287
- file_extension = Path(file_path).suffix.lower()
288
-
289
- if file_extension not in self.supported_formats:
290
- return {
291
- "error": f"Unsupported file type: {file_extension}",
292
- "markdown": "",
293
- "structure": {},
294
- }
295
-
296
- try:
297
- self.progress.update(
298
- 30, 100, f"Extracting content from {file_extension} file..."
299
- )
300
-
301
- # Extract content using appropriate method
302
- extractor = self.supported_formats[file_extension]
303
- markdown_content = extractor(file_path)
304
-
305
- self.progress.update(60, 100, "Analyzing document structure...")
306
-
307
- # Enhanced structure analysis
308
- structure = self._analyze_document_structure(markdown_content)
309
-
310
- self.progress.update(80, 100, "Performing AI analysis...")
311
-
312
- # AI-powered analysis
313
- if options.get("enable_ai_analysis", True):
314
- ai_analysis = self.ai_analyzer.analyze_structure(markdown_content)
315
- structure["ai_analysis"] = ai_analysis
316
- structure["summary"] = self.ai_analyzer.generate_summary(
317
- markdown_content
318
- )
319
-
320
- # Generate frontmatter
321
- frontmatter = self._generate_frontmatter(file_path, structure, options)
322
-
323
- # Final markdown with frontmatter
324
- if options.get("include_frontmatter", True):
325
- final_markdown = frontmatter + "\n\n" + markdown_content
326
- else:
327
- final_markdown = markdown_content
328
-
329
- # Create table of contents
330
- if options.get("generate_toc", False):
331
- toc = self._generate_table_of_contents(markdown_content)
332
- final_markdown = toc + "\n\n" + final_markdown
333
-
334
- self.progress.update(100, 100, "Processing complete!")
335
-
336
- result = {
337
- "success": True,
338
- "file_info": {
339
- "name": Path(file_path).name,
340
- "type": file_extension.upper()[1:],
341
- "size_kb": round(os.path.getsize(file_path) / 1024, 2),
342
- "processed_at": datetime.now().isoformat(),
343
- },
344
- "markdown": final_markdown,
345
- "structure": structure,
346
- "frontmatter": frontmatter,
347
- "preview": final_markdown[:800] + "..."
348
- if len(final_markdown) > 800
349
- else final_markdown,
350
- }
351
-
352
- # Cache the result
353
- if options.get("use_cache", True):
354
- self.cache.set(file_path, result)
355
-
356
- return result
357
-
358
- except Exception as e:
359
- return {
360
- "error": f"Error processing file: {str(e)}",
361
- "markdown": "",
362
- "structure": {},
363
- }
364
-
365
- def process_multiple_documents(
366
- self, file_paths: List[str], options: Dict[str, Any] = None
367
- ) -> Dict[str, Any]:
368
- """Process multiple documents concurrently"""
369
- if not file_paths:
370
- return {"error": "No files provided", "results": []}
371
-
372
- results = []
373
- total_files = len(file_paths)
374
-
375
- with ThreadPoolExecutor(max_workers=3) as executor:
376
- # Submit all tasks
377
- future_to_file = {
378
- executor.submit(self.process_document, file_path, options): file_path
379
- for file_path in file_paths
380
- }
381
-
382
- # Process completed tasks
383
- for i, future in enumerate(as_completed(future_to_file)):
384
- file_path = future_to_file[future]
385
- try:
386
- result = future.result()
387
- result["file_path"] = file_path
388
- results.append(result)
389
- except Exception as e:
390
- results.append(
391
- {
392
- "error": f"Failed to process {file_path}: {str(e)}",
393
- "file_path": file_path,
394
- }
395
- )
396
-
397
- # Update progress
398
- self.progress.update(
399
- i + 1, total_files, f"Processed {i + 1}/{total_files} files"
400
- )
401
-
402
- # Generate combined document if requested
403
- combined_markdown = ""
404
- if options and options.get("combine_documents", False):
405
- combined_markdown = self._combine_documents(results)
406
-
407
- return {
408
- "success": True,
409
- "total_files": total_files,
410
- "results": results,
411
- "combined_markdown": combined_markdown,
412
- }
413
-
414
- def extract_from_pdf(self, pdf_path: str) -> str:
415
- """Enhanced PDF extraction with OCR support"""
416
- if not DEPENDENCIES["pdf"]["available"]:
417
- raise ImportError("PyMuPDF not installed. Run: pip install PyMuPDF")
418
-
419
- fitz = DEPENDENCIES["pdf"]["module"]
420
- doc = fitz.open(pdf_path)
421
- markdown_content = []
422
-
423
- for page_num in range(len(doc)):
424
- page = doc.load_page(page_num)
425
-
426
- # Extract text blocks
427
- blocks = page.get_text("dict")
428
- page_markdown = self._convert_pdf_blocks_to_markdown(blocks)
429
-
430
- # OCR on images if text extraction failed
431
- if not page_markdown.strip() and DEPENDENCIES["ocr"]["available"]:
432
- page_markdown = self._ocr_pdf_page(page)
433
-
434
- if page_markdown.strip():
435
- markdown_content.append(f"## Page {page_num + 1}\n\n{page_markdown}")
436
-
437
- doc.close()
438
- return "\n\n---\n\n".join(markdown_content)
439
 
440
  def extract_from_docx(self, docx_path: str) -> str:
441
- """Enhanced DOCX extraction"""
442
- if not DEPENDENCIES["docx"]["available"]:
443
- raise ImportError("python-docx not installed. Run: pip install python-docx")
444
 
445
- docx = DEPENDENCIES["docx"]["module"]
446
  doc = docx.Document(docx_path)
447
  markdown_content = []
448
 
449
- # Process paragraphs with enhanced formatting
450
  for paragraph in doc.paragraphs:
451
  if paragraph.text.strip():
452
  md_text = self._convert_paragraph_to_markdown(paragraph)
@@ -461,223 +49,47 @@ class AdvancedDocumentConverter:
461
 
462
  return "\n\n".join(markdown_content)
463
 
464
- def extract_from_pptx(self, pptx_path: str) -> str:
465
- """Extract content from PowerPoint presentations"""
466
- if not DEPENDENCIES["pptx"]["available"]:
467
- raise ImportError("python-pptx not installed. Run: pip install python-pptx")
468
 
469
- Presentation = DEPENDENCIES["pptx"]["module"]
470
- prs = Presentation(pptx_path)
471
  markdown_content = []
472
 
473
- for i, slide in enumerate(prs.slides):
474
- slide_content = [f"## Slide {i + 1}\n"]
475
 
476
- for shape in slide.shapes:
477
- if hasattr(shape, "text") and shape.text.strip():
478
- # Determine if it's a title or content
479
- if shape == slide.shapes.title:
480
- slide_content.append(f"### {shape.text.strip()}\n")
481
- else:
482
- slide_content.append(f"{shape.text.strip()}\n")
483
 
484
- if len(slide_content) > 1: # More than just the slide header
485
- markdown_content.append("\n".join(slide_content))
 
486
 
 
487
  return "\n\n---\n\n".join(markdown_content)
488
 
489
- def extract_from_xlsx(self, xlsx_path: str) -> str:
490
- """Extract content from Excel files"""
491
- if not DEPENDENCIES["xlsx"]["available"]:
492
- raise ImportError("openpyxl not installed. Run: pip install openpyxl")
493
-
494
- openpyxl = DEPENDENCIES["xlsx"]["module"]
495
- workbook = openpyxl.load_workbook(xlsx_path, data_only=True)
496
- markdown_content = []
497
-
498
- for sheet_name in workbook.sheetnames:
499
- sheet = workbook[sheet_name]
500
- markdown_content.append(f"## {sheet_name}\n")
501
-
502
- # Find the data range
503
- max_row = sheet.max_row
504
- max_col = sheet.max_column
505
-
506
- if max_row > 0 and max_col > 0:
507
- # Create markdown table
508
- table_rows = []
509
- for row in range(1, min(max_row + 1, 101)): # Limit to 100 rows
510
- row_data = []
511
- for col in range(1, max_col + 1):
512
- cell_value = sheet.cell(row=row, column=col).value
513
- row_data.append(
514
- str(cell_value) if cell_value is not None else ""
515
- )
516
-
517
- if any(cell.strip() for cell in row_data): # Skip empty rows
518
- table_rows.append("| " + " | ".join(row_data) + " |")
519
-
520
- if table_rows:
521
- # Add header separator after first row
522
- if len(table_rows) > 1:
523
- separator = "| " + " | ".join(["---"] * max_col) + " |"
524
- table_rows.insert(1, separator)
525
-
526
- markdown_content.append("\n".join(table_rows))
527
-
528
- return "\n\n".join(markdown_content)
529
-
530
- def extract_from_txt(self, txt_path: str) -> str:
531
- """Extract content from text files"""
532
- try:
533
- with open(txt_path, "r", encoding="utf-8") as f:
534
- content = f.read()
535
- except UnicodeDecodeError:
536
- with open(txt_path, "r", encoding="latin-1") as f:
537
- content = f.read()
538
-
539
- # If it's already markdown, return as-is
540
- if txt_path.endswith(".md"):
541
- return content
542
-
543
- # Convert plain text to markdown with basic formatting
544
- lines = content.split("\n")
545
- markdown_lines = []
546
-
547
- for line in lines:
548
- line = line.strip()
549
- if not line:
550
- markdown_lines.append("")
551
- continue
552
-
553
- # Check if line looks like a heading
554
- if (
555
- len(line.split()) <= 8
556
- and (line.isupper() or line.istitle())
557
- and not line.endswith(".")
558
- ):
559
- markdown_lines.append(f"## {line}")
560
- else:
561
- markdown_lines.append(line)
562
-
563
- return "\n".join(markdown_lines)
564
-
565
- def extract_from_rtf(self, rtf_path: str) -> str:
566
- """Extract content from RTF files"""
567
- if not DEPENDENCIES["rtf"]["available"]:
568
- raise ImportError("striprtf not installed. Run: pip install striprtf")
569
-
570
- rtf_to_text = DEPENDENCIES["rtf"]["module"]
571
-
572
- with open(rtf_path, "r", encoding="utf-8") as f:
573
- rtf_content = f.read()
574
-
575
- plain_text = rtf_to_text(rtf_content)
576
- return self.extract_from_txt_content(plain_text)
577
-
578
- def extract_from_epub(self, epub_path: str) -> str:
579
- """Extract content from EPUB files"""
580
- if not DEPENDENCIES["epub"]["available"]:
581
- raise ImportError("ebooklib not installed. Run: pip install ebooklib")
582
-
583
- ebooklib, epub = DEPENDENCIES["epub"]["module"]
584
- book = epub.read_epub(epub_path)
585
-
586
- markdown_content = []
587
-
588
- for item in book.get_items():
589
- if item.get_type() == ebooklib.ITEM_DOCUMENT:
590
- content = item.get_content().decode("utf-8")
591
- # Basic HTML to markdown conversion
592
- text = re.sub(r"<[^>]+>", "", content) # Remove HTML tags
593
- text = re.sub(r"\s+", " ", text).strip() # Clean whitespace
594
-
595
- if text:
596
- markdown_content.append(text)
597
-
598
- return "\n\n".join(markdown_content)
599
-
600
- def _ocr_pdf_page(self, page) -> str:
601
- """Perform OCR on PDF page"""
602
- if not DEPENDENCIES["ocr"]["available"]:
603
- return ""
604
-
605
- pytesseract, Image = DEPENDENCIES["ocr"]["module"]
606
-
607
- try:
608
- # Convert page to image
609
- pix = page.get_pixmap()
610
- img_data = pix.tobytes("png")
611
- image = Image.open(io.BytesIO(img_data))
612
-
613
- # Perform OCR
614
- text = pytesseract.image_to_string(image, lang="eng")
615
- return text.strip()
616
- except Exception:
617
- return ""
618
-
619
- def _convert_pdf_blocks_to_markdown(self, blocks_dict: Dict) -> str:
620
- """Enhanced PDF blocks to markdown conversion"""
621
- markdown_lines = []
622
-
623
- for block in blocks_dict.get("blocks", []):
624
- if block.get("type") == 0: # Text block
625
- for line in block.get("lines", []):
626
- line_text = ""
627
- for span in line.get("spans", []):
628
- text = span.get("text", "").strip()
629
- if text:
630
- font_size = span.get("size", 12)
631
- flags = span.get("flags", 0)
632
-
633
- is_bold = bool(flags & 16)
634
- is_italic = bool(flags & 2)
635
-
636
- # Apply inline formatting
637
- if is_bold and is_italic:
638
- text = f"***{text}***"
639
- elif is_bold:
640
- text = f"**{text}**"
641
- elif is_italic:
642
- text = f"*{text}*"
643
-
644
- # Apply heading formatting based on font size
645
- if font_size >= 20:
646
- text = f"# {text}"
647
- elif font_size >= 18:
648
- text = f"## {text}"
649
- elif font_size >= 16:
650
- text = f"### {text}"
651
- elif font_size >= 14:
652
- text = f"#### {text}"
653
-
654
- line_text += text + " "
655
-
656
- if line_text.strip():
657
- markdown_lines.append(line_text.strip())
658
-
659
- return "\n\n".join(markdown_lines)
660
-
661
  def _convert_paragraph_to_markdown(self, paragraph) -> str:
662
- """Enhanced paragraph to markdown conversion"""
663
  text = paragraph.text.strip()
664
  if not text:
665
  return ""
666
 
667
  style_name = paragraph.style.name if paragraph.style else "Normal"
668
 
669
- # Enhanced formatting detection
670
  is_bold = any(run.bold for run in paragraph.runs if run.bold)
671
- is_italic = any(run.italic for run in paragraph.runs if run.italic)
672
 
673
- # Font size detection
674
  font_size = 12
675
  if paragraph.runs:
676
  first_run = paragraph.runs[0]
677
  if first_run.font.size:
678
  font_size = first_run.font.size.pt
679
 
680
- # Advanced heading detection
681
  if "Title" in style_name or (is_bold and font_size >= 18):
682
  return f"# {text}"
683
  elif "Heading 1" in style_name or (is_bold and font_size >= 16):
@@ -693,114 +105,130 @@ class AdvancedDocumentConverter:
693
  elif "Heading 6" in style_name:
694
  return f"###### {text}"
695
  elif re.match(r"^[\d\w]\.\s|^[β€’\-\*]\s|^\d+\)\s", text):
696
- # Enhanced list detection
697
- if re.match(r"^\d+\.", text):
698
- return f"1. {text[text.find('.') + 1 :].strip()}"
699
  else:
700
- return f"- {text[1:].strip() if text[0] in 'β€’-*' else text}"
 
 
 
 
701
  else:
702
- # Apply inline formatting
703
  formatted_text = self._apply_inline_formatting(paragraph)
704
  return formatted_text
705
 
706
  def _apply_inline_formatting(self, paragraph) -> str:
707
- """Enhanced inline formatting application"""
708
  result = ""
709
  for run in paragraph.runs:
710
  text = run.text
711
-
712
- # Apply multiple formatting
713
  if run.bold and run.italic:
714
  text = f"***{text}***"
715
  elif run.bold:
716
  text = f"**{text}**"
717
  elif run.italic:
718
  text = f"*{text}*"
719
- elif run.underline:
720
- text = f"<u>{text}</u>"
721
-
722
  result += text
723
  return result
724
 
725
  def _convert_table_to_markdown(self, table) -> str:
726
- """Enhanced table conversion with better formatting"""
727
  if not table.rows:
728
  return ""
729
 
730
  markdown_rows = []
731
 
732
  # Process header row
733
- header_cells = []
734
- for cell in table.rows[0].cells:
735
- cell_text = cell.text.strip().replace("\n", " ")
736
- header_cells.append(cell_text if cell_text else "Header")
737
-
738
  markdown_rows.append("| " + " | ".join(header_cells) + " |")
739
  markdown_rows.append("| " + " | ".join(["---"] * len(header_cells)) + " |")
740
 
741
  # Process data rows
742
  for row in table.rows[1:]:
743
- cells = []
744
- for cell in row.cells:
745
- cell_text = cell.text.strip().replace("\n", " ")
746
- cells.append(cell_text if cell_text else " ")
747
  markdown_rows.append("| " + " | ".join(cells) + " |")
748
 
749
  return "\n".join(markdown_rows)
750
 
751
- def _analyze_document_structure(self, markdown_text: str) -> Dict[str, Any]:
752
- """Enhanced document structure analysis"""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
753
  lines = markdown_text.split("\n")
754
  structure = {
755
  "headings": {"h1": 0, "h2": 0, "h3": 0, "h4": 0, "h5": 0, "h6": 0},
756
  "lists": {"ordered": 0, "unordered": 0},
757
  "tables": 0,
758
  "paragraphs": 0,
759
- "code_blocks": 0,
760
- "links": 0,
761
- "images": 0,
762
  "bold_text": 0,
763
  "italic_text": 0,
764
  "total_lines": len(lines),
765
  "word_count": len(markdown_text.split()),
766
  "character_count": len(markdown_text),
767
- "reading_time_minutes": max(
768
- 1, len(markdown_text.split()) // 200
769
- ), # ~200 WPM
770
  }
771
 
772
  in_table = False
773
- in_code_block = False
774
 
775
  for line in lines:
776
- original_line = line
777
  line = line.strip()
778
  if not line:
779
  continue
780
 
781
- # Code blocks
782
- if line.startswith("```"):
783
- in_code_block = not in_code_block
784
- if in_code_block:
785
- structure["code_blocks"] += 1
786
- continue
787
-
788
- if in_code_block:
789
- continue
790
-
791
- # Headings
792
  if line.startswith("#"):
793
  level = len(line) - len(line.lstrip("#"))
794
  if level <= 6:
795
  structure["headings"][f"h{level}"] += 1
796
 
797
- # Lists
798
  elif re.match(r"^\d+\.\s", line):
799
  structure["lists"]["ordered"] += 1
800
  elif re.match(r"^[\-\*\+]\s", line):
801
  structure["lists"]["unordered"] += 1
802
 
803
- # Tables
804
  elif "|" in line and not in_table:
805
  structure["tables"] += 1
806
  in_table = True
@@ -813,579 +241,198 @@ class AdvancedDocumentConverter:
813
  ):
814
  structure["paragraphs"] += 1
815
 
816
- # Links and images
817
- structure["links"] += len(re.findall(r"\[([^\]]+)\]\([^)]+\)", line))
818
- structure["images"] += len(re.findall(r"!\[([^\]]*)\]\([^)]+\)", line))
819
-
820
- # Formatting
821
  structure["bold_text"] += len(re.findall(r"\*\*[^*]+\*\*", line))
822
  structure["italic_text"] += len(re.findall(r"\*[^*]+\*", line))
823
 
824
  return structure
825
 
826
- def _generate_frontmatter(
827
- self, file_path: str, structure: Dict, options: Dict
828
- ) -> str:
829
- """Generate YAML frontmatter for the document"""
830
- frontmatter_data = {
831
- "title": Path(file_path).stem.replace("_", " ").replace("-", " ").title(),
832
- "created": datetime.now().strftime("%Y-%m-%d"),
833
- "source_file": Path(file_path).name,
834
- "file_type": Path(file_path).suffix[1:].upper(),
835
- "word_count": structure.get("word_count", 0),
836
- "reading_time": f"{structure.get('reading_time_minutes', 1)} min",
837
- "headings": structure.get("headings", {}),
838
- "has_tables": structure.get("tables", 0) > 0,
839
- "has_images": structure.get("images", 0) > 0,
840
- }
841
 
842
- # Add AI analysis if available
843
- if "ai_analysis" in structure:
844
- ai_data = structure["ai_analysis"]
845
- if ai_data.get("entities"):
846
- frontmatter_data["entities"] = [
847
- entity[0] for entity in ai_data["entities"][:5]
848
- ]
849
- if ai_data.get("topics"):
850
- frontmatter_data["topics"] = ai_data["topics"][:5]
851
-
852
- # Add summary if available
853
- if "summary" in structure:
854
- frontmatter_data["summary"] = structure["summary"]
855
-
856
- # Convert to YAML
857
- yaml_lines = ["---"]
858
- for key, value in frontmatter_data.items():
859
- if isinstance(value, dict):
860
- yaml_lines.append(f"{key}:")
861
- for subkey, subvalue in value.items():
862
- yaml_lines.append(f" {subkey}: {subvalue}")
863
- elif isinstance(value, list):
864
- yaml_lines.append(f"{key}:")
865
- for item in value:
866
- yaml_lines.append(f" - {item}")
867
- else:
868
- yaml_lines.append(f"{key}: {value}")
869
- yaml_lines.append("---")
870
 
871
- return "\n".join(yaml_lines)
 
872
 
873
- def _generate_table_of_contents(self, markdown_text: str) -> str:
874
- """Generate table of contents from headings"""
875
- toc_lines = ["## Table of Contents\n"]
876
 
877
- lines = markdown_text.split("\n")
878
- for line in lines:
879
- line = line.strip()
880
- if line.startswith("#"):
881
- # Extract heading level and text
882
- level = len(line) - len(line.lstrip("#"))
883
- heading_text = line.lstrip("#").strip()
884
-
885
- if level <= 4 and heading_text: # Only include up to h4
886
- # Create anchor link
887
- anchor = (
888
- heading_text.lower().replace(" ", "-").replace("[^a-z0-9-]", "")
889
- )
890
- indent = " " * (level - 1)
891
- toc_lines.append(f"{indent}- [{heading_text}](#{anchor})")
892
 
893
- return "\n".join(toc_lines)
 
894
 
895
- def _combine_documents(self, results: List[Dict]) -> str:
896
- """Combine multiple documents into one"""
897
- combined_parts = []
898
-
899
- for i, result in enumerate(results):
900
- if result.get("success") and result.get("markdown"):
901
- file_name = result.get("file_info", {}).get("name", f"Document {i + 1}")
902
- combined_parts.append(f"# {file_name}\n\n{result['markdown']}")
 
 
 
 
 
 
 
 
 
 
903
 
904
- return "\n\n---\n\n".join(combined_parts)
 
 
 
 
 
905
 
 
 
906
 
907
- class EnhancedGradioInterface:
908
- """Enhanced Gradio interface with advanced features"""
 
 
 
 
 
 
 
 
 
 
 
909
 
910
- def __init__(self):
911
- self.converter = AdvancedDocumentConverter()
912
- self.processing_queue = []
913
-
914
- def create_interface(self):
915
- """Create the enhanced Gradio interface"""
916
-
917
- # Custom CSS for better styling
918
- custom_css = """
919
- .container { max-width: 1200px; margin: auto; }
920
- .upload-area { border: 2px dashed #ccc; border-radius: 10px; padding: 20px; text-align: center; }
921
- .progress-bar { background: linear-gradient(90deg, #4CAF50, #45a049); }
922
- .feature-grid { display: grid; grid-template-columns: repeat(auto-fit, minmax(250px, 1fr)); gap: 15px; }
923
- .dependency-status { padding: 10px; border-radius: 5px; margin: 5px 0; }
924
- .available { background-color: #d4edda; color: #155724; }
925
- .unavailable { background-color: #f8d7da; color: #721c24; }
926
- """
927
-
928
- with gr.Blocks(
929
- title="πŸš€ Advanced Document to Markdown Converter",
930
- css=custom_css,
931
- theme=gr.themes.Soft(),
932
- ) as demo:
933
- # Header
934
- gr.Markdown("""
935
- # πŸš€ Advanced Document to Markdown Converter
936
-
937
- **Convert any document to Markdown with AI-powered analysis and advanced features**
938
-
939
- Supports: PDF, DOCX, PPTX, XLSX, TXT, MD, RTF, EPUB + OCR for images
940
- """)
941
-
942
- # Dependency status
943
- self._create_dependency_status()
944
-
945
- with gr.Tabs():
946
- # Single Document Tab
947
- with gr.TabItem("πŸ“„ Single Document"):
948
- self._create_single_document_tab()
949
-
950
- # Batch Processing Tab
951
- with gr.TabItem("πŸ“š Batch Processing"):
952
- self._create_batch_processing_tab()
953
-
954
- # Settings Tab
955
- with gr.TabItem("βš™οΈ Settings"):
956
- self._create_settings_tab()
957
-
958
- # Export Tab
959
- with gr.TabItem("πŸ’Ύ Export"):
960
- self._create_export_tab()
961
-
962
- return demo
963
-
964
- def _create_dependency_status(self):
965
- """Create dependency status display"""
966
- with gr.Accordion("πŸ“‹ System Status", open=False):
967
- status_html = "<div class='feature-grid'>"
968
-
969
- for dep_name, dep_info in DEPENDENCIES.items():
970
- status_class = "available" if dep_info["available"] else "unavailable"
971
- status_icon = "βœ…" if dep_info["available"] else "❌"
972
-
973
- feature_map = {
974
- "docx": "Word Documents (.docx)",
975
- "pdf": "PDF Documents (.pdf)",
976
- "pptx": "PowerPoint (.pptx)",
977
- "xlsx": "Excel Files (.xlsx)",
978
- "ocr": "OCR (Image Text Extraction)",
979
- "nlp": "AI Text Analysis",
980
- "epub": "E-books (.epub)",
981
- "rtf": "Rich Text Format (.rtf)",
982
- }
983
 
984
- feature_name = feature_map.get(dep_name, dep_name.upper())
985
- status_html += f"<div class='dependency-status {status_class}'>{status_icon} {feature_name}</div>"
986
 
987
- status_html += "</div>"
988
- gr.HTML(status_html)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
989
 
990
- def _create_single_document_tab(self):
991
- """Create single document processing tab"""
992
  with gr.Row():
993
  with gr.Column(scale=1):
 
994
  file_input = gr.File(
995
  label="πŸ“Ž Upload Document",
996
- file_types=[
997
- ".pdf",
998
- ".docx",
999
- ".pptx",
1000
- ".xlsx",
1001
- ".txt",
1002
- ".md",
1003
- ".rtf",
1004
- ".epub",
1005
- ],
1006
  type="filepath",
1007
  )
1008
 
1009
- with gr.Accordion("πŸŽ›οΈ Processing Options", open=True):
1010
- enable_ai = gr.Checkbox(label="🧠 Enable AI Analysis", value=True)
1011
- include_frontmatter = gr.Checkbox(
1012
- label="πŸ“‹ Include Frontmatter", value=True
1013
- )
1014
- generate_toc = gr.Checkbox(
1015
- label="πŸ“‘ Generate Table of Contents", value=False
1016
- )
1017
- use_cache = gr.Checkbox(label="⚑ Use Cache", value=True)
1018
-
1019
- process_btn = gr.Button(
1020
- "πŸš€ Process Document", variant="primary", size="lg"
1021
  )
1022
 
1023
- # Progress display
1024
- progress_bar = gr.Progress()
1025
- status_text = gr.Textbox(label="πŸ“Š Status", interactive=False)
 
 
 
 
 
1026
 
1027
  with gr.Column(scale=2):
 
1028
  with gr.Tabs():
1029
  with gr.TabItem("πŸ“ Markdown Output"):
1030
  markdown_output = gr.Textbox(
1031
  label="Generated Markdown",
1032
- lines=25,
1033
- max_lines=50,
1034
  show_copy_button=True,
1035
- placeholder="Processed markdown will appear here...",
1036
  )
1037
 
1038
- with gr.TabItem("πŸ” Structure Analysis"):
1039
  structure_output = gr.JSON(label="Document Structure")
1040
 
1041
- with gr.TabItem("🧠 AI Analysis"):
1042
- ai_analysis_output = gr.JSON(label="AI-Powered Analysis")
1043
-
1044
- with gr.TabItem("ℹ️ File Info"):
1045
- file_info_output = gr.JSON(label="File Information")
1046
-
1047
- with gr.TabItem("πŸ“‹ Frontmatter"):
1048
- frontmatter_output = gr.Textbox(
1049
- label="Generated Frontmatter",
1050
- lines=15,
1051
- show_copy_button=True,
1052
- )
1053
 
1054
- # Event handlers
1055
- def process_single_document(file_path, ai_enabled, frontmatter, toc, cache):
 
1056
  if not file_path:
1057
- return "No file uploaded", {}, {}, {}, ""
1058
 
1059
- options = {
1060
- "enable_ai_analysis": ai_enabled,
1061
- "include_frontmatter": frontmatter,
1062
- "generate_toc": toc,
1063
- "use_cache": cache,
1064
- }
1065
-
1066
- result = self.converter.process_document(file_path, options)
1067
 
1068
  if "error" in result:
1069
- return f"❌ Error: {result['error']}", {}, {}, {}, ""
1070
-
1071
- ai_analysis = result["structure"].get("ai_analysis", {})
1072
-
1073
- return (
1074
- result["markdown"],
1075
- result["structure"],
1076
- ai_analysis,
1077
- result["file_info"],
1078
- result.get("frontmatter", ""),
1079
- )
1080
-
1081
- process_btn.click(
1082
- fn=process_single_document,
1083
- inputs=[
1084
- file_input,
1085
- enable_ai,
1086
- include_frontmatter,
1087
- generate_toc,
1088
- use_cache,
1089
- ],
1090
- outputs=[
1091
- markdown_output,
1092
- structure_output,
1093
- ai_analysis_output,
1094
- file_info_output,
1095
- frontmatter_output,
1096
- ],
1097
- )
1098
-
1099
- def _create_batch_processing_tab(self):
1100
- """Create batch processing tab"""
1101
- with gr.Row():
1102
- with gr.Column(scale=1):
1103
- batch_files = gr.File(
1104
- label="πŸ“š Upload Multiple Documents",
1105
- file_count="multiple",
1106
- file_types=[
1107
- ".pdf",
1108
- ".docx",
1109
- ".pptx",
1110
- ".xlsx",
1111
- ".txt",
1112
- ".md",
1113
- ".rtf",
1114
- ".epub",
1115
- ],
1116
- type="filepath",
1117
- )
1118
-
1119
- with gr.Accordion("πŸŽ›οΈ Batch Options", open=True):
1120
- combine_docs = gr.Checkbox(
1121
- label="πŸ”— Combine into Single Document", value=False
1122
- )
1123
- batch_ai = gr.Checkbox(label="🧠 Enable AI Analysis", value=True)
1124
- batch_frontmatter = gr.Checkbox(
1125
- label="πŸ“‹ Include Frontmatter", value=True
1126
- )
1127
- max_workers = gr.Slider(
1128
- label="⚑ Concurrent Workers",
1129
- minimum=1,
1130
- maximum=5,
1131
- value=3,
1132
- step=1,
1133
- )
1134
-
1135
- batch_process_btn = gr.Button(
1136
- "πŸš€ Process All Documents", variant="primary", size="lg"
1137
- )
1138
-
1139
- # Batch progress
1140
- batch_progress = gr.Progress()
1141
- batch_status = gr.Textbox(label="πŸ“Š Batch Status", interactive=False)
1142
-
1143
- with gr.Column(scale=2):
1144
- with gr.Tabs():
1145
- with gr.TabItem("πŸ“‹ Batch Results"):
1146
- batch_results = gr.JSON(label="Processing Results")
1147
-
1148
- with gr.TabItem("πŸ“„ Combined Document"):
1149
- combined_output = gr.Textbox(
1150
- label="Combined Markdown",
1151
- lines=25,
1152
- show_copy_button=True,
1153
- placeholder="Combined document will appear here if enabled...",
1154
- )
1155
-
1156
- with gr.TabItem("πŸ“Š Batch Statistics"):
1157
- batch_stats = gr.JSON(label="Batch Processing Statistics")
1158
-
1159
- def process_batch_documents(
1160
- file_paths, combine, ai_enabled, frontmatter, workers
1161
- ):
1162
- if not file_paths:
1163
- return "No files uploaded", "", {}
1164
-
1165
- options = {
1166
- "enable_ai_analysis": ai_enabled,
1167
- "include_frontmatter": frontmatter,
1168
- "combine_documents": combine,
1169
- }
1170
-
1171
- result = self.converter.process_multiple_documents(file_paths, options)
1172
-
1173
- # Generate statistics
1174
- stats = {
1175
- "total_files": result["total_files"],
1176
- "successful": len([r for r in result["results"] if r.get("success")]),
1177
- "failed": len([r for r in result["results"] if "error" in r]),
1178
- "total_words": sum(
1179
- r.get("structure", {}).get("word_count", 0)
1180
- for r in result["results"]
1181
- if r.get("success")
1182
- ),
1183
- "processing_time": "N/A", # Would need timing implementation
1184
- }
1185
-
1186
- return result["results"], result.get("combined_markdown", ""), stats
1187
-
1188
- batch_process_btn.click(
1189
- fn=process_batch_documents,
1190
- inputs=[
1191
- batch_files,
1192
- combine_docs,
1193
- batch_ai,
1194
- batch_frontmatter,
1195
- max_workers,
1196
- ],
1197
- outputs=[batch_results, combined_output, batch_stats],
1198
- )
1199
-
1200
- def _create_settings_tab(self):
1201
- """Create settings and configuration tab"""
1202
- with gr.Column():
1203
- gr.Markdown("## βš™οΈ Global Settings")
1204
-
1205
- with gr.Row():
1206
- with gr.Column():
1207
- gr.Markdown("### 🎨 Output Formatting")
1208
-
1209
- markdown_style = gr.Dropdown(
1210
- label="Markdown Style",
1211
- choices=["Standard", "GitHub Flavored", "CommonMark", "Pandoc"],
1212
- value="GitHub Flavored",
1213
- )
1214
-
1215
- heading_style = gr.Dropdown(
1216
- label="Heading Style",
1217
- choices=["ATX (# Header)", "Setext (Header\\n=====)"],
1218
- value="ATX (# Header)",
1219
- )
1220
-
1221
- line_break_style = gr.Dropdown(
1222
- label="Line Break Style",
1223
- choices=["Two Spaces", "Backslash"],
1224
- value="Two Spaces",
1225
- )
1226
 
1227
- with gr.Column():
1228
- gr.Markdown("### 🧠 AI Settings")
 
 
1229
 
1230
- ai_model = gr.Dropdown(
1231
- label="NLP Model",
1232
- choices=["en_core_web_sm", "en_core_web_md", "en_core_web_lg"],
1233
- value="en_core_web_sm",
1234
- )
1235
-
1236
- summary_length = gr.Slider(
1237
- label="Summary Max Length",
1238
- minimum=50,
1239
- maximum=500,
1240
- value=200,
1241
- step=50,
1242
- )
1243
-
1244
- max_topics = gr.Slider(
1245
- label="Max Topics to Extract",
1246
- minimum=5,
1247
- maximum=20,
1248
- value=10,
1249
- step=1,
1250
- )
1251
-
1252
- with gr.Row():
1253
- with gr.Column():
1254
- gr.Markdown("### πŸ”§ Processing Settings")
1255
-
1256
- cache_enabled = gr.Checkbox(label="Enable Global Cache", value=True)
1257
- ocr_enabled = gr.Checkbox(label="Enable OCR by Default", value=True)
1258
- preserve_formatting = gr.Checkbox(
1259
- label="Preserve Original Formatting", value=True
1260
- )
1261
-
1262
- max_file_size = gr.Slider(
1263
- label="Max File Size (MB)",
1264
- minimum=1,
1265
- maximum=100,
1266
- value=50,
1267
- step=1,
1268
- )
1269
 
1270
- with gr.Column():
1271
- gr.Markdown("### πŸ“Š Performance")
1272
-
1273
- clear_cache_btn = gr.Button("πŸ—‘οΈ Clear Cache", variant="secondary")
1274
-
1275
- cache_info = gr.JSON(label="Cache Information")
1276
-
1277
- system_info = gr.JSON(
1278
- label="System Information",
1279
- value={
1280
- "supported_formats": list(
1281
- self.converter.supported_formats.keys()
1282
- ),
1283
- "available_features": [
1284
- k for k, v in DEPENDENCIES.items() if v["available"]
1285
- ],
1286
- "missing_features": [
1287
- k for k, v in DEPENDENCIES.items() if not v["available"]
1288
- ],
1289
- },
1290
- )
1291
-
1292
- def clear_cache():
1293
- # Implementation would clear the cache directory
1294
- return {"status": "Cache cleared", "timestamp": datetime.now().isoformat()}
1295
-
1296
- clear_cache_btn.click(fn=clear_cache, outputs=[cache_info])
1297
-
1298
- def _create_export_tab(self):
1299
- """Create export and download tab"""
1300
- with gr.Column():
1301
- gr.Markdown("## πŸ’Ύ Export Options")
1302
-
1303
- with gr.Row():
1304
- with gr.Column():
1305
- gr.Markdown("### πŸ“€ Export Formats")
1306
-
1307
- export_format = gr.Dropdown(
1308
- label="Export Format",
1309
- choices=[
1310
- "Markdown (.md)",
1311
- "HTML (.html)",
1312
- "PDF (.pdf)",
1313
- "ZIP Archive",
1314
- ],
1315
- value="Markdown (.md)",
1316
- )
1317
-
1318
- include_metadata = gr.Checkbox(label="Include Metadata", value=True)
1319
- include_css = gr.Checkbox(
1320
- label="Include CSS (for HTML)", value=True
1321
- )
1322
-
1323
- custom_css = gr.Textbox(
1324
- label="Custom CSS",
1325
- lines=10,
1326
- placeholder="/* Custom CSS for HTML export */",
1327
- visible=False,
1328
- )
1329
-
1330
- with gr.Column():
1331
- gr.Markdown("### πŸ“‹ Export Templates")
1332
-
1333
- template_choice = gr.Dropdown(
1334
- label="Document Template",
1335
- choices=[
1336
- "Default",
1337
- "Academic Paper",
1338
- "Technical Documentation",
1339
- "Blog Post",
1340
- "README",
1341
- ],
1342
- value="Default",
1343
- )
1344
-
1345
- custom_header = gr.Textbox(
1346
- label="Custom Header",
1347
- lines=3,
1348
- placeholder="Custom header to prepend to document",
1349
- )
1350
-
1351
- custom_footer = gr.Textbox(
1352
- label="Custom Footer",
1353
- lines=3,
1354
- placeholder="Custom footer to append to document",
1355
- )
1356
-
1357
- with gr.Row():
1358
- export_btn = gr.Button(
1359
- "πŸ“¦ Generate Export", variant="primary", size="lg"
1360
- )
1361
- download_btn = gr.File(label="πŸ“₯ Download Export", interactive=False)
1362
-
1363
- export_status = gr.Textbox(label="Export Status", interactive=False)
1364
-
1365
- def update_css_visibility(format_choice):
1366
- return gr.update(visible="HTML" in format_choice)
1367
-
1368
- export_format.change(
1369
- fn=update_css_visibility, inputs=[export_format], outputs=[custom_css]
1370
  )
1371
 
1372
-
1373
- # Create and launch the application
1374
- def main():
1375
- """Main application entry point"""
1376
- interface = EnhancedGradioInterface()
1377
- demo = interface.create_interface()
1378
-
1379
- # Launch with MCP server enabled
1380
- demo.launch(
1381
- mcp_server=True,
1382
- server_name="0.0.0.0",
1383
- server_port=7860,
1384
- share=True,
1385
- show_api=True,
1386
- show_error=True,
1387
- )
 
 
 
 
1388
 
1389
 
1390
  if __name__ == "__main__":
1391
- main()
 
 
 
1
  import gradio as gr
2
  import re
3
+ from typing import Dict, Any
4
  import os
 
 
 
 
 
 
 
5
  from pathlib import Path
6
+
7
+ # Import dependencies for PDF and DOCX processing
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  try:
9
  import docx
10
 
11
+ DOCX_AVAILABLE = True
12
  except ImportError:
13
+ DOCX_AVAILABLE = False
14
 
15
  try:
16
  import fitz # PyMuPDF
17
 
18
+ PDF_AVAILABLE = True
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  except ImportError:
20
+ PDF_AVAILABLE = False
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
 
23
+ class DocumentToMarkdownConverter:
24
+ """Simple document to markdown converter"""
25
 
26
  def __init__(self):
27
+ pass
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
  def extract_from_docx(self, docx_path: str) -> str:
30
+ """Extract content from DOCX and convert to Markdown"""
31
+ if not DOCX_AVAILABLE:
32
+ raise ImportError("python-docx not installed")
33
 
 
34
  doc = docx.Document(docx_path)
35
  markdown_content = []
36
 
37
+ # Process paragraphs
38
  for paragraph in doc.paragraphs:
39
  if paragraph.text.strip():
40
  md_text = self._convert_paragraph_to_markdown(paragraph)
 
49
 
50
  return "\n\n".join(markdown_content)
51
 
52
+ def extract_from_pdf(self, pdf_path: str) -> str:
53
+ """Extract content from PDF and convert to Markdown"""
54
+ if not PDF_AVAILABLE:
55
+ raise ImportError("PyMuPDF not installed")
56
 
57
+ doc = fitz.open(pdf_path)
 
58
  markdown_content = []
59
 
60
+ for page_num in range(len(doc)):
61
+ page = doc.load_page(page_num)
62
 
63
+ # Extract text blocks with formatting
64
+ blocks = page.get_text("dict")
65
+ page_markdown = self._convert_pdf_blocks_to_markdown(blocks)
 
 
 
 
66
 
67
+ if page_markdown.strip():
68
+ page_header = f"## Page {page_num + 1}"
69
+ markdown_content.append(page_header + "\n\n" + page_markdown)
70
 
71
+ doc.close()
72
  return "\n\n---\n\n".join(markdown_content)
73
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74
  def _convert_paragraph_to_markdown(self, paragraph) -> str:
75
+ """Convert DOCX paragraph to Markdown"""
76
  text = paragraph.text.strip()
77
  if not text:
78
  return ""
79
 
80
  style_name = paragraph.style.name if paragraph.style else "Normal"
81
 
82
+ # Check if paragraph has bold formatting
83
  is_bold = any(run.bold for run in paragraph.runs if run.bold)
 
84
 
85
+ # Check font size for heading detection
86
  font_size = 12
87
  if paragraph.runs:
88
  first_run = paragraph.runs[0]
89
  if first_run.font.size:
90
  font_size = first_run.font.size.pt
91
 
92
+ # Convert based on style and formatting
93
  if "Title" in style_name or (is_bold and font_size >= 18):
94
  return f"# {text}"
95
  elif "Heading 1" in style_name or (is_bold and font_size >= 16):
 
105
  elif "Heading 6" in style_name:
106
  return f"###### {text}"
107
  elif re.match(r"^[\d\w]\.\s|^[β€’\-\*]\s|^\d+\)\s", text):
108
+ # List items
109
+ if text.startswith(("1.", "2.", "3.", "4.", "5.", "6.", "7.", "8.", "9.")):
110
+ return f"1. {text[2:].strip()}"
111
  else:
112
+ char_to_check = text[0] if text else ""
113
+ if char_to_check in "β€’-*":
114
+ return f"- {text[1:].strip()}"
115
+ else:
116
+ return f"- {text}"
117
  else:
118
+ # Regular paragraph
119
  formatted_text = self._apply_inline_formatting(paragraph)
120
  return formatted_text
121
 
122
  def _apply_inline_formatting(self, paragraph) -> str:
123
+ """Apply inline formatting (bold, italic) to text"""
124
  result = ""
125
  for run in paragraph.runs:
126
  text = run.text
 
 
127
  if run.bold and run.italic:
128
  text = f"***{text}***"
129
  elif run.bold:
130
  text = f"**{text}**"
131
  elif run.italic:
132
  text = f"*{text}*"
 
 
 
133
  result += text
134
  return result
135
 
136
  def _convert_table_to_markdown(self, table) -> str:
137
+ """Convert DOCX table to Markdown table"""
138
  if not table.rows:
139
  return ""
140
 
141
  markdown_rows = []
142
 
143
  # Process header row
144
+ header_cells = [cell.text.strip() for cell in table.rows[0].cells]
 
 
 
 
145
  markdown_rows.append("| " + " | ".join(header_cells) + " |")
146
  markdown_rows.append("| " + " | ".join(["---"] * len(header_cells)) + " |")
147
 
148
  # Process data rows
149
  for row in table.rows[1:]:
150
+ cells = [cell.text.strip() for cell in row.cells]
 
 
 
151
  markdown_rows.append("| " + " | ".join(cells) + " |")
152
 
153
  return "\n".join(markdown_rows)
154
 
155
+ def _convert_pdf_blocks_to_markdown(self, blocks_dict) -> str:
156
+ """Convert PDF text blocks to Markdown"""
157
+ markdown_lines = []
158
+
159
+ for block in blocks_dict.get("blocks", []):
160
+ if block.get("type") == 0: # Text block
161
+ for line in block.get("lines", []):
162
+ line_text = ""
163
+ for span in line.get("spans", []):
164
+ text = span.get("text", "").strip()
165
+ if text:
166
+ # Check formatting
167
+ font_size = span.get("size", 12)
168
+ flags = span.get("flags", 0)
169
+
170
+ # Bold = flags & 16, Italic = flags & 2
171
+ is_bold = bool(flags & 16)
172
+ is_italic = bool(flags & 2)
173
+
174
+ # Apply formatting
175
+ if is_bold and is_italic:
176
+ text = f"***{text}***"
177
+ elif is_bold:
178
+ text = f"**{text}**"
179
+ elif is_italic:
180
+ text = f"*{text}*"
181
+
182
+ # Check if it's a heading based on font size
183
+ if font_size >= 18:
184
+ text = f"# {text}"
185
+ elif font_size >= 16:
186
+ text = f"## {text}"
187
+ elif font_size >= 14:
188
+ text = f"### {text}"
189
+
190
+ line_text += text + " "
191
+
192
+ if line_text.strip():
193
+ markdown_lines.append(line_text.strip())
194
+
195
+ return "\n\n".join(markdown_lines)
196
+
197
+ def analyze_markdown_structure(self, markdown_text: str) -> Dict[str, Any]:
198
+ """Analyze the structure of extracted Markdown"""
199
  lines = markdown_text.split("\n")
200
  structure = {
201
  "headings": {"h1": 0, "h2": 0, "h3": 0, "h4": 0, "h5": 0, "h6": 0},
202
  "lists": {"ordered": 0, "unordered": 0},
203
  "tables": 0,
204
  "paragraphs": 0,
 
 
 
205
  "bold_text": 0,
206
  "italic_text": 0,
207
  "total_lines": len(lines),
208
  "word_count": len(markdown_text.split()),
209
  "character_count": len(markdown_text),
 
 
 
210
  }
211
 
212
  in_table = False
 
213
 
214
  for line in lines:
 
215
  line = line.strip()
216
  if not line:
217
  continue
218
 
219
+ # Count headings
 
 
 
 
 
 
 
 
 
 
220
  if line.startswith("#"):
221
  level = len(line) - len(line.lstrip("#"))
222
  if level <= 6:
223
  structure["headings"][f"h{level}"] += 1
224
 
225
+ # Count lists
226
  elif re.match(r"^\d+\.\s", line):
227
  structure["lists"]["ordered"] += 1
228
  elif re.match(r"^[\-\*\+]\s", line):
229
  structure["lists"]["unordered"] += 1
230
 
231
+ # Count tables
232
  elif "|" in line and not in_table:
233
  structure["tables"] += 1
234
  in_table = True
 
241
  ):
242
  structure["paragraphs"] += 1
243
 
244
+ # Count formatting
 
 
 
 
245
  structure["bold_text"] += len(re.findall(r"\*\*[^*]+\*\*", line))
246
  structure["italic_text"] += len(re.findall(r"\*[^*]+\*", line))
247
 
248
  return structure
249
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
250
 
251
+ def extract_document_to_markdown(file_path: str) -> Dict[str, Any]:
252
+ """
253
+ Extract document content and convert to Markdown format
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
254
 
255
+ Args:
256
+ file_path: Path to PDF or DOCX file
257
 
258
+ Returns:
259
+ Dictionary containing markdown content and structure analysis
260
+ """
261
 
262
+ if not file_path or not os.path.exists(file_path):
263
+ return {"error": "File not found", "markdown": "", "structure": {}}
 
 
 
 
 
 
 
 
 
 
 
 
 
264
 
265
+ converter = DocumentToMarkdownConverter()
266
+ file_extension = Path(file_path).suffix.lower()
267
 
268
+ try:
269
+ if file_extension == ".docx":
270
+ if not DOCX_AVAILABLE:
271
+ return {
272
+ "error": "python-docx not installed. Run: pip install python-docx",
273
+ "markdown": "",
274
+ "structure": {},
275
+ }
276
+ markdown_content = converter.extract_from_docx(file_path)
277
+
278
+ elif file_extension == ".pdf":
279
+ if not PDF_AVAILABLE:
280
+ return {
281
+ "error": "PyMuPDF not installed. Run: pip install PyMuPDF",
282
+ "markdown": "",
283
+ "structure": {},
284
+ }
285
+ markdown_content = converter.extract_from_pdf(file_path)
286
 
287
+ else:
288
+ return {
289
+ "error": f"Unsupported file type: {file_extension}. Only PDF and DOCX files are supported.",
290
+ "markdown": "",
291
+ "structure": {},
292
+ }
293
 
294
+ # Analyze markdown structure
295
+ structure = converter.analyze_markdown_structure(markdown_content)
296
 
297
+ return {
298
+ "success": True,
299
+ "file_info": {
300
+ "name": Path(file_path).name,
301
+ "type": file_extension.upper()[1:],
302
+ "size_kb": round(os.path.getsize(file_path) / 1024, 2),
303
+ },
304
+ "markdown": markdown_content,
305
+ "structure": structure,
306
+ "preview": markdown_content[:500] + "..."
307
+ if len(markdown_content) > 500
308
+ else markdown_content,
309
+ }
310
 
311
+ except Exception as e:
312
+ return {
313
+ "error": f"Error processing file: {str(e)}",
314
+ "markdown": "",
315
+ "structure": {},
316
+ }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
317
 
 
 
318
 
319
+ def create_interface():
320
+ """Create the main Gradio interface"""
321
+
322
+ with gr.Blocks(
323
+ title="Document to Markdown Converter", theme=gr.themes.Soft()
324
+ ) as demo:
325
+ gr.Markdown("""
326
+ # πŸ“„ Document to Markdown Converter
327
+
328
+ Convert PDF and DOCX files to Markdown format with structure analysis.
329
+
330
+ **Supported formats:** PDF (.pdf), Word Documents (.docx)
331
+ """)
332
+
333
+ # Show dependency status
334
+ missing_deps = []
335
+ if not DOCX_AVAILABLE:
336
+ missing_deps.append("python-docx")
337
+ if not PDF_AVAILABLE:
338
+ missing_deps.append("PyMuPDF")
339
+
340
+ if missing_deps:
341
+ gr.Markdown(
342
+ f"⚠️ **Missing dependencies**: Some features may be limited. Missing: {', '.join(missing_deps)}"
343
+ )
344
+ else:
345
+ gr.Markdown("βœ… **All dependencies available**: Full functionality enabled")
346
 
 
 
347
  with gr.Row():
348
  with gr.Column(scale=1):
349
+ # File upload
350
  file_input = gr.File(
351
  label="πŸ“Ž Upload Document",
352
+ file_types=[".pdf", ".docx"],
 
 
 
 
 
 
 
 
 
353
  type="filepath",
354
  )
355
 
356
+ # Process button
357
+ extract_btn = gr.Button(
358
+ "πŸ”„ Convert to Markdown", variant="primary", size="lg"
 
 
 
 
 
 
 
 
 
359
  )
360
 
361
+ # Options
362
+ with gr.Accordion("βš™οΈ Options", open=False):
363
+ show_structure = gr.Checkbox(
364
+ label="πŸ“Š Show Structure Analysis", value=True
365
+ )
366
+ show_preview = gr.Checkbox(
367
+ label="πŸ‘οΈ Show Preview Only (first 500 chars)", value=False
368
+ )
369
 
370
  with gr.Column(scale=2):
371
+ # Output tabs
372
  with gr.Tabs():
373
  with gr.TabItem("πŸ“ Markdown Output"):
374
  markdown_output = gr.Textbox(
375
  label="Generated Markdown",
376
+ lines=20,
377
+ max_lines=40,
378
  show_copy_button=True,
379
+ placeholder="Converted markdown will appear here...",
380
  )
381
 
382
+ with gr.TabItem("πŸ“Š Structure Analysis"):
383
  structure_output = gr.JSON(label="Document Structure")
384
 
385
+ with gr.TabItem("ℹ️ File Information"):
386
+ info_output = gr.JSON(label="File Details")
 
 
 
 
 
 
 
 
 
 
387
 
388
+ # Event handler
389
+ def process_document(file_path, show_struct, show_prev):
390
+ """Process uploaded document"""
391
  if not file_path:
392
+ return "No file uploaded", {}, {}
393
 
394
+ result = extract_document_to_markdown(file_path)
 
 
 
 
 
 
 
395
 
396
  if "error" in result:
397
+ return f"❌ Error: {result['error']}", {}, {}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
398
 
399
+ # Determine what to show
400
+ markdown_text = result["preview"] if show_prev else result["markdown"]
401
+ structure = result["structure"] if show_struct else {}
402
+ file_info = result["file_info"]
403
 
404
+ return markdown_text, structure, file_info
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
405
 
406
+ # Connect the button
407
+ extract_btn.click(
408
+ fn=process_document,
409
+ inputs=[file_input, show_structure, show_preview],
410
+ outputs=[markdown_output, structure_output, info_output],
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
411
  )
412
 
413
+ # Examples section
414
+ gr.Markdown("""
415
+ ## πŸ“– Usage Examples
416
+
417
+ 1. **Upload a PDF or DOCX file** using the file uploader above
418
+ 2. **Click "Convert to Markdown"** to process the document
419
+ 3. **View results** in the tabs:
420
+ - **Markdown Output**: The converted markdown text
421
+ - **Structure Analysis**: Document statistics and structure
422
+ - **File Information**: Basic file details
423
+
424
+ ### ✨ Features
425
+ - **Smart heading detection** based on font size and styles
426
+ - **Table extraction** and markdown formatting
427
+ - **List detection** and proper markdown conversion
428
+ - **Inline formatting** preservation (bold, italic)
429
+ - **Structure analysis** with statistics
430
+ """)
431
+
432
+ return demo
433
 
434
 
435
  if __name__ == "__main__":
436
+ # Create and launch the interface
437
+ demo = create_interface()
438
+ demo.launch(server_name="0.0.0.0", mcp_server=True, server_port=7860, share=True)