seanpedrickcase commited on
Commit
af187f0
·
1 Parent(s): 6f96988

Updated documentation. Fix on ocr_output upload before pdf. Duplicate page fix

Browse files
README.md CHANGED
@@ -12,7 +12,7 @@ license: agpl-3.0
12
 
13
  version: 1.0.0
14
 
15
- Redact personally identifiable information (PII) from documents (pdf, images), open text, or tabular data (xlsx/csv/parquet). Please see the [User Guide](#user-guide) for a walkthrough on how to use the app. Below is a very brief overview.
16
 
17
  To identify text in documents, the 'local' text/OCR image analysis uses spacy/tesseract, and works quite well for documents with typed text. If available, choose 'AWS Textract service' to redact more complex elements e.g. signatures or handwriting. Then, choose a method for PII identification. 'Local' is quick and gives good results if you are primarily looking for a custom list of terms to redact (see Redaction settings). If available, AWS Comprehend gives better results at a small cost.
18
 
@@ -20,7 +20,191 @@ After redaction, review suggested redactions on the 'Review redactions' tab. The
20
 
21
  NOTE: The app is not 100% accurate, and it will miss some personal information. It is essential that all outputs are reviewed **by a human** before using the final outputs.
22
 
23
- # USER GUIDE
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
  ## Table of contents
26
 
@@ -35,7 +219,7 @@ NOTE: The app is not 100% accurate, and it will miss some personal information.
35
  - [Redacting only specific pages](#redacting-only-specific-pages)
36
  - [Handwriting and signature redaction](#handwriting-and-signature-redaction)
37
  - [Reviewing and modifying suggested redactions](#reviewing-and-modifying-suggested-redactions)
38
- - [Redacting tabular data files (CSV/XLSX) or copy and pasted text](#redacting-tabular-data-files-xlsxcsv-or-copy-and-pasted-text)
39
 
40
  See the [advanced user guide here](#advanced-user-guide):
41
  - [Merging redaction review files](#merging-redaction-review-files)
@@ -225,9 +409,11 @@ On the 'Review redactions' tab you have a visual interface that allows you to in
225
 
226
  ### Uploading documents for review
227
 
228
- The top area has a file upload area where you can upload original, unredacted PDFs, alongside the '..._review_file.csv' that is produced by the redaction process. Once you have uploaded these two files, click the '**Review redactions based on original PDF...**' button to load in the files for review. This will allow you to visualise and modify the suggested redactions using the interface below.
229
 
230
- Optionally, you can also upload one of the '..._ocr_output.csv' files here that comes out of a redaction task, so that you can navigate the extracted text from the document.
 
 
231
 
232
  ![Search extracted text](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/search_extracted_text.PNG)
233
 
@@ -315,6 +501,77 @@ Once you have filtered the table, or selected a row from the table, you have a f
315
 
316
  If you made a mistake, click the 'Undo last element removal' button to restore the Search suggested redactions table to its previous state (can only undo the last action).
317
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
318
  ### Navigating through the document using the 'Search all extracted text'
319
 
320
  The 'search all extracted text' table will contain text if you have just redacted a document, or if you have uploaded a '..._ocr_output.csv' file alongside a document file and review file on the Review redactions tab as [described above](#uploading-documents-for-review).
@@ -327,11 +584,11 @@ You can search through the extracted text by using the search bar just above the
327
 
328
  ![Search suggested redaction area](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/search_extracted_text.PNG)
329
 
330
- ## Redacting tabular data files (XLSX/CSV) or copy and pasted text
331
 
332
- ### Tabular data files (XLSX/CSV)
333
 
334
- The app can be used to redact tabular data files such as xlsx or csv files. For this to work properly, your data file needs to be in a simple table format, with a single table starting from the first cell (A1), and no other information in the sheet. Similarly for .xlsx files, each sheet in the file that you want to redact should be in this simple format.
335
 
336
  To demonstrate this, we can use [the example csv file 'combined_case_notes.csv'](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/combined_case_notes.csv), which is a small dataset of dummy social care case notes. Go to the 'Open text or Excel/csv files' tab. Drop the file into the upload area. After the file is loaded, you should see the suggested columns for redaction in the box underneath. You can select and deselect columns to redact as you wish from this list.
337
 
 
12
 
13
  version: 1.0.0
14
 
15
+ Redact personally identifiable information (PII) from documents (pdf, images), Word files (.docx), or tabular data (xlsx/csv/parquet). Please see the [User Guide](#user-guide) for a walkthrough on how to use the app. Below is a very brief overview.
16
 
17
  To identify text in documents, the 'local' text/OCR image analysis uses spacy/tesseract, and works quite well for documents with typed text. If available, choose 'AWS Textract service' to redact more complex elements e.g. signatures or handwriting. Then, choose a method for PII identification. 'Local' is quick and gives good results if you are primarily looking for a custom list of terms to redact (see Redaction settings). If available, AWS Comprehend gives better results at a small cost.
18
 
 
20
 
21
  NOTE: The app is not 100% accurate, and it will miss some personal information. It is essential that all outputs are reviewed **by a human** before using the final outputs.
22
 
23
+ ---
24
+
25
+ ## 🚀 Quick Start - Installation and first run
26
+
27
+ Follow these instructions to get the document redaction application running on your local machine.
28
+
29
+ ### 1. Prerequisites: System Dependencies
30
+
31
+ This application relies on two external tools for OCR (Tesseract) and PDF processing (Poppler). Please install them on your system before proceeding.
32
+
33
+ ---
34
+
35
+
36
+ #### **On Windows**
37
+
38
+ Installation on Windows requires downloading installers and adding the programs to your system's PATH.
39
+
40
+ 1. **Install Tesseract OCR:**
41
+ * Download the installer from the official Tesseract at [UB Mannheim page](https://github.com/UB-Mannheim/tesseract/wiki) (e.g., `tesseract-ocr-w64-setup-v5.X.X...exe`).
42
+ * Run the installer.
43
+ * **IMPORTANT:** During installation, ensure you select the option to "Add Tesseract to system PATH for all users" or a similar option. This is crucial for the application to find the Tesseract executable.
44
+
45
+
46
+ 2. **Install Poppler:**
47
+ * Download the latest Poppler binary for Windows. A common source is the [Poppler for Windows](https://github.com/oschwartz10612/poppler-windows) GitHub releases page. Download the `.zip` file (e.g., `poppler-24.02.0-win.zip`).
48
+ * Extract the contents of the zip file to a permanent location on your computer, for example, `C:\Program Files\poppler\`.
49
+ * You must add the `bin` folder from your Poppler installation to your system's PATH environment variable.
50
+ * Search for "Edit the system environment variables" in the Windows Start Menu and open it.
51
+ * Click the "Environment Variables..." button.
52
+ * In the "System variables" section, find and select the `Path` variable, then click "Edit...".
53
+ * Click "New" and add the full path to the `bin` directory inside your Poppler folder (e.g., `C:\Program Files\poppler\poppler-24.02.0\bin`).
54
+ * Click OK on all windows to save the changes.
55
+
56
+ To verify, open a new Command Prompt and run `tesseract --version` and `pdftoppm -v`. If they both return version information, you have successfully installed the prerequisites.
57
+
58
+ ---
59
+
60
+ #### **On Linux (Debian/Ubuntu)**
61
+
62
+ Open your terminal and run the following command to install Tesseract and Poppler:
63
+
64
+ ```bash
65
+ sudo apt-get update && sudo apt-get install -y tesseract-ocr poppler-utils
66
+ ```
67
+
68
+ #### **On Linux (Fedora/CentOS/RHEL)**
69
+
70
+ Open your terminal and use the `dnf` or `yum` package manager:
71
+
72
+ ```bash
73
+ sudo dnf install -y tesseract poppler-utils
74
+ ```
75
+ ---
76
+
77
+
78
+ ### 2. Installation: Code and Python Packages
79
+
80
+ Once the system prerequisites are installed, you can set up the Python environment.
81
+
82
+ #### Step 1: Clone the Repository
83
+
84
+ Open your terminal or Git Bash and clone this repository:
85
+ ```bash
86
+ git clone https://github.com/seanpedrick-case/doc_redaction.git
87
+ cd doc_redaction
88
+ ```
89
+
90
+ #### Step 2: Create and Activate a Virtual Environment (Recommended)
91
+
92
+ It is highly recommended to use a virtual environment to isolate project dependencies and avoid conflicts with other Python projects.
93
+
94
+ ```bash
95
+ # Create the virtual environment
96
+ python -m venv venv
97
+
98
+ # Activate it
99
+ # On Windows:
100
+ .\venv\Scripts\activate
101
+
102
+ # On macOS/Linux:
103
+ source venv/bin/activate
104
+ ```
105
+
106
+ #### Step 3: Install Python Dependencies
107
+
108
+ This project uses `pyproject.toml` to manage dependencies. You can install everything with a single pip command. This process will also download the required Spacy models and other packages directly from their URLs.
109
+
110
+ ```bash
111
+ pip install .
112
+ ```
113
+
114
+ Alternatively, you can use the `requirements.txt` file:
115
+ ```bash
116
+ pip install -r requirements.txt
117
+ ```
118
+
119
+ ### 3. Run the Application
120
+
121
+ With all dependencies installed, you can now start the Gradio application.
122
+
123
+ ```bash
124
+ python app.py
125
+ ```
126
+
127
+ After running the command, the application will start, and you will see a local URL in your terminal (usually `http://127.0.0.1:7860`).
128
+
129
+ Open this URL in your web browser to use the document redaction tool
130
+
131
+ ---
132
+
133
+
134
+ ### 4. ⚙️ Configuration (Optional)
135
+
136
+ You can customise the application's behavior by creating a configuration file. This allows you to change settings without modifying the source code, such as enabling AWS features, changing logging behavior, or pointing to local Tesseract/Poppler installations. A full overview of all the potential settings you can modify in the app_config.env file can be seen in tools/config.py, with explanation on the documentation website for [the github repo](https://seanpedrick-case.github.io/doc_redaction/)
137
+
138
+ To get started:
139
+ 1. Locate the `example_config.env` file in the root of the project.
140
+ 2. Create a new file named `app_config.env` inside the `config/` directory (i.e., `config/app_config.env`).
141
+ 3. Copy the contents from `example_config.env` into your new `config/app_config.env` file.
142
+ 4. Modify the values in `config/app_config.env` to suit your needs. The application will automatically load these settings on startup.
143
+
144
+ If you do not create this file, the application will run with default settings.
145
+
146
+ #### Configuration Breakdown
147
+
148
+ Here is an overview of the most important settings, separated by whether they are for local use or require AWS.
149
+
150
+ ---
151
+
152
+ #### **Local & General Settings (No AWS Required)**
153
+
154
+ These settings are useful for all users, regardless of whether you are using AWS.
155
+
156
+ * `TESSERACT_FOLDER` / `POPPLER_FOLDER`
157
+ * Use these if you installed Tesseract or Poppler to a custom location on **Windows** and did not add them to the system PATH.
158
+ * Provide the path to the respective installation folders (for Poppler, point to the `bin` sub-directory).
159
+ * **Examples:** `POPPLER_FOLDER=C:/Program Files/poppler-24.02.0/bin/` `TESSERACT_FOLDER=tesseract/`
160
+
161
+ * `SHOW_LANGUAGE_SELECTION=True`
162
+ * Set to `True` to display a language selection dropdown in the UI for OCR processing.
163
+
164
+ * `CHOSEN_LOCAL_OCR_MODEL=tesseract`"
165
+ * Choose the backend for local OCR. Options are `tesseract`, `paddle`, or `hybrid`. "Tesseract" is the default, and is recommended. "hybrid" is a combination of the two - first pass through the redactions will be done with Tesseract, and then a second pass will be done with PaddleOCR on words with low confidence. "paddle" will only return whole line text extraction, and so will only work for OCR, not redaction.
166
+
167
+ * `SESSION_OUTPUT_FOLDER=False`
168
+ * If `True`, redacted files will be saved in unique subfolders within the `output/` directory for each session.
169
+
170
+ * `DISPLAY_FILE_NAMES_IN_LOGS=False`
171
+ * For privacy, file names are not recorded in usage logs by default. Set to `True` to include them.
172
+
173
+ ---
174
+
175
+ #### **AWS-Specific Settings**
176
+
177
+ These settings are only relevant if you intend to use AWS services like Textract for OCR and Comprehend for PII detection.
178
+
179
+ * `RUN_AWS_FUNCTIONS=1`
180
+ * **This is the master switch.** You must set this to `1` to enable any AWS functionality. If it is `0`, all other AWS settings will be ignored.
181
+
182
+ * **UI Options:**
183
+ * `SHOW_AWS_TEXT_EXTRACTION_OPTIONS=True`: Adds "AWS Textract" as an option in the text extraction dropdown.
184
+ * `SHOW_AWS_PII_DETECTION_OPTIONS=True`: Adds "AWS Comprehend" as an option in the PII detection dropdown.
185
+
186
+ * **Core AWS Configuration:**
187
+ * `AWS_REGION=example-region`: Set your AWS region (e.g., `us-east-1`).
188
+ * `DOCUMENT_REDACTION_BUCKET=example-bucket`: The name of the S3 bucket the application will use for temporary file storage and processing.
189
+
190
+ * **AWS Logging:**
191
+ * `SAVE_LOGS_TO_DYNAMODB=True`: If enabled, usage and feedback logs will be saved to DynamoDB tables.
192
+ * `ACCESS_LOG_DYNAMODB_TABLE_NAME`, `USAGE_LOG_DYNAMODB_TABLE_NAME`, etc.: Specify the names of your DynamoDB tables for logging.
193
+
194
+ * **Advanced AWS Textract Features:**
195
+ * `SHOW_WHOLE_DOCUMENT_TEXTRACT_CALL_OPTIONS=True`: Enables UI components for large-scale, asynchronous document processing via Textract.
196
+ * `TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_BUCKET=example-bucket-output`: A separate S3 bucket for the final output of asynchronous Textract jobs.
197
+ * `LOAD_PREVIOUS_TEXTRACT_JOBS_S3=True`: If enabled, the app will try to load the status of previously submitted asynchronous jobs from S3.
198
+
199
+ * **Cost Tracking (for internal accounting):**
200
+ * `SHOW_COSTS=True`: Displays an estimated cost for AWS operations. Can be enabled even if AWS functions are off.
201
+ * `GET_COST_CODES=True`: Enables a dropdown for users to select a cost code before running a job.
202
+ * `COST_CODES_PATH=config/cost_codes.csv`: The local path to a CSV file containing your cost codes.
203
+ * `ENFORCE_COST_CODES=True`: Makes selecting a cost code mandatory before starting a redaction.
204
+
205
+ Now you have the app installed, what follows is a guide on how to use it for basic and advanced redaction.
206
+
207
+ # User Guide
208
 
209
  ## Table of contents
210
 
 
219
  - [Redacting only specific pages](#redacting-only-specific-pages)
220
  - [Handwriting and signature redaction](#handwriting-and-signature-redaction)
221
  - [Reviewing and modifying suggested redactions](#reviewing-and-modifying-suggested-redactions)
222
+ - [Redacting Word, tabular data files (CSV/XLSX) or copy and pasted text](#redacting-word-tabular-data-files-xlsxcsv-or-copy-and-pasted-text)
223
 
224
  See the [advanced user guide here](#advanced-user-guide):
225
  - [Merging redaction review files](#merging-redaction-review-files)
 
409
 
410
  ### Uploading documents for review
411
 
412
+ The top area has a file upload area where you can upload files for review . In the left box, upload the original PDF file. Click '1. Upload original PDF'. In the right box, you can upload the '..._review_file.csv' that is produced by the redaction process.
413
 
414
+ Optionally, you can upload a '..._ocr_result_with_words' file here, that will allow you to search through the text and easily [add new redactions based on word search](#searching-and-adding-custom-redactions). You can also upload one of the '..._ocr_output.csv' file here that comes out of a redaction task, so that you can navigate the extracted text from the document. Click the button '2. Upload Review or OCR csv files' load in these files.
415
+
416
+ Now you can review and modify the suggested redactions using the interface described below.
417
 
418
  ![Search extracted text](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/search_extracted_text.PNG)
419
 
 
501
 
502
  If you made a mistake, click the 'Undo last element removal' button to restore the Search suggested redactions table to its previous state (can only undo the last action).
503
 
504
+ ### Searching and Adding Custom Redactions
505
+
506
+ After a document has been processed, you may need to redact specific terms, names, or phrases that the automatic PII (Personally Identifiable Information) detection might have missed. The **"Search text to make new redactions"** tab gives you the power to find and redact any text within your document manually.
507
+
508
+ #### How to Use the Search and Redact Feature
509
+
510
+ The workflow is designed to be simple: **Search → Select → Redact**.
511
+
512
+ ---
513
+
514
+ #### **Step 1: Search for Text**
515
+
516
+ 1. Navigate to the **"Search text to make new redactions"** tab.
517
+ 2. The main table will initially be populated with all the text extracted from the document, broken down by word.
518
+ 3. To narrow this down, use the **"Multi-word text search"** box to type the word or phrase you want to find.
519
+ 4. Click the **"Search"** button or press Enter.
520
+ 5. The table below will update to show only the rows containing text that matches your search query.
521
+
522
+ > **Tip:** You can also filter the results by page number using the **"Page"** dropdown. To clear all filters and see the full text again, click the **"Reset table to original state"** button.
523
+
524
+ ---
525
+
526
+ #### **Step 2: Select and Review a Match**
527
+
528
+ When you click on any row in the search results table:
529
+
530
+ * The document preview on the left will automatically jump to that page, allowing you to see the word in its original context.
531
+ * The details of your selection will appear in the smaller **"Selected row"** table for confirmation.
532
+
533
+ ---
534
+
535
+ #### **Step 3: Choose Your Redaction Method**
536
+
537
+ You have several powerful options for redacting the text you've found:
538
+
539
+ * **Redact a Single, Specific Instance:**
540
+ * Click on the exact row in the table you want to redact.
541
+ * Click the **`Redact specific text row`** button.
542
+ * Only that single instance will be redacted.
543
+
544
+ * **Redact All Instances of a Word/Phrase:**
545
+ * Let's say you want to redact the project name "Project Alpha" everywhere it appears.
546
+ * Find and select one instance of "Project Alpha" in the table.
547
+ * Click the **`Redact all words with same text as selected row`** button.
548
+ * The application will find and redact every single occurrence of "Project Alpha" throughout the entire document.
549
+
550
+ * **Redact All Current Search Results:**
551
+ * Perform a search (e.g., for a specific person's name).
552
+ * If you are confident that every result shown in the filtered table should be redacted, click the **`Redact all text in table`** button.
553
+ * This will apply a redaction to all currently visible items in the table in one go.
554
+
555
+ ---
556
+
557
+ #### **Customising Your New Redactions**
558
+
559
+ Before you click one of the redact buttons, you can customize the appearance and label of the new redactions under the **"Search options"** accordion:
560
+
561
+ * **Label for new redactions:** Change the text that appears on the redaction box (default is "Redaction"). You could change this to "CONFIDENTIAL" or "CUSTOM".
562
+ * **Colour for labels:** Set a custom color for the redaction box by providing an RGB value. The format must be three numbers (0-255) in parentheses, for example:
563
+ * ` (255, 0, 0) ` for Red
564
+ * ` (0, 0, 0) ` for Black
565
+ * ` (255, 255, 0) ` for Yellow
566
+
567
+ #### **Undoing a Mistake**
568
+
569
+ If you make a mistake, you can reverse the last redaction action you performed on this tab.
570
+
571
+ * Click the **`Undo latest redaction`** button. This will revert the last set of redactions you added (whether it was a single row, all of a certain text, or all search results).
572
+
573
+ > **Important:** This undo button only works for the *most recent* action. It maintains a single backup state, so it cannot undo actions that are two or more steps in the past.
574
+
575
  ### Navigating through the document using the 'Search all extracted text'
576
 
577
  The 'search all extracted text' table will contain text if you have just redacted a document, or if you have uploaded a '..._ocr_output.csv' file alongside a document file and review file on the Review redactions tab as [described above](#uploading-documents-for-review).
 
584
 
585
  ![Search suggested redaction area](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/search_extracted_text.PNG)
586
 
587
+ ## Redacting Word, tabular data files (XLSX/CSV) or copy and pasted text
588
 
589
+ ### Word or tabular data files (XLSX/CSV)
590
 
591
+ The app can be used to redact Word (.docx), or tabular data files such as xlsx or csv files. For this to work properly, your data file needs to be in a simple table format, with a single table starting from the first cell (A1), and no other information in the sheet. Similarly for .xlsx files, each sheet in the file that you want to redact should be in this simple format.
592
 
593
  To demonstrate this, we can use [the example csv file 'combined_case_notes.csv'](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/combined_case_notes.csv), which is a small dataset of dummy social care case notes. Go to the 'Open text or Excel/csv files' tab. Drop the file into the upload area. After the file is loaded, you should see the suggested columns for redaction in the box underneath. You can select and deselect columns to redact as you wish from this list.
594
 
app.py CHANGED
@@ -257,7 +257,7 @@ with app:
257
  gr.Markdown(
258
  """# Document redaction
259
 
260
- Redact personally identifiable information (PII) from documents (PDF, images), open text, or tabular data (XLSX/CSV/Parquet). Please see the [User Guide](https://github.com/seanpedrick-case/doc_redaction/blob/main/README.md) for a walkthrough on how to use the app. Below is a very brief overview.
261
 
262
  To identify text in documents, the 'Local' text/OCR image analysis uses spaCy/Tesseract, and works well only for documents with typed text. If available, choose 'AWS Textract' to redact more complex elements e.g. signatures or handwriting. Then, choose a method for PII identification. 'Local' is quick and gives good results if you are primarily looking for a custom list of terms to redact (see Redaction settings). If available, AWS Comprehend gives better results at a small cost.
263
 
@@ -908,7 +908,6 @@ with app:
908
  ###
909
  # IDENTIFY DUPLICATE PAGES
910
  ###
911
- #in_duplicate_pages.upload(fn = prepare_image_or_pdf, inputs=[in_duplicate_pages, text_extract_method_radio, all_page_line_level_ocr_results_df_base, all_page_line_level_ocr_results_with_words_df_base, latest_file_completed_num, redaction_output_summary_textbox, second_loop_state, annotate_max_pages, all_image_annotations_state, prepare_for_review_bool, in_fully_redacted_list_state, output_folder_textbox, input_folder_textbox, prepare_images_bool_false, page_sizes, pdf_doc_state], outputs=[redaction_output_summary_textbox, prepared_pdf_state, images_pdf_state, annotate_max_pages, annotate_max_pages_bottom, pdf_doc_state, all_image_annotations_state, review_file_df, document_cropboxes, page_sizes, textract_output_found_checkbox, all_img_details_state, all_page_line_level_ocr_results_df_base, relevant_ocr_output_with_words_found_checkbox, all_page_line_level_ocr_results_with_words_df_base])
912
 
913
  find_duplicate_pages_btn.click(
914
  fn=run_duplicate_analysis,
@@ -977,9 +976,7 @@ with app:
977
  all_output_files_btn.click(fn=load_all_output_files, inputs=output_folder_textbox, outputs=all_output_files)
978
 
979
  # Language selection dropdown
980
- chosen_language_full_name_drop.select(update_language_dropdown, inputs=[chosen_language_full_name_drop], outputs=[chosen_language_drop])#.\
981
- #success(download_tesseract_lang_pack, inputs=[chosen_language_drop], outputs = [tesseract_lang_data_file_path]).\
982
- #success(load_spacy_model, inputs=[chosen_language_drop], outputs=[updated_nlp_analyser_state])
983
 
984
  ###
985
  # APP LOAD AND LOGGING
 
257
  gr.Markdown(
258
  """# Document redaction
259
 
260
+ Redact personally identifiable information (PII) from documents (PDF, images), Word files (.docx), or tabular data (XLSX/CSV/Parquet). Please see the [User Guide](https://github.com/seanpedrick-case/doc_redaction/blob/main/README.md) for a walkthrough on how to use the app. Below is a very brief overview.
261
 
262
  To identify text in documents, the 'Local' text/OCR image analysis uses spaCy/Tesseract, and works well only for documents with typed text. If available, choose 'AWS Textract' to redact more complex elements e.g. signatures or handwriting. Then, choose a method for PII identification. 'Local' is quick and gives good results if you are primarily looking for a custom list of terms to redact (see Redaction settings). If available, AWS Comprehend gives better results at a small cost.
263
 
 
908
  ###
909
  # IDENTIFY DUPLICATE PAGES
910
  ###
 
911
 
912
  find_duplicate_pages_btn.click(
913
  fn=run_duplicate_analysis,
 
976
  all_output_files_btn.click(fn=load_all_output_files, inputs=output_folder_textbox, outputs=all_output_files)
977
 
978
  # Language selection dropdown
979
+ chosen_language_full_name_drop.select(update_language_dropdown, inputs=[chosen_language_full_name_drop], outputs=[chosen_language_drop])
 
 
980
 
981
  ###
982
  # APP LOAD AND LOGGING
cdk/cdk_config.py CHANGED
@@ -213,9 +213,9 @@ SAVE_LOGS_TO_CSV = get_or_create_env_var('SAVE_LOGS_TO_CSV', 'True')
213
 
214
  ### DYNAMODB logs. Whether to save to DynamoDB, and the headers of the table
215
  SAVE_LOGS_TO_DYNAMODB = get_or_create_env_var('SAVE_LOGS_TO_DYNAMODB', 'True')
216
- ACCESS_LOG_DYNAMODB_TABLE_NAME = get_or_create_env_var('ACCESS_LOG_DYNAMODB_TABLE_NAME', f"{CDK_PREFIX}dynamodb-access-log".lower())
217
- FEEDBACK_LOG_DYNAMODB_TABLE_NAME = get_or_create_env_var('FEEDBACK_LOG_DYNAMODB_TABLE_NAME', f"{CDK_PREFIX}dynamodb-feedback".lower())
218
- USAGE_LOG_DYNAMODB_TABLE_NAME = get_or_create_env_var('USAGE_LOG_DYNAMODB_TABLE_NAME', f"{CDK_PREFIX}dynamodb-usage".lower())
219
 
220
  ###
221
  # REDACTION OPTIONS
 
213
 
214
  ### DYNAMODB logs. Whether to save to DynamoDB, and the headers of the table
215
  SAVE_LOGS_TO_DYNAMODB = get_or_create_env_var('SAVE_LOGS_TO_DYNAMODB', 'True')
216
+ ACCESS_LOG_DYNAMODB_TABLE_NAME = get_or_create_env_var('ACCESS_LOG_DYNAMODB_TABLE_NAME', f"{CDK_PREFIX}dynamodb-access-logs".lower())
217
+ FEEDBACK_LOG_DYNAMODB_TABLE_NAME = get_or_create_env_var('FEEDBACK_LOG_DYNAMODB_TABLE_NAME', f"{CDK_PREFIX}dynamodb-feedback-logs".lower())
218
+ USAGE_LOG_DYNAMODB_TABLE_NAME = get_or_create_env_var('USAGE_LOG_DYNAMODB_TABLE_NAME', f"{CDK_PREFIX}dynamodb-usage-logs".lower())
219
 
220
  ###
221
  # REDACTION OPTIONS
cdk/post_cdk_build_quickstart.py CHANGED
@@ -13,10 +13,10 @@ start_codebuild_build(PROJECT_NAME=CODEBUILD_PROJECT_NAME)
13
  # Upload config.env file to S3 bucket
14
  upload_file_to_s3(local_file_paths="config/config.env", s3_key="", s3_bucket=S3_LOG_CONFIG_BUCKET_NAME)
15
 
16
- total_seconds = 450 # 7.5 minutes
17
  update_interval = 1 # Update every second
18
 
19
- print("Waiting 7.5 minutes for the CodeBuild container to build.")
20
 
21
  # tqdm iterates over a range, and you perform a small sleep in each iteration
22
  for i in tqdm(range(total_seconds), desc="Building container"):
 
13
  # Upload config.env file to S3 bucket
14
  upload_file_to_s3(local_file_paths="config/config.env", s3_key="", s3_bucket=S3_LOG_CONFIG_BUCKET_NAME)
15
 
16
+ total_seconds = 660 # 11 minutes
17
  update_interval = 1 # Update every second
18
 
19
+ print("Waiting 11 minutes for the CodeBuild container to build.")
20
 
21
  # tqdm iterates over a range, and you perform a small sleep in each iteration
22
  for i in tqdm(range(total_seconds), desc="Building container"):
cdk/requirements.txt CHANGED
@@ -1,5 +1,5 @@
1
- aws-cdk-lib==2.202.0
2
- boto3==1.38.41
3
- pandas==2.3.0
4
  nodejs==0.1.1
5
  python-dotenv==1.0.1
 
1
+ aws-cdk-lib==2.212.0
2
+ boto3==1.40.10
3
+ pandas==2.3.1
4
  nodejs==0.1.1
5
  python-dotenv==1.0.1
example_config.env ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ TESSERACT_FOLDER=tesseract/
2
+ POPPLER_FOLDER=poppler/poppler-24.02.0/Library/bin/
3
+ SHOW_LANGUAGE_SELECTION=True
4
+ CHOSEN_LOCAL_OCR_MODEL=tesseract
5
+
6
+ SESSION_OUTPUT_FOLDER=False
7
+ DISPLAY_FILE_NAMES_IN_LOGS=False
8
+
9
+ RUN_AWS_FUNCTIONS=1 # Set to 0 if you don't want to run AWS functions
10
+ SAVE_LOGS_TO_DYNAMODB=True
11
+ S3_COST_CODES_PATH=cost_codes.csv
12
+ SHOW_AWS_TEXT_EXTRACTION_OPTIONS=True
13
+ SHOW_AWS_PII_DETECTION_OPTIONS=True
14
+ AWS_REGION=example-region
15
+ DOCUMENT_REDACTION_BUCKET=example-bucket
16
+ SHOW_WHOLE_DOCUMENT_TEXTRACT_CALL_OPTIONS=True
17
+ TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_BUCKET=example-bucket-output
18
+ LOAD_PREVIOUS_TEXTRACT_JOBS_S3=True
19
+ ACCESS_LOG_DYNAMODB_TABLE_NAME=example-dynamodb-access-log
20
+ USAGE_LOG_DYNAMODB_TABLE_NAME=example-dynamodb-usage
21
+ FEEDBACK_LOG_DYNAMODB_TABLE_NAME=example-dynamodb-feedback
22
+ SHOW_COSTS=True
23
+ GET_COST_CODES=True
24
+ COST_CODES_PATH=config/cost_codes.csv
25
+ ENFORCE_COST_CODES=True
26
+ DEFAULT_COST_CODE=example_cost_code
src/app_settings.qmd CHANGED
@@ -115,6 +115,16 @@ Configuration for input and output file handling.
115
  * **Default Value:** `'input/'`
116
  * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
117
 
 
 
 
 
 
 
 
 
 
 
118
  ## Logging Options
119
 
120
  Settings for configuring application logging, including log formats and storage locations.
@@ -161,7 +171,7 @@ Settings for configuring application logging, including log formats and storage
161
 
162
  * **`CSV_USAGE_LOG_HEADERS`**
163
  * **Description:** Defines custom headers for CSV usage logs.
164
- * **Default Value:** A predefined list of header names. Refer to `tools/config.py` for the complete list.
165
  * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
166
 
167
  * **`SAVE_LOGS_TO_DYNAMODB`**
@@ -214,12 +224,17 @@ Settings for configuring application logging, including log formats and storage
214
  Configurations related to the text redaction process, including PII detection models and external tool paths.
215
 
216
  * **`TESSERACT_FOLDER`**
217
- * **Description:** Path to the local Tesseract OCR installation folder. Only required if Tesseract is not in path, or you are running a version of the app as an .exe installed with Pyinstaller. Gives the path to the local Tesseract OCR model for text extraction.
218
  * **Default Value:** `""` (empty string)
219
  * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
220
 
 
 
 
 
 
221
  * **`POPPLER_FOLDER`**
222
- * **Description:** Path to the local Poppler installation's `bin` folder. Only required if Tesseract is not in path, or you are running a version of the app as an .exe installed with Pyinstaller. Poppler is used for PDF processing.
223
  * **Default Value:** `""` (empty string)
224
  * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
225
 
@@ -283,24 +298,34 @@ Configurations related to the text redaction process, including PII detection mo
283
  * **Default Value:** Value of `AWS_PII_OPTION` if `SHOW_AWS_PII_DETECTION_OPTIONS` is True, else value of `LOCAL_PII_OPTION`.
284
  * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`. Provide one of the PII detection option display names.
285
 
 
 
 
 
 
 
 
 
 
 
286
  * **`CHOSEN_COMPREHEND_ENTITIES`**
287
  * **Description:** A list of AWS Comprehend PII entity types to be redacted when using AWS Comprehend.
288
- * **Default Value:** A predefined list of entity types. Refer to `tools/config.py` for the complete list.
289
  * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`. This should be a string representation of a Python list.
290
 
291
  * **`FULL_COMPREHEND_ENTITY_LIST`**
292
  * **Description:** The complete list of PII entity types supported by AWS Comprehend that can be selected for redaction.
293
- * **Default Value:** A predefined list of entity types. Refer to `tools/config.py` for the complete list.
294
  * **Configuration:** This is typically an informational variable reflecting the capabilities of AWS Comprehend and is not meant to be changed by users directly affecting redaction behavior (use `CHOSEN_COMPREHEND_ENTITIES` for that). Set as an environment variable directly, or include in `config/app_config.env`.
295
 
296
  * **`CHOSEN_REDACT_ENTITIES`**
297
  * **Description:** A list of local PII entity types to be redacted when using the local PII detection model.
298
- * **Default Value:** A predefined list of entity types. Refer to `tools/config.py` for the complete list.
299
  * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`. This should be a string representation of a Python list.
300
 
301
  * **`FULL_ENTITY_LIST`**
302
  * **Description:** The complete list of PII entity types supported by the local PII detection model that can be selected for redaction.
303
- * **Default Value:** A predefined list of entity types. Refer to `tools/config.py` for the complete list.
304
  * **Configuration:** This is typically an informational variable reflecting the capabilities of the local model and is not meant to be changed by users directly affecting redaction behavior (use `CHOSEN_REDACT_ENTITIES` for that). Set as an environment variable directly, or include in `config/app_config.env`.
305
 
306
  * **`PAGE_BREAK_VALUE`**
@@ -309,20 +334,15 @@ Configurations related to the text redaction process, including PII detection mo
309
  * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
310
 
311
  * **`MAX_TIME_VALUE`**
312
- * **Description:** Specifies the maximum time (in arbitrary units, likely seconds or milliseconds depending on implementation) for a process before it might be timed out.
313
  * **Default Value:** `'999999'`
314
  * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
315
 
316
  * **`CUSTOM_BOX_COLOUR`**
317
- * **Description:** Allows specifying a custom color for the redaction boxes drawn on documents (e.g., "grey", "red", "#FF0000"). If empty, a default color is used.
318
  * **Default Value:** `""` (empty string)
319
  * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
320
 
321
- * **`REDACTION_LANGUAGE`**
322
- * **Description:** Specifies the language for redaction processing. Currently, only "en" (English) is supported.
323
- * **Default Value:** `"en"`
324
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
325
-
326
  * **`RETURN_PDF_END_OF_REDACTION`**
327
  * **Description:** If set to `'True'`, the application will return a PDF document at the end of the redaction task.
328
  * **Default Value:** `"True"`
@@ -333,13 +353,42 @@ Configurations related to the text redaction process, including PII detection mo
333
  * **Default Value:** `"False"`
334
  * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
335
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
336
  ## App Run Options
337
 
338
  General runtime configurations for the application.
339
 
340
  * **`TLDEXTRACT_CACHE`**
341
- * **Description:** Path to the cache file used by the `tldextract` library, which helps in accurately extracting top-level domains (TLDs) from URLs.
342
- * **Default Value:** `'tld/.tld_set_snapshot'`
343
  * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
344
 
345
  * **`COGNITO_AUTH`**
@@ -436,7 +485,7 @@ Settings related to tracking and applying cost codes for application usage.
436
  Configurations for features related to processing whole documents via APIs, particularly AWS Textract for large documents.
437
 
438
  * **`SHOW_WHOLE_DOCUMENT_TEXTRACT_CALL_OPTIONS`**
439
- * **Description:** Controls whether UI options for whole document Textract calls are displayed. (Note: Mentioned as not currently implemented in the source).
440
  * **Default Value:** `'False'`
441
  * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
442
 
@@ -461,12 +510,12 @@ Configurations for features related to processing whole documents via APIs, part
461
  * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env` (or `config/aws_config.env` if `AWS_CONFIG_PATH` is configured).
462
 
463
  * **`TEXTRACT_JOBS_S3_LOC`**
464
- * **Description:** The S3 subfolder (within `TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_BUCKET`) where Textract job data (output) is stored.
465
  * **Default Value:** `'output'`
466
  * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env` (or `config/aws_config.env` if `AWS_CONFIG_PATH` is configured).
467
 
468
  * **`TEXTRACT_JOBS_S3_INPUT_LOC`**
469
- * **Description:** The S3 subfolder (within `TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_BUCKET`) where Textract job input is stored.
470
  * **Default Value:** `'input'`
471
  * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env` (or `config/aws_config.env` if `AWS_CONFIG_PATH` is configured).
472
 
@@ -478,4 +527,4 @@ Configurations for features related to processing whole documents via APIs, part
478
  * **`DAYS_TO_DISPLAY_WHOLE_DOCUMENT_JOBS`**
479
  * **Description:** Specifies the number of past days for which to display whole document Textract jobs in the UI.
480
  * **Default Value:** `'7'`
481
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
 
115
  * **Default Value:** `'input/'`
116
  * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
117
 
118
+ * **`GRADIO_TEMP_DIR`**
119
+ * **Description:** Defines the path for Gradio's temporary file storage.
120
+ * **Default Value:** `'tmp/gradio_tmp/'`
121
+ * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
122
+
123
+ * **`MPLCONFIGDIR`**
124
+ * **Description:** Specifies the cache directory for the Matplotlib library, which is used for plotting and image handling.
125
+ * **Default Value:** `'tmp/matplotlib_cache/'`
126
+ * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
127
+
128
  ## Logging Options
129
 
130
  Settings for configuring application logging, including log formats and storage locations.
 
171
 
172
  * **`CSV_USAGE_LOG_HEADERS`**
173
  * **Description:** Defines custom headers for CSV usage logs.
174
+ * **Default Value:** A predefined list of header names. Refer to `config.py` for the complete list.
175
  * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
176
 
177
  * **`SAVE_LOGS_TO_DYNAMODB`**
 
224
  Configurations related to the text redaction process, including PII detection models and external tool paths.
225
 
226
  * **`TESSERACT_FOLDER`**
227
+ * **Description:** Path to the local Tesseract OCR installation folder. Only required if Tesseract is not in the system's PATH, or when running a packaged executable (e.g., via PyInstaller).
228
  * **Default Value:** `""` (empty string)
229
  * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
230
 
231
+ * **`TESSERACT_DATA_FOLDER`**
232
+ * **Description:** Path to the Tesseract trained data files (e.g., `tessdata`).
233
+ * **Default Value:** `"/usr/share/tessdata"`
234
+ * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
235
+
236
  * **`POPPLER_FOLDER`**
237
+ * **Description:** Path to the local Poppler installation's `bin` folder. Poppler is used for PDF processing. Only required if Poppler is not in the system's PATH.
238
  * **Default Value:** `""` (empty string)
239
  * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
240
 
 
298
  * **Default Value:** Value of `AWS_PII_OPTION` if `SHOW_AWS_PII_DETECTION_OPTIONS` is True, else value of `LOCAL_PII_OPTION`.
299
  * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`. Provide one of the PII detection option display names.
300
 
301
+ * **`CHOSEN_LOCAL_OCR_MODEL`**
302
+ * **Description:** Choose the engine for local OCR: `"tesseract"`, `"paddle"`, or `"hybrid"`. "paddle" is effective for line extraction but not word-level redaction. "hybrid" uses Tesseract first, then PaddleOCR for low-confidence words.
303
+ * **Default Value:** `"tesseract"`
304
+ * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
305
+
306
+ * **`PREPROCESS_LOCAL_OCR_IMAGES`**
307
+ * **Description:** If set to `"True"`, images will be preprocessed (e.g., deskewed, contrast adjusted) before being sent to the local OCR engine. This can sometimes yield worse results on clean scans.
308
+ * **Default Value:** `"False"`
309
+ * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
310
+
311
  * **`CHOSEN_COMPREHEND_ENTITIES`**
312
  * **Description:** A list of AWS Comprehend PII entity types to be redacted when using AWS Comprehend.
313
+ * **Default Value:** A predefined list of entity types. Refer to `config.py` for the complete list.
314
  * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`. This should be a string representation of a Python list.
315
 
316
  * **`FULL_COMPREHEND_ENTITY_LIST`**
317
  * **Description:** The complete list of PII entity types supported by AWS Comprehend that can be selected for redaction.
318
+ * **Default Value:** A predefined list of entity types. Refer to `config.py` for the complete list.
319
  * **Configuration:** This is typically an informational variable reflecting the capabilities of AWS Comprehend and is not meant to be changed by users directly affecting redaction behavior (use `CHOSEN_COMPREHEND_ENTITIES` for that). Set as an environment variable directly, or include in `config/app_config.env`.
320
 
321
  * **`CHOSEN_REDACT_ENTITIES`**
322
  * **Description:** A list of local PII entity types to be redacted when using the local PII detection model.
323
+ * **Default Value:** A predefined list of entity types. Refer to `config.py` for the complete list.
324
  * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`. This should be a string representation of a Python list.
325
 
326
  * **`FULL_ENTITY_LIST`**
327
  * **Description:** The complete list of PII entity types supported by the local PII detection model that can be selected for redaction.
328
+ * **Default Value:** A predefined list of entity types. Refer to `config.py` for the complete list.
329
  * **Configuration:** This is typically an informational variable reflecting the capabilities of the local model and is not meant to be changed by users directly affecting redaction behavior (use `CHOSEN_REDACT_ENTITIES` for that). Set as an environment variable directly, or include in `config/app_config.env`.
330
 
331
  * **`PAGE_BREAK_VALUE`**
 
334
  * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
335
 
336
  * **`MAX_TIME_VALUE`**
337
+ * **Description:** Specifies a maximum time value for long-running processes.
338
  * **Default Value:** `'999999'`
339
  * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
340
 
341
  * **`CUSTOM_BOX_COLOUR`**
342
+ * **Description:** Allows specifying a custom color for the redaction boxes drawn on documents. Only `"grey"` is currently supported as a custom value. If empty, a default color is used.
343
  * **Default Value:** `""` (empty string)
344
  * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
345
 
 
 
 
 
 
346
  * **`RETURN_PDF_END_OF_REDACTION`**
347
  * **Description:** If set to `'True'`, the application will return a PDF document at the end of the redaction task.
348
  * **Default Value:** `"True"`
 
353
  * **Default Value:** `"False"`
354
  * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
355
 
356
+ ## Language Options
357
+
358
+ Settings for multi-language support in OCR and PII detection.
359
+
360
+ * **`SHOW_LANGUAGE_SELECTION`**
361
+ * **Description:** If set to `"True"`, a dropdown menu for language selection will be visible in the user interface.
362
+ * **Default Value:** `"False"`
363
+ * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
364
+
365
+ * **`DEFAULT_LANGUAGE_FULL_NAME`**
366
+ * **Description:** The default language's full name (e.g., "english") to be displayed in the UI.
367
+ * **Default Value:** `"english"`
368
+ * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
369
+
370
+ * **`DEFAULT_LANGUAGE`**
371
+ * **Description:** The default language's short code (e.g., "en") used by the backend engines. Ensure the corresponding Tesseract/PaddleOCR language packs are installed.
372
+ * **Default Value:** `"en"`
373
+ * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
374
+
375
+ * **`MAPPED_LANGUAGE_CHOICES`**
376
+ * **Description:** A string list of full language names (e.g., 'english', 'french') presented to the user in the language dropdown.
377
+ * **Default Value:** A predefined list. See `config.py`.
378
+ * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
379
+
380
+ * **`LANGUAGE_CHOICES`**
381
+ * **Description:** A string list of short language codes (e.g., 'en', 'fr') that correspond to `MAPPED_LANGUAGE_CHOICES`. This is what the backend uses.
382
+ * **Default Value:** A predefined list. See `config.py`.
383
+ * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
384
+
385
  ## App Run Options
386
 
387
  General runtime configurations for the application.
388
 
389
  * **`TLDEXTRACT_CACHE`**
390
+ * **Description:** Path to the cache directory used by the `tldextract` library, which helps in accurately extracting top-level domains (TLDs) from URLs.
391
+ * **Default Value:** `'tmp/tld/'`
392
  * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
393
 
394
  * **`COGNITO_AUTH`**
 
485
  Configurations for features related to processing whole documents via APIs, particularly AWS Textract for large documents.
486
 
487
  * **`SHOW_WHOLE_DOCUMENT_TEXTRACT_CALL_OPTIONS`**
488
+ * **Description:** Controls whether UI options for whole document Textract calls are displayed.
489
  * **Default Value:** `'False'`
490
  * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
491
 
 
510
  * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env` (or `config/aws_config.env` if `AWS_CONFIG_PATH` is configured).
511
 
512
  * **`TEXTRACT_JOBS_S3_LOC`**
513
+ * **Description:** The S3 subfolder (within the main redaction bucket) where Textract job data (output) is stored.
514
  * **Default Value:** `'output'`
515
  * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env` (or `config/aws_config.env` if `AWS_CONFIG_PATH` is configured).
516
 
517
  * **`TEXTRACT_JOBS_S3_INPUT_LOC`**
518
+ * **Description:** The S3 subfolder (within the main redaction bucket) where Textract job input is stored.
519
  * **Default Value:** `'input'`
520
  * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env` (or `config/aws_config.env` if `AWS_CONFIG_PATH` is configured).
521
 
 
527
  * **`DAYS_TO_DISPLAY_WHOLE_DOCUMENT_JOBS`**
528
  * **Description:** Specifies the number of past days for which to display whole document Textract jobs in the UI.
529
  * **Default Value:** `'7'`
530
+ * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
src/user_guide.qmd CHANGED
@@ -20,7 +20,7 @@ format:
20
  - [Redacting only specific pages](#redacting-only-specific-pages)
21
  - [Handwriting and signature redaction](#handwriting-and-signature-redaction)
22
  - [Reviewing and modifying suggested redactions](#reviewing-and-modifying-suggested-redactions)
23
- - [Redacting tabular data files (CSV/XLSX) or copy and pasted text](#redacting-tabular-data-files-xlsxcsv-or-copy-and-pasted-text)
24
 
25
  See the [advanced user guide here](#advanced-user-guide):
26
  - [Merging redaction review files](#merging-redaction-review-files)
@@ -210,9 +210,11 @@ On the 'Review redactions' tab you have a visual interface that allows you to in
210
 
211
  ### Uploading documents for review
212
 
213
- The top area has a file upload area where you can upload original, unredacted PDFs, alongside the '..._review_file.csv' that is produced by the redaction process. Once you have uploaded these two files, click the '**Review redactions based on original PDF...**' button to load in the files for review. This will allow you to visualise and modify the suggested redactions using the interface below.
214
 
215
- Optionally, you can also upload one of the '..._ocr_output.csv' files here that comes out of a redaction task, so that you can navigate the extracted text from the document.
 
 
216
 
217
  ![Search extracted text](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/search_extracted_text.PNG)
218
 
@@ -300,6 +302,77 @@ Once you have filtered the table, or selected a row from the table, you have a f
300
 
301
  If you made a mistake, click the 'Undo last element removal' button to restore the Search suggested redactions table to its previous state (can only undo the last action).
302
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
303
  ### Navigating through the document using the 'Search all extracted text'
304
 
305
  The 'search all extracted text' table will contain text if you have just redacted a document, or if you have uploaded a '..._ocr_output.csv' file alongside a document file and review file on the Review redactions tab as [described above](#uploading-documents-for-review).
@@ -312,11 +385,11 @@ You can search through the extracted text by using the search bar just above the
312
 
313
  ![Search suggested redaction area](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/search_extracted_text.PNG)
314
 
315
- ## Redacting tabular data files (XLSX/CSV) or copy and pasted text
316
 
317
- ### Tabular data files (XLSX/CSV)
318
 
319
- The app can be used to redact tabular data files such as xlsx or csv files. For this to work properly, your data file needs to be in a simple table format, with a single table starting from the first cell (A1), and no other information in the sheet. Similarly for .xlsx files, each sheet in the file that you want to redact should be in this simple format.
320
 
321
  To demonstrate this, we can use [the example csv file 'combined_case_notes.csv'](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/combined_case_notes.csv), which is a small dataset of dummy social care case notes. Go to the 'Open text or Excel/csv files' tab. Drop the file into the upload area. After the file is loaded, you should see the suggested columns for redaction in the box underneath. You can select and deselect columns to redact as you wish from this list.
322
 
 
20
  - [Redacting only specific pages](#redacting-only-specific-pages)
21
  - [Handwriting and signature redaction](#handwriting-and-signature-redaction)
22
  - [Reviewing and modifying suggested redactions](#reviewing-and-modifying-suggested-redactions)
23
+ - [Redacting Word, tabular data files (CSV/XLSX) or copy and pasted text](#redacting-word-tabular-data-files-xlsxcsv-or-copy-and-pasted-text)
24
 
25
  See the [advanced user guide here](#advanced-user-guide):
26
  - [Merging redaction review files](#merging-redaction-review-files)
 
210
 
211
  ### Uploading documents for review
212
 
213
+ The top area has a file upload area where you can upload files for review . In the left box, upload the original PDF file. Click '1. Upload original PDF'. In the right box, you can upload the '..._review_file.csv' that is produced by the redaction process.
214
 
215
+ Optionally, you can upload a '..._ocr_result_with_words' file here, that will allow you to search through the text and easily [add new redactions based on word search](#searching-and-adding-custom-redactions). You can also upload one of the '..._ocr_output.csv' file here that comes out of a redaction task, so that you can navigate the extracted text from the document. Click the button '2. Upload Review or OCR csv files' load in these files.
216
+
217
+ Now you can review and modify the suggested redactions using the interface described below.
218
 
219
  ![Search extracted text](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/search_extracted_text.PNG)
220
 
 
302
 
303
  If you made a mistake, click the 'Undo last element removal' button to restore the Search suggested redactions table to its previous state (can only undo the last action).
304
 
305
+ ### Searching and Adding Custom Redactions
306
+
307
+ After a document has been processed, you may need to redact specific terms, names, or phrases that the automatic PII (Personally Identifiable Information) detection might have missed. The **"Search text to make new redactions"** tab gives you the power to find and redact any text within your document manually.
308
+
309
+ #### How to Use the Search and Redact Feature
310
+
311
+ The workflow is designed to be simple: **Search → Select → Redact**.
312
+
313
+ ---
314
+
315
+ #### **Step 1: Search for Text**
316
+
317
+ 1. Navigate to the **"Search text to make new redactions"** tab.
318
+ 2. The main table will initially be populated with all the text extracted from the document, broken down by word.
319
+ 3. To narrow this down, use the **"Multi-word text search"** box to type the word or phrase you want to find.
320
+ 4. Click the **"Search"** button or press Enter.
321
+ 5. The table below will update to show only the rows containing text that matches your search query.
322
+
323
+ > **Tip:** You can also filter the results by page number using the **"Page"** dropdown. To clear all filters and see the full text again, click the **"Reset table to original state"** button.
324
+
325
+ ---
326
+
327
+ #### **Step 2: Select and Review a Match**
328
+
329
+ When you click on any row in the search results table:
330
+
331
+ * The document preview on the left will automatically jump to that page, allowing you to see the word in its original context.
332
+ * The details of your selection will appear in the smaller **"Selected row"** table for confirmation.
333
+
334
+ ---
335
+
336
+ #### **Step 3: Choose Your Redaction Method**
337
+
338
+ You have several powerful options for redacting the text you've found:
339
+
340
+ * **Redact a Single, Specific Instance:**
341
+ * Click on the exact row in the table you want to redact.
342
+ * Click the **`Redact specific text row`** button.
343
+ * Only that single instance will be redacted.
344
+
345
+ * **Redact All Instances of a Word/Phrase:**
346
+ * Let's say you want to redact the project name "Project Alpha" everywhere it appears.
347
+ * Find and select one instance of "Project Alpha" in the table.
348
+ * Click the **`Redact all words with same text as selected row`** button.
349
+ * The application will find and redact every single occurrence of "Project Alpha" throughout the entire document.
350
+
351
+ * **Redact All Current Search Results:**
352
+ * Perform a search (e.g., for a specific person's name).
353
+ * If you are confident that every result shown in the filtered table should be redacted, click the **`Redact all text in table`** button.
354
+ * This will apply a redaction to all currently visible items in the table in one go.
355
+
356
+ ---
357
+
358
+ #### **Customising Your New Redactions**
359
+
360
+ Before you click one of the redact buttons, you can customize the appearance and label of the new redactions under the **"Search options"** accordion:
361
+
362
+ * **Label for new redactions:** Change the text that appears on the redaction box (default is "Redaction"). You could change this to "CONFIDENTIAL" or "CUSTOM".
363
+ * **Colour for labels:** Set a custom color for the redaction box by providing an RGB value. The format must be three numbers (0-255) in parentheses, for example:
364
+ * ` (255, 0, 0) ` for Red
365
+ * ` (0, 0, 0) ` for Black
366
+ * ` (255, 255, 0) ` for Yellow
367
+
368
+ #### **Undoing a Mistake**
369
+
370
+ If you make a mistake, you can reverse the last redaction action you performed on this tab.
371
+
372
+ * Click the **`Undo latest redaction`** button. This will revert the last set of redactions you added (whether it was a single row, all of a certain text, or all search results).
373
+
374
+ > **Important:** This undo button only works for the *most recent* action. It maintains a single backup state, so it cannot undo actions that are two or more steps in the past.
375
+
376
  ### Navigating through the document using the 'Search all extracted text'
377
 
378
  The 'search all extracted text' table will contain text if you have just redacted a document, or if you have uploaded a '..._ocr_output.csv' file alongside a document file and review file on the Review redactions tab as [described above](#uploading-documents-for-review).
 
385
 
386
  ![Search suggested redaction area](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/search_extracted_text.PNG)
387
 
388
+ ## Redacting Word, tabular data files (XLSX/CSV) or copy and pasted text
389
 
390
+ ### Word or tabular data files (XLSX/CSV)
391
 
392
+ The app can be used to redact Word (.docx), or tabular data files such as xlsx or csv files. For this to work properly, your data file needs to be in a simple table format, with a single table starting from the first cell (A1), and no other information in the sheet. Similarly for .xlsx files, each sheet in the file that you want to redact should be in this simple format.
393
 
394
  To demonstrate this, we can use [the example csv file 'combined_case_notes.csv'](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/combined_case_notes.csv), which is a small dataset of dummy social care case notes. Go to the 'Open text or Excel/csv files' tab. Drop the file into the upload area. After the file is loaded, you should see the suggested columns for redaction in the box underneath. You can select and deselect columns to redact as you wish from this list.
395
 
tools/config.py CHANGED
@@ -267,7 +267,7 @@ if NO_REDACTION_PII_OPTION in TABULAR_PII_DETECTION_MODELS:
267
  TABULAR_PII_DETECTION_MODELS.remove(NO_REDACTION_PII_OPTION)
268
 
269
  ### Local OCR model - Tesseract vs PaddleOCR
270
- CHOSEN_LOCAL_OCR_MODEL = get_or_create_env_var('CHOSEN_LOCAL_OCR_MODEL', "tesseract") # Choose between "tesseract", "hybrid", and "paddle"
271
 
272
  PREPROCESS_LOCAL_OCR_IMAGES = get_or_create_env_var('PREPROCESS_LOCAL_OCR_IMAGES', "False") # Whether to try and preprocess images before extracting text. NOTE: I have found in testing that this often results in WORSE results for scanned pages, so it is default False
273
 
 
267
  TABULAR_PII_DETECTION_MODELS.remove(NO_REDACTION_PII_OPTION)
268
 
269
  ### Local OCR model - Tesseract vs PaddleOCR
270
+ CHOSEN_LOCAL_OCR_MODEL = get_or_create_env_var('CHOSEN_LOCAL_OCR_MODEL', "tesseract") # Choose between "tesseract", "hybrid", and "paddle". "paddle" will only return whole line text extraction, and so will only work for OCR, not redaction. "hybrid" is a combination of the two - first pass through the redactions will be done with Tesseract, and then a second pass will be done with PaddleOCR on words with low confidence.
271
 
272
  PREPROCESS_LOCAL_OCR_IMAGES = get_or_create_env_var('PREPROCESS_LOCAL_OCR_IMAGES', "False") # Whether to try and preprocess images before extracting text. NOTE: I have found in testing that this often results in WORSE results for scanned pages, so it is default False
273
 
tools/data_anonymise.py CHANGED
@@ -758,9 +758,6 @@ def anonymise_script(df:pd.DataFrame,
758
  batch_anonymizer = BatchAnonymizerEngine(anonymizer_engine = anonymizer)
759
  analyzer_results = list()
760
 
761
- # Use provided comprehend language or fall back to main language
762
- language = language
763
-
764
  if pii_identification_method == "Local":
765
 
766
  # Use custom analyzer to be able to track progress with Gradio
 
758
  batch_anonymizer = BatchAnonymizerEngine(anonymizer_engine = anonymizer)
759
  analyzer_results = list()
760
 
 
 
 
761
  if pii_identification_method == "Local":
762
 
763
  # Use custom analyzer to be able to track progress with Gradio
tools/file_conversion.py CHANGED
@@ -834,7 +834,10 @@ def prepare_image_or_pdf(
834
  out_message.append(out_time)
835
  combined_out_message = '\n'.join(out_message)
836
 
837
- number_of_pages = len(page_sizes)
 
 
 
838
 
839
  print("Finished loading in files")
840
 
 
834
  out_message.append(out_time)
835
  combined_out_message = '\n'.join(out_message)
836
 
837
+ if not page_sizes:
838
+ number_of_pages = 1
839
+ else:
840
+ number_of_pages = len(page_sizes)
841
 
842
  print("Finished loading in files")
843
 
tools/find_duplicate_pages.py CHANGED
@@ -1209,7 +1209,7 @@ def create_annotation_objects_from_duplicates(
1209
  if duplicates_df.empty:
1210
  raise Warning("No duplicates found")
1211
  if ocr_results_df.empty:
1212
- raise Warning("No OCR results found for file under review. Please upload relevant OCR_output file for the PDF file on the review tab.")
1213
 
1214
  if combine_pages == False:
1215
  page_to_image_map = {item['page']: item['image_path'] for item in page_sizes}
 
1209
  if duplicates_df.empty:
1210
  raise Warning("No duplicates found")
1211
  if ocr_results_df.empty:
1212
+ raise Warning("No OCR results found for file under review. Please upload relevant OCR_output file and original PDF document on the review tab.")
1213
 
1214
  if combine_pages == False:
1215
  page_to_image_map = {item['page']: item['image_path'] for item in page_sizes}
tools/load_spacy_model_custom_recognisers.py CHANGED
@@ -506,9 +506,9 @@ def create_nlp_analyser(language: str = DEFAULT_LANGUAGE, custom_list: List[str]
506
  return nlp_analyser
507
 
508
  # Create the default nlp_analyser using the new function
509
- nlp_analyser, nlp_model = create_nlp_analyser(DEFAULT_LANGUAGE, return_also_model=True)
510
 
511
- def spacy_fuzzy_search(text: str, custom_query_list:List[str]=[], spelling_mistakes_max:int = 1, search_whole_phrase:bool=True, nlp=nlp_model, progress=gr.Progress(track_tqdm=True)):
512
  ''' Conduct fuzzy match on a list of text data.'''
513
 
514
  all_matches = []
@@ -546,7 +546,6 @@ def spacy_fuzzy_search(text: str, custom_query_list:List[str]=[], spelling_mista
546
 
547
  else:
548
  # If matching a whole phrase, use Spacy PhraseMatcher, then consider similarity after using Levenshtein distance.
549
- #tokenised_query = [string_query.lower()]
550
  # If you want to match the whole phrase, use phrase matcher
551
  matcher = FuzzyMatcher(nlp.vocab)
552
  patterns = [nlp.make_doc(string_query)] # Convert query into a Doc object
@@ -567,9 +566,7 @@ def spacy_fuzzy_search(text: str, custom_query_list:List[str]=[], spelling_mista
567
  for match_id, start, end in matches:
568
  span = str(doc[start:end]).strip()
569
  query_search = str(query).strip()
570
- #print("doc:", doc)
571
- #print("span:", span)
572
- #print("query_search:", query_search)
573
 
574
  # Convert word positions to character positions
575
  start_char = doc[start].idx # Start character position
@@ -584,9 +581,6 @@ def spacy_fuzzy_search(text: str, custom_query_list:List[str]=[], spelling_mista
584
  for match_id, start, end, ratio, pattern in matches:
585
  span = str(doc[start:end]).strip()
586
  query_search = str(query).strip()
587
- #print("doc:", doc)
588
- #print("span:", span)
589
- #print("query_search:", query_search)
590
 
591
  # Calculate Levenshtein distance. Only keep matches with less than specified number of spelling mistakes
592
  distance = Levenshtein.distance(query_search.lower(), span.lower())
@@ -600,9 +594,6 @@ def spacy_fuzzy_search(text: str, custom_query_list:List[str]=[], spelling_mista
600
  start_char = doc[start].idx # Start character position
601
  end_char = doc[end - 1].idx + len(doc[end - 1]) # End character position
602
 
603
- #print("start_char:", start_char)
604
- #print("end_char:", end_char)
605
-
606
  all_matches.append(match_count)
607
  all_start_positions.append(start_char)
608
  all_end_positions.append(end_char)
 
506
  return nlp_analyser
507
 
508
  # Create the default nlp_analyser using the new function
509
+ nlp_analyser, nlp = create_nlp_analyser(DEFAULT_LANGUAGE, return_also_model=True)
510
 
511
+ def spacy_fuzzy_search(text: str, custom_query_list:List[str]=[], spelling_mistakes_max:int = 1, search_whole_phrase:bool=True, nlp=nlp, progress=gr.Progress(track_tqdm=True)):
512
  ''' Conduct fuzzy match on a list of text data.'''
513
 
514
  all_matches = []
 
546
 
547
  else:
548
  # If matching a whole phrase, use Spacy PhraseMatcher, then consider similarity after using Levenshtein distance.
 
549
  # If you want to match the whole phrase, use phrase matcher
550
  matcher = FuzzyMatcher(nlp.vocab)
551
  patterns = [nlp.make_doc(string_query)] # Convert query into a Doc object
 
566
  for match_id, start, end in matches:
567
  span = str(doc[start:end]).strip()
568
  query_search = str(query).strip()
569
+
 
 
570
 
571
  # Convert word positions to character positions
572
  start_char = doc[start].idx # Start character position
 
581
  for match_id, start, end, ratio, pattern in matches:
582
  span = str(doc[start:end]).strip()
583
  query_search = str(query).strip()
 
 
 
584
 
585
  # Calculate Levenshtein distance. Only keep matches with less than specified number of spelling mistakes
586
  distance = Levenshtein.distance(query_search.lower(), span.lower())
 
594
  start_char = doc[start].idx # Start character position
595
  end_char = doc[end - 1].idx + len(doc[end - 1]) # End character position
596
 
 
 
 
597
  all_matches.append(match_count)
598
  all_start_positions.append(start_char)
599
  all_end_positions.append(end_char)
tools/redaction_review.py CHANGED
@@ -797,6 +797,10 @@ def update_annotator_object_and_filter_df(
797
  page_num_reported_zero_indexed = page_num_reported - 1
798
  annotate_previous_page = page_num_reported # Store the determined page number
799
 
 
 
 
 
800
  # --- Process page sizes DataFrame ---
801
  page_sizes_df = pd.DataFrame(page_sizes)
802
  if not page_sizes_df.empty:
@@ -916,7 +920,10 @@ def update_annotator_object_and_filter_df(
916
 
917
 
918
  # --- Final Output Components ---
919
- page_number_reported_gradio_comp = gr.Number(label = "Current page", value=page_num_reported, precision=0)
 
 
 
920
 
921
  ### Present image_annotator outputs
922
  # Handle the case where current_page_image_annotator_object couldn't be prepared
 
797
  page_num_reported_zero_indexed = page_num_reported - 1
798
  annotate_previous_page = page_num_reported # Store the determined page number
799
 
800
+ if not page_sizes:
801
+ page_num_reported = 0
802
+ annotate_previous_page = 0
803
+
804
  # --- Process page sizes DataFrame ---
805
  page_sizes_df = pd.DataFrame(page_sizes)
806
  if not page_sizes_df.empty:
 
920
 
921
 
922
  # --- Final Output Components ---
923
+ if page_sizes:
924
+ page_number_reported_gradio_comp = gr.Number(label = "Current page", value=page_num_reported, precision=0, maximum=len(page_sizes), minimum=1)
925
+ else:
926
+ page_number_reported_gradio_comp = gr.Number(label = "Current page", value=0, precision=0, maximum=9999, minimum=0)
927
 
928
  ### Present image_annotator outputs
929
  # Handle the case where current_page_image_annotator_object couldn't be prepared