Spaces:

seanpedrickcase
/

document_redaction

Running

App Files Files Community

seanpedrickcase commited on 15 days ago

Commit

af187f0

1 Parent(s): 6f96988

Updated documentation. Fix on ocr_output upload before pdf. Duplicate page fix

Browse files

Files changed (14) hide show

README.md +265 -8
app.py +2 -5
cdk/cdk_config.py +3 -3
cdk/post_cdk_build_quickstart.py +2 -2
cdk/requirements.txt +3 -3
example_config.env +26 -0
src/app_settings.qmd +69 -20
src/user_guide.qmd +79 -6
tools/config.py +1 -1
tools/data_anonymise.py +0 -3
tools/file_conversion.py +4 -1
tools/find_duplicate_pages.py +1 -1
tools/load_spacy_model_custom_recognisers.py +3 -12
tools/redaction_review.py +8 -1

README.md CHANGED Viewed

@@ -12,7 +12,7 @@ license: agpl-3.0
 version: 1.0.0
-Redact personally identifiable information (PII) from documents (pdf, images), open text, or tabular data (xlsx/csv/parquet). Please see the [User Guide](#user-guide) for a walkthrough on how to use the app. Below is a very brief overview.
 To identify text in documents, the 'local' text/OCR image analysis uses spacy/tesseract, and works quite well for documents with typed text. If available, choose 'AWS Textract service' to redact more complex elements e.g. signatures or handwriting. Then, choose a method for PII identification. 'Local' is quick and gives good results if you are primarily looking for a custom list of terms to redact (see Redaction settings). If available, AWS Comprehend gives better results at a small cost.
@@ -20,7 +20,191 @@ After redaction, review suggested redactions on the 'Review redactions' tab. The
 NOTE: The app is not 100% accurate, and it will miss some personal information. It is essential that all outputs are reviewed **by a human** before using the final outputs.
-# USER GUIDE
 ## Table of contents
@@ -35,7 +219,7 @@ NOTE: The app is not 100% accurate, and it will miss some personal information.
     - [Redacting only specific pages](#redacting-only-specific-pages)
     - [Handwriting and signature redaction](#handwriting-and-signature-redaction)
 - [Reviewing and modifying suggested redactions](#reviewing-and-modifying-suggested-redactions)
-- [Redacting tabular data files (CSV/XLSX) or copy and pasted text](#redacting-tabular-data-files-xlsxcsv-or-copy-and-pasted-text)
 See the [advanced user guide here](#advanced-user-guide):
 - [Merging redaction review files](#merging-redaction-review-files)
@@ -225,9 +409,11 @@ On the 'Review redactions' tab you have a visual interface that allows you to in
 ### Uploading documents for review
-The top area has a file upload area where you can upload original, unredacted PDFs, alongside the '..._review_file.csv' that is produced by the redaction process. Once you have uploaded these two files, click the '**Review redactions based on original PDF...**' button to load in the files for review. This will allow you to visualise and modify the suggested redactions using the interface below.
-Optionally, you can also upload one of the '..._ocr_output.csv' files here that comes out of a redaction task, so that you can navigate the extracted text from the document.
 ![Search extracted text](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/search_extracted_text.PNG)
@@ -315,6 +501,77 @@ Once you have filtered the table, or selected a row from the table, you have a f
 If you made a mistake, click the 'Undo last element removal' button to restore the Search suggested redactions table to its previous state (can only undo the last action).
 ### Navigating through the document using the 'Search all extracted text'
 The 'search all extracted text' table will contain text if you have just redacted a document, or if you have uploaded a '..._ocr_output.csv' file alongside a document file and review file on the Review redactions tab as [described above](#uploading-documents-for-review).
@@ -327,11 +584,11 @@ You can search through the extracted text by using the search bar just above the
 ![Search suggested redaction area](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/search_extracted_text.PNG)
-## Redacting tabular data files (XLSX/CSV) or copy and pasted text
-### Tabular data files (XLSX/CSV)
-The app can be used to redact tabular data files such as xlsx or csv files. For this to work properly, your data file needs to be in a simple table format, with a single table starting from the first cell (A1), and no other information in the sheet. Similarly for .xlsx files, each sheet in the file that you want to redact should be in this simple format.
 To demonstrate this, we can use [the example csv file 'combined_case_notes.csv'](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/combined_case_notes.csv), which is a small dataset of dummy social care case notes. Go to the 'Open text or Excel/csv files' tab. Drop the file into the upload area. After the file is loaded, you should see the suggested columns for redaction in the box underneath. You can select and deselect columns to redact as you wish from this list.

 version: 1.0.0
+Redact personally identifiable information (PII) from documents (pdf, images), Word files (.docx), or tabular data (xlsx/csv/parquet). Please see the [User Guide](#user-guide) for a walkthrough on how to use the app. Below is a very brief overview.
 To identify text in documents, the 'local' text/OCR image analysis uses spacy/tesseract, and works quite well for documents with typed text. If available, choose 'AWS Textract service' to redact more complex elements e.g. signatures or handwriting. Then, choose a method for PII identification. 'Local' is quick and gives good results if you are primarily looking for a custom list of terms to redact (see Redaction settings). If available, AWS Comprehend gives better results at a small cost.
 NOTE: The app is not 100% accurate, and it will miss some personal information. It is essential that all outputs are reviewed **by a human** before using the final outputs.
+---
+## 🚀 Quick Start - Installation and first run
+Follow these instructions to get the document redaction application running on your local machine.
+### 1. Prerequisites: System Dependencies
+This application relies on two external tools for OCR (Tesseract) and PDF processing (Poppler). Please install them on your system before proceeding.
+---
+#### **On Windows**
+Installation on Windows requires downloading installers and adding the programs to your system's PATH.
+1.  **Install Tesseract OCR:**
+    *   Download the installer from the official Tesseract at [UB Mannheim page](https://github.com/UB-Mannheim/tesseract/wiki) (e.g., `tesseract-ocr-w64-setup-v5.X.X...exe`).
+    *   Run the installer.
+    *   **IMPORTANT:** During installation, ensure you select the option to "Add Tesseract to system PATH for all users" or a similar option. This is crucial for the application to find the Tesseract executable.
+2.  **Install Poppler:**
+    *   Download the latest Poppler binary for Windows. A common source is the [Poppler for Windows](https://github.com/oschwartz10612/poppler-windows) GitHub releases page. Download the `.zip` file (e.g., `poppler-24.02.0-win.zip`).
+    *   Extract the contents of the zip file to a permanent location on your computer, for example, `C:\Program Files\poppler\`.
+    *   You must add the `bin` folder from your Poppler installation to your system's PATH environment variable.
+        *   Search for "Edit the system environment variables" in the Windows Start Menu and open it.
+        *   Click the "Environment Variables..." button.
+        *   In the "System variables" section, find and select the `Path` variable, then click "Edit...".
+        *   Click "New" and add the full path to the `bin` directory inside your Poppler folder (e.g., `C:\Program Files\poppler\poppler-24.02.0\bin`).
+        *   Click OK on all windows to save the changes.
+    To verify, open a new Command Prompt and run `tesseract --version` and `pdftoppm -v`. If they both return version information, you have successfully installed the prerequisites.
+---
+#### **On Linux (Debian/Ubuntu)**
+Open your terminal and run the following command to install Tesseract and Poppler:
+```bash
+sudo apt-get update && sudo apt-get install -y tesseract-ocr poppler-utils
+```
+#### **On Linux (Fedora/CentOS/RHEL)**
+Open your terminal and use the `dnf` or `yum` package manager:
+```bash
+sudo dnf install -y tesseract poppler-utils
+```
+---
+### 2. Installation: Code and Python Packages
+Once the system prerequisites are installed, you can set up the Python environment.
+#### Step 1: Clone the Repository
+Open your terminal or Git Bash and clone this repository:
+```bash
+git clone https://github.com/seanpedrick-case/doc_redaction.git
+cd doc_redaction
+```
+#### Step 2: Create and Activate a Virtual Environment (Recommended)
+It is highly recommended to use a virtual environment to isolate project dependencies and avoid conflicts with other Python projects.
+```bash
+# Create the virtual environment
+python -m venv venv
+# Activate it
+# On Windows:
+.\venv\Scripts\activate
+# On macOS/Linux:
+source venv/bin/activate
+```
+#### Step 3: Install Python Dependencies
+This project uses `pyproject.toml` to manage dependencies. You can install everything with a single pip command. This process will also download the required Spacy models and other packages directly from their URLs.
+```bash
+pip install .
+```
+Alternatively, you can use the `requirements.txt` file:
+```bash
+pip install -r requirements.txt
+```
+### 3. Run the Application
+With all dependencies installed, you can now start the Gradio application.
+```bash
+python app.py
+```
+After running the command, the application will start, and you will see a local URL in your terminal (usually `http://127.0.0.1:7860`).
+Open this URL in your web browser to use the document redaction tool
+---
+### 4. ⚙️ Configuration (Optional)
+You can customise the application's behavior by creating a configuration file. This allows you to change settings without modifying the source code, such as enabling AWS features, changing logging behavior, or pointing to local Tesseract/Poppler installations. A full overview of all the potential settings you can modify in the app_config.env file can be seen in tools/config.py, with explanation on the documentation website for [the github repo](https://seanpedrick-case.github.io/doc_redaction/)
+To get started:
+1.  Locate the `example_config.env` file in the root of the project.
+2.  Create a new file named `app_config.env` inside the `config/` directory (i.e., `config/app_config.env`).
+3.  Copy the contents from `example_config.env` into your new `config/app_config.env` file.
+4.  Modify the values in `config/app_config.env` to suit your needs. The application will automatically load these settings on startup.
+If you do not create this file, the application will run with default settings.
+#### Configuration Breakdown
+Here is an overview of the most important settings, separated by whether they are for local use or require AWS.
+---
+#### **Local & General Settings (No AWS Required)**
+These settings are useful for all users, regardless of whether you are using AWS.
+*   `TESSERACT_FOLDER` / `POPPLER_FOLDER`
+    *   Use these if you installed Tesseract or Poppler to a custom location on **Windows** and did not add them to the system PATH.
+    *   Provide the path to the respective installation folders (for Poppler, point to the `bin` sub-directory).
+    *   **Examples:** `POPPLER_FOLDER=C:/Program Files/poppler-24.02.0/bin/` `TESSERACT_FOLDER=tesseract/`
+*   `SHOW_LANGUAGE_SELECTION=True`
+    *   Set to `True` to display a language selection dropdown in the UI for OCR processing.
+*   `CHOSEN_LOCAL_OCR_MODEL=tesseract`"
+    *   Choose the backend for local OCR. Options are `tesseract`, `paddle`, or `hybrid`. "Tesseract" is the default, and is recommended. "hybrid" is a combination of the two - first pass through the redactions will be done with Tesseract, and then a second pass will be done with PaddleOCR on words with low confidence. "paddle" will only return whole line text extraction, and so will only work for OCR, not redaction.
+*   `SESSION_OUTPUT_FOLDER=False`
+    *   If `True`, redacted files will be saved in unique subfolders within the `output/` directory for each session.
+*   `DISPLAY_FILE_NAMES_IN_LOGS=False`
+    *   For privacy, file names are not recorded in usage logs by default. Set to `True` to include them.
+---
+#### **AWS-Specific Settings**
+These settings are only relevant if you intend to use AWS services like Textract for OCR and Comprehend for PII detection.
+*   `RUN_AWS_FUNCTIONS=1`
+    *   **This is the master switch.** You must set this to `1` to enable any AWS functionality. If it is `0`, all other AWS settings will be ignored.
+*   **UI Options:**
+    *   `SHOW_AWS_TEXT_EXTRACTION_OPTIONS=True`: Adds "AWS Textract" as an option in the text extraction dropdown.
+    *   `SHOW_AWS_PII_DETECTION_OPTIONS=True`: Adds "AWS Comprehend" as an option in the PII detection dropdown.
+*   **Core AWS Configuration:**
+    *   `AWS_REGION=example-region`: Set your AWS region (e.g., `us-east-1`).
+    *   `DOCUMENT_REDACTION_BUCKET=example-bucket`: The name of the S3 bucket the application will use for temporary file storage and processing.
+*   **AWS Logging:**
+    *   `SAVE_LOGS_TO_DYNAMODB=True`: If enabled, usage and feedback logs will be saved to DynamoDB tables.
+    *   `ACCESS_LOG_DYNAMODB_TABLE_NAME`, `USAGE_LOG_DYNAMODB_TABLE_NAME`, etc.: Specify the names of your DynamoDB tables for logging.
+*   **Advanced AWS Textract Features:**
+    *   `SHOW_WHOLE_DOCUMENT_TEXTRACT_CALL_OPTIONS=True`: Enables UI components for large-scale, asynchronous document processing via Textract.
+    *   `TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_BUCKET=example-bucket-output`: A separate S3 bucket for the final output of asynchronous Textract jobs.
+    *   `LOAD_PREVIOUS_TEXTRACT_JOBS_S3=True`: If enabled, the app will try to load the status of previously submitted asynchronous jobs from S3.
+*   **Cost Tracking (for internal accounting):**
+    *   `SHOW_COSTS=True`: Displays an estimated cost for AWS operations. Can be enabled even if AWS functions are off.
+    *   `GET_COST_CODES=True`: Enables a dropdown for users to select a cost code before running a job.
+    *   `COST_CODES_PATH=config/cost_codes.csv`: The local path to a CSV file containing your cost codes.
+    *   `ENFORCE_COST_CODES=True`: Makes selecting a cost code mandatory before starting a redaction.
+Now you have the app installed, what follows is a guide on how to use it for basic and advanced redaction.
+# User Guide
 ## Table of contents
     - [Redacting only specific pages](#redacting-only-specific-pages)
     - [Handwriting and signature redaction](#handwriting-and-signature-redaction)
 - [Reviewing and modifying suggested redactions](#reviewing-and-modifying-suggested-redactions)
+- [Redacting Word, tabular data files (CSV/XLSX) or copy and pasted text](#redacting-word-tabular-data-files-xlsxcsv-or-copy-and-pasted-text)
 See the [advanced user guide here](#advanced-user-guide):
 - [Merging redaction review files](#merging-redaction-review-files)
 ### Uploading documents for review
+The top area has a file upload area where you can upload files for review . In the left box, upload the original PDF file. Click '1. Upload original PDF'. In the right box, you can upload the '..._review_file.csv' that is produced by the redaction process.
+Optionally, you can upload a '..._ocr_result_with_words' file here, that will allow you to search through the text and easily [add new redactions based on word search](#searching-and-adding-custom-redactions). You can also upload one of the '..._ocr_output.csv' file here that comes out of a redaction task, so that you can navigate the extracted text from the document. Click the button '2. Upload Review or OCR csv files' load in these files.
+Now you can review and modify the suggested redactions using the interface described below.
 ![Search extracted text](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/search_extracted_text.PNG)
 If you made a mistake, click the 'Undo last element removal' button to restore the Search suggested redactions table to its previous state (can only undo the last action).
+### Searching and Adding Custom Redactions
+After a document has been processed, you may need to redact specific terms, names, or phrases that the automatic PII (Personally Identifiable Information) detection might have missed. The **"Search text to make new redactions"** tab gives you the power to find and redact any text within your document manually.
+#### How to Use the Search and Redact Feature
+The workflow is designed to be simple: **Search → Select → Redact**.
+---
+#### **Step 1: Search for Text**
+1.  Navigate to the **"Search text to make new redactions"** tab.
+2.  The main table will initially be populated with all the text extracted from the document, broken down by word.
+3.  To narrow this down, use the **"Multi-word text search"** box to type the word or phrase you want to find.
+4.  Click the **"Search"** button or press Enter.
+5.  The table below will update to show only the rows containing text that matches your search query.
+> **Tip:** You can also filter the results by page number using the **"Page"** dropdown. To clear all filters and see the full text again, click the **"Reset table to original state"** button.
+---
+#### **Step 2: Select and Review a Match**
+When you click on any row in the search results table:
+*   The document preview on the left will automatically jump to that page, allowing you to see the word in its original context.
+*   The details of your selection will appear in the smaller **"Selected row"** table for confirmation.
+---
+#### **Step 3: Choose Your Redaction Method**
+You have several powerful options for redacting the text you've found:
+*   **Redact a Single, Specific Instance:**
+    *   Click on the exact row in the table you want to redact.
+    *   Click the **`Redact specific text row`** button.
+    *   Only that single instance will be redacted.
+*   **Redact All Instances of a Word/Phrase:**
+    *   Let's say you want to redact the project name "Project Alpha" everywhere it appears.
+    *   Find and select one instance of "Project Alpha" in the table.
+    *   Click the **`Redact all words with same text as selected row`** button.
+    *   The application will find and redact every single occurrence of "Project Alpha" throughout the entire document.
+*   **Redact All Current Search Results:**
+    *   Perform a search (e.g., for a specific person's name).
+    *   If you are confident that every result shown in the filtered table should be redacted, click the **`Redact all text in table`** button.
+    *   This will apply a redaction to all currently visible items in the table in one go.
+---
+#### **Customising Your New Redactions**
+Before you click one of the redact buttons, you can customize the appearance and label of the new redactions under the **"Search options"** accordion:
+*   **Label for new redactions:** Change the text that appears on the redaction box (default is "Redaction"). You could change this to "CONFIDENTIAL" or "CUSTOM".
+*   **Colour for labels:** Set a custom color for the redaction box by providing an RGB value. The format must be three numbers (0-255) in parentheses, for example:
+    *   ` (255, 0, 0) ` for Red
+    *   ` (0, 0, 0) ` for Black
+    *   ` (255, 255, 0) ` for Yellow
+#### **Undoing a Mistake**
+If you make a mistake, you can reverse the last redaction action you performed on this tab.
+*   Click the **`Undo latest redaction`** button. This will revert the last set of redactions you added (whether it was a single row, all of a certain text, or all search results).
+> **Important:** This undo button only works for the *most recent* action. It maintains a single backup state, so it cannot undo actions that are two or more steps in the past.
 ### Navigating through the document using the 'Search all extracted text'
 The 'search all extracted text' table will contain text if you have just redacted a document, or if you have uploaded a '..._ocr_output.csv' file alongside a document file and review file on the Review redactions tab as [described above](#uploading-documents-for-review).
 ![Search suggested redaction area](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/search_extracted_text.PNG)
+## Redacting Word, tabular data files (XLSX/CSV) or copy and pasted text
+### Word or tabular data files (XLSX/CSV)
+The app can be used to redact Word (.docx), or tabular data files such as xlsx or csv files. For this to work properly, your data file needs to be in a simple table format, with a single table starting from the first cell (A1), and no other information in the sheet. Similarly for .xlsx files, each sheet in the file that you want to redact should be in this simple format.
 To demonstrate this, we can use [the example csv file 'combined_case_notes.csv'](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/combined_case_notes.csv), which is a small dataset of dummy social care case notes. Go to the 'Open text or Excel/csv files' tab. Drop the file into the upload area. After the file is loaded, you should see the suggested columns for redaction in the box underneath. You can select and deselect columns to redact as you wish from this list.

app.py CHANGED Viewed

@@ -257,7 +257,7 @@ with app:
     gr.Markdown(
     """# Document redaction
-    Redact personally identifiable information (PII) from documents (PDF, images), open text, or tabular data (XLSX/CSV/Parquet). Please see the [User Guide](https://github.com/seanpedrick-case/doc_redaction/blob/main/README.md) for a walkthrough on how to use the app. Below is a very brief overview.
     To identify text in documents, the 'Local' text/OCR image analysis uses spaCy/Tesseract, and works well only for documents with typed text. If available, choose 'AWS Textract' to redact more complex elements e.g. signatures or handwriting. Then, choose a method for PII identification. 'Local' is quick and gives good results if you are primarily looking for a custom list of terms to redact (see Redaction settings). If available, AWS Comprehend gives better results at a small cost.
@@ -908,7 +908,6 @@ with app:
     ###
     # IDENTIFY DUPLICATE PAGES
     ###
-    #in_duplicate_pages.upload(fn = prepare_image_or_pdf, inputs=[in_duplicate_pages, text_extract_method_radio, all_page_line_level_ocr_results_df_base, all_page_line_level_ocr_results_with_words_df_base, latest_file_completed_num, redaction_output_summary_textbox, second_loop_state, annotate_max_pages, all_image_annotations_state, prepare_for_review_bool, in_fully_redacted_list_state, output_folder_textbox, input_folder_textbox, prepare_images_bool_false, page_sizes, pdf_doc_state], outputs=[redaction_output_summary_textbox, prepared_pdf_state, images_pdf_state, annotate_max_pages, annotate_max_pages_bottom, pdf_doc_state, all_image_annotations_state, review_file_df, document_cropboxes, page_sizes, textract_output_found_checkbox, all_img_details_state, all_page_line_level_ocr_results_df_base, relevant_ocr_output_with_words_found_checkbox, all_page_line_level_ocr_results_with_words_df_base])
     find_duplicate_pages_btn.click(
         fn=run_duplicate_analysis,
@@ -977,9 +976,7 @@ with app:
     all_output_files_btn.click(fn=load_all_output_files, inputs=output_folder_textbox, outputs=all_output_files)
     # Language selection dropdown
-    chosen_language_full_name_drop.select(update_language_dropdown, inputs=[chosen_language_full_name_drop], outputs=[chosen_language_drop])#.\
-    #success(download_tesseract_lang_pack, inputs=[chosen_language_drop], outputs = [tesseract_lang_data_file_path]).\
-    #success(load_spacy_model, inputs=[chosen_language_drop], outputs=[updated_nlp_analyser_state])
     ###
     # APP LOAD AND LOGGING

     gr.Markdown(
     """# Document redaction
+    Redact personally identifiable information (PII) from documents (PDF, images), Word files (.docx), or tabular data (XLSX/CSV/Parquet). Please see the [User Guide](https://github.com/seanpedrick-case/doc_redaction/blob/main/README.md) for a walkthrough on how to use the app. Below is a very brief overview.
     To identify text in documents, the 'Local' text/OCR image analysis uses spaCy/Tesseract, and works well only for documents with typed text. If available, choose 'AWS Textract' to redact more complex elements e.g. signatures or handwriting. Then, choose a method for PII identification. 'Local' is quick and gives good results if you are primarily looking for a custom list of terms to redact (see Redaction settings). If available, AWS Comprehend gives better results at a small cost.
     ###
     # IDENTIFY DUPLICATE PAGES
     ###
     find_duplicate_pages_btn.click(
         fn=run_duplicate_analysis,
     all_output_files_btn.click(fn=load_all_output_files, inputs=output_folder_textbox, outputs=all_output_files)
     # Language selection dropdown
+    chosen_language_full_name_drop.select(update_language_dropdown, inputs=[chosen_language_full_name_drop], outputs=[chosen_language_drop])
     ###
     # APP LOAD AND LOGGING

cdk/cdk_config.py CHANGED Viewed

@@ -213,9 +213,9 @@ SAVE_LOGS_TO_CSV = get_or_create_env_var('SAVE_LOGS_TO_CSV', 'True')
 ### DYNAMODB logs. Whether to save to DynamoDB, and the headers of the table
 SAVE_LOGS_TO_DYNAMODB = get_or_create_env_var('SAVE_LOGS_TO_DYNAMODB', 'True')
-ACCESS_LOG_DYNAMODB_TABLE_NAME = get_or_create_env_var('ACCESS_LOG_DYNAMODB_TABLE_NAME', f"{CDK_PREFIX}dynamodb-access-log".lower())
-FEEDBACK_LOG_DYNAMODB_TABLE_NAME = get_or_create_env_var('FEEDBACK_LOG_DYNAMODB_TABLE_NAME', f"{CDK_PREFIX}dynamodb-feedback".lower())
-USAGE_LOG_DYNAMODB_TABLE_NAME = get_or_create_env_var('USAGE_LOG_DYNAMODB_TABLE_NAME', f"{CDK_PREFIX}dynamodb-usage".lower())
 ###
 # REDACTION OPTIONS

 ### DYNAMODB logs. Whether to save to DynamoDB, and the headers of the table
 SAVE_LOGS_TO_DYNAMODB = get_or_create_env_var('SAVE_LOGS_TO_DYNAMODB', 'True')
+ACCESS_LOG_DYNAMODB_TABLE_NAME = get_or_create_env_var('ACCESS_LOG_DYNAMODB_TABLE_NAME', f"{CDK_PREFIX}dynamodb-access-logs".lower())
+FEEDBACK_LOG_DYNAMODB_TABLE_NAME = get_or_create_env_var('FEEDBACK_LOG_DYNAMODB_TABLE_NAME', f"{CDK_PREFIX}dynamodb-feedback-logs".lower())
+USAGE_LOG_DYNAMODB_TABLE_NAME = get_or_create_env_var('USAGE_LOG_DYNAMODB_TABLE_NAME', f"{CDK_PREFIX}dynamodb-usage-logs".lower())
 ###
 # REDACTION OPTIONS

cdk/post_cdk_build_quickstart.py CHANGED Viewed

@@ -13,10 +13,10 @@ start_codebuild_build(PROJECT_NAME=CODEBUILD_PROJECT_NAME)
 # Upload config.env file to S3 bucket
 upload_file_to_s3(local_file_paths="config/config.env", s3_key="", s3_bucket=S3_LOG_CONFIG_BUCKET_NAME)
-total_seconds = 450 # 7.5 minutes
 update_interval = 1 # Update every second
-print("Waiting 7.5 minutes for the CodeBuild container to build.")
 # tqdm iterates over a range, and you perform a small sleep in each iteration
 for i in tqdm(range(total_seconds), desc="Building container"):

 # Upload config.env file to S3 bucket
 upload_file_to_s3(local_file_paths="config/config.env", s3_key="", s3_bucket=S3_LOG_CONFIG_BUCKET_NAME)
+total_seconds = 660 # 11 minutes
 update_interval = 1 # Update every second
+print("Waiting 11 minutes for the CodeBuild container to build.")
 # tqdm iterates over a range, and you perform a small sleep in each iteration
 for i in tqdm(range(total_seconds), desc="Building container"):

cdk/requirements.txt CHANGED Viewed

@@ -1,5 +1,5 @@
-aws-cdk-lib==2.202.0
-boto3==1.38.41
-pandas==2.3.0
 nodejs==0.1.1
 python-dotenv==1.0.1

+aws-cdk-lib==2.212.0
+boto3==1.40.10
+pandas==2.3.1
 nodejs==0.1.1
 python-dotenv==1.0.1

example_config.env ADDED Viewed

	@@ -0,0 +1,26 @@

+TESSERACT_FOLDER=tesseract/
+POPPLER_FOLDER=poppler/poppler-24.02.0/Library/bin/
+SHOW_LANGUAGE_SELECTION=True
+CHOSEN_LOCAL_OCR_MODEL=tesseract
+SESSION_OUTPUT_FOLDER=False
+DISPLAY_FILE_NAMES_IN_LOGS=False
+RUN_AWS_FUNCTIONS=1 # Set to 0 if you don't want to run AWS functions
+SAVE_LOGS_TO_DYNAMODB=True
+S3_COST_CODES_PATH=cost_codes.csv
+SHOW_AWS_TEXT_EXTRACTION_OPTIONS=True
+SHOW_AWS_PII_DETECTION_OPTIONS=True
+AWS_REGION=example-region
+DOCUMENT_REDACTION_BUCKET=example-bucket
+SHOW_WHOLE_DOCUMENT_TEXTRACT_CALL_OPTIONS=True
+TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_BUCKET=example-bucket-output
+LOAD_PREVIOUS_TEXTRACT_JOBS_S3=True
+ACCESS_LOG_DYNAMODB_TABLE_NAME=example-dynamodb-access-log
+USAGE_LOG_DYNAMODB_TABLE_NAME=example-dynamodb-usage
+FEEDBACK_LOG_DYNAMODB_TABLE_NAME=example-dynamodb-feedback
+SHOW_COSTS=True
+GET_COST_CODES=True
+COST_CODES_PATH=config/cost_codes.csv
+ENFORCE_COST_CODES=True
+DEFAULT_COST_CODE=example_cost_code

src/app_settings.qmd CHANGED Viewed

@@ -115,6 +115,16 @@ Configuration for input and output file handling.
     *   **Default Value:** `'input/'`
     *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
 ## Logging Options
 Settings for configuring application logging, including log formats and storage locations.
@@ -161,7 +171,7 @@ Settings for configuring application logging, including log formats and storage
 *   **`CSV_USAGE_LOG_HEADERS`**
     *   **Description:** Defines custom headers for CSV usage logs.
-    *   **Default Value:** A predefined list of header names. Refer to `tools/config.py` for the complete list.
     *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
 *   **`SAVE_LOGS_TO_DYNAMODB`**
@@ -214,12 +224,17 @@ Settings for configuring application logging, including log formats and storage
 Configurations related to the text redaction process, including PII detection models and external tool paths.
 *   **`TESSERACT_FOLDER`**
-    *   **Description:** Path to the local Tesseract OCR installation folder. Only required if Tesseract is not in path, or you are running a version of the app as an .exe installed with Pyinstaller. Gives the path to the local Tesseract OCR model for text extraction.
     *   **Default Value:** `""` (empty string)
     *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
 *   **`POPPLER_FOLDER`**
-    *   **Description:** Path to the local Poppler installation's `bin` folder. Only required if Tesseract is not in path, or you are running a version of the app as an .exe installed with Pyinstaller. Poppler is used for PDF processing.
     *   **Default Value:** `""` (empty string)
     *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
@@ -283,24 +298,34 @@ Configurations related to the text redaction process, including PII detection mo
     *   **Default Value:** Value of `AWS_PII_OPTION` if `SHOW_AWS_PII_DETECTION_OPTIONS` is True, else value of `LOCAL_PII_OPTION`.
     *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`. Provide one of the PII detection option display names.
 *   **`CHOSEN_COMPREHEND_ENTITIES`**
     *   **Description:** A list of AWS Comprehend PII entity types to be redacted when using AWS Comprehend.
-    *   **Default Value:** A predefined list of entity types. Refer to `tools/config.py` for the complete list.
     *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`. This should be a string representation of a Python list.
 *   **`FULL_COMPREHEND_ENTITY_LIST`**
     *   **Description:** The complete list of PII entity types supported by AWS Comprehend that can be selected for redaction.
-    *   **Default Value:** A predefined list of entity types. Refer to `tools/config.py` for the complete list.
     *   **Configuration:** This is typically an informational variable reflecting the capabilities of AWS Comprehend and is not meant to be changed by users directly affecting redaction behavior (use `CHOSEN_COMPREHEND_ENTITIES` for that). Set as an environment variable directly, or include in `config/app_config.env`.
 *   **`CHOSEN_REDACT_ENTITIES`**
     *   **Description:** A list of local PII entity types to be redacted when using the local PII detection model.
-    *   **Default Value:** A predefined list of entity types. Refer to `tools/config.py` for the complete list.
     *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`. This should be a string representation of a Python list.
 *   **`FULL_ENTITY_LIST`**
     *   **Description:** The complete list of PII entity types supported by the local PII detection model that can be selected for redaction.
-    *   **Default Value:** A predefined list of entity types. Refer to `tools/config.py` for the complete list.
     *   **Configuration:** This is typically an informational variable reflecting the capabilities of the local model and is not meant to be changed by users directly affecting redaction behavior (use `CHOSEN_REDACT_ENTITIES` for that). Set as an environment variable directly, or include in `config/app_config.env`.
 *   **`PAGE_BREAK_VALUE`**
@@ -309,20 +334,15 @@ Configurations related to the text redaction process, including PII detection mo
     *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
 *   **`MAX_TIME_VALUE`**
-    *   **Description:** Specifies the maximum time (in arbitrary units, likely seconds or milliseconds depending on implementation) for a process before it might be timed out.
     *   **Default Value:** `'999999'`
     *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
 *   **`CUSTOM_BOX_COLOUR`**
-    *   **Description:** Allows specifying a custom color for the redaction boxes drawn on documents (e.g., "grey", "red", "#FF0000"). If empty, a default color is used.
     *   **Default Value:** `""` (empty string)
     *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
-*   **`REDACTION_LANGUAGE`**
-    *   **Description:** Specifies the language for redaction processing. Currently, only "en" (English) is supported.
-    *   **Default Value:** `"en"`
-    *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
 *   **`RETURN_PDF_END_OF_REDACTION`**
     *   **Description:** If set to `'True'`, the application will return a PDF document at the end of the redaction task.
     *   **Default Value:** `"True"`
@@ -333,13 +353,42 @@ Configurations related to the text redaction process, including PII detection mo
     *   **Default Value:** `"False"`
     *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
 ## App Run Options
 General runtime configurations for the application.
 *   **`TLDEXTRACT_CACHE`**
-    *   **Description:** Path to the cache file used by the `tldextract` library, which helps in accurately extracting top-level domains (TLDs) from URLs.
-    *   **Default Value:** `'tld/.tld_set_snapshot'`
     *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
 *   **`COGNITO_AUTH`**
@@ -436,7 +485,7 @@ Settings related to tracking and applying cost codes for application usage.
 Configurations for features related to processing whole documents via APIs, particularly AWS Textract for large documents.
 *   **`SHOW_WHOLE_DOCUMENT_TEXTRACT_CALL_OPTIONS`**
-    *   **Description:** Controls whether UI options for whole document Textract calls are displayed. (Note: Mentioned as not currently implemented in the source).
     *   **Default Value:** `'False'`
     *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
@@ -461,12 +510,12 @@ Configurations for features related to processing whole documents via APIs, part
     *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env` (or `config/aws_config.env` if `AWS_CONFIG_PATH` is configured).
 *   **`TEXTRACT_JOBS_S3_LOC`**
-    *   **Description:** The S3 subfolder (within `TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_BUCKET`) where Textract job data (output) is stored.
     *   **Default Value:** `'output'`
     *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env` (or `config/aws_config.env` if `AWS_CONFIG_PATH` is configured).
 *   **`TEXTRACT_JOBS_S3_INPUT_LOC`**
-    *   **Description:** The S3 subfolder (within `TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_BUCKET`) where Textract job input is stored.
     *   **Default Value:** `'input'`
     *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env` (or `config/aws_config.env` if `AWS_CONFIG_PATH` is configured).
@@ -478,4 +527,4 @@ Configurations for features related to processing whole documents via APIs, part
 *   **`DAYS_TO_DISPLAY_WHOLE_DOCUMENT_JOBS`**
     *   **Description:** Specifies the number of past days for which to display whole document Textract jobs in the UI.
     *   **Default Value:** `'7'`
-    *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.

     *   **Default Value:** `'input/'`
     *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
+*   **`GRADIO_TEMP_DIR`**
+    *   **Description:** Defines the path for Gradio's temporary file storage.
+    *   **Default Value:** `'tmp/gradio_tmp/'`
+    *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
+*   **`MPLCONFIGDIR`**
+    *   **Description:** Specifies the cache directory for the Matplotlib library, which is used for plotting and image handling.
+    *   **Default Value:** `'tmp/matplotlib_cache/'`
+    *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
 ## Logging Options
 Settings for configuring application logging, including log formats and storage locations.
 *   **`CSV_USAGE_LOG_HEADERS`**
     *   **Description:** Defines custom headers for CSV usage logs.
+    *   **Default Value:** A predefined list of header names. Refer to `config.py` for the complete list.
     *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
 *   **`SAVE_LOGS_TO_DYNAMODB`**
 Configurations related to the text redaction process, including PII detection models and external tool paths.
 *   **`TESSERACT_FOLDER`**
+    *   **Description:** Path to the local Tesseract OCR installation folder. Only required if Tesseract is not in the system's PATH, or when running a packaged executable (e.g., via PyInstaller).
     *   **Default Value:** `""` (empty string)
     *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
+*   **`TESSERACT_DATA_FOLDER`**
+    *   **Description:** Path to the Tesseract trained data files (e.g., `tessdata`).
+    *   **Default Value:** `"/usr/share/tessdata"`
+    *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
 *   **`POPPLER_FOLDER`**
+    *   **Description:** Path to the local Poppler installation's `bin` folder. Poppler is used for PDF processing. Only required if Poppler is not in the system's PATH.
     *   **Default Value:** `""` (empty string)
     *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
     *   **Default Value:** Value of `AWS_PII_OPTION` if `SHOW_AWS_PII_DETECTION_OPTIONS` is True, else value of `LOCAL_PII_OPTION`.
     *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`. Provide one of the PII detection option display names.
+*   **`CHOSEN_LOCAL_OCR_MODEL`**
+    *   **Description:** Choose the engine for local OCR: `"tesseract"`, `"paddle"`, or `"hybrid"`. "paddle" is effective for line extraction but not word-level redaction. "hybrid" uses Tesseract first, then PaddleOCR for low-confidence words.
+    *   **Default Value:** `"tesseract"`
+    *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
+*   **`PREPROCESS_LOCAL_OCR_IMAGES`**
+    *   **Description:** If set to `"True"`, images will be preprocessed (e.g., deskewed, contrast adjusted) before being sent to the local OCR engine. This can sometimes yield worse results on clean scans.
+    *   **Default Value:** `"False"`
+    *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
 *   **`CHOSEN_COMPREHEND_ENTITIES`**
     *   **Description:** A list of AWS Comprehend PII entity types to be redacted when using AWS Comprehend.
+    *   **Default Value:** A predefined list of entity types. Refer to `config.py` for the complete list.
     *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`. This should be a string representation of a Python list.
 *   **`FULL_COMPREHEND_ENTITY_LIST`**
     *   **Description:** The complete list of PII entity types supported by AWS Comprehend that can be selected for redaction.
+    *   **Default Value:** A predefined list of entity types. Refer to `config.py` for the complete list.
     *   **Configuration:** This is typically an informational variable reflecting the capabilities of AWS Comprehend and is not meant to be changed by users directly affecting redaction behavior (use `CHOSEN_COMPREHEND_ENTITIES` for that). Set as an environment variable directly, or include in `config/app_config.env`.
 *   **`CHOSEN_REDACT_ENTITIES`**
     *   **Description:** A list of local PII entity types to be redacted when using the local PII detection model.
+    *   **Default Value:** A predefined list of entity types. Refer to `config.py` for the complete list.
     *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`. This should be a string representation of a Python list.
 *   **`FULL_ENTITY_LIST`**
     *   **Description:** The complete list of PII entity types supported by the local PII detection model that can be selected for redaction.
+    *   **Default Value:** A predefined list of entity types. Refer to `config.py` for the complete list.
     *   **Configuration:** This is typically an informational variable reflecting the capabilities of the local model and is not meant to be changed by users directly affecting redaction behavior (use `CHOSEN_REDACT_ENTITIES` for that). Set as an environment variable directly, or include in `config/app_config.env`.
 *   **`PAGE_BREAK_VALUE`**
     *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
 *   **`MAX_TIME_VALUE`**
+    *   **Description:** Specifies a maximum time value for long-running processes.
     *   **Default Value:** `'999999'`
     *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
 *   **`CUSTOM_BOX_COLOUR`**
+    *   **Description:** Allows specifying a custom color for the redaction boxes drawn on documents. Only `"grey"` is currently supported as a custom value. If empty, a default color is used.
     *   **Default Value:** `""` (empty string)
     *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
 *   **`RETURN_PDF_END_OF_REDACTION`**
     *   **Description:** If set to `'True'`, the application will return a PDF document at the end of the redaction task.
     *   **Default Value:** `"True"`
     *   **Default Value:** `"False"`
     *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
+## Language Options
+Settings for multi-language support in OCR and PII detection.
+*   **`SHOW_LANGUAGE_SELECTION`**
+    *   **Description:** If set to `"True"`, a dropdown menu for language selection will be visible in the user interface.
+    *   **Default Value:** `"False"`
+    *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
+*   **`DEFAULT_LANGUAGE_FULL_NAME`**
+    *   **Description:** The default language's full name (e.g., "english") to be displayed in the UI.
+    *   **Default Value:** `"english"`
+    *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
+*   **`DEFAULT_LANGUAGE`**
+    *   **Description:** The default language's short code (e.g., "en") used by the backend engines. Ensure the corresponding Tesseract/PaddleOCR language packs are installed.
+    *   **Default Value:** `"en"`
+    *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
+*   **`MAPPED_LANGUAGE_CHOICES`**
+    *   **Description:** A string list of full language names (e.g., 'english', 'french') presented to the user in the language dropdown.
+    *   **Default Value:** A predefined list. See `config.py`.
+    *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
+*   **`LANGUAGE_CHOICES`**
+    *   **Description:** A string list of short language codes (e.g., 'en', 'fr') that correspond to `MAPPED_LANGUAGE_CHOICES`. This is what the backend uses.
+    *   **Default Value:** A predefined list. See `config.py`.
+    *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
 ## App Run Options
 General runtime configurations for the application.
 *   **`TLDEXTRACT_CACHE`**
+    *   **Description:** Path to the cache directory used by the `tldextract` library, which helps in accurately extracting top-level domains (TLDs) from URLs.
+    *   **Default Value:** `'tmp/tld/'`
     *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
 *   **`COGNITO_AUTH`**
 Configurations for features related to processing whole documents via APIs, particularly AWS Textract for large documents.
 *   **`SHOW_WHOLE_DOCUMENT_TEXTRACT_CALL_OPTIONS`**
+    *   **Description:** Controls whether UI options for whole document Textract calls are displayed.
     *   **Default Value:** `'False'`
     *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
     *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env` (or `config/aws_config.env` if `AWS_CONFIG_PATH` is configured).
 *   **`TEXTRACT_JOBS_S3_LOC`**
+    *   **Description:** The S3 subfolder (within the main redaction bucket) where Textract job data (output) is stored.
     *   **Default Value:** `'output'`
     *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env` (or `config/aws_config.env` if `AWS_CONFIG_PATH` is configured).
 *   **`TEXTRACT_JOBS_S3_INPUT_LOC`**
+    *   **Description:** The S3 subfolder (within the main redaction bucket) where Textract job input is stored.
     *   **Default Value:** `'input'`
     *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env` (or `config/aws_config.env` if `AWS_CONFIG_PATH` is configured).
 *   **`DAYS_TO_DISPLAY_WHOLE_DOCUMENT_JOBS`**
     *   **Description:** Specifies the number of past days for which to display whole document Textract jobs in the UI.
     *   **Default Value:** `'7'`
+    *   **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.

src/user_guide.qmd CHANGED Viewed

@@ -20,7 +20,7 @@ format:
     - [Redacting only specific pages](#redacting-only-specific-pages)
     - [Handwriting and signature redaction](#handwriting-and-signature-redaction)
 - [Reviewing and modifying suggested redactions](#reviewing-and-modifying-suggested-redactions)
-- [Redacting tabular data files (CSV/XLSX) or copy and pasted text](#redacting-tabular-data-files-xlsxcsv-or-copy-and-pasted-text)
 See the [advanced user guide here](#advanced-user-guide):
 - [Merging redaction review files](#merging-redaction-review-files)
@@ -210,9 +210,11 @@ On the 'Review redactions' tab you have a visual interface that allows you to in
 ### Uploading documents for review
-The top area has a file upload area where you can upload original, unredacted PDFs, alongside the '..._review_file.csv' that is produced by the redaction process. Once you have uploaded these two files, click the '**Review redactions based on original PDF...**' button to load in the files for review. This will allow you to visualise and modify the suggested redactions using the interface below.
-Optionally, you can also upload one of the '..._ocr_output.csv' files here that comes out of a redaction task, so that you can navigate the extracted text from the document.
 ![Search extracted text](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/search_extracted_text.PNG)
@@ -300,6 +302,77 @@ Once you have filtered the table, or selected a row from the table, you have a f
 If you made a mistake, click the 'Undo last element removal' button to restore the Search suggested redactions table to its previous state (can only undo the last action).
 ### Navigating through the document using the 'Search all extracted text'
 The 'search all extracted text' table will contain text if you have just redacted a document, or if you have uploaded a '..._ocr_output.csv' file alongside a document file and review file on the Review redactions tab as [described above](#uploading-documents-for-review).
@@ -312,11 +385,11 @@ You can search through the extracted text by using the search bar just above the
 ![Search suggested redaction area](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/search_extracted_text.PNG)
-## Redacting tabular data files (XLSX/CSV) or copy and pasted text
-### Tabular data files (XLSX/CSV)
-The app can be used to redact tabular data files such as xlsx or csv files. For this to work properly, your data file needs to be in a simple table format, with a single table starting from the first cell (A1), and no other information in the sheet. Similarly for .xlsx files, each sheet in the file that you want to redact should be in this simple format.
 To demonstrate this, we can use [the example csv file 'combined_case_notes.csv'](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/combined_case_notes.csv), which is a small dataset of dummy social care case notes. Go to the 'Open text or Excel/csv files' tab. Drop the file into the upload area. After the file is loaded, you should see the suggested columns for redaction in the box underneath. You can select and deselect columns to redact as you wish from this list.

     - [Redacting only specific pages](#redacting-only-specific-pages)
     - [Handwriting and signature redaction](#handwriting-and-signature-redaction)
 - [Reviewing and modifying suggested redactions](#reviewing-and-modifying-suggested-redactions)
+- [Redacting Word, tabular data files (CSV/XLSX) or copy and pasted text](#redacting-word-tabular-data-files-xlsxcsv-or-copy-and-pasted-text)
 See the [advanced user guide here](#advanced-user-guide):
 - [Merging redaction review files](#merging-redaction-review-files)
 ### Uploading documents for review
+The top area has a file upload area where you can upload files for review . In the left box, upload the original PDF file. Click '1. Upload original PDF'. In the right box, you can upload the '..._review_file.csv' that is produced by the redaction process.
+Optionally, you can upload a '..._ocr_result_with_words' file here, that will allow you to search through the text and easily [add new redactions based on word search](#searching-and-adding-custom-redactions). You can also upload one of the '..._ocr_output.csv' file here that comes out of a redaction task, so that you can navigate the extracted text from the document. Click the button '2. Upload Review or OCR csv files' load in these files.
+Now you can review and modify the suggested redactions using the interface described below.
 ![Search extracted text](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/search_extracted_text.PNG)
 If you made a mistake, click the 'Undo last element removal' button to restore the Search suggested redactions table to its previous state (can only undo the last action).
+### Searching and Adding Custom Redactions
+After a document has been processed, you may need to redact specific terms, names, or phrases that the automatic PII (Personally Identifiable Information) detection might have missed. The **"Search text to make new redactions"** tab gives you the power to find and redact any text within your document manually.
+#### How to Use the Search and Redact Feature
+The workflow is designed to be simple: **Search → Select → Redact**.
+---
+#### **Step 1: Search for Text**
+1.  Navigate to the **"Search text to make new redactions"** tab.
+2.  The main table will initially be populated with all the text extracted from the document, broken down by word.
+3.  To narrow this down, use the **"Multi-word text search"** box to type the word or phrase you want to find.
+4.  Click the **"Search"** button or press Enter.
+5.  The table below will update to show only the rows containing text that matches your search query.
+> **Tip:** You can also filter the results by page number using the **"Page"** dropdown. To clear all filters and see the full text again, click the **"Reset table to original state"** button.
+---
+#### **Step 2: Select and Review a Match**
+When you click on any row in the search results table:
+*   The document preview on the left will automatically jump to that page, allowing you to see the word in its original context.
+*   The details of your selection will appear in the smaller **"Selected row"** table for confirmation.
+---
+#### **Step 3: Choose Your Redaction Method**
+You have several powerful options for redacting the text you've found:
+*   **Redact a Single, Specific Instance:**
+    *   Click on the exact row in the table you want to redact.
+    *   Click the **`Redact specific text row`** button.
+    *   Only that single instance will be redacted.
+*   **Redact All Instances of a Word/Phrase:**
+    *   Let's say you want to redact the project name "Project Alpha" everywhere it appears.
+    *   Find and select one instance of "Project Alpha" in the table.
+    *   Click the **`Redact all words with same text as selected row`** button.
+    *   The application will find and redact every single occurrence of "Project Alpha" throughout the entire document.
+*   **Redact All Current Search Results:**
+    *   Perform a search (e.g., for a specific person's name).
+    *   If you are confident that every result shown in the filtered table should be redacted, click the **`Redact all text in table`** button.
+    *   This will apply a redaction to all currently visible items in the table in one go.
+---
+#### **Customising Your New Redactions**
+Before you click one of the redact buttons, you can customize the appearance and label of the new redactions under the **"Search options"** accordion:
+*   **Label for new redactions:** Change the text that appears on the redaction box (default is "Redaction"). You could change this to "CONFIDENTIAL" or "CUSTOM".
+*   **Colour for labels:** Set a custom color for the redaction box by providing an RGB value. The format must be three numbers (0-255) in parentheses, for example:
+    *   ` (255, 0, 0) ` for Red
+    *   ` (0, 0, 0) ` for Black
+    *   ` (255, 255, 0) ` for Yellow
+#### **Undoing a Mistake**
+If you make a mistake, you can reverse the last redaction action you performed on this tab.
+*   Click the **`Undo latest redaction`** button. This will revert the last set of redactions you added (whether it was a single row, all of a certain text, or all search results).
+> **Important:** This undo button only works for the *most recent* action. It maintains a single backup state, so it cannot undo actions that are two or more steps in the past.
 ### Navigating through the document using the 'Search all extracted text'
 The 'search all extracted text' table will contain text if you have just redacted a document, or if you have uploaded a '..._ocr_output.csv' file alongside a document file and review file on the Review redactions tab as [described above](#uploading-documents-for-review).
 ![Search suggested redaction area](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/search_extracted_text.PNG)
+## Redacting Word, tabular data files (XLSX/CSV) or copy and pasted text
+### Word or tabular data files (XLSX/CSV)
+The app can be used to redact Word (.docx), or tabular data files such as xlsx or csv files. For this to work properly, your data file needs to be in a simple table format, with a single table starting from the first cell (A1), and no other information in the sheet. Similarly for .xlsx files, each sheet in the file that you want to redact should be in this simple format.
 To demonstrate this, we can use [the example csv file 'combined_case_notes.csv'](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/combined_case_notes.csv), which is a small dataset of dummy social care case notes. Go to the 'Open text or Excel/csv files' tab. Drop the file into the upload area. After the file is loaded, you should see the suggested columns for redaction in the box underneath. You can select and deselect columns to redact as you wish from this list.

tools/config.py CHANGED Viewed

@@ -267,7 +267,7 @@ if NO_REDACTION_PII_OPTION in TABULAR_PII_DETECTION_MODELS:
     TABULAR_PII_DETECTION_MODELS.remove(NO_REDACTION_PII_OPTION)
 ### Local OCR model - Tesseract vs PaddleOCR
-CHOSEN_LOCAL_OCR_MODEL = get_or_create_env_var('CHOSEN_LOCAL_OCR_MODEL', "tesseract") # Choose between "tesseract", "hybrid", and "paddle"
 PREPROCESS_LOCAL_OCR_IMAGES = get_or_create_env_var('PREPROCESS_LOCAL_OCR_IMAGES', "False") # Whether to try and preprocess images before extracting text. NOTE: I have found in testing that this often results in WORSE results for scanned pages, so it is default False

     TABULAR_PII_DETECTION_MODELS.remove(NO_REDACTION_PII_OPTION)
 ### Local OCR model - Tesseract vs PaddleOCR
+CHOSEN_LOCAL_OCR_MODEL = get_or_create_env_var('CHOSEN_LOCAL_OCR_MODEL', "tesseract") # Choose between "tesseract", "hybrid", and "paddle". "paddle" will only return whole line text extraction, and so will only work for OCR, not redaction. "hybrid" is a combination of the two - first pass through the redactions will be done with Tesseract, and then a second pass will be done with PaddleOCR on words with low confidence.
 PREPROCESS_LOCAL_OCR_IMAGES = get_or_create_env_var('PREPROCESS_LOCAL_OCR_IMAGES', "False") # Whether to try and preprocess images before extracting text. NOTE: I have found in testing that this often results in WORSE results for scanned pages, so it is default False

tools/data_anonymise.py CHANGED Viewed

@@ -758,9 +758,6 @@ def anonymise_script(df:pd.DataFrame,
     batch_anonymizer = BatchAnonymizerEngine(anonymizer_engine = anonymizer)
     analyzer_results = list()
-    # Use provided comprehend language or fall back to main language
-    language = language
     if pii_identification_method == "Local":
         # Use custom analyzer to be able to track progress with Gradio

     batch_anonymizer = BatchAnonymizerEngine(anonymizer_engine = anonymizer)
     analyzer_results = list()
     if pii_identification_method == "Local":
         # Use custom analyzer to be able to track progress with Gradio

tools/file_conversion.py CHANGED Viewed

@@ -834,7 +834,10 @@ def prepare_image_or_pdf(
         out_message.append(out_time)
         combined_out_message = '\n'.join(out_message)
-    number_of_pages = len(page_sizes)
     print("Finished loading in files")

         out_message.append(out_time)
         combined_out_message = '\n'.join(out_message)
+    if not page_sizes:
+        number_of_pages = 1
+    else:
+        number_of_pages = len(page_sizes)
     print("Finished loading in files")

tools/find_duplicate_pages.py CHANGED Viewed

@@ -1209,7 +1209,7 @@ def create_annotation_objects_from_duplicates(
     if duplicates_df.empty:
         raise Warning("No duplicates found")
     if ocr_results_df.empty:
-        raise Warning("No OCR results found for file under review. Please upload relevant OCR_output file for the PDF file on the review tab.")
     if combine_pages == False:
         page_to_image_map = {item['page']: item['image_path'] for item in page_sizes}

     if duplicates_df.empty:
         raise Warning("No duplicates found")
     if ocr_results_df.empty:
+        raise Warning("No OCR results found for file under review. Please upload relevant OCR_output file and original PDF document on the review tab.")
     if combine_pages == False:
         page_to_image_map = {item['page']: item['image_path'] for item in page_sizes}

tools/load_spacy_model_custom_recognisers.py CHANGED Viewed

@@ -506,9 +506,9 @@ def create_nlp_analyser(language: str = DEFAULT_LANGUAGE, custom_list: List[str]
     return nlp_analyser
 # Create the default nlp_analyser using the new function
-nlp_analyser, nlp_model = create_nlp_analyser(DEFAULT_LANGUAGE, return_also_model=True)
-def spacy_fuzzy_search(text: str, custom_query_list:List[str]=[], spelling_mistakes_max:int = 1, search_whole_phrase:bool=True, nlp=nlp_model, progress=gr.Progress(track_tqdm=True)):
     ''' Conduct fuzzy match on a list of text data.'''
     all_matches = []
@@ -546,7 +546,6 @@ def spacy_fuzzy_search(text: str, custom_query_list:List[str]=[], spelling_mista
         else:
             # If matching a whole phrase, use Spacy PhraseMatcher, then consider similarity after using Levenshtein distance.
-            #tokenised_query = [string_query.lower()]
             # If you want to match the whole phrase, use phrase matcher
             matcher = FuzzyMatcher(nlp.vocab)
             patterns = [nlp.make_doc(string_query)]  # Convert query into a Doc object
@@ -567,9 +566,7 @@ def spacy_fuzzy_search(text: str, custom_query_list:List[str]=[], spelling_mista
                 for match_id, start, end in matches:
                     span = str(doc[start:end]).strip()
                     query_search = str(query).strip()
-                    #print("doc:", doc)
-                    #print("span:", span)
-                    #print("query_search:", query_search)
                     # Convert word positions to character positions
                     start_char = doc[start].idx  # Start character position
@@ -584,9 +581,6 @@ def spacy_fuzzy_search(text: str, custom_query_list:List[str]=[], spelling_mista
                 for match_id, start, end, ratio, pattern in matches:
                     span = str(doc[start:end]).strip()
                     query_search = str(query).strip()
-                    #print("doc:", doc)
-                    #print("span:", span)
-                    #print("query_search:", query_search)
                     # Calculate Levenshtein distance. Only keep matches with less than specified number of spelling mistakes
                     distance = Levenshtein.distance(query_search.lower(), span.lower())
@@ -600,9 +594,6 @@ def spacy_fuzzy_search(text: str, custom_query_list:List[str]=[], spelling_mista
                         start_char = doc[start].idx  # Start character position
                         end_char = doc[end - 1].idx + len(doc[end - 1])  # End character position
-                        #print("start_char:", start_char)
-                        #print("end_char:", end_char)
                         all_matches.append(match_count)
                         all_start_positions.append(start_char)
                         all_end_positions.append(end_char)

     return nlp_analyser
 # Create the default nlp_analyser using the new function
+nlp_analyser, nlp = create_nlp_analyser(DEFAULT_LANGUAGE, return_also_model=True)
+def spacy_fuzzy_search(text: str, custom_query_list:List[str]=[], spelling_mistakes_max:int = 1, search_whole_phrase:bool=True, nlp=nlp, progress=gr.Progress(track_tqdm=True)):
     ''' Conduct fuzzy match on a list of text data.'''
     all_matches = []
         else:
             # If matching a whole phrase, use Spacy PhraseMatcher, then consider similarity after using Levenshtein distance.
             # If you want to match the whole phrase, use phrase matcher
             matcher = FuzzyMatcher(nlp.vocab)
             patterns = [nlp.make_doc(string_query)]  # Convert query into a Doc object
                 for match_id, start, end in matches:
                     span = str(doc[start:end]).strip()
                     query_search = str(query).strip()
                     # Convert word positions to character positions
                     start_char = doc[start].idx  # Start character position
                 for match_id, start, end, ratio, pattern in matches:
                     span = str(doc[start:end]).strip()
                     query_search = str(query).strip()
                     # Calculate Levenshtein distance. Only keep matches with less than specified number of spelling mistakes
                     distance = Levenshtein.distance(query_search.lower(), span.lower())
                         start_char = doc[start].idx  # Start character position
                         end_char = doc[end - 1].idx + len(doc[end - 1])  # End character position
                         all_matches.append(match_count)
                         all_start_positions.append(start_char)
                         all_end_positions.append(end_char)

tools/redaction_review.py CHANGED Viewed

@@ -797,6 +797,10 @@ def update_annotator_object_and_filter_df(
     page_num_reported_zero_indexed = page_num_reported - 1
     annotate_previous_page = page_num_reported # Store the determined page number
     # --- Process page sizes DataFrame ---
     page_sizes_df = pd.DataFrame(page_sizes)
     if not page_sizes_df.empty:
@@ -916,7 +920,10 @@ def update_annotator_object_and_filter_df(
     # --- Final Output Components ---
-    page_number_reported_gradio_comp = gr.Number(label = "Current page", value=page_num_reported, precision=0)
     ### Present image_annotator outputs
     # Handle the case where current_page_image_annotator_object couldn't be prepared

     page_num_reported_zero_indexed = page_num_reported - 1
     annotate_previous_page = page_num_reported # Store the determined page number
+    if not page_sizes:
+        page_num_reported = 0
+        annotate_previous_page = 0
     # --- Process page sizes DataFrame ---
     page_sizes_df = pd.DataFrame(page_sizes)
     if not page_sizes_df.empty:
     # --- Final Output Components ---
+    if page_sizes:
+        page_number_reported_gradio_comp = gr.Number(label = "Current page", value=page_num_reported, precision=0, maximum=len(page_sizes), minimum=1)
+    else:
+        page_number_reported_gradio_comp = gr.Number(label = "Current page", value=0, precision=0, maximum=9999, minimum=0)
     ### Present image_annotator outputs
     # Handle the case where current_page_image_annotator_object couldn't be prepared