Spaces:

rawwerks
/

handwriting-ocr

Runtime error

App Files Files Community

Raymond Weitekamp commited on Feb 8

Commit

d4167d9

0 Parent(s):

Initial commit without binary files

Browse files

Files changed (9) hide show

.gitattributes +1 -0
.gitignore +28 -0
README.md +34 -0
app.py +118 -0
requirements.txt +5 -0
run_local.sh +35 -0
test_app.py +56 -0
test_e2e.py +73 -0
test_local.sh +69 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1 @@


1	+ *.png filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,28 @@

+# Virtual Environment
+venv/
+env/
+.env/
+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+.Python
+*.so
+.pytest_cache/
+# IDE
+.idea/
+.vscode/
+*.swp
+*.swo
+# Gradio
+flagged/
+gradio_cached_examples/
+# OS
+.DS_Store
+Thumbs.db
+test-image.png

README.md ADDED Viewed

	@@ -0,0 +1,34 @@

+---
+title: Handwriting OCR Dataset Collection
+emoji: ✍️
+colorFrom: blue
+colorTo: indigo
+sdk: gradio
+sdk_version: 5.15.0
+app_file: app.py
+pinned: false
+short_description: Collect handwritten text samples for OCR training
+tags:
+- ocr
+- handwriting
+- dataset
+- computer-vision
+---
+# Handwriting OCR Dataset Collection
+This Space provides an interface for collecting handwritten samples of text to create a dataset for OCR (Optical Character Recognition) training. Users are presented with text snippets which they can handwrite and upload as images.
+## How it Works
+1. You will be shown 1-5 consecutive sentences about OCR and handwriting recognition
+2. Write these sentences by hand on paper
+3. Take a photo or scan of your handwriting
+4. Upload the image through the interface
+5. Submit or skip to get a new text block
+The collected data pairs (text and corresponding handwritten images) will be used to train and improve handwriting recognition models.
+## Usage
+Simply visit the Space and follow the on-screen instructions to contribute your handwriting samples to the dataset.

app.py ADDED Viewed

	@@ -0,0 +1,118 @@

+import gradio as gr
+import random
+import os
+from datetime import datetime
+# The list of sentences from our previous conversation.
+sentences = [
+    "Optical character recognition (OCR) is the process of converting images of text into machine-readable data.",
+    "When applied to handwriting, OCR faces additional challenges because of the natural variability in individual penmanship.",
+    "Over the last century, advances in computer vision and machine learning have transformed handwriting OCR from bulky, specialized hardware into highly accurate, software-driven systems.",
+    "The origins of OCR date back to the early 20th century.",
+    "Early pioneers explored how machines might read text.",
+    "In the 1920s, inventors such as Emanuel Goldberg developed early devices that could capture printed characters by converting them into telegraph codes.",
+    "Around the same time, Gustav Tauschek created the Reading Machine using template-matching methods to detect letters in images.",
+    "These devices were designed for printed text and depended on fixed, machine-friendly fonts rather than natural handwriting.",
+    "In the 1950s, systems like David Shepard's GISMO emerged to begin automating the conversion of paper records into digital form.",
+    "Although these early OCR systems were limited in scope and accuracy, they laid the groundwork for later innovations.",
+    "The 1960s saw OCR technology being applied to real-world tasks.",
+    "In 1965, American inventor Jacob Rabinow developed an OCR machine specifically aimed at sorting mail by reading addresses.",
+    "This was a critical step for the U.S. Postal Service.",
+    "Soon after, research groups, including those at IBM, began developing machines such as the IBM 1287, which was capable of reading handprinted numbers on envelopes to facilitate automated mail processing.",
+    "These systems marked the first attempts to apply computer vision to handwritten data on a large scale.",
+    "By the late 1980s and early 1990s, researchers such as Yann LeCun and his colleagues developed neural network architectures to recognize handwritten digits.",
+    "Their work, initially applied to reading ZIP codes on mail, demonstrated that carefully designed, constrained neural networks could achieve error rates as low as about 1% on USPS data.",
+    "Sargur Srihari and his team at the Center of Excellence for Document Analysis and Recognition extended these ideas to develop complete handwritten address interpretation systems.",
+    "These systems, deployed by the USPS and postal agencies worldwide, helped automate the routing of mail and revolutionized the sorting process.",
+    "The development and evaluation of handwriting OCR have been driven in part by standard benchmark datasets.",
+    "The MNIST dataset, introduced in the 1990s, consists of 70,000 images of handwritten digits and became the de facto benchmark for handwritten digit recognition.",
+    "Complementing MNIST is the USPS dataset, which provides images of hand‐written digits derived from actual envelopes and captures real-world variability.",
+    "Handwriting OCR entered a new era with the introduction of neural network models.",
+    "In 1989, LeCun et al. applied backpropagation to a convolutional neural network tailored for handwritten digit recognition, an innovation that evolved into the LeNet series.",
+    "By automatically learning features rather than relying on hand-designed templates, these networks drastically improved recognition performance.",
+    "As computational power increased and large labeled datasets became available, deep learning models, particularly convolutional neural networks and recurrent neural networks, pushed the accuracy of handwriting OCR to near-human levels.",
+    "Modern systems can handle both printed and cursive text, automatically segmenting and recognizing characters in complex handwritten documents.",
+    "Cursive handwriting presents a classic challenge known as Sayre's paradox, where word recognition requires letter segmentation and letter segmentation requires word recognition.",
+    "Contemporary approaches use implicit segmentation methods, often combined with hidden Markov models or end-to-end neural networks, to circumvent this paradox.",
+    "Today's handwriting OCR systems are highly accurate and widely deployed.",
+    "Modern systems combine OCR with artificial intelligence to not only recognize text but also extract meaning, verify data, and integrate into larger enterprise workflows.",
+    "Projects such as In Codice Ratio use deep convolutional networks to transcribe historical handwritten documents, further expanding OCR applications.",
+    "Despite impressive advances, handwriting OCR continues to face challenges with highly variable or degraded handwriting.",
+    "Ongoing research aims to improve recognition accuracy, particularly for cursive and unconstrained handwriting, and to extend support across languages and historical scripts.",
+    "With improvements in deep learning architectures, increased computing power, and large annotated datasets, future OCR systems are expected to become even more robust, handling real-world handwriting in diverse applications from postal services to archival digitization.",
+    "Today's research in handwriting OCR benefits from a wide array of well-established datasets and ongoing evaluation challenges.",
+    "These resources help drive the development of increasingly robust systems for both digit and full-text recognition.",
+    "For handwritten digit recognition, the MNIST dataset remains the most widely used benchmark thanks to its simplicity and broad adoption.",
+    "Complementing MNIST is the USPS dataset, which is derived from actual mail envelopes and provides additional challenges with real-world variability.",
+    "The IAM Handwriting Database is one of the most popular datasets for unconstrained offline handwriting recognition and includes scanned pages of handwritten English text with corresponding transcriptions.",
+    "It is frequently used to train and evaluate models that work on full-line or full-page recognition tasks.",
+    "For systems designed to capture the dynamic aspects of handwriting, such as pen stroke trajectories, the IAM On-Line Handwriting Database offers valuable data.",
+    "The CVL dataset provides multi-writer handwritten texts with a range of writing styles, making it useful for assessing the generalization capabilities of OCR systems across diverse handwriting samples.",
+    "The RIMES dataset, developed for French handwriting recognition, contains scanned documents and is a key resource for evaluating systems in multilingual settings.",
+    "Various ICDAR competitions, such as ICDAR 2013 and ICDAR 2017, have released datasets that reflect the complexities of real-world handwriting, including historical documents and unconstrained writing.",
+    "For Arabic handwriting recognition, the KHATT dataset offers a collection of handwritten texts that capture the unique challenges of cursive and context-dependent scripts.",
+    "These datasets, along with continual evaluation efforts through competitions hosted at ICDAR and ICFHR, ensure that the field keeps pushing toward higher accuracy, better robustness, and broader language coverage.",
+    "Emerging benchmarks, often tailored to specific scripts, historical documents, or noisy real-world data, will further refine the state-of-the-art in handwriting OCR.",
+    "This array of resources continues to shape the development of handwriting OCR systems today.",
+    "This additional section outlines today's most influential datasets and benchmarks, highlighting how they continue to shape the development of handwriting OCR systems."
+]
+class OCRDataCollector:
+    def __init__(self):
+        self.collected_pairs = []
+        self.current_text_block = self.get_random_text_block()
+    def get_random_text_block(self):
+        block_length = random.randint(1, 5)
+        start_index = random.randint(0, len(sentences) - block_length)
+        block = " ".join(sentences[start_index:start_index + block_length])
+        return block
+    def submit_image(self, image, text_block):
+        if image is None:
+            message = "No image uploaded. Please try again or use 'Skip' to move on."
+        else:
+            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+            self.collected_pairs.append({"text": text_block, "image": image, "timestamp": timestamp})
+            message = "Thank you! Your submission has been saved."
+        new_text = self.get_random_text_block()
+        return new_text, message
+    def skip_text(self, text_block):
+        new_text = self.get_random_text_block()
+        message = "Skipped. Here is the next text."
+        return new_text, message
+def create_gradio_interface():
+    collector = OCRDataCollector()
+    with gr.Blocks() as demo:
+        gr.Markdown("## Crowdsourcing Handwriting OCR Dataset")
+        gr.Markdown("You will be shown between 1 and 5 consecutive sentences. Please handwrite them on paper and upload an image of your handwriting. If you wish to skip the current text, click 'Skip'.")
+        text_box = gr.Textbox(value=collector.current_text_block, label="Text to Handwrite", interactive=False)
+        image_input = gr.Image(type="pil", label="Upload Handwritten Image", sources=["upload"])
+        with gr.Row():
+            submit_btn = gr.Button("Submit")
+            skip_btn = gr.Button("Skip")
+        submit_btn.click(
+            fn=collector.submit_image,
+            inputs=[image_input, text_box],
+            outputs=[text_box]
+        )
+        skip_btn.click(
+            fn=collector.skip_text,
+            inputs=text_box,
+            outputs=[text_box]
+        )
+    return demo
+if __name__ == "__main__":
+    demo = create_gradio_interface()
+    demo.launch()

requirements.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+gradio>=3.50.2
+Pillow>=10.0.0
+pytest>=7.0.0
+pytest-playwright>=0.4.0
+playwright>=1.40.0

run_local.sh ADDED Viewed

	@@ -0,0 +1,35 @@

+#!/bin/bash
+# Exit on error
+set -e
+# Kill any existing processes using port 7862
+echo "Cleaning up port 7862..."
+lsof -ti:7862 | xargs kill -9 2>/dev/null || true
+# Check if uv is installed, if not install it
+if ! command -v uv &> /dev/null; then
+    echo "Installing uv package installer..."
+    curl -LsSf https://astral.sh/uv/install.sh | sh
+fi
+# Create virtual environment if it doesn't exist
+if [ ! -d "venv" ]; then
+    echo "Creating virtual environment..."
+    python -m venv venv
+fi
+# Activate virtual environment
+echo "Activating virtual environment..."
+source venv/bin/activate
+# Install dependencies using uv
+echo "Installing dependencies with uv..."
+uv pip install -r requirements.txt
+# Start the Gradio app
+echo "Starting Gradio app..."
+python app.py
+# Deactivate virtual environment when done
+deactivate

test_app.py ADDED Viewed

	@@ -0,0 +1,56 @@

+import pytest
+from PIL import Image
+import numpy as np
+from app import OCRDataCollector, sentences
+import io
+@pytest.fixture
+def collector():
+    return OCRDataCollector()
+def test_get_random_text_block(collector):
+    # Test that we get a non-empty string
+    text_block = collector.get_random_text_block()
+    assert isinstance(text_block, str)
+    assert len(text_block) > 0
+    # Test that the text block contains content from our sentences
+    assert any(sentence in text_block for sentence in sentences)
+    # Test that we get different blocks (probabilistic, but very likely)
+    blocks = [collector.get_random_text_block() for _ in range(5)]
+    assert len(set(blocks)) > 1, "Random blocks should be different"
+def test_skip_text(collector):
+    # Test that we get a different text block when skipping
+    current_text = collector.get_random_text_block()
+    new_text = collector.get_random_text_block()
+    assert isinstance(new_text, str)
+    assert len(new_text) > 0
+    assert new_text != current_text  # This is probabilistic but very likely
+def test_submit_image(collector):
+    # Create a dummy test image using numpy array
+    img_array = np.zeros((100, 100, 3), dtype=np.uint8)
+    img_array.fill(255)  # White image
+    # Convert numpy array to PIL Image
+    test_image = Image.fromarray(img_array)
+    # Test the current text block
+    current_text = collector.get_random_text_block()
+    # Test submission with valid image
+    new_text = collector.submit_image(test_image, current_text)
+    assert isinstance(new_text, str)
+    assert len(new_text) > 0
+    assert len(collector.collected_pairs) == 1
+    assert collector.collected_pairs[0]["text"] == current_text
+    # Test submission with no image
+    new_text = collector.submit_image(None, current_text)
+    assert isinstance(new_text, str)
+    assert len(new_text) > 0
+    # Should not have added to collected_pairs
+    assert len(collector.collected_pairs) == 1

test_e2e.py ADDED Viewed

	@@ -0,0 +1,73 @@

+import pytest
+import os
+from playwright.sync_api import expect
+from PIL import Image
+import numpy as np
+import tempfile
+# Constants
+GRADIO_PORT = 7862
+GRADIO_URL = f"http://localhost:{GRADIO_PORT}"
+@pytest.fixture(scope="module")
+def test_image():
+    # Create a temporary test image
+    test_img = Image.fromarray(np.zeros((100, 100, 3), dtype=np.uint8))
+    temp_dir = tempfile.mkdtemp()
+    test_img_path = os.path.join(temp_dir, "test_image.png")
+    test_img.save(test_img_path)
+    yield test_img_path
+    # Cleanup
+    os.remove(test_img_path)
+    os.rmdir(temp_dir)
+def test_page_loads(page):
+    page.goto(GRADIO_URL)
+    page.wait_for_load_state("networkidle")
+    # Check if title is present with exact text
+    expect(page.locator("h2", has_text="Crowdsourcing Handwriting OCR Dataset")).to_be_visible()
+    # Check if main interface elements are present
+    expect(page.get_by_label("Text to Handwrite")).to_be_visible()
+    expect(page.locator('input[type="file"]')).to_be_attached()
+    expect(page.get_by_role("button", name="Submit")).to_be_visible()
+    expect(page.get_by_role("button", name="Skip")).to_be_visible()
+def test_skip_functionality(page):
+    page.goto(GRADIO_URL)
+    page.wait_for_load_state("networkidle")
+    # Get initial text
+    text_box = page.get_by_label("Text to Handwrite")
+    initial_text = text_box.input_value()
+    # Click skip button
+    page.get_by_role("button", name="Skip").click()
+    page.wait_for_timeout(2000)  # Wait for response
+    # Get new text and verify it changed
+    new_text = text_box.input_value()
+    assert initial_text != new_text
+def test_upload_image(page, test_image):
+    page.goto(GRADIO_URL)
+    page.wait_for_load_state("networkidle")
+    # Get initial text
+    text_box = page.get_by_label("Text to Handwrite")
+    initial_text = text_box.input_value()
+    # Upload image - file input is hidden, but we can still set its value
+    page.locator('input[type="file"]').set_input_files(test_image)
+    page.wait_for_timeout(2000)  # Wait for upload
+    # Click submit to complete the upload
+    page.get_by_role("button", name="Submit").click()
+    page.wait_for_timeout(2000)  # Wait for response
+    # Verify text changed after submission
+    new_text = text_box.input_value()
+    assert initial_text != new_text

test_local.sh ADDED Viewed

	@@ -0,0 +1,69 @@

+#!/bin/bash
+# Exit on error
+set -e
+# Kill any existing processes using port 7862
+echo "Cleaning up port 7862..."
+lsof -ti:7862 | xargs kill -9 2>/dev/null || true
+# Check if uv is installed, if not install it
+if ! command -v uv &> /dev/null; then
+    echo "Installing uv package installer..."
+    curl -LsSf https://astral.sh/uv/install.sh | sh
+fi
+# Create virtual environment if it doesn't exist
+if [ ! -d "venv" ]; then
+    echo "Creating virtual environment..."
+    python -m venv venv
+fi
+# Activate virtual environment
+echo "Activating virtual environment..."
+source venv/bin/activate
+# Install dependencies using uv
+echo "Installing dependencies with uv..."
+uv pip install -r requirements.txt
+# Install Playwright browsers
+echo "Installing Playwright browsers..."
+playwright install chromium
+# Run unit tests
+echo "Running unit tests..."
+python -m pytest test_app.py -v
+if [ $? -eq 0 ]; then
+    echo "Unit tests passed! Starting Gradio app..."
+    # Start Gradio app in background
+    python app.py &
+    GRADIO_PID=$!
+    # Wait for server to start
+    echo "Waiting for Gradio server to start..."
+    sleep 3
+    # Run e2e tests
+    echo "Running e2e tests..."
+    python -m pytest test_e2e.py -v
+    E2E_STATUS=$?
+    # Kill Gradio server
+    kill $GRADIO_PID
+    if [ $E2E_STATUS -eq 0 ]; then
+        echo "All tests passed! Starting Gradio app for development..."
+        python app.py
+    else
+        echo "E2E tests failed! Please fix the issues before running the app."
+        exit 1
+    fi
+else
+    echo "Unit tests failed! Please fix the issues before running e2e tests."
+    exit 1
+fi
+# Deactivate virtual environment
+deactivate