Raymond Weitekamp commited on
Commit
d4167d9
·
0 Parent(s):

Initial commit without binary files

Browse files
Files changed (9) hide show
  1. .gitattributes +1 -0
  2. .gitignore +28 -0
  3. README.md +34 -0
  4. app.py +118 -0
  5. requirements.txt +5 -0
  6. run_local.sh +35 -0
  7. test_app.py +56 -0
  8. test_e2e.py +73 -0
  9. test_local.sh +69 -0
.gitattributes ADDED
@@ -0,0 +1 @@
 
 
1
+ *.png filter=lfs diff=lfs merge=lfs -text
.gitignore ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Virtual Environment
2
+ venv/
3
+ env/
4
+ .env/
5
+
6
+ # Python
7
+ __pycache__/
8
+ *.py[cod]
9
+ *$py.class
10
+ .Python
11
+ *.so
12
+ .pytest_cache/
13
+
14
+ # IDE
15
+ .idea/
16
+ .vscode/
17
+ *.swp
18
+ *.swo
19
+
20
+ # Gradio
21
+ flagged/
22
+ gradio_cached_examples/
23
+
24
+ # OS
25
+ .DS_Store
26
+ Thumbs.db
27
+
28
+ test-image.png
README.md ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Handwriting OCR Dataset Collection
3
+ emoji: ✍️
4
+ colorFrom: blue
5
+ colorTo: indigo
6
+ sdk: gradio
7
+ sdk_version: 5.15.0
8
+ app_file: app.py
9
+ pinned: false
10
+ short_description: Collect handwritten text samples for OCR training
11
+ tags:
12
+ - ocr
13
+ - handwriting
14
+ - dataset
15
+ - computer-vision
16
+ ---
17
+
18
+ # Handwriting OCR Dataset Collection
19
+
20
+ This Space provides an interface for collecting handwritten samples of text to create a dataset for OCR (Optical Character Recognition) training. Users are presented with text snippets which they can handwrite and upload as images.
21
+
22
+ ## How it Works
23
+
24
+ 1. You will be shown 1-5 consecutive sentences about OCR and handwriting recognition
25
+ 2. Write these sentences by hand on paper
26
+ 3. Take a photo or scan of your handwriting
27
+ 4. Upload the image through the interface
28
+ 5. Submit or skip to get a new text block
29
+
30
+ The collected data pairs (text and corresponding handwritten images) will be used to train and improve handwriting recognition models.
31
+
32
+ ## Usage
33
+
34
+ Simply visit the Space and follow the on-screen instructions to contribute your handwriting samples to the dataset.
app.py ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import random
3
+ import os
4
+ from datetime import datetime
5
+
6
+ # The list of sentences from our previous conversation.
7
+ sentences = [
8
+ "Optical character recognition (OCR) is the process of converting images of text into machine-readable data.",
9
+ "When applied to handwriting, OCR faces additional challenges because of the natural variability in individual penmanship.",
10
+ "Over the last century, advances in computer vision and machine learning have transformed handwriting OCR from bulky, specialized hardware into highly accurate, software-driven systems.",
11
+ "The origins of OCR date back to the early 20th century.",
12
+ "Early pioneers explored how machines might read text.",
13
+ "In the 1920s, inventors such as Emanuel Goldberg developed early devices that could capture printed characters by converting them into telegraph codes.",
14
+ "Around the same time, Gustav Tauschek created the Reading Machine using template-matching methods to detect letters in images.",
15
+ "These devices were designed for printed text and depended on fixed, machine-friendly fonts rather than natural handwriting.",
16
+ "In the 1950s, systems like David Shepard's GISMO emerged to begin automating the conversion of paper records into digital form.",
17
+ "Although these early OCR systems were limited in scope and accuracy, they laid the groundwork for later innovations.",
18
+ "The 1960s saw OCR technology being applied to real-world tasks.",
19
+ "In 1965, American inventor Jacob Rabinow developed an OCR machine specifically aimed at sorting mail by reading addresses.",
20
+ "This was a critical step for the U.S. Postal Service.",
21
+ "Soon after, research groups, including those at IBM, began developing machines such as the IBM 1287, which was capable of reading handprinted numbers on envelopes to facilitate automated mail processing.",
22
+ "These systems marked the first attempts to apply computer vision to handwritten data on a large scale.",
23
+ "By the late 1980s and early 1990s, researchers such as Yann LeCun and his colleagues developed neural network architectures to recognize handwritten digits.",
24
+ "Their work, initially applied to reading ZIP codes on mail, demonstrated that carefully designed, constrained neural networks could achieve error rates as low as about 1% on USPS data.",
25
+ "Sargur Srihari and his team at the Center of Excellence for Document Analysis and Recognition extended these ideas to develop complete handwritten address interpretation systems.",
26
+ "These systems, deployed by the USPS and postal agencies worldwide, helped automate the routing of mail and revolutionized the sorting process.",
27
+ "The development and evaluation of handwriting OCR have been driven in part by standard benchmark datasets.",
28
+ "The MNIST dataset, introduced in the 1990s, consists of 70,000 images of handwritten digits and became the de facto benchmark for handwritten digit recognition.",
29
+ "Complementing MNIST is the USPS dataset, which provides images of hand‐written digits derived from actual envelopes and captures real-world variability.",
30
+ "Handwriting OCR entered a new era with the introduction of neural network models.",
31
+ "In 1989, LeCun et al. applied backpropagation to a convolutional neural network tailored for handwritten digit recognition, an innovation that evolved into the LeNet series.",
32
+ "By automatically learning features rather than relying on hand-designed templates, these networks drastically improved recognition performance.",
33
+ "As computational power increased and large labeled datasets became available, deep learning models, particularly convolutional neural networks and recurrent neural networks, pushed the accuracy of handwriting OCR to near-human levels.",
34
+ "Modern systems can handle both printed and cursive text, automatically segmenting and recognizing characters in complex handwritten documents.",
35
+ "Cursive handwriting presents a classic challenge known as Sayre's paradox, where word recognition requires letter segmentation and letter segmentation requires word recognition.",
36
+ "Contemporary approaches use implicit segmentation methods, often combined with hidden Markov models or end-to-end neural networks, to circumvent this paradox.",
37
+ "Today's handwriting OCR systems are highly accurate and widely deployed.",
38
+ "Modern systems combine OCR with artificial intelligence to not only recognize text but also extract meaning, verify data, and integrate into larger enterprise workflows.",
39
+ "Projects such as In Codice Ratio use deep convolutional networks to transcribe historical handwritten documents, further expanding OCR applications.",
40
+ "Despite impressive advances, handwriting OCR continues to face challenges with highly variable or degraded handwriting.",
41
+ "Ongoing research aims to improve recognition accuracy, particularly for cursive and unconstrained handwriting, and to extend support across languages and historical scripts.",
42
+ "With improvements in deep learning architectures, increased computing power, and large annotated datasets, future OCR systems are expected to become even more robust, handling real-world handwriting in diverse applications from postal services to archival digitization.",
43
+ "Today's research in handwriting OCR benefits from a wide array of well-established datasets and ongoing evaluation challenges.",
44
+ "These resources help drive the development of increasingly robust systems for both digit and full-text recognition.",
45
+ "For handwritten digit recognition, the MNIST dataset remains the most widely used benchmark thanks to its simplicity and broad adoption.",
46
+ "Complementing MNIST is the USPS dataset, which is derived from actual mail envelopes and provides additional challenges with real-world variability.",
47
+ "The IAM Handwriting Database is one of the most popular datasets for unconstrained offline handwriting recognition and includes scanned pages of handwritten English text with corresponding transcriptions.",
48
+ "It is frequently used to train and evaluate models that work on full-line or full-page recognition tasks.",
49
+ "For systems designed to capture the dynamic aspects of handwriting, such as pen stroke trajectories, the IAM On-Line Handwriting Database offers valuable data.",
50
+ "The CVL dataset provides multi-writer handwritten texts with a range of writing styles, making it useful for assessing the generalization capabilities of OCR systems across diverse handwriting samples.",
51
+ "The RIMES dataset, developed for French handwriting recognition, contains scanned documents and is a key resource for evaluating systems in multilingual settings.",
52
+ "Various ICDAR competitions, such as ICDAR 2013 and ICDAR 2017, have released datasets that reflect the complexities of real-world handwriting, including historical documents and unconstrained writing.",
53
+ "For Arabic handwriting recognition, the KHATT dataset offers a collection of handwritten texts that capture the unique challenges of cursive and context-dependent scripts.",
54
+ "These datasets, along with continual evaluation efforts through competitions hosted at ICDAR and ICFHR, ensure that the field keeps pushing toward higher accuracy, better robustness, and broader language coverage.",
55
+ "Emerging benchmarks, often tailored to specific scripts, historical documents, or noisy real-world data, will further refine the state-of-the-art in handwriting OCR.",
56
+ "This array of resources continues to shape the development of handwriting OCR systems today.",
57
+ "This additional section outlines today's most influential datasets and benchmarks, highlighting how they continue to shape the development of handwriting OCR systems."
58
+ ]
59
+
60
+ class OCRDataCollector:
61
+ def __init__(self):
62
+ self.collected_pairs = []
63
+ self.current_text_block = self.get_random_text_block()
64
+
65
+ def get_random_text_block(self):
66
+ block_length = random.randint(1, 5)
67
+ start_index = random.randint(0, len(sentences) - block_length)
68
+ block = " ".join(sentences[start_index:start_index + block_length])
69
+ return block
70
+
71
+ def submit_image(self, image, text_block):
72
+ if image is None:
73
+ message = "No image uploaded. Please try again or use 'Skip' to move on."
74
+ else:
75
+ timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
76
+ self.collected_pairs.append({"text": text_block, "image": image, "timestamp": timestamp})
77
+ message = "Thank you! Your submission has been saved."
78
+ new_text = self.get_random_text_block()
79
+ return new_text, message
80
+
81
+ def skip_text(self, text_block):
82
+ new_text = self.get_random_text_block()
83
+ message = "Skipped. Here is the next text."
84
+ return new_text, message
85
+
86
+ def create_gradio_interface():
87
+ collector = OCRDataCollector()
88
+
89
+ with gr.Blocks() as demo:
90
+ gr.Markdown("## Crowdsourcing Handwriting OCR Dataset")
91
+ gr.Markdown("You will be shown between 1 and 5 consecutive sentences. Please handwrite them on paper and upload an image of your handwriting. If you wish to skip the current text, click 'Skip'.")
92
+
93
+ text_box = gr.Textbox(value=collector.current_text_block, label="Text to Handwrite", interactive=False)
94
+ image_input = gr.Image(type="pil", label="Upload Handwritten Image", sources=["upload"])
95
+
96
+
97
+ with gr.Row():
98
+ submit_btn = gr.Button("Submit")
99
+ skip_btn = gr.Button("Skip")
100
+
101
+ submit_btn.click(
102
+ fn=collector.submit_image,
103
+ inputs=[image_input, text_box],
104
+ outputs=[text_box]
105
+ )
106
+
107
+ skip_btn.click(
108
+ fn=collector.skip_text,
109
+ inputs=text_box,
110
+ outputs=[text_box]
111
+ )
112
+
113
+
114
+ return demo
115
+
116
+ if __name__ == "__main__":
117
+ demo = create_gradio_interface()
118
+ demo.launch()
requirements.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ gradio>=3.50.2
2
+ Pillow>=10.0.0
3
+ pytest>=7.0.0
4
+ pytest-playwright>=0.4.0
5
+ playwright>=1.40.0
run_local.sh ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ # Exit on error
4
+ set -e
5
+
6
+ # Kill any existing processes using port 7862
7
+ echo "Cleaning up port 7862..."
8
+ lsof -ti:7862 | xargs kill -9 2>/dev/null || true
9
+
10
+ # Check if uv is installed, if not install it
11
+ if ! command -v uv &> /dev/null; then
12
+ echo "Installing uv package installer..."
13
+ curl -LsSf https://astral.sh/uv/install.sh | sh
14
+ fi
15
+
16
+ # Create virtual environment if it doesn't exist
17
+ if [ ! -d "venv" ]; then
18
+ echo "Creating virtual environment..."
19
+ python -m venv venv
20
+ fi
21
+
22
+ # Activate virtual environment
23
+ echo "Activating virtual environment..."
24
+ source venv/bin/activate
25
+
26
+ # Install dependencies using uv
27
+ echo "Installing dependencies with uv..."
28
+ uv pip install -r requirements.txt
29
+
30
+ # Start the Gradio app
31
+ echo "Starting Gradio app..."
32
+ python app.py
33
+
34
+ # Deactivate virtual environment when done
35
+ deactivate
test_app.py ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pytest
2
+ from PIL import Image
3
+ import numpy as np
4
+ from app import OCRDataCollector, sentences
5
+ import io
6
+
7
+ @pytest.fixture
8
+ def collector():
9
+ return OCRDataCollector()
10
+
11
+ def test_get_random_text_block(collector):
12
+ # Test that we get a non-empty string
13
+ text_block = collector.get_random_text_block()
14
+ assert isinstance(text_block, str)
15
+ assert len(text_block) > 0
16
+
17
+ # Test that the text block contains content from our sentences
18
+ assert any(sentence in text_block for sentence in sentences)
19
+
20
+ # Test that we get different blocks (probabilistic, but very likely)
21
+ blocks = [collector.get_random_text_block() for _ in range(5)]
22
+ assert len(set(blocks)) > 1, "Random blocks should be different"
23
+
24
+ def test_skip_text(collector):
25
+ # Test that we get a different text block when skipping
26
+ current_text = collector.get_random_text_block()
27
+ new_text = collector.get_random_text_block()
28
+
29
+ assert isinstance(new_text, str)
30
+ assert len(new_text) > 0
31
+ assert new_text != current_text # This is probabilistic but very likely
32
+
33
+ def test_submit_image(collector):
34
+ # Create a dummy test image using numpy array
35
+ img_array = np.zeros((100, 100, 3), dtype=np.uint8)
36
+ img_array.fill(255) # White image
37
+
38
+ # Convert numpy array to PIL Image
39
+ test_image = Image.fromarray(img_array)
40
+
41
+ # Test the current text block
42
+ current_text = collector.get_random_text_block()
43
+
44
+ # Test submission with valid image
45
+ new_text = collector.submit_image(test_image, current_text)
46
+ assert isinstance(new_text, str)
47
+ assert len(new_text) > 0
48
+ assert len(collector.collected_pairs) == 1
49
+ assert collector.collected_pairs[0]["text"] == current_text
50
+
51
+ # Test submission with no image
52
+ new_text = collector.submit_image(None, current_text)
53
+ assert isinstance(new_text, str)
54
+ assert len(new_text) > 0
55
+ # Should not have added to collected_pairs
56
+ assert len(collector.collected_pairs) == 1
test_e2e.py ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pytest
2
+ import os
3
+ from playwright.sync_api import expect
4
+ from PIL import Image
5
+ import numpy as np
6
+ import tempfile
7
+
8
+ # Constants
9
+ GRADIO_PORT = 7862
10
+ GRADIO_URL = f"http://localhost:{GRADIO_PORT}"
11
+
12
+ @pytest.fixture(scope="module")
13
+ def test_image():
14
+ # Create a temporary test image
15
+ test_img = Image.fromarray(np.zeros((100, 100, 3), dtype=np.uint8))
16
+ temp_dir = tempfile.mkdtemp()
17
+ test_img_path = os.path.join(temp_dir, "test_image.png")
18
+ test_img.save(test_img_path)
19
+
20
+ yield test_img_path
21
+
22
+ # Cleanup
23
+ os.remove(test_img_path)
24
+ os.rmdir(temp_dir)
25
+
26
+ def test_page_loads(page):
27
+ page.goto(GRADIO_URL)
28
+ page.wait_for_load_state("networkidle")
29
+
30
+ # Check if title is present with exact text
31
+ expect(page.locator("h2", has_text="Crowdsourcing Handwriting OCR Dataset")).to_be_visible()
32
+
33
+ # Check if main interface elements are present
34
+ expect(page.get_by_label("Text to Handwrite")).to_be_visible()
35
+ expect(page.locator('input[type="file"]')).to_be_attached()
36
+ expect(page.get_by_role("button", name="Submit")).to_be_visible()
37
+ expect(page.get_by_role("button", name="Skip")).to_be_visible()
38
+
39
+ def test_skip_functionality(page):
40
+ page.goto(GRADIO_URL)
41
+ page.wait_for_load_state("networkidle")
42
+
43
+ # Get initial text
44
+ text_box = page.get_by_label("Text to Handwrite")
45
+ initial_text = text_box.input_value()
46
+
47
+ # Click skip button
48
+ page.get_by_role("button", name="Skip").click()
49
+ page.wait_for_timeout(2000) # Wait for response
50
+
51
+ # Get new text and verify it changed
52
+ new_text = text_box.input_value()
53
+ assert initial_text != new_text
54
+
55
+ def test_upload_image(page, test_image):
56
+ page.goto(GRADIO_URL)
57
+ page.wait_for_load_state("networkidle")
58
+
59
+ # Get initial text
60
+ text_box = page.get_by_label("Text to Handwrite")
61
+ initial_text = text_box.input_value()
62
+
63
+ # Upload image - file input is hidden, but we can still set its value
64
+ page.locator('input[type="file"]').set_input_files(test_image)
65
+ page.wait_for_timeout(2000) # Wait for upload
66
+
67
+ # Click submit to complete the upload
68
+ page.get_by_role("button", name="Submit").click()
69
+ page.wait_for_timeout(2000) # Wait for response
70
+
71
+ # Verify text changed after submission
72
+ new_text = text_box.input_value()
73
+ assert initial_text != new_text
test_local.sh ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ # Exit on error
4
+ set -e
5
+
6
+ # Kill any existing processes using port 7862
7
+ echo "Cleaning up port 7862..."
8
+ lsof -ti:7862 | xargs kill -9 2>/dev/null || true
9
+
10
+ # Check if uv is installed, if not install it
11
+ if ! command -v uv &> /dev/null; then
12
+ echo "Installing uv package installer..."
13
+ curl -LsSf https://astral.sh/uv/install.sh | sh
14
+ fi
15
+
16
+ # Create virtual environment if it doesn't exist
17
+ if [ ! -d "venv" ]; then
18
+ echo "Creating virtual environment..."
19
+ python -m venv venv
20
+ fi
21
+
22
+ # Activate virtual environment
23
+ echo "Activating virtual environment..."
24
+ source venv/bin/activate
25
+
26
+ # Install dependencies using uv
27
+ echo "Installing dependencies with uv..."
28
+ uv pip install -r requirements.txt
29
+
30
+ # Install Playwright browsers
31
+ echo "Installing Playwright browsers..."
32
+ playwright install chromium
33
+
34
+ # Run unit tests
35
+ echo "Running unit tests..."
36
+ python -m pytest test_app.py -v
37
+
38
+ if [ $? -eq 0 ]; then
39
+ echo "Unit tests passed! Starting Gradio app..."
40
+ # Start Gradio app in background
41
+ python app.py &
42
+ GRADIO_PID=$!
43
+
44
+ # Wait for server to start
45
+ echo "Waiting for Gradio server to start..."
46
+ sleep 3
47
+
48
+ # Run e2e tests
49
+ echo "Running e2e tests..."
50
+ python -m pytest test_e2e.py -v
51
+ E2E_STATUS=$?
52
+
53
+ # Kill Gradio server
54
+ kill $GRADIO_PID
55
+
56
+ if [ $E2E_STATUS -eq 0 ]; then
57
+ echo "All tests passed! Starting Gradio app for development..."
58
+ python app.py
59
+ else
60
+ echo "E2E tests failed! Please fix the issues before running the app."
61
+ exit 1
62
+ fi
63
+ else
64
+ echo "Unit tests failed! Please fix the issues before running e2e tests."
65
+ exit 1
66
+ fi
67
+
68
+ # Deactivate virtual environment
69
+ deactivate