sunnychenxiwang's picture
Upload 1600 files
14c9181 verified

A newer version of the Gradio SDK is available: 5.12.0

Upgrade

Text Detection

This page is a manual preparation guide for datasets not yet supported by [Dataset Preparer](./dataset_preparer.md), which all these scripts will be eventually migrated into.

Overview

Dataset Images Annotation Files
training validation testing
ICDAR2011 homepage - -
ICDAR2017 homepage instances_training.json instances_val.json -
CurvedSynText150k homepage | Part1 | Part2 instances_training.json - -
DeText homepage - - -
Lecture Video DB homepage - - -
LSVT homepage - - -
IMGUR homepage - - -
KAIST homepage - - -
MTWI homepage - - -
ReCTS homepage - - -
IIIT-ILST homepage - - -
VinText homepage - - -
BID homepage - - -
RCTW homepage - - -
HierText homepage - - -
ArT homepage - - -

Install AWS CLI (optional)

  • Since there are some datasets that require the AWS CLI to be installed in advance, we provide a quick installation guide here:

      curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
      unzip awscliv2.zip
      sudo ./aws/install
      ./aws/install -i /usr/local/aws-cli -b /usr/local/bin
      !aws configure
      # this command will require you to input keys, you can skip them except
      # for the Default region name
      # AWS Access Key ID [None]:
      # AWS Secret Access Key [None]:
      # Default region name [None]: us-east-1
      # Default output format [None]
    

For users in China, these datasets can also be downloaded from OpenDataLab with high speed:

Important Note

**For users who want to train models on CTW1500, ICDAR 2015/2017, and Totaltext dataset,** there might be some images containing orientation info in EXIF data. The default OpenCV
backend used in MMCV would read them and apply the rotation on the images.  However, their gold annotations are made on the raw pixels, and such
inconsistency results in false examples in the training set. Therefore, users should use `dict(type='LoadImageFromFile', color_type='color_ignore_orientation')` in pipelines to change MMCV's default loading behaviour. (see [DBNet's pipeline config](https://github.com/open-mmlab/mmocr/blob/main/configs/_base_/det_pipelines/dbnet_pipeline.py) for example)

ICDAR 2011 (Born-Digital Images)

  • Step1: Download Challenge1_Training_Task12_Images.zip, Challenge1_Training_Task1_GT.zip, Challenge1_Test_Task12_Images.zip, and Challenge1_Test_Task1_GT.zip from homepage Task 1.1: Text Localization (2013 edition).

    mkdir icdar2011 && cd icdar2011
    mkdir imgs && mkdir annotations
    
    # Download ICDAR 2011
    wget https://rrc.cvc.uab.es/downloads/Challenge1_Training_Task12_Images.zip --no-check-certificate
    wget https://rrc.cvc.uab.es/downloads/Challenge1_Training_Task1_GT.zip --no-check-certificate
    wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task12_Images.zip --no-check-certificate
    wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task1_GT.zip --no-check-certificate
    
    # For images
    unzip -q Challenge1_Training_Task12_Images.zip -d imgs/training
    unzip -q Challenge1_Test_Task12_Images.zip -d imgs/test
    # For annotations
    unzip -q Challenge1_Training_Task1_GT.zip -d annotations/training
    unzip -q Challenge1_Test_Task1_GT.zip -d annotations/test
    
    rm Challenge1_Training_Task12_Images.zip && rm Challenge1_Test_Task12_Images.zip && rm Challenge1_Training_Task1_GT.zip && rm Challenge1_Test_Task1_GT.zip
    
  • Step 2: Generate instances_training.json and instances_test.json with the following command:

    python tools/dataset_converters/textdet/ic11_converter.py PATH/TO/icdar2011 --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    │── icdar2011
    β”‚   β”œβ”€β”€ imgs
    β”‚   β”œβ”€β”€ instances_test.json
    β”‚   └── instances_training.json
    

ICDAR 2017

  • Follow similar steps as ICDAR 2015.

  • The resulting directory structure looks like the following:

    β”œβ”€β”€ icdar2017
    β”‚   β”œβ”€β”€ imgs
    β”‚   β”œβ”€β”€ annotations
    β”‚   β”œβ”€β”€ instances_training.json
    β”‚   └── instances_val.json
    

CurvedSynText150k

  • Step1: Download syntext1.zip and syntext2.zip to CurvedSynText150k/.

  • Step2:

    unzip -q syntext1.zip
    mv train.json train1.json
    unzip images.zip
    rm images.zip
    
    unzip -q syntext2.zip
    mv train.json train2.json
    unzip images.zip
    rm images.zip
    
  • Step3: Download instances_training.json to CurvedSynText150k/

  • Or, generate instances_training.json with following command:

    python tools/dataset_converters/common/curvedsyntext_converter.py PATH/TO/CurvedSynText150k --nproc 4
    
  • The resulting directory structure looks like the following:

    β”œβ”€β”€ CurvedSynText150k
    β”‚   β”œβ”€β”€ syntext_word_eng
    β”‚   β”œβ”€β”€ emcs_imgs
    β”‚   └── instances_training.json
    

DeText

  • Step1: Download ch9_training_images.zip, ch9_training_localization_transcription_gt.zip, ch9_validation_images.zip, and ch9_validation_localization_transcription_gt.zip from Task 3: End to End on the homepage.

    mkdir detext && cd detext
    mkdir imgs && mkdir annotations && mkdir imgs/training && mkdir imgs/val && mkdir annotations/training && mkdir annotations/val
    
    # Download DeText
    wget https://rrc.cvc.uab.es/downloads/ch9_training_images.zip --no-check-certificate
    wget https://rrc.cvc.uab.es/downloads/ch9_training_localization_transcription_gt.zip --no-check-certificate
    wget https://rrc.cvc.uab.es/downloads/ch9_validation_images.zip --no-check-certificate
    wget https://rrc.cvc.uab.es/downloads/ch9_validation_localization_transcription_gt.zip --no-check-certificate
    
    # Extract images and annotations
    unzip -q ch9_training_images.zip -d imgs/training && unzip -q ch9_training_localization_transcription_gt.zip -d annotations/training && unzip -q ch9_validation_images.zip -d imgs/val && unzip -q ch9_validation_localization_transcription_gt.zip -d annotations/val
    
    # Remove zips
    rm ch9_training_images.zip && rm ch9_training_localization_transcription_gt.zip && rm ch9_validation_images.zip && rm ch9_validation_localization_transcription_gt.zip
    
  • Step2: Generate instances_training.json and instances_val.json with following command:

    python tools/dataset_converters/textdet/detext_converter.py PATH/TO/detext --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    │── detext
    β”‚   β”œβ”€β”€ annotations
    β”‚   β”œβ”€β”€ imgs
    β”‚   β”œβ”€β”€ instances_test.json
    β”‚   └── instances_training.json
    

Lecture Video DB

  • Step1: Download IIIT-CVid.zip to lv/.

    mkdir lv && cd lv
    
    # Download LV dataset
    wget http://cdn.iiit.ac.in/cdn/preon.iiit.ac.in/~kartik/IIIT-CVid.zip
    unzip -q IIIT-CVid.zip
    
    mv IIIT-CVid/Frames imgs
    
    rm IIIT-CVid.zip
    
  • Step2: Generate instances_training.json, instances_val.json, and instances_test.json with following command:

    python tools/dataset_converters/textdet/lv_converter.py PATH/TO/lv --nproc 4
    
  • The resulting directory structure looks like the following:

    │── lv
    β”‚   β”œβ”€β”€ imgs
    β”‚   β”œβ”€β”€ instances_test.json
    β”‚   β”œβ”€β”€ instances_training.json
    β”‚   └── instances_val.json
    

LSVT

  • Step1: Download train_full_images_0.tar.gz, train_full_images_1.tar.gz, and train_full_labels.json to lsvt/.

    mkdir lsvt && cd lsvt
    
    # Download LSVT dataset
    wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_0.tar.gz
    wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_1.tar.gz
    wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_labels.json
    
    mkdir annotations
    tar -xf train_full_images_0.tar.gz && tar -xf train_full_images_1.tar.gz
    mv train_full_labels.json annotations/ && mv train_full_images_1/*.jpg train_full_images_0/
    mv train_full_images_0 imgs
    
    rm train_full_images_0.tar.gz && rm train_full_images_1.tar.gz && rm -rf train_full_images_1
    
  • Step2: Generate instances_training.json and instances_val.json (optional) with the following command:

    # Annotations of LSVT test split is not publicly available, split a validation
    # set by adding --val-ratio 0.2
    python tools/dataset_converters/textdet/lsvt_converter.py PATH/TO/lsvt
    
  • After running the above codes, the directory structure should be as follows:

    |── lsvt
    β”‚   β”œβ”€β”€ imgs
    β”‚   β”œβ”€β”€ instances_training.json
    β”‚   └── instances_val.json (optional)
    

IMGUR

  • Step1: Run download_imgur5k.py to download images. You can merge PR#5 in your local repository to enable a much faster parallel execution of image download.

    mkdir imgur && cd imgur
    
    git clone https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset.git
    
    # Download images from imgur.com. This may take SEVERAL HOURS!
    python ./IMGUR5K-Handwriting-Dataset/download_imgur5k.py --dataset_info_dir ./IMGUR5K-Handwriting-Dataset/dataset_info/ --output_dir ./imgs
    
    # For annotations
    mkdir annotations
    mv ./IMGUR5K-Handwriting-Dataset/dataset_info/*.json annotations
    
    rm -rf IMGUR5K-Handwriting-Dataset
    
  • Step2: Generate instances_train.json, instance_val.json and instances_test.json with the following command:

    python tools/dataset_converters/textdet/imgur_converter.py PATH/TO/imgur
    
  • After running the above codes, the directory structure should be as follows:

    │── imgur
    β”‚   β”œβ”€β”€ annotations
    β”‚   β”œβ”€β”€ imgs
    β”‚   β”œβ”€β”€ instances_test.json
    β”‚   β”œβ”€β”€ instances_training.json
    β”‚   └── instances_val.json
    

KAIST

  • Step1: Complete download KAIST_all.zip to kaist/.

    mkdir kaist && cd kaist
    mkdir imgs && mkdir annotations
    
    # Download KAIST dataset
    wget http://www.iapr-tc11.org/dataset/KAIST_SceneText/KAIST_all.zip
    unzip -q KAIST_all.zip
    
    rm KAIST_all.zip
    
  • Step2: Extract zips:

    python tools/dataset_converters/common/extract_kaist.py PATH/TO/kaist
    
  • Step3: Generate instances_training.json and instances_val.json (optional) with following command:

    # Since KAIST does not provide an official split, you can split the dataset by adding --val-ratio 0.2
    python tools/dataset_converters/textdet/kaist_converter.py PATH/TO/kaist --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    │── kaist
    β”‚   β”œβ”€β”€ annotations
    β”‚   β”œβ”€β”€ imgs
    β”‚   β”œβ”€β”€ instances_training.json
    β”‚   └── instances_val.json (optional)
    

MTWI

  • Step1: Download mtwi_2018_train.zip from homepage.

    mkdir mtwi && cd mtwi
    
    unzip -q mtwi_2018_train.zip
    mv image_train imgs && mv txt_train annotations
    
    rm mtwi_2018_train.zip
    
  • Step2: Generate instances_training.json and instance_val.json (optional) with the following command:

    # Annotations of MTWI test split is not publicly available, split a validation
    # set by adding --val-ratio 0.2
    python tools/dataset_converters/textdet/mtwi_converter.py PATH/TO/mtwi --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    │── mtwi
    β”‚   β”œβ”€β”€ annotations
    β”‚   β”œβ”€β”€ imgs
    β”‚   β”œβ”€β”€ instances_training.json
    β”‚   └── instances_val.json (optional)
    

ReCTS

  • Step1: Download ReCTS.zip to rects/ from the homepage.

    mkdir rects && cd rects
    
    # Download ReCTS dataset
    # You can also find Google Drive link on the dataset homepage
    wget https://datasets.cvc.uab.es/rrc/ReCTS.zip --no-check-certificate
    unzip -q ReCTS.zip
    
    mv img imgs && mv gt_unicode annotations
    
    rm ReCTS.zip && rm -rf gt
    
  • Step2: Generate instances_training.json and instances_val.json (optional) with following command:

    # Annotations of ReCTS test split is not publicly available, split a validation
    # set by adding --val-ratio 0.2
    python tools/dataset_converters/textdet/rects_converter.py PATH/TO/rects --nproc 4 --val-ratio 0.2
    
  • After running the above codes, the directory structure should be as follows:

    │── rects
    β”‚   β”œβ”€β”€ annotations
    β”‚   β”œβ”€β”€ imgs
    β”‚   β”œβ”€β”€ instances_val.json (optional)
    β”‚   └── instances_training.json
    

ILST

  • Step1: Download IIIT-ILST from onedrive

  • Step2: Run the following commands

    unzip -q IIIT-ILST.zip && rm IIIT-ILST.zip
    cd IIIT-ILST
    
    # rename files
    cd Devanagari && for i in `ls`; do mv -f $i `echo "devanagari_"$i`; done && cd ..
    cd Malayalam && for i in `ls`; do mv -f $i `echo "malayalam_"$i`; done && cd ..
    cd Telugu && for i in `ls`; do mv -f $i `echo "telugu_"$i`; done && cd ..
    
    # transfer image path
    mkdir imgs && mkdir annotations
    mv Malayalam/{*jpg,*jpeg} imgs/ && mv Malayalam/*xml annotations/
    mv Devanagari/*jpg imgs/ && mv Devanagari/*xml annotations/
    mv Telugu/*jpeg imgs/ && mv Telugu/*xml annotations/
    
    # remove unnecessary files
    rm -rf Devanagari && rm -rf Malayalam && rm -rf Telugu && rm -rf README.txt
    
  • Step3: Generate instances_training.json and instances_val.json (optional). Since the original dataset doesn't have a validation set, you may specify --val-ratio to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.

    python tools/dataset_converters/textdet/ilst_converter.py    PATH/TO/IIIT-ILST --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    │── IIIT-ILST
    β”‚   β”œβ”€β”€ annotations
    β”‚   β”œβ”€β”€ imgs
    β”‚   β”œβ”€β”€ instances_val.json (optional)
    β”‚   └── instances_training.json
    

VinText

  • Step1: Download vintext.zip to vintext

    mkdir vintext && cd vintext
    
    # Download dataset from google drive
    wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml' -O- β”‚ sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml" -O vintext.zip && rm -rf /tmp/cookies.txt
    
    # Extract images and annotations
    unzip -q vintext.zip && rm vintext.zip
    mv vietnamese/labels ./ && mv vietnamese/test_image ./ && mv vietnamese/train_images ./ && mv vietnamese/unseen_test_images ./
    rm -rf vietnamese
    
    # Rename files
    mv labels annotations && mv test_image test && mv train_images  training && mv unseen_test_images  unseen_test
    mkdir imgs
    mv training imgs/ && mv test imgs/ && mv unseen_test imgs/
    
  • Step2: Generate instances_training.json, instances_test.json and instances_unseen_test.json

    python tools/dataset_converters/textdet/vintext_converter.py PATH/TO/vintext --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    │── vintext
    β”‚   β”œβ”€β”€ annotations
    β”‚   β”œβ”€β”€ imgs
    β”‚   β”œβ”€β”€ instances_test.json
    β”‚   β”œβ”€β”€ instances_unseen_test.json
    β”‚   └── instances_training.json
    

BID

  • Step1: Download BID Dataset.zip

  • Step2: Run the following commands to preprocess the dataset

    # Rename
    mv BID\ Dataset.zip BID_Dataset.zip
    
    # Unzip and Rename
    unzip -q BID_Dataset.zip && rm BID_Dataset.zip
    mv BID\ Dataset BID
    
    # The BID dataset has a problem of permission, and you may
    # add permission for this file
    chmod -R 777 BID
    cd BID
    mkdir imgs && mkdir annotations
    
    # For images and annotations
    mv CNH_Aberta/*in.jpg imgs && mv CNH_Aberta/*txt annotations && rm -rf CNH_Aberta
    mv CNH_Frente/*in.jpg imgs && mv CNH_Frente/*txt annotations && rm -rf CNH_Frente
    mv CNH_Verso/*in.jpg imgs && mv CNH_Verso/*txt annotations && rm -rf CNH_Verso
    mv CPF_Frente/*in.jpg imgs && mv CPF_Frente/*txt annotations && rm -rf CPF_Frente
    mv CPF_Verso/*in.jpg imgs && mv CPF_Verso/*txt annotations && rm -rf CPF_Verso
    mv RG_Aberto/*in.jpg imgs && mv RG_Aberto/*txt annotations && rm -rf RG_Aberto
    mv RG_Frente/*in.jpg imgs && mv RG_Frente/*txt annotations && rm -rf RG_Frente
    mv RG_Verso/*in.jpg imgs && mv RG_Verso/*txt annotations && rm -rf RG_Verso
    
    # Remove unnecessary files
    rm -rf desktop.ini
    
  • Step3: - Step3: Generate instances_training.json and instances_val.json (optional). Since the original dataset doesn't have a validation set, you may specify --val-ratio to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.

    python tools/dataset_converters/textdet/bid_converter.py PATH/TO/BID --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    │── BID
    β”‚   β”œβ”€β”€ annotations
    β”‚   β”œβ”€β”€ imgs
    β”‚   β”œβ”€β”€ instances_training.json
    β”‚   └── instances_val.json (optional)
    

RCTW

  • Step1: Download train_images.zip.001, train_images.zip.002, and train_gts.zip from the homepage, extract the zips to rctw/imgs and rctw/annotations, respectively.

  • Step2: Generate instances_training.json and instances_val.json (optional). Since the test annotations are not publicly available, you may specify --val-ratio to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.

    # Annotations of RCTW test split is not publicly available, split a validation set by adding --val-ratio 0.2
    python tools/dataset_converters/textdet/rctw_converter.py PATH/TO/rctw --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    │── rctw
    β”‚   β”œβ”€β”€ annotations
    β”‚   β”œβ”€β”€ imgs
    β”‚   β”œβ”€β”€ instances_training.json
    β”‚   └── instances_val.json (optional)
    

HierText

  • Step1 (optional): Install AWS CLI.

  • Step2: Clone HierText repo to get annotations

    mkdir HierText
    git clone https://github.com/google-research-datasets/hiertext.git
    
  • Step3: Download train.tgz, validation.tgz from aws

    aws s3 --no-sign-request cp s3://open-images-dataset/ocr/train.tgz .
    aws s3 --no-sign-request cp s3://open-images-dataset/ocr/validation.tgz .
    
  • Step4: Process raw data

    # process annotations
    mv hiertext/gt ./
    rm -rf hiertext
    mv gt annotations
    gzip -d annotations/train.jsonl.gz
    gzip -d annotations/validation.jsonl.gz
    # process images
    mkdir imgs
    mv train.tgz imgs/
    mv validation.tgz imgs/
    tar -xzvf imgs/train.tgz
    tar -xzvf imgs/validation.tgz
    
  • Step5: Generate instances_training.json and instance_val.json. HierText includes different levels of annotation, from paragraph, line, to word. Check the original paper for details. E.g. set --level paragraph to get paragraph-level annotation. Set --level line to get line-level annotation. set --level word to get word-level annotation.

    # Collect word annotation from HierText  --level word
    python tools/dataset_converters/textdet/hiertext_converter.py PATH/TO/HierText --level word --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    │── HierText
    β”‚   β”œβ”€β”€ annotations
    β”‚   β”œβ”€β”€ imgs
    β”‚   β”œβ”€β”€ instances_training.json
    β”‚   └── instances_val.json
    

ArT

  • Step1: Download train_images.tar.gz, and train_labels.json from the homepage to art/

    mkdir art && cd art
    mkdir annotations
    
    # Download ArT dataset
    wget https://dataset-bj.cdn.bcebos.com/art/train_images.tar.gz --no-check-certificate
    wget https://dataset-bj.cdn.bcebos.com/art/train_labels.json --no-check-certificate
    
    # Extract
    tar -xf train_images.tar.gz
    mv train_images imgs
    mv train_labels.json annotations/
    
    # Remove unnecessary files
    rm train_images.tar.gz
    
  • Step2: Generate instances_training.json and instances_val.json (optional). Since the test annotations are not publicly available, you may specify --val-ratio to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.

    # Annotations of ArT test split is not publicly available, split a validation set by adding --val-ratio 0.2
    python tools/data/textdet/art_converter.py PATH/TO/art --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    │── art
    β”‚   β”œβ”€β”€ annotations
    β”‚   β”œβ”€β”€ imgs
    β”‚   β”œβ”€β”€ instances_training.json
    β”‚   └── instances_val.json (optional)