sunnychenxiwang's picture
Upload 1595 files
0b4516f verified
history blame
31.2 kB

Text Recognition

This page is a manual preparation guide for datasets not yet supported by [Dataset Preparer](./, which all these scripts will be eventually migrated into.


Dataset images annotation file annotation file
training test
coco_text homepage train_labels.json -
ICDAR2011 homepage - -
SynthAdd (code:627x) train_labels.json -
OpenVINO Open Images annotations annotations
DeText homepage - -
Lecture Video DB homepage - -
LSVT homepage - -
IMGUR homepage - -
KAIST homepage - -
MTWI homepage - -
ReCTS homepage - -
IIIT-ILST homepage - -
VinText homepage - -
BID homepage - -
RCTW homepage - -
HierText homepage - -
ArT homepage - -

(*) Since the official homepage is unavailable now, we provide an alternative for quick reference. However, we do not guarantee the correctness of the dataset.

Install AWS CLI (optional)

  • Since there are some datasets that require the AWS CLI to be installed in advance, we provide a quick installation guide here:

      curl "" -o ""
      sudo ./aws/install
      ./aws/install -i /usr/local/aws-cli -b /usr/local/bin
      !aws configure
      # this command will require you to input keys, you can skip them except
      # for the Default region name
      # AWS Access Key ID [None]:
      # AWS Secret Access Key [None]:
      # Default region name [None]: us-east-1
      # Default output format [None]

For users in China, these datasets can also be downloaded from OpenDataLab with high speed:

ICDAR 2011 (Born-Digital Images)

  • Step1: Download,, and Challenge1_Test_Task3_GT.txt from homepage Task 1.3: Word Recognition (2013 edition).

    mkdir icdar2011 && cd icdar2011
    mkdir annotations
    # Download ICDAR 2011
    wget --no-check-certificate
    wget --no-check-certificate
    wget --no-check-certificate
    # For images
    mkdir crops
    unzip -q -d crops/train
    unzip -q -d crops/test
    # For annotations
    mv Challenge1_Test_Task3_GT.txt annotations && mv crops/train/gt.txt annotations/Challenge1_Train_Task3_GT.txt
  • Step2: Convert original annotations to train_labels.json and test_labels.json with the following command:

    python tools/dataset_converters/textrecog/ PATH/TO/icdar2011
  • After running the above codes, the directory structure should be as follows:

    β”œβ”€β”€ icdar2011
    β”‚   β”œβ”€β”€ crops
    β”‚   β”œβ”€β”€ train_labels.json
    β”‚   └── test_labels.json


  • Step1: Download from homepage

  • Step2: Download train_labels.json

  • After running the above codes, the directory structure should be as follows:

    β”œβ”€β”€ coco_text
    β”‚   β”œβ”€β”€ train_labels.json
    β”‚   └── train_words


  • Step1: Download from SynthAdd (code:627x))

  • Step2: Download train_labels.json

  • Step3:

    mkdir SynthAdd && cd SynthAdd
    mv /path/to/ .
    mv /path/to/train_labels.json .
    # create soft link
    cd /path/to/mmocr/data/recog
    ln -s /path/to/SynthAdd SynthAdd
  • After running the above codes, the directory structure should be as follows:

    β”œβ”€β”€ SynthAdd
    β”‚   β”œβ”€β”€ train_labels.json
    β”‚   └── SynthText_Add


  • Step1 (optional): Install AWS CLI.

  • Step2: Download Open Images subsets train_1, train_2, train_5, train_f, and validation to openvino/.

    mkdir openvino && cd openvino
    # Download Open Images subsets
    for s in 1 2 5 f; do
      aws s3 --no-sign-request cp s3://open-images-dataset/tar/train_${s}.tar.gz .
    aws s3 --no-sign-request cp s3://open-images-dataset/tar/validation.tar.gz .
    # Download annotations
    for s in 1 2 5 f; do
    # Extract images
    mkdir -p openimages_v5/val
    for s in 1 2 5 f; do
      tar zxf train_${s}.tar.gz -C openimages_v5
    tar zxf validation.tar.gz -C openimages_v5/val
  • Step3: Generate train_{1,2,5,f}_labels.json, val_labels.json and crop images using 4 processes with the following command:

    python tools/dataset_converters/textrecog/ /path/to/openvino 4
  • After running the above codes, the directory structure should be as follows:

    β”œβ”€β”€ OpenVINO
    β”‚   β”œβ”€β”€ image_1
    β”‚   β”œβ”€β”€ image_2
    β”‚   β”œβ”€β”€ image_5
    β”‚   β”œβ”€β”€ image_f
    β”‚   β”œβ”€β”€ image_val
    β”‚   β”œβ”€β”€ train_1_labels.json
    β”‚   β”œβ”€β”€ train_2_labels.json
    β”‚   β”œβ”€β”€ train_5_labels.json
    β”‚   β”œβ”€β”€ train_f_labels.json
    β”‚   └── val_labels.json


  • Step1: Download,,, and from Task 3: End to End on the homepage.

    mkdir detext && cd detext
    mkdir imgs && mkdir annotations && mkdir imgs/training && mkdir imgs/val && mkdir annotations/training && mkdir annotations/val
    # Download DeText
    wget --no-check-certificate
    wget --no-check-certificate
    wget --no-check-certificate
    wget --no-check-certificate
    # Extract images and annotations
    unzip -q -d imgs/training && unzip -q -d annotations/training && unzip -q -d imgs/val && unzip -q -d annotations/val
    # Remove zips
    rm && rm && rm && rm
  • Step2: Generate train_labels.json and test_labels.json with following command:

    # Add --preserve-vertical to preserve vertical texts for training, otherwise
    # vertical images will be filtered and stored in PATH/TO/detext/ignores
    python tools/dataset_converters/textrecog/ PATH/TO/detext --nproc 4
  • After running the above codes, the directory structure should be as follows:

    β”œβ”€β”€ detext
    β”‚   β”œβ”€β”€ crops
    β”‚   β”œβ”€β”€ ignores
    β”‚   β”œβ”€β”€ train_labels.json
    β”‚   └── test_labels.json


  • Step1: Download labeled_images.tar.gz to naf/.

    mkdir naf && cd naf
    # Download NAF dataset
    tar -zxf labeled_images.tar.gz
    # For images
    mkdir annotations && mv labeled_images imgs
    # For annotations
    git clone
    mv NAF_dataset/train_valid_test_split.json annotations/ && mv NAF_dataset/groups annotations/
    rm -rf NAF_dataset && rm labeled_images.tar.gz
  • Step2: Generate train_labels.json, val_labels.json, and test_labels.json with following command:

    # Add --preserve-vertical to preserve vertical texts for training, otherwise
    # vertical images will be filtered and stored in PATH/TO/naf/ignores
    python tools/dataset_converters/textrecog/ PATH/TO/naf --nproc 4
  • After running the above codes, the directory structure should be as follows:

    β”œβ”€β”€ naf
    β”‚   β”œβ”€β”€ crops
    β”‚   β”œβ”€β”€ train_labels.json
    β”‚   β”œβ”€β”€ val_labels.json
    β”‚   └── test_labels.json

Lecture Video DB

This section is not fully tested yet.
The LV dataset has already provided cropped images and the corresponding annotations
  • Step1: Download to lv/.

    mkdir lv && cd lv
    # Download LV dataset
    unzip -q
    # For image
    mv IIIT-CVid/Crops ./
    # For annotation
    mv IIIT-CVid/train.txt train_labels.json && mv IIIT-CVid/val.txt val_label.txt && mv IIIT-CVid/test.txt test_labels.json
  • Step2: Generate train_labels.json, val.json, and test.json with following command:

    python tools/dataset_converters/textdreog/ PATH/TO/lv
  • After running the above codes, the directory structure should be as follows:

    β”œβ”€β”€ lv
    β”‚   β”œβ”€β”€ Crops
    β”‚   β”œβ”€β”€ train_labels.json
    β”‚   └── test_labels.json


This section is not fully tested yet.
  • Step1: Download train_full_images_0.tar.gz, train_full_images_1.tar.gz, and train_full_labels.json to lsvt/.

    mkdir lsvt && cd lsvt
    # Download LSVT dataset
    mkdir annotations
    tar -xf train_full_images_0.tar.gz && tar -xf train_full_images_1.tar.gz
    mv train_full_labels.json annotations/ && mv train_full_images_1/*.jpg train_full_images_0/
    mv train_full_images_0 imgs
    rm train_full_images_0.tar.gz && rm train_full_images_1.tar.gz && rm -rf train_full_images_1
  • Step2: Generate train_labels.json and val_label.json (optional) with the following command:

    # Annotations of LSVT test split is not publicly available, split a validation
    # set by adding --val-ratio 0.2
    # Add --preserve-vertical to preserve vertical texts for training, otherwise
    # vertical images will be filtered and stored in PATH/TO/lsvt/ignores
    python tools/dataset_converters/textdrecog/ PATH/TO/lsvt --nproc 4
  • After running the above codes, the directory structure should be as follows:

    β”œβ”€β”€ lsvt
    β”‚   β”œβ”€β”€ crops
    β”‚   β”œβ”€β”€ ignores
    β”‚   β”œβ”€β”€ train_labels.json
    β”‚   └── val_label.json (optional)


This section is not fully tested yet.
  • Step1: Run to download images. You can merge PR#5 in your local repository to enable a much faster parallel execution of image download.

    mkdir imgur && cd imgur
    git clone
    # Download images from This may take SEVERAL HOURS!
    python ./IMGUR5K-Handwriting-Dataset/ --dataset_info_dir ./IMGUR5K-Handwriting-Dataset/dataset_info/ --output_dir ./imgs
    # For annotations
    mkdir annotations
    mv ./IMGUR5K-Handwriting-Dataset/dataset_info/*.json annotations
    rm -rf IMGUR5K-Handwriting-Dataset
  • Step2: Generate train_labels.json, val_label.txt and test_labels.json and crop images with the following command:

    python tools/dataset_converters/textrecog/ PATH/TO/imgur
  • After running the above codes, the directory structure should be as follows:

    β”œβ”€β”€ imgur
    β”‚   β”œβ”€β”€ crops
    β”‚   β”œβ”€β”€ train_labels.json
    β”‚   β”œβ”€β”€ test_labels.json
    β”‚   └── val_label.json


This section is not fully tested yet.
  • Step1: Download to kaist/.

    mkdir kaist && cd kaist
    mkdir imgs && mkdir annotations
    # Download KAIST dataset
    unzip -q && rm
  • Step2: Extract zips:

    python tools/dataset_converters/common/ PATH/TO/kaist
  • Step3: Generate train_labels.json and val_label.json (optional) with following command:

    # Since KAIST does not provide an official split, you can split the dataset by adding --val-ratio 0.2
    # Add --preserve-vertical to preserve vertical texts for training, otherwise
    # vertical images will be filtered and stored in PATH/TO/kaist/ignores
    python tools/dataset_converters/textrecog/ PATH/TO/kaist --nproc 4
  • After running the above codes, the directory structure should be as follows:

    β”œβ”€β”€ kaist
    β”‚   β”œβ”€β”€ crops
    β”‚   β”œβ”€β”€ ignores
    β”‚   β”œβ”€β”€ train_labels.json
    β”‚   └── val_label.json (optional)


This section is not fully tested yet.
  • Step1: Download from homepage.

    mkdir mtwi && cd mtwi
    unzip -q
    mv image_train imgs && mv txt_train annotations
  • Step2: Generate train_labels.json and val_label.json (optional) with the following command:

    # Annotations of MTWI test split is not publicly available, split a validation
    # set by adding --val-ratio 0.2
    # Add --preserve-vertical to preserve vertical texts for training, otherwise
    # vertical images will be filtered and stored in PATH/TO/mtwi/ignores
    python tools/dataset_converters/textrecog/ PATH/TO/mtwi --nproc 4
  • After running the above codes, the directory structure should be as follows:

    β”œβ”€β”€ mtwi
    β”‚   β”œβ”€β”€ crops
    β”‚   β”œβ”€β”€ train_labels.json
    β”‚   └── val_label.json (optional)


This section is not fully tested yet.
  • Step1: Download to rects/ from the homepage.

    mkdir rects && cd rects
    # Download ReCTS dataset
    # You can also find Google Drive link on the dataset homepage
    wget --no-check-certificate
    unzip -q
    mv img imgs && mv gt_unicode annotations
    rm -f && rm -rf gt
  • Step2: Generate train_labels.json and val_label.json (optional) with the following command:

    # Annotations of ReCTS test split is not publicly available, split a validation
    # set by adding --val-ratio 0.2
    # Add --preserve-vertical to preserve vertical texts for training, otherwise
    # vertical images will be filtered and stored in PATH/TO/rects/ignores
    python tools/dataset_converters/textrecog/ PATH/TO/rects --nproc 4
  • After running the above codes, the directory structure should be as follows:

    β”œβ”€β”€ rects
    β”‚   β”œβ”€β”€ crops
    β”‚   β”œβ”€β”€ ignores
    β”‚   β”œβ”€β”€ train_labels.json
    β”‚   └── val_label.json (optional)


This section is not fully tested yet.
  • Step1: Download from onedrive link

  • Step2: Run the following commands

    unzip -q && rm
    cd IIIT-ILST
    # rename files
    cd Devanagari && for i in `ls`; do mv -f $i `echo "devanagari_"$i`; done && cd ..
    cd Malayalam && for i in `ls`; do mv -f $i `echo "malayalam_"$i`; done && cd ..
    cd Telugu && for i in `ls`; do mv -f $i `echo "telugu_"$i`; done && cd ..
    # transfer image path
    mkdir imgs && mkdir annotations
    mv Malayalam/{*jpg,*jpeg} imgs/ && mv Malayalam/*xml annotations/
    mv Devanagari/*jpg imgs/ && mv Devanagari/*xml annotations/
    mv Telugu/*jpeg imgs/ && mv Telugu/*xml annotations/
    # remove unnecessary files
    rm -rf Devanagari && rm -rf Malayalam && rm -rf Telugu && rm -rf README.txt
  • Step3: Generate train_labels.json and val_label.json (optional) and crop images using 4 processes with the following command (add --preserve-vertical if you wish to preserve the images containing vertical texts). Since the original dataset doesn't have a validation set, you may specify --val-ratio to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.

    python tools/dataset_converters/textrecog/ PATH/TO/IIIT-ILST --nproc 4
  • After running the above codes, the directory structure should be as follows:

    β”œβ”€β”€ IIIT-ILST
    β”‚   β”œβ”€β”€ crops
    β”‚   β”œβ”€β”€ ignores
    β”‚   β”œβ”€β”€ train_labels.json
    β”‚   └── val_label.json (optional)


This section is not fully tested yet.
  • Step1: Download to vintext

    mkdir vintext && cd vintext
    # Download dataset from google drive
    wget --load-cookies /tmp/cookies.txt "$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate '' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml" -O && rm -rf /tmp/cookies.txt
    # Extract images and annotations
    unzip -q && rm
    mv vietnamese/labels ./ && mv vietnamese/test_image ./ && mv vietnamese/train_images ./ && mv vietnamese/unseen_test_images ./
    rm -rf vietnamese
    # Rename files
    mv labels annotations && mv test_image test && mv train_images  training && mv unseen_test_images  unseen_test
    mkdir imgs
    mv training imgs/ && mv test imgs/ && mv unseen_test imgs/
  • Step2: Generate train_labels.json, test_labels.json, unseen_test_labels.json, and crop images using 4 processes with the following command (add --preserve-vertical if you wish to preserve the images containing vertical texts).

    python tools/dataset_converters/textrecog/ PATH/TO/vietnamese --nproc 4
  • After running the above codes, the directory structure should be as follows:

    β”œβ”€β”€ vintext
    β”‚   β”œβ”€β”€ crops
    β”‚   β”œβ”€β”€ ignores
    β”‚   β”œβ”€β”€ train_labels.json
    β”‚   β”œβ”€β”€ test_labels.json
    β”‚   └── unseen_test_labels.json


This section is not fully tested yet.
  • Step1: Download BID

  • Step2: Run the following commands to preprocess the dataset

    # Rename
    mv BID\
    # Unzip and Rename
    unzip -q && rm
    mv BID\ Dataset BID
    # The BID dataset has a problem of permission, and you may
    # add permission for this file
    chmod -R 777 BID
    cd BID
    mkdir imgs && mkdir annotations
    # For images and annotations
    mv CNH_Aberta/*in.jpg imgs && mv CNH_Aberta/*txt annotations && rm -rf CNH_Aberta
    mv CNH_Frente/*in.jpg imgs && mv CNH_Frente/*txt annotations && rm -rf CNH_Frente
    mv CNH_Verso/*in.jpg imgs && mv CNH_Verso/*txt annotations && rm -rf CNH_Verso
    mv CPF_Frente/*in.jpg imgs && mv CPF_Frente/*txt annotations && rm -rf CPF_Frente
    mv CPF_Verso/*in.jpg imgs && mv CPF_Verso/*txt annotations && rm -rf CPF_Verso
    mv RG_Aberto/*in.jpg imgs && mv RG_Aberto/*txt annotations && rm -rf RG_Aberto
    mv RG_Frente/*in.jpg imgs && mv RG_Frente/*txt annotations && rm -rf RG_Frente
    mv RG_Verso/*in.jpg imgs && mv RG_Verso/*txt annotations && rm -rf RG_Verso
    # Remove unnecessary files
    rm -rf desktop.ini
  • Step3: Generate train_labels.json and val_label.json (optional) and crop images using 4 processes with the following command (add --preserve-vertical if you wish to preserve the images containing vertical texts). Since the original dataset doesn't have a validation set, you may specify --val-ratio to split the dataset. E.g., if test-ratio is 0.2, then 20% of the data are left out as the validation set in this example.

    python tools/dataset_converters/textrecog/ PATH/TO/BID --nproc 4
  • After running the above codes, the directory structure should be as follows:

    β”œβ”€β”€ BID
    β”‚   β”œβ”€β”€ crops
    β”‚   β”œβ”€β”€ ignores
    β”‚   β”œβ”€β”€ train_labels.json
    β”‚   └── val_label.json (optional)


This section is not fully tested yet.
  • Step1: Download,, and from the homepage, extract the zips to rctw/imgs and rctw/annotations, respectively.

  • Step2: Generate train_labels.json and val_label.json (optional). Since the original dataset doesn't have a validation set, you may specify --val-ratio to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.

    # Annotations of RCTW test split is not publicly available, split a validation set by adding --val-ratio 0.2
    # Add --preserve-vertical to preserve vertical texts for training, otherwise vertical images will be filtered and stored in PATH/TO/rctw/ignores
    python tools/dataset_converters/textrecog/ PATH/TO/rctw --nproc 4
  • After running the above codes, the directory structure should be as follows:

    │── rctw
    β”‚   β”œβ”€β”€ crops
    β”‚   β”œβ”€β”€ ignores
    β”‚   β”œβ”€β”€ train_labels.json
    β”‚   └── val_label.json (optional)


This section is not fully tested yet.
  • Step1 (optional): Install AWS CLI.

  • Step2: Clone HierText repo to get annotations

    mkdir HierText
    git clone
  • Step3: Download train.tgz, validation.tgz from aws

    aws s3 --no-sign-request cp s3://open-images-dataset/ocr/train.tgz .
    aws s3 --no-sign-request cp s3://open-images-dataset/ocr/validation.tgz .
  • Step4: Process raw data

    # process annotations
    mv hiertext/gt ./
    rm -rf hiertext
    mv gt annotations
    gzip -d annotations/train.json.gz
    gzip -d annotations/validation.json.gz
    # process images
    mkdir imgs
    mv train.tgz imgs/
    mv validation.tgz imgs/
    tar -xzvf imgs/train.tgz
    tar -xzvf imgs/validation.tgz
  • Step5: Generate train_labels.json and val_label.json. HierText includes different levels of annotation, including paragraph, line, and word. Check the original paper for details. E.g. set --level paragraph to get paragraph-level annotation. Set --level line to get line-level annotation. set --level word to get word-level annotation.

    # Collect word annotation from HierText  --level word
    # Add --preserve-vertical to preserve vertical texts for training, otherwise vertical images will be filtered and stored in PATH/TO/HierText/ignores
    python tools/dataset_converters/textrecog/ PATH/TO/HierText --level word --nproc 4
  • After running the above codes, the directory structure should be as follows:

    │── HierText
    β”‚   β”œβ”€β”€ crops
    β”‚   β”œβ”€β”€ ignores
    β”‚   β”œβ”€β”€ train_labels.json
    β”‚   └── val_label.json


This section is not fully tested yet.
  • Step1: Download train_images.tar.gz, and train_labels.json from the homepage to art/

    mkdir art && cd art
    mkdir annotations
    # Download ArT dataset
    # Extract
    tar -xf train_task2_images.tar.gz
    mv train_task2_images crops
    mv train_task2_labels.json annotations/
    # Remove unnecessary files
    rm train_images.tar.gz
  • Step2: Generate train_labels.json and val_label.json (optional). Since the test annotations are not publicly available, you may specify --val-ratio to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.

    # Annotations of ArT test split is not publicly available, split a validation set by adding --val-ratio 0.2
    python tools/dataset_converters/textrecog/ PATH/TO/art
  • After running the above codes, the directory structure should be as follows:

    │── art
    β”‚   β”œβ”€β”€ crops
    β”‚   β”œβ”€β”€ train_labels.json
    β”‚   └── val_label.json (optional)