sunnychenxiwang's picture
Upload 1600 files
14c9181 verified

A newer version of the Gradio SDK is available: 5.9.1

Upgrade

Data Transforms and Pipeline

In the design of MMOCR, dataset construction and preparation are decoupled. That is, dataset construction classes such as OCRDataset are responsible for loading and parsing annotation files; while data transforms further apply data preprocessing, augmentation, formatting, and other related functions. Currently, there are five types of data transforms implemented in MMOCR, as shown in the following table.

Transforms Type File Description
Data Loading loading.py Implemented the data loading functions.
Data Formatting formatting.py Formatting the data required by different tasks.
Cross Project Data Adapter adapters.py Converting the data format between other OpenMMLab projects and MMOCR.
Data Augmentation Functions ocr_transforms.py
textdet_transforms.py
textrecog_transforms.py
Various built-in data augmentation methods designed for different tasks.
Wrappers of Third Party Packages wrappers.py Wrapping the transforms implemented in popular third party packages such as ImgAug, and adapting them to MMOCR format.

Since each data transform class is independent of each other, we can easily combine any data transforms to build a data pipeline after we have defined the data fields. As shown in the following figure, in MMOCR, a typical training data pipeline consists of three stages: data loading, data augmentation, and data formatting. Users only need to define the data pipeline list in the configuration file and specify the specific data transform class and its parameters:

Flowchart

train_pipeline_r18 = [
    # Loading images
    dict(
        type='LoadImageFromFile',
        color_type='color_ignore_orientation'),
    # Loading annotations
    dict(
        type='LoadOCRAnnotations',
        with_polygon=True,
        with_bbox=True,
        with_label=True,
    ),
    # Data augmentation
    dict(
        type='ImgAugWrapper',
        args=[['Fliplr', 0.5],
              dict(cls='Affine', rotate=[-10, 10]), ['Resize', [0.5, 3.0]]]),
    dict(type='RandomCrop', min_side_ratio=0.1),
    dict(type='Resize', scale=(640, 640), keep_ratio=True),
    dict(type='Pad', size=(640, 640)),
    # Data formatting
    dict(
        type='PackTextDetInputs',
        meta_keys=('img_path', 'ori_shape', 'img_shape'))
]
More tutorials about data pipeline configuration can be found in the [Config Doc](../user_guides/config.md#data-pipeline-configuration). Next, we will briefly introduce the data transforms supported in MMOCR according to their categories.

For each data transform, MMOCR provides a detailed docstring. For example, in the header of each data transform class, we annotate Required Keys, Modified Keys and Added Keys. The Required Keys represent the mandatory fields that should be included in the input required by the data transform, while the Modified Keys and Added Keys indicate that the transform may modify or add the fields into the original data. For example, LoadImageFromFile implements the image loading function, whose Required Keys is the image path img_path, and the Modified Keys includes the loaded image img, the current size of the image img_shape, the original size of the image ori_shape, and other image attributes.

@TRANSFORMS.register_module()
class LoadImageFromFile(MMCV_LoadImageFromFile):
    # We provide detailed docstring for each data transform.
    """Load an image from file.

    Required Keys:

    - img_path

    Modified Keys:

    - img
    - img_shape
    - ori_shape
    """
In the data pipeline of MMOCR, the image and label information are saved in a dictionary. By using the unified fields, the data can be freely transferred between different data transforms. Therefore, it is very important to understand the conventional fields used in MMOCR.

For your convenience, the following table lists the conventional keys used in MMOCR data transforms.

Key Type Description
img np.array(dtype=np.uint8) Image array, shape of (h, w, c).
img_shape tuple(int, int) Current image size (h, w).
ori_shape tuple(int, int) Original image size (h, w).
scale tuple(int, int) Stores the target image size (h, w) specified by the user in the Resize data transform series. Note: This value may not correspond to the actual image size after the transformation.
scale_factor tuple(float, float) Stores the target image scale factor (w_scale, h_scale) specified by the user in the Resize data transform series. Note: This value may not correspond to the actual image size after the transformation.
keep_ratio bool Boolean flag determines whether to keep the aspect ratio while scaling images.
flip bool Boolean flags to indicate whether the image has been flipped.
flip_direction str Flipping direction, options are horizontal, vertical, diagonal.
gt_bboxes np.array(dtype=np.float32) Ground-truth bounding boxes.
gt_polygons list[np.array(dtype=np.float32) Ground-truth polygons.
gt_bboxes_labels np.array(dtype=np.int64) Category label of bounding boxes. By default, MMOCR uses 0 to represent "text" instances.
gt_texts list[str] Ground-truth text content of the instance.
gt_ignored np.array(dtype=np.bool_) Boolean flag indicating whether ignoring the instance (used in text detection).

Data Loading

Data loading transforms mainly implement the functions of loading data from different formats and backends. Currently, the following data loading transforms are implemented in MMOCR:

Transforms Name Required Keys Modified/Added Keys Description
LoadImageFromFile img_path img
img_shape
ori_shape
Load image from the specified path,supporting different file storage backends (e.g. disk, http, petrel) and decoding backends (e.g. cv2, turbojpeg, pillow, tifffile).
LoadOCRAnnotations bbox
bbox_label
polygon
ignore
text
gt_bboxes
gt_bboxes_labels
gt_polygons
gt_ignored
gt_texts
Parse the annotation required by OCR task.
LoadKIEAnnotations bboxes bbox_labels edge_labels
texts
gt_bboxes
gt_bboxes_labels
gt_edge_labels
gt_texts
ori_shape
Parse the annotation required by KIE task.

Data Augmentation

Data augmentation is an indispensable process in text detection and recognition tasks. Currently, MMOCR has implemented dozens of data augmentation modules commonly used in OCR fields, which are classified into ocr_transforms.py, textdet_transforms.py, and textrecog_transforms.py.

Specifically, ocr_transforms.py implements generic OCR data augmentation modules such as RandomCrop and RandomRotate:

Transforms Name Required Keys Modified/Added Keys Description
RandomCrop img
gt_bboxes
gt_bboxes_labels
gt_polygons
gt_ignored
gt_texts (optional)
img
img_shape
gt_bboxes
gt_bboxes_labels
gt_polygons
gt_ignored
gt_texts (optional)
Randomly crop the image and make sure the cropped image contains at least one text instance. The optional parameter is min_side_ratio, which controls the ratio of the short side of the cropped image to the original image, the default value is 0.4.
RandomRotate img
img_shape
gt_bboxes (optional)
gt_polygons (optional)
img
img_shape
gt_bboxes (optional)
gt_polygons (optional)
rotated_angle
Randomly rotate the image and optionally fill the blank areas of the rotated image.

textdet_transforms.py implements text detection related data augmentation modules:

Transforms Name Required Keys Modified/Added Keys Description
RandomFlip img
gt_bboxes
gt_polygons
img
gt_bboxes
gt_polygons
flip
flip_direction
Random flip, support horizontal, vertical and diagonal modes. Defaults to horizontal.
FixInvalidPolygon gt_polygons
gt_ignored
gt_polygons
gt_ignored
Automatically fixing the invalid polygons included in the annotations.

textrecog_transforms.py implements text recognition related data augmentation modules:

Transforms Name Required Keys Modified/Added Keys Description
RescaleToHeight img img
img_shape
scale
scale_factor
keep_ratio
Scales the image to the specified height while keeping the aspect ratio. When min_width and max_width are specified, the aspect ratio may be changed.
The above table only briefly introduces some selected data augmentation methods, for more information please refer to the [API documentation](../api.rst) or the code docstrings.

Data Formatting

Data formatting transforms are responsible for packaging images, ground truth labels, and other information into a dictionary. Different tasks usually rely on different formatting transforms. For example:

Transforms Name Required Keys Modified/Added Keys Description
PackTextDetInputs - - Pack the inputs required by text detection.
PackTextRecogInputs - - Pack the inputs required by text recognition.
PackKIEInputs - - Pack the inputs required by KIE.

Cross Project Data Adapters

The cross-project data adapters bridge the data formats between MMOCR and other OpenMMLab libraries such as MMDetection, making it possible to call models implemented in other OpenMMLab projects. Currently, MMOCR has implemented MMDet2MMOCR and MMOCR2MMDet, allowing data to be converted between MMDetection and MMOCR formats; with these adapters, users can easily train any detectors supported by MMDetection in MMOCR. For example, we provide a tutorial to show how to train Mask R-CNN as a text detector in MMOCR.

Transforms Name Required Keys Modified/Added Keys Description
MMDet2MMOCR gt_masks gt_ignore_flags gt_polygons
gt_ignored
Convert the fields used in MMDet to MMOCR.
MMOCR2MMDet img_shape
gt_polygons
gt_ignored
gt_masks gt_ignore_flags Convert the fields used in MMOCR to MMDet.

Wrappers

To facilitate the use of popular third-party CV libraries in MMOCR, we provide wrappers in wrappers.py to unify the data format between MMOCR and other third-party libraries. Users can directly configure the data transforms provided by these libraries in the configuration file of MMOCR. The supported wrappers are as follows:

Transforms Name Required Keys Modified/Added Keys Description
ImgAugWrapper img
gt_polygons (optional for text recognition)
gt_bboxes (optional for text recognition)
gt_bboxes_labels (optional for text recognition)
gt_ignored (optional for text recognition)
gt_texts (optional)
img
gt_polygons (optional for text recognition)
gt_bboxes (optional for text recognition)
gt_bboxes_labels (optional for text recognition)
gt_ignored (optional for text recognition)
img_shape (optional)
gt_texts (optional)
ImgAug wrapper, which bridges the data format and configuration between ImgAug and MMOCR, allowing users to config the data augmentation methods supported by ImgAug in MMOCR.
TorchVisionWrapper img img
img_shape
TorchVision wrapper, which bridges the data format and configuration between TorchVision and MMOCR, allowing users to config the data transforms supported by torchvision.transforms in MMOCR.

ImgAugWrapper Example

For example, in the original ImgAug, we can define a Sequential type data augmentation pipeline as follows to perform random flipping, random rotation and random scaling on the image:

import imgaug.augmenters as iaa

aug = iaa.Sequential(
  iaa.Fliplr(0.5),                # horizontally flip 50% of all images
  iaa.Affine(rotate=(-10, 10)),   # rotate by -10 to +10 degrees
  iaa.Resize((0.5, 3.0))          # scale images to 50-300% of their size
)

In MMOCR, we can directly configure the above data augmentation pipeline in train_pipeline as follows:

dict(
  type='ImgAugWrapper',
  args=[
    ['Fliplr', 0.5],
    dict(cls='Affine', rotate=[-10, 10]),
    ['Resize', [0.5, 3.0]],
  ]
)

Specifically, the args parameter accepts a list, and each element in the list can be a list or a dictionary. If it is a list, the first element of the list is the class name in imgaug.augmenters, and the following elements are the initialization parameters of the class; if it is a dictionary, the cls key corresponds to the class name in imgaug.augmenters, and the other key-value pairs correspond to the initialization parameters of the class.

TorchVisionWrapper Example

For example, in the original TorchVision, we can define a Compose type data transformation pipeline as follows to perform color jittering on the image:

import torchvision.transforms as transforms

aug = transforms.Compose([
  transforms.ColorJitter(
    brightness=32.0 / 255,  # brightness jittering range
    saturation=0.5)         # saturation jittering range
])

In MMOCR, we can directly configure the above data transformation pipeline in train_pipeline as follows:

dict(
  type='TorchVisionWrapper',
  op='ColorJitter',
  brightness=32.0 / 255,
  saturation=0.5
)

Specifically, the op parameter is the class name in torchvision.transforms, and the following parameters correspond to the initialization parameters of the class.