Data Transforms and Pipeline

In the design of MMOCR, dataset construction and preparation are decoupled. That is, dataset construction classes such as OCRDataset are responsible for loading and parsing annotation files; while data transforms further apply data preprocessing, augmentation, formatting, and other related functions. Currently, there are five types of data transforms implemented in MMOCR, as shown in the following table.


Transforms Type	File	Description
Data Loading	loading.py	Implemented the data loading functions.
Data Formatting	formatting.py	Formatting the data required by different tasks.
Cross Project Data Adapter	adapters.py	Converting the data format between other OpenMMLab projects and MMOCR.
Data Augmentation Functions	ocr_transforms.py textdet_transforms.py textrecog_transforms.py	Various built-in data augmentation methods designed for different tasks.
Wrappers of Third Party Packages	wrappers.py	Wrapping the transforms implemented in popular third party packages such as ImgAug, and adapting them to MMOCR format.

Since each data transform class is independent of each other, we can easily combine any data transforms to build a data pipeline after we have defined the data fields. As shown in the following figure, in MMOCR, a typical training data pipeline consists of three stages: data loading, data augmentation, and data formatting. Users only need to define the data pipeline list in the configuration file and specify the specific data transform class and its parameters:

train_pipeline_r18 = [
    # Loading images
    dict(
        type='LoadImageFromFile',
        color_type='color_ignore_orientation'),
    # Loading annotations
    dict(
        type='LoadOCRAnnotations',
        with_polygon=True,
        with_bbox=True,
        with_label=True,
    ),
    # Data augmentation
    dict(
        type='ImgAugWrapper',
        args=[['Fliplr', 0.5],
              dict(cls='Affine', rotate=[-10, 10]), ['Resize', [0.5, 3.0]]]),
    dict(type='RandomCrop', min_side_ratio=0.1),
    dict(type='Resize', scale=(640, 640), keep_ratio=True),
    dict(type='Pad', size=(640, 640)),
    # Data formatting
    dict(
        type='PackTextDetInputs',
        meta_keys=('img_path', 'ori_shape', 'img_shape'))
]

More tutorials about data pipeline configuration can be found in the [Config Doc](../user_guides/config.md#data-pipeline-configuration). Next, we will briefly introduce the data transforms supported in MMOCR according to their categories.

For each data transform, MMOCR provides a detailed docstring. For example, in the header of each data transform class, we annotate Required Keys, Modified Keys and Added Keys. The Required Keys represent the mandatory fields that should be included in the input required by the data transform, while the Modified Keys and Added Keys indicate that the transform may modify or add the fields into the original data. For example, LoadImageFromFile implements the image loading function, whose Required Keys is the image path img_path, and the Modified Keys includes the loaded image img, the current size of the image img_shape, the original size of the image ori_shape, and other image attributes.

@TRANSFORMS.register_module()
class LoadImageFromFile(MMCV_LoadImageFromFile):
    # We provide detailed docstring for each data transform.
    """Load an image from file.

    Required Keys:

    - img_path

    Modified Keys:

    - img
    - img_shape
    - ori_shape
    """

In the data pipeline of MMOCR, the image and label information are saved in a dictionary. By using the unified fields, the data can be freely transferred between different data transforms. Therefore, it is very important to understand the conventional fields used in MMOCR.

For your convenience, the following table lists the conventional keys used in MMOCR data transforms.


Key	Type	Description
img	`np.array(dtype=np.uint8)`	Image array, shape of `(h, w, c)`.
img_shape	`tuple(int, int)`	Current image size `(h, w)`.
ori_shape	`tuple(int, int)`	Original image size `(h, w)`.
scale	`tuple(int, int)`	Stores the target image size `(h, w)` specified by the user in the `Resize` data transform series. Note: This value may not correspond to the actual image size after the transformation.
scale_factor	`tuple(float, float)`	Stores the target image scale factor `(w_scale, h_scale)` specified by the user in the `Resize` data transform series. Note: This value may not correspond to the actual image size after the transformation.
keep_ratio	`bool`	Boolean flag determines whether to keep the aspect ratio while scaling images.
flip	`bool`	Boolean flags to indicate whether the image has been flipped.
flip_direction	`str`	Flipping direction, options are `horizontal`, `vertical`, `diagonal`.
gt_bboxes	`np.array(dtype=np.float32)`	Ground-truth bounding boxes.
gt_polygons	`list[np.array(dtype=np.float32)`	Ground-truth polygons.
gt_bboxes_labels	`np.array(dtype=np.int64)`	Category label of bounding boxes. By default, MMOCR uses `0` to represent "text" instances.
gt_texts	`list[str]`	Ground-truth text content of the instance.
gt_ignored	`np.array(dtype=np.bool_)`	Boolean flag indicating whether ignoring the instance (used in text detection).

Data Loading

Data loading transforms mainly implement the functions of loading data from different formats and backends. Currently, the following data loading transforms are implemented in MMOCR:


Transforms Name	Required Keys	Modified/Added Keys	Description
LoadImageFromFile	`img_path`	`img` `img_shape` `ori_shape`	Load image from the specified path，supporting different file storage backends (e.g. `disk`, `http`, `petrel`) and decoding backends (e.g. `cv2`, `turbojpeg`, `pillow`, `tifffile`).
LoadOCRAnnotations	`bbox` `bbox_label` `polygon` `ignore` `text`	`gt_bboxes` `gt_bboxes_labels` `gt_polygons` `gt_ignored` `gt_texts`	Parse the annotation required by OCR task.
LoadKIEAnnotations	`bboxes` `bbox_labels` `edge_labels` `texts`	`gt_bboxes` `gt_bboxes_labels` `gt_edge_labels` `gt_texts` `ori_shape`	Parse the annotation required by KIE task.

Data Augmentation

Data augmentation is an indispensable process in text detection and recognition tasks. Currently, MMOCR has implemented dozens of data augmentation modules commonly used in OCR fields, which are classified into ocr_transforms.py, textdet_transforms.py, and textrecog_transforms.py.

Specifically, ocr_transforms.py implements generic OCR data augmentation modules such as RandomCrop and RandomRotate:


Transforms Name	Required Keys	Modified/Added Keys	Description
RandomCrop	`img` `gt_bboxes` `gt_bboxes_labels` `gt_polygons` `gt_ignored` `gt_texts` (optional)	`img` `img_shape` `gt_bboxes` `gt_bboxes_labels` `gt_polygons` `gt_ignored` `gt_texts` (optional)	Randomly crop the image and make sure the cropped image contains at least one text instance. The optional parameter is `min_side_ratio`, which controls the ratio of the short side of the cropped image to the original image, the default value is `0.4`.
RandomRotate	`img` `img_shape` `gt_bboxes` (optional) `gt_polygons` (optional)	`img` `img_shape` `gt_bboxes` (optional) `gt_polygons` (optional) `rotated_angle`	Randomly rotate the image and optionally fill the blank areas of the rotated image.

textdet_transforms.py implements text detection related data augmentation modules:


Transforms Name	Required Keys	Modified/Added Keys	Description
RandomFlip	`img` `gt_bboxes` `gt_polygons`	`img` `gt_bboxes` `gt_polygons` `flip` `flip_direction`	Random flip, support `horizontal`, `vertical` and `diagonal` modes. Defaults to `horizontal`.
FixInvalidPolygon	`gt_polygons` `gt_ignored`	`gt_polygons` `gt_ignored`	Automatically fixing the invalid polygons included in the annotations.

textrecog_transforms.py implements text recognition related data augmentation modules:


Transforms Name	Required Keys	Modified/Added Keys	Description
RescaleToHeight	`img`	`img` `img_shape` `scale` `scale_factor` `keep_ratio`	Scales the image to the specified height while keeping the aspect ratio. When `min_width` and `max_width` are specified, the aspect ratio may be changed.

The above table only briefly introduces some selected data augmentation methods, for more information please refer to the [API documentation](../api.rst) or the code docstrings.

Data Formatting

Data formatting transforms are responsible for packaging images, ground truth labels, and other information into a dictionary. Different tasks usually rely on different formatting transforms. For example:


Transforms Name	Required Keys	Modified/Added Keys	Description
PackTextDetInputs	-	-	Pack the inputs required by text detection.
PackTextRecogInputs	-	-	Pack the inputs required by text recognition.
PackKIEInputs	-	-	Pack the inputs required by KIE.

Cross Project Data Adapters

The cross-project data adapters bridge the data formats between MMOCR and other OpenMMLab libraries such as MMDetection, making it possible to call models implemented in other OpenMMLab projects. Currently, MMOCR has implemented MMDet2MMOCR and MMOCR2MMDet, allowing data to be converted between MMDetection and MMOCR formats; with these adapters, users can easily train any detectors supported by MMDetection in MMOCR. For example, we provide a tutorial to show how to train Mask R-CNN as a text detector in MMOCR.


Transforms Name	Required Keys	Modified/Added Keys	Description
MMDet2MMOCR	`gt_masks` `gt_ignore_flags`	`gt_polygons` `gt_ignored`	Convert the fields used in MMDet to MMOCR.
MMOCR2MMDet	`img_shape` `gt_polygons` `gt_ignored`	`gt_masks` `gt_ignore_flags`	Convert the fields used in MMOCR to MMDet.

Wrappers

To facilitate the use of popular third-party CV libraries in MMOCR, we provide wrappers in wrappers.py to unify the data format between MMOCR and other third-party libraries. Users can directly configure the data transforms provided by these libraries in the configuration file of MMOCR. The supported wrappers are as follows:


Transforms Name	Required Keys	Modified/Added Keys	Description
ImgAugWrapper	`img` `gt_polygons` (optional for text recognition) `gt_bboxes` (optional for text recognition) `gt_bboxes_labels` (optional for text recognition) `gt_ignored` (optional for text recognition) `gt_texts` (optional)	`img` `gt_polygons` (optional for text recognition) `gt_bboxes` (optional for text recognition) `gt_bboxes_labels` (optional for text recognition) `gt_ignored` (optional for text recognition) `img_shape` (optional) `gt_texts` (optional)	ImgAug wrapper, which bridges the data format and configuration between ImgAug and MMOCR, allowing users to config the data augmentation methods supported by ImgAug in MMOCR.
TorchVisionWrapper	`img`	`img` `img_shape`	TorchVision wrapper, which bridges the data format and configuration between TorchVision and MMOCR, allowing users to config the data transforms supported by `torchvision.transforms` in MMOCR.

`ImgAugWrapper` Example

For example, in the original ImgAug, we can define a Sequential type data augmentation pipeline as follows to perform random flipping, random rotation and random scaling on the image:

import imgaug.augmenters as iaa

aug = iaa.Sequential(
  iaa.Fliplr(0.5),                # horizontally flip 50% of all images
  iaa.Affine(rotate=(-10, 10)),   # rotate by -10 to +10 degrees
  iaa.Resize((0.5, 3.0))          # scale images to 50-300% of their size
)

In MMOCR, we can directly configure the above data augmentation pipeline in train_pipeline as follows:

dict(
  type='ImgAugWrapper',
  args=[
    ['Fliplr', 0.5],
    dict(cls='Affine', rotate=[-10, 10]),
    ['Resize', [0.5, 3.0]],
  ]
)

Specifically, the args parameter accepts a list, and each element in the list can be a list or a dictionary. If it is a list, the first element of the list is the class name in imgaug.augmenters, and the following elements are the initialization parameters of the class; if it is a dictionary, the cls key corresponds to the class name in imgaug.augmenters, and the other key-value pairs correspond to the initialization parameters of the class.

`TorchVisionWrapper` Example

For example, in the original TorchVision, we can define a Compose type data transformation pipeline as follows to perform color jittering on the image:

import torchvision.transforms as transforms

aug = transforms.Compose([
  transforms.ColorJitter(
    brightness=32.0 / 255,  # brightness jittering range
    saturation=0.5)         # saturation jittering range
])

In MMOCR, we can directly configure the above data transformation pipeline in train_pipeline as follows:

dict(
  type='TorchVisionWrapper',
  op='ColorJitter',
  brightness=32.0 / 255,
  saturation=0.5
)

Specifically, the op parameter is the class name in torchvision.transforms, and the following parameters correspond to the initialization parameters of the class.