Spaces:
Sleeping
A newer version of the Gradio SDK is available:
5.12.0
Data Structures and Elements
MMOCR uses {external+mmengine:doc}MMEngine: Abstract Data Element <advanced_tutorials/data_element>
to encapsulate the data required for each task into data_sample
. The base class has implemented basic add/delete/update/check functions and supports data migration between different devices, as well as dictionary-like and tensor-like operations, which also allows the interfaces of different algorithms to be unified.
Thanks to the unified data structures, the data flow between each module in the algorithm libraries, such as visualizer
, evaluator
, dataset
, is greatly simplified. In MMOCR, we have the following conventions for different data types.
- xxxData: Single granularity data annotation or model output. Currently MMEngine has three built-in granularities of {external+mmengine:doc}
data elements <advanced_tutorials/data_element>
, including instance-level data (InstanceData
), pixel-level data (PixelData
) and image-level label data (LabelData
). Among the tasks currently supported by MMOCR, text detection and key information extraction tasks useInstanceData
to encapsulate the bounding boxes and the corresponding box label, while the text recognition task usesLabelData
to encapsulate the text content. - xxxDataSample: inherited from {external+mmengine:doc}
MMEngine: Base Data Element <advanced_tutorials/data_element>
, used to hold all annotation and prediction information that required by a single task. For example,TextDetDataSample
for the text detection,TextRecogDataSample
for text recognition, andKIEDataSample
for the key information extraction task.
In the following, we will introduce the practical application of data elements xxxData and data samples xxxDataSample in MMOCR, respectively.
Data Elements - xxxData
InstanceData
and LabelData
are the BaseDataElement
defined in MMEngine
to encapsulate different granularity of annotation data or model output. In MMOCR, we have used InstanceData
and LabelData
for encapsulating the data types actually used in OCR-related tasks.
InstanceData
In the text detection task, the detector concentrate on instance-level text samples, so we use InstanceData
to encapsulate the data needed for this task. Typically, its required training annotation and prediction output contain rectangular or polygonal bounding boxes, as well as bounding box labels. Since the text detection task has only one positive sample class, "text", in MMOCR we use 0
to number this class by default. The following code example shows how to use the InstanceData
to encapsulate the data used in the text detection task.
import torch
from mmengine.structures import InstanceData
# defining gt_instance for encapsulating the ground truth data
gt_instance = InstanceData()
gt_instance.bbox = torch.Tensor([[0, 0, 10, 10], [10, 10, 20, 20]])
gt_instance.polygons = torch.Tensor([[[0, 0], [10, 0], [10, 10], [0, 10]],
[[10, 10], [20, 10], [20, 20], [10, 20]]])
gt_instance.label = torch.Tensor([0, 0])
# defining pred_instance for encapsulating the prediction data
pred_instances = InstanceData()
pred_polygons, scores = model(input)
pred_instances.polygons = pred_polygons
pred_instances.scores = scores
The conventions for the fields in InstanceData
in MMOCR are shown in the table below. It is important to note that the length of each field in InstanceData
must be equal to the number of instances N
in the sample.
Field | Type | Description |
bboxes | torch.FloatTensor |
Bounding boxes [x1, y1, x2, y2] with the shape (N, 4) . |
labels | torch.LongTensor |
Instance label with the shape (N, ) . By default, MMOCR uses 0 to represent the "text" class. |
polygons | list[np.array(dtype=np.float32)] |
Polygonal bounding boxes with the shape (N, ) . |
scores | torch.Tensor |
Confidence scores of the predictions of bounding boxes. (N, ) . |
ignored | torch.BoolTensor |
Whether to ignore the current sample with the shape (N, ) . |
texts | list[str] |
The text content of each instance with the shape (N, ) ,used for e2e text spotting or KIE task. |
text_scores | torch.FloatTensor |
Confidence score of the predictions of text contents with the shape (N, ) ,used for e2e text spotting task. |
edge_labels | torch.IntTensor |
The node adjacency matrix with the shape (N, N) . In KIE, the optional values for the state between nodes are -1 (ignored, not involved in loss calculation),0 (disconnected) and 1 (connected). |
edge_scores | torch.FloatTensor |
The prediction confidence of each edge in the KIE task, with the shape (N, N) . |
LabelData
For text recognition tasks, both labeled content and predicted content are wrapped using LabelData
.
import torch
from mmengine.data import LabelData
# defining gt_text for encapsulating the ground truth data
gt_text = LabelData()
gt_text.item = 'MMOCR'
# defining pred_text for encapsulating the prediction data
pred_text = LabelData()
index, score = model(input)
text = dictionary.idx2str(index)
pred_text.score = score
pred_text.item = text
The conventions for the LabelData
fields in MMOCR are shown in the following table.
Field | Type | Description |
item | str |
Text content. |
score | list[float] |
Confidence socre of the predicted text. |
indexes | torch.LongTensor |
A sequence of text characters encoded by dictionary and containing all special characters except <UNK> . |
padded_indexes | torch.LongTensor |
If the length of indexes is less than the maximum sequence length and pad_idx exists, this field holds the encoded text sequence padded to the maximum sequence length of max_seq_len . |
DataSample xxxDataSample
By defining a uniform data structure, we can easily encapsulate the annotation data and prediction results in a unified way, making data transfer between different modules of the code base easier. In MMOCR, we have designed three data structures based on the data needed in three tasks: TextDetDataSample
, TextRecogDataSample
, and KIEDataSample
. These data structures all inherit from {external+mmengine:doc}MMEngine: Base Data Element <advanced_tutorials/data_element>
, which is used to hold all annotation and prediction information required by each task.
Text Detection - TextDetDataSample
TextDetDataSample is used to encapsulate the data needed for the text detection task. It contains two main fields gt_instances
and pred_instances
, which are used to store the annotation information and prediction results respectively.
Field | Type | Description |
gt_instances | InstanceData |
Annotation information. |
pred_instances | InstanceData |
Prediction results. |
The fields of InstanceData
that will be used are:
Field | Type | Description |
bboxes | torch.FloatTensor |
Bounding boxes [x1, y1, x2, y2] with the shape (N, 4) . |
labels | torch.LongTensor |
Instance label with the shape (N, ) . By default, MMOCR uses 0 to represent the "text" class. |
polygons | list[np.array(dtype=np.float32)] |
Polygonal bounding boxes with the shape (N, ) . |
scores | torch.Tensor |
Confidence scores of the predictions of bounding boxes. (N, ) . |
ignored | torch.BoolTensor |
Boolean flags with the shape (N, ) , indicating whether to ignore the current sample. |
Since text detection models usually only output one of the bboxes/polygons, we only need to make sure that one of these two is assigned a value.
The following sample code demonstrates the use of TextDetDataSample
.
import torch
from mmengine.data import TextDetDataSample
data_sample = TextDetDataSample()
# Define the ground truth data
img_meta = dict(img_shape=(800, 1196, 3), pad_shape=(800, 1216, 3))
gt_instances = InstanceData(metainfo=img_meta)
gt_instances.bboxes = torch.rand((5, 4))
gt_instances.labels = torch.zeros((5,), dtype=torch.long)
data_sample.gt_instances = gt_instances
# Define the prediction data
pred_instances = InstanceData()
pred_instances.bboxes = torch.rand((5, 4))
pred_instances.labels = torch.zeros((5,), dtype=torch.long)
data_sample.pred_instances = pred_instances
Text Recognition - TextRecogDataSample
TextRecogDataSample
is used to encapsulate the data for the text recognition task. It has two fields, gt_text
and pred_text
, which are used to store annotation information and prediction results, respectively.
The following sample code demonstrates the use of TextRecogDataSample
.
import torch
from mmengine.data import TextRecogDataSample
data_sample = TextRecogDataSample()
# Define the ground truth data
img_meta = dict(img_shape=(800, 1196, 3), pad_shape=(800, 1216, 3))
gt_text = LabelData(metainfo=img_meta)
gt_text.item = 'mmocr'
data_sample.gt_text = gt_text
# Define the prediction data
pred_text = LabelData(metainfo=img_meta)
pred_text.item = 'mmocr'
data_sample.pred_text = pred_text
The fields of LabelData
that will be used are:
Field | Type | Description |
item | list[str] |
The text corresponding to the instance, of length (N, ), for end-to-end OCR tasks and KIE |
score | torch.FloatTensor |
Confidence of the text prediction, of length (N, ), for the end-to-end OCR task |
indexes | torch.LongTensor |
A sequence of text characters encoded by dictionary and containing all special characters except <UNK> . |
padded_indexes | torch.LongTensor |
If the length of indexes is less than the maximum sequence length and pad_idx exists, this field holds the encoded text sequence padded to the maximum sequence length of max_seq_len . |
Key Information Extraction - KIEDataSample
KIEDataSample
is used to encapsulate the data needed for the KIE task. It also contains two fields, gt_instances
and pred_instances
, which are used to store annotation information and prediction results respectively.
Field | Type | Description |
gt_instances | InstanceData |
Annotation information. |
pred_instances | InstanceData |
Prediction results. |
The InstanceData
fields that will be used by this task are shown in the following table.
Field | Type | Description |
bboxes | torch.FloatTensor |
Bounding boxes [x1, y1, x2, y2] with the shape (N, 4) . |
labels | torch.LongTensor |
Instance label with the shape (N, ) . |
texts | list[str] |
The text content of each instance with the shape (N, ) ,used for e2e text spotting or KIE task. |
edge_labels | torch.IntTensor |
The node adjacency matrix with the shape (N, N) . In the KIE task, the optional values for the state between nodes are -1 (ignored, not involved in loss calculation),0 (disconnected) and 1 (connected). |
edge_scores | torch.FloatTensor |
The prediction confidence of each edge in the KIE task, with the shape (N, N) . |
scores | torch.FloatTensor |
The confidence scores for node label predictions, with the shape (N,) . |
Since there is no unified standard for model implementation of KIE tasks, the design currently considers only [SDMGR](../../../configs/kie/sdmgr/README.md) model usage scenarios. Therefore, the design is subject to change as we support more KIE models.
The following sample code shows the use of KIEDataSample
.
import torch
from mmengine.data import KIEDataSample
data_sample = KIEDataSample()
# Define the ground truth data
img_meta = dict(img_shape=(800, 1196, 3),pad_shape=(800, 1216, 3))
gt_instances = InstanceData(metainfo=img_meta)
gt_instances.bboxes = torch.rand((5, 4))
gt_instances.labels = torch.zeros((5,), dtype=torch.long)
gt_instances.texts = ['text1', 'text2', 'text3', 'text4', 'text5']
gt_instances.edge_lebels = torch.randint(-1, 2, (5, 5))
data_sample.gt_instances = gt_instances
# Define the prediction data
pred_instances = InstanceData()
pred_instances.bboxes = torch.rand((5, 4))
pred_instances.labels = torch.rand((5,))
pred_instances.edge_labels = torch.randint(-1, 2, (10, 10))
pred_instances.edge_scores = torch.rand((10, 10))
data_sample.pred_instances = pred_instances