Finetuned RT-DETR model to extract tables from IRS 1040 2023 forms

For IRS from 1040 document data parsing, I have previously uploaded a trained Donut model that is based on vision transformers. The donut model can perform single-shot parsing of 1040 forms and return parsed form values in json format. Vision transformers are cutting edge AI models, they still have some limitations when performing OCR related tasks, where they sometimes hallucinate. Secondly, they do not provide confidence level for extracted fields data, this makes it extremely challenging in making downstream decisions on when to accept a particular field value or drop the parsed value.

Especially when dealing with financial data, like Form 1040, accuracy and confidence values are of utmost importance.

This article provides a working example of using multiple AI models to perform OCR of the form 1040 and extract text values in json format with confidence levels for each field.

 -----------------------
| Classification Model  |    (Model is used to classify IRS Form 1040 by page)
 ----------------------- 
         |
         |
         |
 -----------------------
|       RT-DETR         | 
| Object Detection Model|    (Model trained to extract header and tables from Form 1040)
 -----------------------
         |
         |
         |
 -----------------------
|   Table Transformer   |  (Table transformer model along with OCR models 
|   Text OCR            |  i.e. PaddleOCR or Tesseract to parse field data)
 -----------------------

Classes for form 1040

The RT-DETR model is finetuned with 6 classes related to 1040 2023 form.

Page 1 classes

1040_pg1_header - represents the header of the page 1

1040_pg1_tax_tbl - represents a table with financial values

1040_pg1_sch_b - represents a table with schedule b financial values

Page 2 classes

1040_pg2_tax_tbl

1040_pg2_pay_tbl

1040_pg2_signature_frm

Fake Synthetic Data for IRS 1040 2023 Form Page 1

Cropped - Class: 1040_pg1_header

Cropped - Class: 1040_pg1_tax_tbl

Cropped - Class: 1040_pg1_sch_b

Fake Synthetic Data for IRS 1040 2023 Form Page 2

Cropped - Class: 1040_pg2_tax_tbl

Cropped - Class: 1040_pg2_pay_tbl

Cropped - Class: 1040_pg2_signature_frm

from ultralytics import RTDETR
import cv2
import supervision as sv 

# --------------------------
model_file = 'replace with path to model file /1040_2023_v1.pt'

# Load a trained model from local path
model = RTDETR(model_file)

# Display model information (optional)
model.info()

image_path = 'path to source image'

# read src image
img = cv2.imread(image_path)

# perform inference
results = model.predict(img, imgsz=1024) #imgsz is set to 1024 as the model is finetuned with this size

# use the supervision library for parsing results and generating redline boxes
detections = sv.Detections.from_ultralytics(results[0])

#get a bounding box and label the annotator
bounding_box_annotator = sv.BoundingBoxAnnotator() 
label_annotaotr = sv.LabelAnnotator()

# generate labels for images
labels = [
    f"{class_name} {confidence:.2f}"
    for class_name, confidence
    in zip(detections['class_name'], detections.confidence)
]

# annotate images with labeled bounding boxes
annotated_image = bounding_box_annotator.annotate(
    scene=img.copy(),
    detections=detections
)
annotated_image = label_annotaotr.annotate(annotated_image, detections=detections, labels=labels)
# dummy counter for generated image names
count = 0
# write the annotated image
cv2.imwrite('redlined_' + str(count) + '.png', annotated_image)

# crop bounding boxes and save 
for xyxy in detections.xyxy:
    cropped_image = sv.crop_image(image=img, xyxy=xyxy)
    count = count + 1
    cv2.imwrite('bboxes_' + str(count) + '.png', cropped_image)

hsarfraz
/

rt-detr-irs-form-1040-2023-table-detection

You need to agree to share your contact information to access this model