Skip to main content

Dataset

In this chapter, we briefly introduce the datasets used for training and testing our models. These datasets include a variety of document images.

SmartDoc 2015

SmartDoc 2015

The SmartDoc 2015 - Challenge 1 dataset was originally created for the SmartDoc 2015 competition, focusing on evaluating document image capture methods using smartphones. Challenge 1 is particularly about detecting and segmenting document regions from video frames captured from smartphone preview streams.

MIDV-500/2019

MIDV-500

MIDV-500 consists of 500 video clips of 50 different types of identity documents, including 17 IDs, 14 passports, 13 driver's licenses, and 6 other identity documents from different countries. It features real-world scenarios to extensively study various document analysis issues. MIDV-2019 includes images with distortions and low light.

MIDV-2020

MIDV-2020

MIDV-2020 features 10 types of documents, including 1000 annotated video clips, 1000 scanned images, and 1000 photos of 1000 unique simulated identity documents, each with unique textual field values and uniquely generated artificial faces.

CORD v0

CORD

This dataset comprises thousands of Indonesian receipts, which include text annotations for OCR and multi-layer semantic tags for parsing. The provided dataset can be used to address various OCR and parsing tasks.

Synthetic Dataset

Docpool

Considering the limitations of available datasets, we use dynamic image synthesis techniques. In essence, we first collected a dataset including images of various documents and certificates found online. We then used an Indoor dataset as a background, onto which we superimposed the documents data. Moreover, for the MIDV-500/MIDV-2019/CORD datasets, which all have corresponding polygon groundtruth, we also synthesized various documents onto these datasets in the spirit of maximizing resource use and increasing dataset diversity.