He guo, Xiameng Qin, Jiaming Liu, Junyu Han, Jingtuo Liu, Errui Ding Department of Computer Vision Technology(VIS), Baidu Inc
Abstract
Extracting entity from images is a crucial part of many OCR applications, such as entity recognition of cards, invoices, and receipts. Most of the existing works employ classical detection and recognition paradigm. Paper proposes an Entity-aware Attention Text Extraction Network called EATEN, which is an end-to-end trainable system to extract the entities without any post-processing. In the proposed framework, each entity is parsed by its corresponding entity-aware decoder, respectively. Moreover, we innovatively introduce a state transition mechanism which further improves the robustness of entity extraction. In consideration of the absence of public benchmarks, we construct a dataset of almost 0.6 million images in three realworld scenarios (train ticket, passport and business card), which is publicly available at https://github.com/beacandler/EATEN. To the best of our knowledge, EATEN is the first single shot method to extract entities from images. Extensive experiments on these benchmarks demonstrate the state-of-the-art performance of EATEN.
Recently, scene text detection and recognition, two fundamental tasks in the field of computer vision, have become increasingly popular due to their wide applications such as scene text understanding, image and video retrieval. Among these applications, extracting Entity of Interest (EoI) is one of the most challenging and practical problems, which needs to identify texts that belong to certain entities. Taking passport for example, there are many entities in the image, such as Country, Name, Birthday and so forth. In practical applications, we only need to output the texts for some predefined entities, e.g. “China” or “USA” for the entity “Country”, “Jack” or “Rose” for the entity “Name”. Previous approaches mainly adopt two steps, in which text information is extracted firstly via OCR (Optical Character Recognition), and then EoIs are extracted by handcrafted rules or layout analysis. Nevertheless, in the detection and recognition paradigm, engineers have to develop post-processing steps, which are handcrafted rules to determine which part of the recognized text belongs to the predefined EoIs.
It’s usually the post-processing steps, rather than the ability of detection and recognition, restraints the performance of EoIs extraction. For example, if the positions of entities have a slight offset to the standard positions, inaccurate entities will be extracted due to sensitive template representation. In this paper, a single shot Entity-aware Attention Text Extraction Network (EATEN) is proposed to extract EoIs from images within a single neural network. we use a CNN-based feature extractor to extract feature maps from original image. Then we design an entity-aware attention network, which is composed of multiple entity-aware decoders, initial state warm up and state transition between decoders, to capture all entities in the image. Compared with traditional methods, EATEN is an end-to-end trainable framework instead of multi-stage procedures. EATEN is able to cover most of the corner cases with arbitrary shapes, projective/affine transformations, position drift without any correction due to the introduction of spatial attention mechanism.
- Text preparing. To make the synthetic images more general, we collected a large corpus including Chinese name, address, etc. by crawling from the Internet.
- Font rendering. We select one font for a specific scenario, and the EoIs are rendered on the background template images using an open image library. Especially, in the business card scenario, we prepared more than one hundred template images containing 85 simple background and pure images with random color to render text.
- Transformation. We rotate the image randomly in a range of [-5, +5] degree, then resize the image according to the longer side. Elastic transformation is also employed.
- Noise. Gaussian noise, blur, average blur, sharpen, brightness, hue, and saturation are applied.
Compared methods We compare several baseline methods with our approach:
(1) General OCR. A typical paradigm, OCR and matching, that firstly detects and reads all the text by OCR engine1 , and then extracts EoIs if the content of text fits predefined regular expressions or the position of text fits in designed templates.
(2) Attention OCR . It reads multiple
lines of scene text by attention mechanism and has achieved
state-of-the-art performance in several datasets. We adapt it
to transcribe the EoIs sequentially, using
Conclusion
paper, we proposed an end-to-end framework called EATEN for extracting EoIs in images. A dataset with three real-world scenarios was established to verify the efficiency of the proposed method and to complement the research of EoI extraction. In contrast to traditional approaches based on text detection and text recognition, EATEN is efficiently trained without bounding box and full-text annotations, and directly predicts target entities of an input image in one shot without any bells and whistles. It shows superior performance in all the scenarios and shows the full capacity of extracting EoIs from images with or without a fixed layout. This study provides a new perspective on text recognition, EoIs extraction, and structural information extraction.
Comments