Skip to main content

EATEN: Entity-aware Attention for Single Shot Visual Text Extraction

 He guo, Xiameng Qin, Jiaming Liu, Junyu Han, Jingtuo Liu, Errui Ding Department of Computer Vision Technology(VIS), Baidu Inc

Paper link

Abstract

Extracting entity from images is a crucial part of many OCR applications, such as entity recognition of cards, invoices, and receipts. Most of the existing works employ classical detection and recognition paradigm. Paper proposes an Entity-aware Attention Text Extraction Network called EATEN, which is an end-to-end trainable system to extract the entities without any post-processing. In the proposed framework, each entity is parsed by its corresponding entity-aware decoder, respectively. Moreover, we innovatively introduce a state transition mechanism which further improves the robustness of entity extraction. In consideration of the absence of public benchmarks, we construct a dataset of almost 0.6 million images in three realworld scenarios (train ticket, passport and business card), which is publicly available at https://github.com/beacandler/EATEN. To the best of our knowledge, EATEN is the first single shot method to extract entities from images. Extensive experiments on these benchmarks demonstrate the state-of-the-art performance of EATEN.

Recently, scene text detection and recognition, two fundamental tasks in the field of computer vision, have become increasingly popular due to their wide applications such as scene text understanding, image and video retrieval. Among these applications, extracting Entity of Interest (EoI) is one of the most challenging and practical problems, which needs to identify texts that belong to certain entities. Taking passport for example, there are many entities in the image, such as Country, Name, Birthday and so forth. In practical applications, we only need to output the texts for some predefined entities, e.g. “China” or “USA” for the entity “Country”, “Jack” or “Rose” for the entity “Name”. Previous approaches mainly adopt two steps, in which text information is extracted firstly via OCR (Optical Character Recognition), and then EoIs are extracted by handcrafted rules or layout analysis. Nevertheless, in the detection and recognition paradigm, engineers have to develop post-processing steps, which are handcrafted rules to determine which part of the recognized text belongs to the predefined EoIs. 


It’s usually the post-processing steps, rather than the ability of detection and recognition, restraints the performance of EoIs extraction. For example, if the positions of entities have a slight offset to the standard positions, inaccurate entities will be extracted due to sensitive template representation. In this paper, a single shot Entity-aware Attention Text Extraction Network (EATEN) is proposed to extract EoIs from images within a single neural network. we use a CNN-based feature extractor to extract feature maps from original image. Then we design an entity-aware attention network, which is composed of multiple entity-aware decoders, initial state warm up and state transition between decoders, to capture all entities in the image. Compared with traditional methods, EATEN is an end-to-end trainable framework instead of multi-stage procedures. EATEN is able to cover most of the corner cases with arbitrary shapes, projective/affine transformations, position drift without any correction due to the introduction of spatial attention mechanism.



The generation processes contain four steps: 
  • Text preparing. To make the synthetic images more general, we collected a large corpus including Chinese name, address, etc. by crawling from the Internet.
  • Font rendering. We select one font for a specific scenario, and the EoIs are rendered on the background template images using an open image library. Especially, in the business card scenario, we prepared more than one hundred template images containing 85 simple background and pure images with random color to render text. 
  • Transformation. We rotate the image randomly in a range of [-5, +5] degree, then resize the image according to the longer side. Elastic transformation is also employed. 
  • Noise. Gaussian noise, blur, average blur, sharpen, brightness, hue, and saturation are applied.


Compared methods We compare several baseline methods with our approach: 

(1) General OCR. A typical paradigm, OCR and matching, that firstly detects and reads all the text by OCR engine1 , and then extracts EoIs if the content of text fits predefined regular expressions or the position of text fits in designed templates. 

(2) Attention OCR . It reads multiple lines of scene text by attention mechanism and has achieved state-of-the-art performance in several datasets. We adapt it to transcribe the EoIs sequentially, using tokens to separate different EoIs. 

(3) EATEN without state transition. This method is for ablation study, to verify the efficiency of proposed state transition.



Conclusion

paper, we proposed an end-to-end framework called EATEN for extracting EoIs in images. A dataset with three real-world scenarios was established to verify the efficiency of the proposed method and to complement the research of EoI extraction. In contrast to traditional approaches based on text detection and text recognition, EATEN is efficiently trained without bounding box and full-text annotations, and directly predicts target entities of an input image in one shot without any bells and whistles. It shows superior performance in all the scenarios and shows the full capacity of extracting EoIs from images with or without a fixed layout. This study provides a new perspective on text recognition, EoIs extraction, and structural information extraction.


Comments

Popular posts from this blog

TableSense: Spreadsheet Table Detection with Convolutional Neural Networks

 - By Haoyu Dong, Shijie Liu, Shi Han, Zhouyu Fu, Dongmei Zhang Microsoft Research, Beijing 100080, China. Beihang University, Beijing 100191, China Paper Link Abstract Spreadsheet table detection is the task of detecting all tables on a given sheet and locating their respective ranges. Automatic table detection is a key enabling technique and an initial step in spreadsheet data intelligence. However, the detection task is challenged by the diversity of table structures and table layouts on the spreadsheet. Considering the analogy between a cell matrix as spreadsheet and a pixel matrix as image, and encouraged by the successful application of Convolutional Neural Networks (CNN) in computer vision, we have developed TableSense, a novel end-to-end framework for spreadsheet table detection. First, we devise an effective cell featurization scheme to better leverage the rich information in each cell; second, we develop an enhanced convolutional neural network model for...

ABOD and its PyOD python module

Angle based detection By  Hans-Peter Kriegel, Matthias Schubert, Arthur Zimek  Ludwig-Maximilians-Universität München  Oettingenstr. 67, 80538 München, Germany Ref Link PyOD By  Yue Zhao   Zain Nasrullah   Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4, Canada  Zheng Li jk  Northeastern University Toronto, Toronto, ON M5X 1E2, Canada I am combining two papers to summarize Anomaly detection. First one is Angle Based Outlier Detection (ABOD) and other one is python module that  uses ABOD along with over 20 other apis (PyOD) . This is third part in the series of Anomaly detection. First article exhibits survey that covered length and breadth of subject, Second article highlighted on data preparation and pre-processing.  Angle Based Outlier Detection. Angles are more stable than distances in high dimensional spaces for example the popularity of cosine-based sim...

Bike sharing Dynamic Re-positioning

-By Xinghua Zheng1, Ming Tang1, Hankz Hankui Zhuo1*, Kevin X. Wen Paper Link Abstract Bike Sharing Systems (BSSs) have been adopted in many major cities of the world due to traffic congestion and carbon emissions. Although there have been approaches to exploiting either bike trailers via crowdsourcing or carrier vehicles to reposition bikes in the “right” stations in the “right” time, they do not jointly consider the usage of both bike trailers and carrier vehicles. In this paper, we aim to take advantage of both bike trailers and carrier vehicles to reduce the loss of demand with regard to the crowdsourcing of bike trailers and the fuel cost of carrier vehicles. In the experiment, we exhibit that our approach outperforms baselines in several datasets from bike sharing companies. Bike-sharing systems (BSSs) typically have a set of base stations that are strategically placed throughout a city and each station has a fixed number of docks, e.g., Capital Bike-share, ...