Skip to main content

TableSense: Spreadsheet Table Detection with Convolutional Neural Networks

 - By Haoyu Dong, Shijie Liu, Shi Han, Zhouyu Fu, Dongmei Zhang
Microsoft Research, Beijing 100080, China.
Beihang University, Beijing 100191, China

Paper Link

Abstract


Spreadsheet table detection is the task of detecting all tables on a given sheet and locating their respective ranges. Automatic table detection is a key enabling technique and an initial step in spreadsheet data intelligence. However, the detection task is challenged by the diversity of table structures and table layouts on the spreadsheet. Considering the analogy between a cell matrix as spreadsheet and a pixel matrix as image, and encouraged by the successful application of Convolutional Neural Networks (CNN) in computer vision, we have developed TableSense, a novel end-to-end framework for spreadsheet table detection. First, we devise an effective cell featurization scheme to better leverage the rich information in each cell; second, we develop an enhanced convolutional neural network model for table detection to meet the domain-specific requirement on precise table boundary detection; third, we propose an effective uncertainty metric to guide an active learning based smart sampling algorithm, which enables the efficient build-up of a training dataset with 22,176 tables on 10,220 sheets with broad coverage of diverse table structures and layouts. Our evaluation shows that TableSense is highly effective with 91.3% recall and 86.5% precision in EoB-2 metric, a significant improvement over both the current detection algorithm that are used in commodity spreadsheet tools and state-of-the-art convolutional neural networks in computer vision.
Spreadsheets are a critical end-user development tool for data management and analysis. In spreadsheet data, the table is a key structure for data processing and information presentation. Automatic table detection is an important initial step for one-click intelligence features such as Ideas in Excel or Explore in Google Sheets, where insights can be recommended from the detected tables with automated end-to-end experience. Despite the importance of automatic table detection for spreadsheets, this problem has largely been overlooked for decades in both the research community and industry. Previous research on table detection has mainly targeted other media, e.g. HTML, images and PDFs. The aim is to retrieve (mostly likely single) table regions from the ambient text. 

The major challenge for these techniques is the understanding of binary files based on metadata analysis and image processing, but the table boundaries are clear. The scenario with spreadsheet table detection is fundamentally different.  a single sheet can have multiple tables cluttered around with potentially different structures for each table. The diversity in multi-table layout and structure significantly confound the problem with obfuscated table boundaries. To the best of our knowledge, there is no prior research effort on this problem in academia, while region-growth techniques are commonly used in commodity spreadsheet tools. However, region-growth is quite fragile with the presence of complicated table structures and layouts on the sheet


A sample spreadsheet with three tables showing various artefacts. Dotted red bounding boxes and dashed green bounding boxes show the tables detected by TableSense and Mask R-CNN, respectively.


Problem Statement


Spreadsheet table detection is the task of detecting all tables on a given sheet and locating their respective ranges. In such a task, an input sheet is represented by a matrix of cells. The output is a list of tables detected, where the range of each detected table is represented by a 4-tuple (colleft, rowtop, colright, rowbottom), which specifies the x and y coordinates for the top-left and bottom-right corners of the bounding box (bbox).

Datasets

All experimental data in the development of TableSense is from our WebSheet dataset, which is a web-crawled spreadsheet corpus including 4,290,022 sheets. WebSheet10k is a sampled subset of WebSheet for human labeling. It contains 10,220 sheets in English, where all table regions on each sheet have been labeled with a corresponding bounding box. To control labeling quality, each sheet has been labeled by a human labeler and then verified by another human labeler. To ensure high coverage of various table structures and multi-table layouts on sheets, we adopt an active learning framework to build WebSheet10k in iterations. Details are provided in Section . WebSheet400 is our test set with labels, which contains 400 randomly sampled sheets with 795 tables from WebSheet without any overlap with WebSheet10k.


TableSense Framework

Framework tailored for table detection. It is an end-to-end model containing a series of modules as follows.
• Cell featurization: Since cells do not have a canonical representation in the spreadsheet, we need to extract cell features before feeding them to the pipeline. Details of cell featurization will be provided in Section .
• CNN backbone: CNN is the backbone of our framework to capture spatial correlations and learn high-level representations from input cell matrix, and fully convolutional network is adopted here so as to enable the model to process spreadsheets of various sizes without rescaling them.
• Table detection head: The two-stage detection mechanism which achieves state-of-the-art results in computer vision is adopted. In this module, the feature maps generated by the CNN backbone are fed to a Region Proposal Network (RPN), which further produces a list of Regions of Interest (RoIs). Then RoIAlign extracts feature maps from each RoI for bounding box regression. Then a CNN-based bounding box regression branch refines the boundaries of these RoIs, a CNN-based table classifier simultaneously scores these RoIs, and a segmentation branch generates the celllevel table mask. These branches are applied to each RoI separately. Finally, Non-Maximum Suppression (NMS) is used to rank the bounding boxes and filter redundant ones. For our task, RoIAlign which is based on bilinear interpolation can preserve more precise per-cell correspondence than RoIPool which uses simple hard quantization



The framework of TableSense for spreadsheet table detection






We propose six measures below for the evaluation of sheet uncertainty.
• Classification uncertainty score: One minus the average classification probability returned by the softmax values for all detected table regions on the sheet.
• Mismatch score of segmentation and detection masks:
One minus the IoU between the segmentation mask and detection mask. The detection mask is produced by setting the values of all cells inside the detected table region to 1 and 0 otherwise.
• Table/Sheet-level Sparsity factor: Table-level sparsity factor is given by the ratio of blank cells in the detected table region, while sheet-level sparsity is given by the lowest sparsity factor for all tables detected on the sheet.
• Overlapping region indicator: 1 if there is an overlap between any two detected table regions and 0 otherwise.
• Boundary mismatch indicator: 1 if there are boundary mismatch and 0 otherwise. A mismatch is identified if any detected boundary is on a blank column or row.
• Out-of-region coverage ratio: The ratio of the number of non-blank cells outside the detected table regions to the total number of non-blank cells.


Conclusion and Future Work

In this paper, we propose the TableSense suite to address the challenges in spreadsheet table detection. TableSense is a unified, end-to-end framework customized from CNN with several key enhancements. First, we propose a featurization scheme to encode cell features. Second, we devise a PBR module to predict precise bounding boxes and incorporate it. Third, we use active learning to effectively select low confidence sheets for human labeling in building up the training dataset. In the future, we will leverage the TableSense technique for automated table structure analysis and make a further step in spreadsheet intelligence.


Comments

Anonymous said…
It's interesting to use active learning for efficient data collection. Thanks!

Popular posts from this blog

ABOD and its PyOD python module

Angle based detection By  Hans-Peter Kriegel, Matthias Schubert, Arthur Zimek  Ludwig-Maximilians-Universität München  Oettingenstr. 67, 80538 München, Germany Ref Link PyOD By  Yue Zhao   Zain Nasrullah   Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4, Canada  Zheng Li jk  Northeastern University Toronto, Toronto, ON M5X 1E2, Canada I am combining two papers to summarize Anomaly detection. First one is Angle Based Outlier Detection (ABOD) and other one is python module that  uses ABOD along with over 20 other apis (PyOD) . This is third part in the series of Anomaly detection. First article exhibits survey that covered length and breadth of subject, Second article highlighted on data preparation and pre-processing.  Angle Based Outlier Detection. Angles are more stable than distances in high dimensional spaces for example the popularity of cosine-based sim...

Ownership at Large

 Open Problems and Challenges in Ownership Management -By John Ahlgren, Maria Eugenia Berezin, Kinga Bojarczuk, Elena Dulskyte, Inna Dvortsova, Johann George, Natalija Gucevska, Mark Harman, Shan He, Ralf Lämmel, Erik Meijer, Silvia Sapora, and Justin Spahr-Summers Facebook Inc.  Software-intensive organizations rely on large numbers of software assets of different types, e.g., source-code files, tables in the data warehouse, and software configurations. Who is the most suitable owner of a given asset changes over time, e.g., due to reorganization and individual function changes. New forms of automation can help suggest more suitable owners for any given asset at a given point in time. By such efforts on ownership health, accountability of ownership is increased. The problem of finding the most suitable owners for an asset is essentially a program comprehension problem: how do we automatically determine who would be best placed to understand, maintain, ev...

Hybrid Approach to Automation, RPA and Machine Learning

- By Wiesław Kopec´, Kinga Skorupska, Piotr Gago, Krzysztof Marasek  Polish-Japanese Academy of Information Technology Paper Link Courtesy DZone   Abstract One of the more prominent trends within Industry 4.0 is the drive to employ Robotic Process Automation (RPA), especially as one of the elements of the Lean approach.     The full implementation of RPA is riddled with challenges relating both to the reality of everyday business operations, from SMEs to SSCs and beyond, and the social effects of the changing job market. To successfully address these points there is a need to develop a solution that would adjust to the existing business operations and at the same time lower the negative social impact of the automation process. To achieve these goals we propose a hybrid, human-centred approach to the development of software robots. This design and  implementation method combines the Living Lab approach with empowerment through part...