TableSense: Spreadsheet Table Detection with Convolutional Neural Networks

- By Haoyu Dong, Shijie Liu, Shi Han, Zhouyu Fu, Dongmei Zhang

Microsoft Research, Beijing 100080, China.

Beihang University, Beijing 100191, China

Abstract

Spreadsheet table detection is the task of detecting all tables on a given sheet and locating their respective ranges. Automatic table detection is a key enabling technique and an initial step in spreadsheet data intelligence. However, the detection task is challenged by the diversity of table structures and table layouts on the spreadsheet. Considering the analogy between a cell matrix as spreadsheet and a pixel matrix as image, and encouraged by the successful application of Convolutional Neural Networks (CNN) in computer vision, we have developed TableSense, a novel end-to-end framework for spreadsheet table detection. First, we devise an effective cell featurization scheme to better leverage the rich information in each cell; second, we develop an enhanced convolutional neural network model for table detection to meet the domain-specific requirement on precise table boundary detection; third, we propose an effective uncertainty metric to guide an active learning based smart sampling algorithm, which enables the efficient build-up of a training dataset with 22,176 tables on 10,220 sheets with broad coverage of diverse table structures and layouts. Our evaluation shows that TableSense is highly effective with 91.3% recall and 86.5% precision in EoB-2 metric, a significant improvement over both the current detection algorithm that are used in commodity spreadsheet tools and state-of-the-art convolutional neural networks in computer vision.

Spreadsheets are a critical end-user development tool for data management and analysis. In spreadsheet data, the table is a key structure for data processing and information presentation. Automatic table detection is an important initial step for one-click intelligence features such as Ideas in Excel or Explore in Google Sheets, where insights can be recommended from the detected tables with automated end-to-end experience. Despite the importance of automatic table detection for spreadsheets, this problem has largely been overlooked for decades in both the research community and industry. Previous research on table detection has mainly targeted other media, e.g. HTML, images and PDFs. The aim is to retrieve (mostly likely single) table regions from the ambient text.

The major challenge for these techniques is the understanding of binary files based on metadata analysis and image processing, but the table boundaries are clear. The scenario with spreadsheet table detection is fundamentally different. a single sheet can have multiple tables cluttered around with potentially different structures for each table. The diversity in multi-table layout and structure significantly confound the problem with obfuscated table boundaries. To the best of our knowledge, there is no prior research effort on this problem in academia, while region-growth techniques are commonly used in commodity spreadsheet tools. However, region-growth is quite fragile with the presence of complicated table structures and layouts on the sheet

A sample spreadsheet with three tables showing various artefacts. Dotted red bounding boxes and dashed green bounding boxes show the tables detected by TableSense and Mask R-CNN, respectively.

Problem Statement

Spreadsheet table detection is the task of detecting all tables on a given sheet and locating their respective ranges. In such a task, an input sheet is represented by a matrix of cells. The output is a list of tables detected, where the range of each detected table is represented by a 4-tuple (colleft, rowtop, colright, rowbottom), which specifies the x and y coordinates for the top-left and bottom-right corners of the bounding box (bbox).

Datasets

All experimental data in the development of TableSense is from our WebSheet dataset, which is a web-crawled spreadsheet corpus including 4,290,022 sheets. WebSheet10k is a sampled subset of WebSheet for human labeling. It contains 10,220 sheets in English, where all table regions on each sheet have been labeled with a corresponding bounding box. To control labeling quality, each sheet has been labeled by a human labeler and then verified by another human labeler. To ensure high coverage of various table structures and multi-table layouts on sheets, we adopt an active learning framework to build WebSheet10k in iterations. Details are provided in Section . WebSheet400 is our test set with labels, which contains 400 randomly sampled sheets with 795 tables from WebSheet without any overlap with WebSheet10k.

TableSense Framework

Framework tailored for table detection. It is an end-to-end model containing a series of modules as follows.

• Cell featurization: Since cells do not have a canonical representation in the spreadsheet, we need to extract cell features before feeding them to the pipeline. Details of cell featurization will be provided in Section .

• CNN backbone: CNN is the backbone of our framework to capture spatial correlations and learn high-level representations from input cell matrix, and fully convolutional network is adopted here so as to enable the model to process spreadsheets of various sizes without rescaling them.

• Table detection head: The two-stage detection mechanism which achieves state-of-the-art results in computer vision is adopted. In this module, the feature maps generated by the CNN backbone are fed to a Region Proposal Network (RPN), which further produces a list of Regions of Interest (RoIs). Then RoIAlign extracts feature maps from each RoI for bounding box regression. Then a CNN-based bounding box regression branch refines the boundaries of these RoIs, a CNN-based table classifier simultaneously scores these RoIs, and a segmentation branch generates the celllevel table mask. These branches are applied to each RoI separately. Finally, Non-Maximum Suppression (NMS) is used to rank the bounding boxes and filter redundant ones. For our task, RoIAlign which is based on bilinear interpolation can preserve more precise per-cell correspondence than RoIPool which uses simple hard quantization

The framework of TableSense for spreadsheet table detection

We propose six measures below for the evaluation of sheet uncertainty.

• Classification uncertainty score: One minus the average classification probability returned by the softmax values for all detected table regions on the sheet.

• Mismatch score of segmentation and detection masks:

One minus the IoU between the segmentation mask and detection mask. The detection mask is produced by setting the values of all cells inside the detected table region to 1 and 0 otherwise.

• Table/Sheet-level Sparsity factor: Table-level sparsity factor is given by the ratio of blank cells in the detected table region, while sheet-level sparsity is given by the lowest sparsity factor for all tables detected on the sheet.

• Overlapping region indicator: 1 if there is an overlap between any two detected table regions and 0 otherwise.

• Boundary mismatch indicator: 1 if there are boundary mismatch and 0 otherwise. A mismatch is identified if any detected boundary is on a blank column or row.

• Out-of-region coverage ratio: The ratio of the number of non-blank cells outside the detected table regions to the total number of non-blank cells.

Conclusion and Future Work

In this paper, we propose the TableSense suite to address the challenges in spreadsheet table detection. TableSense is a unified, end-to-end framework customized from CNN with several key enhancements. First, we propose a featurization scheme to encode cell features. Second, we devise a PBR module to predict precise bounding boxes and incorporate it. Third, we use active learning to effectively select low confidence sheets for human labeling in building up the training dataset. In the future, we will leverage the TableSense technique for automated table structure analysis and make a further step in spreadsheet intelligence.

Why Should I Trust You?. . LIME

-By Marco Tulio Ribeiro, Sameer Singh, Carlos Guestrin University of Washington Seattle, WA 98105, USA Paper Link This is third post in the series of Explainable AI (XAI). Earlier post i shed light on Machine learning impact on models with respect to Biasing. Today's topic is one of the large number of elementary operations , "Linear Proxy Models" (LIME). TRUST I would like to quote Stephen M.R. Covey "THE SPEED OF TRUST" Statements here, which is relevant to Trust. Executive summary link Simply put, trust means confidence. The opposite of trust — distrust — is suspicion. Trust always affects two outcomes: speed and cost. When trust goes down, speed goes down and cost goes up. When trust goes up, speed goes up and cost goes down (Strategy x Execution) x Trust = Results Not trusting people is a greater risk. if the users do not trust a model or a prediction, they will not use it. ...

SRI Blog

Search This Blog