Skip to main content

TableSense: Spreadsheet Table Detection with Convolutional Neural Networks

 - By Haoyu Dong, Shijie Liu, Shi Han, Zhouyu Fu, Dongmei Zhang
Microsoft Research, Beijing 100080, China.
Beihang University, Beijing 100191, China

Paper Link

Abstract


Spreadsheet table detection is the task of detecting all tables on a given sheet and locating their respective ranges. Automatic table detection is a key enabling technique and an initial step in spreadsheet data intelligence. However, the detection task is challenged by the diversity of table structures and table layouts on the spreadsheet. Considering the analogy between a cell matrix as spreadsheet and a pixel matrix as image, and encouraged by the successful application of Convolutional Neural Networks (CNN) in computer vision, we have developed TableSense, a novel end-to-end framework for spreadsheet table detection. First, we devise an effective cell featurization scheme to better leverage the rich information in each cell; second, we develop an enhanced convolutional neural network model for table detection to meet the domain-specific requirement on precise table boundary detection; third, we propose an effective uncertainty metric to guide an active learning based smart sampling algorithm, which enables the efficient build-up of a training dataset with 22,176 tables on 10,220 sheets with broad coverage of diverse table structures and layouts. Our evaluation shows that TableSense is highly effective with 91.3% recall and 86.5% precision in EoB-2 metric, a significant improvement over both the current detection algorithm that are used in commodity spreadsheet tools and state-of-the-art convolutional neural networks in computer vision.
Spreadsheets are a critical end-user development tool for data management and analysis. In spreadsheet data, the table is a key structure for data processing and information presentation. Automatic table detection is an important initial step for one-click intelligence features such as Ideas in Excel or Explore in Google Sheets, where insights can be recommended from the detected tables with automated end-to-end experience. Despite the importance of automatic table detection for spreadsheets, this problem has largely been overlooked for decades in both the research community and industry. Previous research on table detection has mainly targeted other media, e.g. HTML, images and PDFs. The aim is to retrieve (mostly likely single) table regions from the ambient text. 

The major challenge for these techniques is the understanding of binary files based on metadata analysis and image processing, but the table boundaries are clear. The scenario with spreadsheet table detection is fundamentally different.  a single sheet can have multiple tables cluttered around with potentially different structures for each table. The diversity in multi-table layout and structure significantly confound the problem with obfuscated table boundaries. To the best of our knowledge, there is no prior research effort on this problem in academia, while region-growth techniques are commonly used in commodity spreadsheet tools. However, region-growth is quite fragile with the presence of complicated table structures and layouts on the sheet


A sample spreadsheet with three tables showing various artefacts. Dotted red bounding boxes and dashed green bounding boxes show the tables detected by TableSense and Mask R-CNN, respectively.


Problem Statement


Spreadsheet table detection is the task of detecting all tables on a given sheet and locating their respective ranges. In such a task, an input sheet is represented by a matrix of cells. The output is a list of tables detected, where the range of each detected table is represented by a 4-tuple (colleft, rowtop, colright, rowbottom), which specifies the x and y coordinates for the top-left and bottom-right corners of the bounding box (bbox).

Datasets

All experimental data in the development of TableSense is from our WebSheet dataset, which is a web-crawled spreadsheet corpus including 4,290,022 sheets. WebSheet10k is a sampled subset of WebSheet for human labeling. It contains 10,220 sheets in English, where all table regions on each sheet have been labeled with a corresponding bounding box. To control labeling quality, each sheet has been labeled by a human labeler and then verified by another human labeler. To ensure high coverage of various table structures and multi-table layouts on sheets, we adopt an active learning framework to build WebSheet10k in iterations. Details are provided in Section . WebSheet400 is our test set with labels, which contains 400 randomly sampled sheets with 795 tables from WebSheet without any overlap with WebSheet10k.


TableSense Framework

Framework tailored for table detection. It is an end-to-end model containing a series of modules as follows.
• Cell featurization: Since cells do not have a canonical representation in the spreadsheet, we need to extract cell features before feeding them to the pipeline. Details of cell featurization will be provided in Section .
• CNN backbone: CNN is the backbone of our framework to capture spatial correlations and learn high-level representations from input cell matrix, and fully convolutional network is adopted here so as to enable the model to process spreadsheets of various sizes without rescaling them.
• Table detection head: The two-stage detection mechanism which achieves state-of-the-art results in computer vision is adopted. In this module, the feature maps generated by the CNN backbone are fed to a Region Proposal Network (RPN), which further produces a list of Regions of Interest (RoIs). Then RoIAlign extracts feature maps from each RoI for bounding box regression. Then a CNN-based bounding box regression branch refines the boundaries of these RoIs, a CNN-based table classifier simultaneously scores these RoIs, and a segmentation branch generates the celllevel table mask. These branches are applied to each RoI separately. Finally, Non-Maximum Suppression (NMS) is used to rank the bounding boxes and filter redundant ones. For our task, RoIAlign which is based on bilinear interpolation can preserve more precise per-cell correspondence than RoIPool which uses simple hard quantization



The framework of TableSense for spreadsheet table detection






We propose six measures below for the evaluation of sheet uncertainty.
• Classification uncertainty score: One minus the average classification probability returned by the softmax values for all detected table regions on the sheet.
• Mismatch score of segmentation and detection masks:
One minus the IoU between the segmentation mask and detection mask. The detection mask is produced by setting the values of all cells inside the detected table region to 1 and 0 otherwise.
• Table/Sheet-level Sparsity factor: Table-level sparsity factor is given by the ratio of blank cells in the detected table region, while sheet-level sparsity is given by the lowest sparsity factor for all tables detected on the sheet.
• Overlapping region indicator: 1 if there is an overlap between any two detected table regions and 0 otherwise.
• Boundary mismatch indicator: 1 if there are boundary mismatch and 0 otherwise. A mismatch is identified if any detected boundary is on a blank column or row.
• Out-of-region coverage ratio: The ratio of the number of non-blank cells outside the detected table regions to the total number of non-blank cells.


Conclusion and Future Work

In this paper, we propose the TableSense suite to address the challenges in spreadsheet table detection. TableSense is a unified, end-to-end framework customized from CNN with several key enhancements. First, we propose a featurization scheme to encode cell features. Second, we devise a PBR module to predict precise bounding boxes and incorporate it. Third, we use active learning to effectively select low confidence sheets for human labeling in building up the training dataset. In the future, we will leverage the TableSense technique for automated table structure analysis and make a further step in spreadsheet intelligence.


Comments

Anonymous said…
It's interesting to use active learning for efficient data collection. Thanks!

Popular posts from this blog

ABOD and its PyOD python module

Angle based detection By  Hans-Peter Kriegel, Matthias Schubert, Arthur Zimek  Ludwig-Maximilians-Universität München  Oettingenstr. 67, 80538 München, Germany Ref Link PyOD By  Yue Zhao   Zain Nasrullah   Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4, Canada  Zheng Li jk  Northeastern University Toronto, Toronto, ON M5X 1E2, Canada I am combining two papers to summarize Anomaly detection. First one is Angle Based Outlier Detection (ABOD) and other one is python module that  uses ABOD along with over 20 other apis (PyOD) . This is third part in the series of Anomaly detection. First article exhibits survey that covered length and breadth of subject, Second article highlighted on data preparation and pre-processing.  Angle Based Outlier Detection. Angles are more stable than distances in high dimensional spaces for example the popularity of cosine-based sim...

Cybersecurity Threats in Connected and Automated Vehicles based Federated Learning Systems

  Ranwa Al Mallah , Godwin Badu-Marfo , Bilal Farooq image Courtesy: Comparitech Abstract Federated learning (FL) is a machine learning technique that aims at training an algorithm across decentralized entities holding their local data private. Wireless mobile networks allow users to communicate with other fixed or mobile users. The road traffic network represents an infrastructure-based configuration of a wireless mobile network where the Connected and Automated Vehicles (CAV) represent the communicating entities. Applying FL in a wireless mobile network setting gives rise to a new threat in the mobile environment that is very different from the traditional fixed networks. The threat is due to the intrinsic characteristics of the wireless medium and is caused by the characteristics of the vehicular networks such as high node-mobility and rapidly changing topology. Most cyber defense techniques depend on highly reliable and connected networks. This paper explores falsified informat...

MLOps Drivenby Data Quality using ease.ml techniques

 Cedric Renggli, Luka Rimanic, Nezihe Merve Gurel, Bojan Karlas, Wentao Wu, Ce Zhang ETH Zurich Microsoft Research Paper Link ease.ml reference paper link Image courtesy 99designes Developing machine learning models can be seen as a process similar to the one established for traditional software development. A key difference between the two lies in the strong dependency between the quality of a machine learning model and the quality of the data used to train or perform evaluations. In this work, we demonstrate how different aspects of data quality propagate through various stages of machine learning development. By performing joint analysis of the impact of well-known data quality dimensions and the downstream machine learning process, we show that different components of a typical MLOps pipeline can be efficiently designed, providing both a technical and theoretical perspective. Courtesy: google The term “MLOps” is used when this DevOps process is specifically applied to ML. Diffe...