A Baseline for 3D Multi-Object Tracking

-By Xinshuo Weng, Kris Kitani Carnegie Mellon University

Abstract

3D multi-object tracking (MOT) is an essential component technology for many real-time applications such as autonomous driving or assistive robotics. However, recent works for 3D MOT tend to focus more on developing accurate systems giving less regard to computational cost and system complexity. In contrast, this work proposes a simple yet accurate real-time baseline 3D MOT system. We use an off-the-shelf 3D object detector to obtain oriented 3D bounding boxes from the LiDAR point cloud. Then, a combination of 3D Kalman filter and Hungarian algorithm is used for state estimation and data association. Although our baseline system is a straightforward combination of standard methods, we obtain the state-of-the-art results. To evaluate our baseline system, we propose a new 3D MOT extension to the official KITTI 2D MOT evaluation along with two new metrics. Our proposed baseline method for 3D MOT establishes new state-of-theart performance on 3D MOT for KITTI, improving the 3D MOTA from 72.23 of prior art to 76.47. Surprisingly, by projecting our 3D tracking results to the 2D image plane and compare against published 2D MOT methods, our system places 2nd on the official KITTI leaderboard. Also, our proposed 3D MOT method runs at a rate of 214.7 FPS, 65× faster than the state-of-the-art 2D MOT system.

Multi-object tracking (MOT) is an essential component technology for many vision applications such as autonomous driving , robot collision prediction and video face alignment. Due to the significant advance in object detection, there has been much progress on MOT. For example, for the car class on the KITTI MOT benchmark, the MOTA (multi-object tracking accuracy) has improved from 57.03 to 84.24 in two years. Although the accuracy has been significantly improved, it has come at the cost of increasing system complexity and computational cost. Complex systems make modular analysis challenging and it is not always clear which part of the system contributes the most to performance. For example, leading works have substantial different system pipelines but only minor differences in performance. Also, the adverse effect of increased computational cost is obvious. Despite having excellent accuracy, real-time tracking is out of reach. In contrast to prior work which tends to focus more on accuracy over system complexity and computational cost, this work aims to develop an accurate, simple and real-time 3D MOT system. We show that our proposed system which combines the minimal components for 3D MOT works extremely well. On the KITTI dataset, our system establishes new state-of-the-art performance on 3D MOT. Surprisingly, if we project our 3D tracking results to the 2D image plane and compare against all published 2D MOT methods, our system places 2nd on the official KITTI leaderboard . In addition, due to the simplicity of our system, it can run at a rate of 214.7 FPS on KITTI test set, 65 times faster than the state-of-the-art MOT system BeyondPixels. When comparing against other real-time MOT systems such as Complexer-YOLO, LP-SSVM, 3D-CNN/PMBM, and MCMOT-CPD, our system is at least twice as fast and achieves much higher accuracy.

Related Work

2D Multi-Object Tracking.

Recent 2D MOT systems can be mostly split into two categories based on the data association: batch and online methods. The batch methods attempt to find the global optimal solution from the entire sequence. They often create a network flow graph and can be solved by the min-cost flow algorithms. On the other hand, the online methods consider only the detection at the current frame and are usually efficient for real-time application. These methods often formulate the data association as a bipartite graph matching problem and solve it using the Hungarian algorithm. Beyond using the Hungarian algorithm in a post-processing step, modern online methods design the deep association networks that are able to construct the association using neural networks. Our MOT system also belongs to online methods. For simplicity and real-time efficiency, we adopt the original Hungarian algorithm without using neural networks. Independent of the data association, designing a proper cost function for affinity measure is also crucial to the MOT system. Early works employ hand-crafted features such as spatial distance and colour histograms as the cost function. Instead, modern methods apply the motion model and learn the appearance features. In contrast to prior works which combine both appearance and motion models in a complicated way, we choose to employ only the simplest motion model, i.e., constant velocity, without using an appearance model.

3D Multi-Object Tracking.

Most 3D MOT systems share the same components with the 2D MOT systems. The only distinction lies in that the detection boxes are in 3D space instead of the image plane. Therefore, it has the potential to design the motion and appearance models in 3D space without perspective distortion. proposes an image-based method which estimates the location of objects in image space and also their distance to the camera in 3D. Then a Poisson multi-Bernoulli mixture filter is used to estimate the 3D velocity of the objects. [30] applies an unscented Kalman filter in the bird’s eye view to estimate not only the 3D velocity but also the angular velocity. proposes a 2D-3D Kalman filter to jointly utilize the observation from the image and 3D world. Instead of using the hand-crafted filters, Design Siamese networks to learn the filters from data. Unlike previous works which use complicated filters, our proposed system employs only the original Kalman filter for simplicity, but extends its state space to full 3D domain, including not only 3D velocity but also the 3D size, 3D location and heading angle of the objects.

3D Object Detection.

As an indispensable component of the 3D MOT, the quality of the detected 3D bounding box matters. Prior works mainly focus on processing the LiDAR point cloud inputs. Divide the point cloud into equally-spaced 3D voxels and apply 3D CNNs for 3D bounding box prediction. Converts the point cloud to a bird’s eye view representation for efficiency and exploit 2D convolutions. Other works such as directly process the point cloud inputs using PointNet++ for 3D detection. In addition, instead of using the point cloud, achieve the 3D detection from only a single image.

Please refer to paper for implementation and metrics.

TableSense: Spreadsheet Table Detection with Convolutional Neural Networks

- By Haoyu Dong, Shijie Liu, Shi Han, Zhouyu Fu, Dongmei Zhang Microsoft Research, Beijing 100080, China. Beihang University, Beijing 100191, China Paper Link Abstract Spreadsheet table detection is the task of detecting all tables on a given sheet and locating their respective ranges. Automatic table detection is a key enabling technique and an initial step in spreadsheet data intelligence. However, the detection task is challenged by the diversity of table structures and table layouts on the spreadsheet. Considering the analogy between a cell matrix as spreadsheet and a pixel matrix as image, and encouraged by the successful application of Convolutional Neural Networks (CNN) in computer vision, we have developed TableSense, a novel end-to-end framework for spreadsheet table detection. First, we devise an effective cell featurization scheme to better leverage the rich information in each cell; second, we develop an enhanced convolutional neural network model for...

SRI Blog

Search This Blog