Crowd Count Estimation with Point Supervision

-By Zhiheng Ma, Xing Wei Xiaopeng Hong, Yihong Gong

Research Center for Artificial Intelligence, Peng Cheng Laboratory

Abstract

In crowd counting datasets, each person is annotated by a point, which is usually the center of the head. And the task is to estimate the total count in a crowd scene. Most of the state-of-the-art methods are based on density map estimation, which convert the sparse point annotations into a “ground truth” density map through a Gaussian kernel, and then use it as the learning target to train a density map estimator. However, such a “ground-truth” density map is imperfect due to occlusions, perspective effects, variations in object shapes, etc. On the contrary, we propose Bayesian loss, a novel loss function which constructs a density contribution probability model from the point annotations. Instead of constraining the value at every pixel in the density map, the proposed training loss adopts a more reliable supervision on the count expectation at each annotated point. Without bells and whistles, the loss function makes substantial improvements over the baseline loss on all tested datasets. Moreover, our proposed loss function equipped with a standard backbone network, without using any external detectors or multi-scale architectures, plays favourably against the state of the arts. Our method outperforms previous best approaches by a large margin on the latest and largest UCF-QNRF dataset

Counting dense crowds the number of participants in political rallies, civil unrest, social and sport events, etc. In addition, methods for crowd counting also have great potentials to handle similar tasks in other domains, including estimating the number of vehicles in traffic congestion, counting the cells and bacteria from microscopic images, and animal crowd estimations for ecological survey

Crowd counting is a very challenging task because:
1) Dense crowds often have heavy overlaps and occlusions between each other.
2) Perspective effects may cause large variations in human size, shape, and appearance in the image.

Related Work

Detection-then-counting

Most of the early works estimate crowd count by detecting or segmenting individual objects in the scene. This kind of methods has to tackle great challenges from two respects. Firstly, they produce more accurate results (e.g. bounding-boxes or masks of instances) than the overall count which is computationally expensive and most suitable in lower density crowds. In overcrowded scenes, clutters and severe occlusions make it unfeasible to detect every single person, despite the progress in related fields. Secondly, training object detectors require bounding-box or instance mask annotations, which is much more labour-intensive in dense crowds.

Direct count regression.

To avoid the more complex detection problem, some researchers proposed to directly learn a mapping from image features to their counts. Former methods in this category rely on hand-crafted features, such as SIFT, LBP etc., and then learn a regression model. Chan proposed to extract edge, texture and other low-level features of the crowds, and lean a Gaussian Process regression model for crowd counting.

Density map estimation.

This kind of methods takes advantage of the location information to learn a map of density values for each training sample and the final count estimation can be obtained by summing over the predicted density map. Lempitsky and Zisserman proposed to transform the point annotations into a density map by the Gaussian kernel as “ground-truth”. Then they train their models using a least-square objective. This kind of training framework has been widely used in recent methods. Furthermore, thanks to the excellent feature learning ability of deep CNNs, CNN based density map estimation methods have achieved the state-of-the-art performance for crowd counting. One major problem of this framework is how to determine the optimal size of the Gaussian kernel which is influenced by many factors. To make matters worse, the models are trained by a loss function which applies supervision in a pixel-to-pixel manner. Obviously, the performance of such methods highly depends on the quality of the generated “ground-truth” density maps.

Hybrid training.

Several works observed that crowd counting benefits from mixture training strategies, multi-task, multi-loss, etc. DecideNet to adaptively decide whether to use a detection model or a density map estimation model. This approach takes advantage of mixture-of-experts where a detection based model can estimate crowds accurately in low-density scenes while the density map estimation model is good at handling crowded scenes. However, this method requires external pre-trained human detection models and is less efficient. Some researchers proposed to combine multiple losses to assist each other. Train a deep CNN by alternatively optimizing a pixel-wise loss function and a global count regression loss. A similar training approach was adopted by Zhang, in which they first train their model via the density map loss and then add a relative count loss in the last few epochs.

Visualization of the posterior label probability. We construct an entropy map using Eq. (15), which measures the uncertainty on the label a pixel in the density map belongs to. The color is warmer, the value is larger. (a) Input image. (b)-(c): Entropy maps with different σ, without background pixel modelling. (e)-(f): Entropy maps with different d, with background pixel modelling. (d): Blend of the input image and the entropy map in (e)

The experimental results in Table 1 and the highlights can be summarized as follows:
• BAYESIAN+ achieves the state-of-the-art accuracy on all the four benchmark datasets. On the latest and the toughest UCF-QNRF dataset, it reduces the MAE and MSE values of the best method (CL-CNN). It is worth mentioning that our method does not use any external detection models or multi-scale structures.
• BAYESIAN+ consistently improves the performance of BAYESIAN by around 3% on all the four datasets.
• Both BAYESIAN and BAYESIAN+ outperform BASELINE significantly on all the four datasets. BAYESIAN+ makes 15% improvements on UCFQNRF, 9% on ShanghaiTechA, 8% on ShanghaiTechB, and 8% on UCF CC 50, respectively.

Conclusions

Authors propose a novel loss function for crowd count estimation with point supervision. Different from previous methods that transform point annotations into the “ground-truth” density maps using the Gaussian kernel with pixel-wise supervision, our loss function adopts more reliable supervision on the count expectation at each annotated point. Extensive experiments have demonstrated the advantages of our proposed methods in terms of accuracy, robustness, and generalization. The current form of our formulation is fairly general and can easily incorporate other knowledge, e.g., specific foreground or background priors, scale and temporal likelihoods, and other facts to further improve the proposed method.

TableSense: Spreadsheet Table Detection with Convolutional Neural Networks

- By Haoyu Dong, Shijie Liu, Shi Han, Zhouyu Fu, Dongmei Zhang Microsoft Research, Beijing 100080, China. Beihang University, Beijing 100191, China Paper Link Abstract Spreadsheet table detection is the task of detecting all tables on a given sheet and locating their respective ranges. Automatic table detection is a key enabling technique and an initial step in spreadsheet data intelligence. However, the detection task is challenged by the diversity of table structures and table layouts on the spreadsheet. Considering the analogy between a cell matrix as spreadsheet and a pixel matrix as image, and encouraged by the successful application of Convolutional Neural Networks (CNN) in computer vision, we have developed TableSense, a novel end-to-end framework for spreadsheet table detection. First, we devise an effective cell featurization scheme to better leverage the rich information in each cell; second, we develop an enhanced convolutional neural network model for...

SRI Blog

Search This Blog