-By Zhiheng Ma, Xing Wei Xiaopeng Hong, Yihong Gong
Research Center for Artificial Intelligence, Peng Cheng Laboratory
Abstract
In crowd counting datasets, each person is annotated by a point, which is usually the center of the head. And the task is to estimate the total count in a crowd scene. Most of the state-of-the-art methods are based on density map estimation, which convert the sparse point annotations into a “ground truth” density map through a Gaussian kernel, and then use it as the learning target to train a density map estimator. However, such a “ground-truth” density map is imperfect due to occlusions, perspective effects, variations in object shapes, etc. On the contrary, we propose Bayesian loss, a novel loss function which constructs a density contribution probability model from the point annotations. Instead of constraining the value at every pixel in the density map, the proposed training loss adopts a more reliable supervision on the count expectation at each annotated point. Without bells and whistles, the loss function makes substantial improvements over the baseline loss on all tested datasets. Moreover, our proposed loss function equipped with a standard backbone network, without using any external detectors or multi-scale architectures, plays favourably against the state of the arts. Our method outperforms previous best approaches by a large margin on the latest and largest UCF-QNRF dataset
Counting dense crowds the number of participants in political rallies, civil unrest, social and sport events, etc. In addition, methods for crowd counting also have great potentials to handle similar tasks in other domains, including estimating the number of vehicles in traffic congestion, counting the cells and bacteria from microscopic images, and animal crowd estimations for ecological survey
Crowd counting is a very challenging task because:
1) Dense crowds often have heavy overlaps and occlusions between each other.
2) Perspective effects may cause large variations in human size, shape, and appearance in the image.
Related Work
Detection-then-counting
Most of the early works estimate crowd count by detecting or segmenting individual objects in the scene. This kind of methods has to tackle great challenges from two respects. Firstly, they produce more accurate results (e.g. bounding-boxes or masks of instances) than the overall count which is computationally expensive and most suitable in lower density crowds. In overcrowded scenes, clutters and severe occlusions make it unfeasible to detect every single person, despite the progress in related fields. Secondly, training object detectors require bounding-box or instance mask annotations, which is much more labour-intensive in dense crowds.Direct count regression.
To avoid the more complex detection problem, some researchers proposed to directly learn a mapping from image features to their counts. Former methods in this category rely on hand-crafted features, such as SIFT, LBP etc., and then learn a regression model. Chan proposed to extract edge, texture and other low-level features of the crowds, and lean a Gaussian Process regression model for crowd counting.
Density map estimation.
This kind of methods takes advantage of the location information to learn a map of density values for each training sample and the final count estimation can be obtained by summing over the predicted density map. Lempitsky and Zisserman proposed to transform the point annotations into a density map by the Gaussian kernel as “ground-truth”. Then they train their models using a least-square objective. This kind of training framework has been widely used in recent methods. Furthermore, thanks to the excellent feature learning ability of deep CNNs, CNN based density map estimation methods have achieved the state-of-the-art performance for crowd counting. One major problem of this framework is how to determine the optimal size of the Gaussian kernel which is influenced by many factors. To make matters worse, the models are trained by a loss function which applies supervision in a pixel-to-pixel manner. Obviously, the performance of such methods highly depends on the quality of the generated “ground-truth” density maps.
Hybrid training.
Several works observed that crowd counting benefits from mixture training strategies, multi-task, multi-loss, etc. DecideNet to adaptively decide whether to use a detection model or a density map estimation model. This approach takes advantage of mixture-of-experts where a detection based model can estimate crowds accurately in low-density scenes while the density map estimation model is good at handling crowded scenes. However, this method requires external pre-trained human detection models and is less efficient. Some researchers proposed to combine multiple losses to assist each other. Train a deep CNN by alternatively optimizing a pixel-wise loss function and a global count regression loss. A similar training approach was adopted by Zhang, in which they first train their model via the density map loss and then add a relative count loss in the last few epochs.
Visualization of the posterior label probability. We construct an entropy map using Eq. (15), which measures the uncertainty on the label a pixel in the density map belongs to. The color is warmer, the value is larger. (a) Input image. (b)-(c): Entropy maps with different σ, without background pixel modelling. (e)-(f): Entropy maps with different d, with background pixel modelling. (d): Blend of the input image and the entropy map in (e)
The experimental results in Table 1 and the highlights can be summarized as follows:
• BAYESIAN+ achieves the state-of-the-art accuracy on all the four benchmark datasets. On the latest and the toughest UCF-QNRF dataset, it reduces the MAE and MSE values of the best method (CL-CNN). It is worth mentioning that our method does not use any external detection models or multi-scale structures.
• BAYESIAN+ consistently improves the performance of BAYESIAN by around 3% on all the four datasets.
• Both BAYESIAN and BAYESIAN+ outperform BASELINE significantly on all the four datasets. BAYESIAN+ makes 15% improvements on UCFQNRF, 9% on ShanghaiTechA, 8% on ShanghaiTechB, and 8% on UCF CC 50, respectively.
Conclusions
Authors propose a novel loss function for crowd count estimation with point supervision. Different from previous methods that transform point annotations into the “ground-truth” density maps using the Gaussian kernel with pixel-wise supervision, our loss function adopts more reliable supervision on the count expectation at each annotated point. Extensive experiments have demonstrated the advantages of our proposed methods in terms of accuracy, robustness, and generalization. The current form of our formulation is fairly general and can easily incorporate other knowledge, e.g., specific foreground or background priors, scale and temporal likelihoods, and other facts to further improve the proposed method.
Comments