Skip to main content

Crowd Count Estimation with Point Supervision


-By Zhiheng Ma, Xing Wei Xiaopeng Hong, Yihong Gong

Research Center for Artificial Intelligence, Peng Cheng Laboratory


Abstract

In crowd counting datasets, each person is annotated by a point, which is usually the center of the head. And the task is to estimate the total count in a crowd scene. Most of the state-of-the-art methods are based on density map estimation, which convert the sparse point annotations into a “ground truth” density map through a Gaussian kernel, and then use it as the learning target to train a density map estimator. However, such a “ground-truth” density map is imperfect due to occlusions, perspective effects, variations in object shapes, etc. On the contrary, we propose Bayesian loss, a novel loss function which constructs a density contribution probability model from the point annotations. Instead of constraining the value at every pixel in the density map, the proposed training loss adopts a more reliable supervision on the count expectation at each annotated point. Without bells and whistles, the loss function makes substantial improvements over the baseline loss on all tested datasets. Moreover, our proposed loss function equipped with a standard backbone network, without using any external detectors or multi-scale architectures, plays favourably against the state of the arts. Our method outperforms previous best approaches by a large margin on the latest and largest UCF-QNRF dataset


 Counting dense crowds the number of participants in political rallies, civil unrest, social and sport events, etc. In addition, methods for crowd counting also have great potentials to handle similar tasks in other domains, including estimating the number of vehicles in traffic congestion, counting the cells and bacteria from microscopic images, and animal crowd estimations for ecological survey

Crowd counting is a very challenging task because:
1) Dense crowds often have heavy overlaps and occlusions between each other.
2) Perspective effects may cause large variations in human size, shape, and appearance in the image.

Related Work

Detection-then-counting

Most of the early works estimate crowd count by detecting or segmenting individual objects in the scene. This kind of methods has to tackle great challenges from two respects. Firstly, they produce more accurate results (e.g. bounding-boxes or masks of instances) than the overall count which is computationally expensive and most suitable in lower density crowds. In overcrowded scenes, clutters and severe occlusions make it unfeasible to detect every single person, despite the progress in related fields. Secondly, training object detectors require bounding-box or instance mask annotations, which is much more labour-intensive in dense crowds.


Direct count regression.


To avoid the more complex detection problem, some researchers proposed to directly learn a mapping from image features to their counts. Former methods in this category rely on hand-crafted features, such as SIFT, LBP etc., and then learn a regression model. Chan proposed to extract edge, texture and other low-level features of the crowds, and lean a Gaussian Process regression model for crowd counting.

Density map estimation. 


This kind of methods takes advantage of the location information to learn a map of density values for each training sample and the final count estimation can be obtained by summing over the predicted density map. Lempitsky and Zisserman proposed to transform the point annotations into a density map by the Gaussian kernel as “ground-truth”. Then they train their models using a least-square objective. This kind of training framework has been widely used in recent methods. Furthermore, thanks to the excellent feature learning ability of deep CNNs, CNN based density map estimation methods have achieved the state-of-the-art performance for crowd counting. One major problem of this framework is how to determine the optimal size of the Gaussian kernel which is influenced by many factors. To make matters worse, the models are trained by a loss function which applies supervision in a pixel-to-pixel manner. Obviously, the performance of such methods highly depends on the quality of the generated “ground-truth” density maps.


Hybrid training. 


Several works observed that crowd counting benefits from mixture training strategies, multi-task, multi-loss, etc. DecideNet to adaptively decide whether to use a detection model or a density map estimation model. This approach takes advantage of mixture-of-experts where a detection based model can estimate crowds accurately in low-density scenes while the density map estimation model is good at handling crowded scenes. However, this method requires external pre-trained human detection models and is less efficient. Some researchers proposed to combine multiple losses to assist each other. Train a deep CNN by alternatively optimizing a pixel-wise loss function and a global count regression loss. A similar training approach was adopted by Zhang, in which they first train their model via the density map loss and then add a relative count loss in the last few epochs.





 Visualization of the posterior label probability. We construct an entropy map using Eq. (15), which measures the uncertainty on the label a pixel in the density map belongs to. The color is warmer, the value is larger. (a) Input image. (b)-(c): Entropy maps with different σ, without background pixel modelling. (e)-(f): Entropy maps with different d, with background pixel modelling. (d): Blend of the input image and the entropy map in (e)




The experimental results in Table 1 and the highlights can be summarized as follows:
• BAYESIAN+ achieves the state-of-the-art accuracy on all the four benchmark datasets. On the latest and the toughest UCF-QNRF dataset, it reduces the MAE and MSE values of the best method (CL-CNN). It is worth mentioning that our method does not use any external detection models or multi-scale structures.
• BAYESIAN+ consistently improves the performance of BAYESIAN by around 3% on all the four datasets.
• Both BAYESIAN and BAYESIAN+ outperform BASELINE significantly on all the four datasets. BAYESIAN+ makes 15% improvements on UCFQNRF, 9% on ShanghaiTechA, 8% on ShanghaiTechB, and 8% on UCF CC 50, respectively.

Conclusions

Authors propose a novel loss function for crowd count estimation with point supervision. Different from previous methods that transform point annotations into the “ground-truth” density maps using the Gaussian kernel with pixel-wise supervision, our loss function adopts more reliable supervision on the count expectation at each annotated point. Extensive experiments have demonstrated the advantages of our proposed methods in terms of accuracy, robustness, and generalization. The current form of our formulation is fairly general and can easily incorporate other knowledge, e.g., specific foreground or background priors, scale and temporal likelihoods, and other facts to further improve the proposed method.


Comments

Popular posts from this blog

ABOD and its PyOD python module

Angle based detection By  Hans-Peter Kriegel, Matthias Schubert, Arthur Zimek  Ludwig-Maximilians-Universität München  Oettingenstr. 67, 80538 München, Germany Ref Link PyOD By  Yue Zhao   Zain Nasrullah   Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4, Canada  Zheng Li jk  Northeastern University Toronto, Toronto, ON M5X 1E2, Canada I am combining two papers to summarize Anomaly detection. First one is Angle Based Outlier Detection (ABOD) and other one is python module that  uses ABOD along with over 20 other apis (PyOD) . This is third part in the series of Anomaly detection. First article exhibits survey that covered length and breadth of subject, Second article highlighted on data preparation and pre-processing.  Angle Based Outlier Detection. Angles are more stable than distances in high dimensional spaces for example the popularity of cosine-based sim...

Ownership at Large

 Open Problems and Challenges in Ownership Management -By John Ahlgren, Maria Eugenia Berezin, Kinga Bojarczuk, Elena Dulskyte, Inna Dvortsova, Johann George, Natalija Gucevska, Mark Harman, Shan He, Ralf Lämmel, Erik Meijer, Silvia Sapora, and Justin Spahr-Summers Facebook Inc.  Software-intensive organizations rely on large numbers of software assets of different types, e.g., source-code files, tables in the data warehouse, and software configurations. Who is the most suitable owner of a given asset changes over time, e.g., due to reorganization and individual function changes. New forms of automation can help suggest more suitable owners for any given asset at a given point in time. By such efforts on ownership health, accountability of ownership is increased. The problem of finding the most suitable owners for an asset is essentially a program comprehension problem: how do we automatically determine who would be best placed to understand, maintain, ev...

Hybrid Approach to Automation, RPA and Machine Learning

- By Wiesław Kopec´, Kinga Skorupska, Piotr Gago, Krzysztof Marasek  Polish-Japanese Academy of Information Technology Paper Link Courtesy DZone   Abstract One of the more prominent trends within Industry 4.0 is the drive to employ Robotic Process Automation (RPA), especially as one of the elements of the Lean approach.     The full implementation of RPA is riddled with challenges relating both to the reality of everyday business operations, from SMEs to SSCs and beyond, and the social effects of the changing job market. To successfully address these points there is a need to develop a solution that would adjust to the existing business operations and at the same time lower the negative social impact of the automation process. To achieve these goals we propose a hybrid, human-centred approach to the development of software robots. This design and  implementation method combines the Living Lab approach with empowerment through part...