Skip to main content

Crowd Count Estimation with Point Supervision


-By Zhiheng Ma, Xing Wei Xiaopeng Hong, Yihong Gong

Research Center for Artificial Intelligence, Peng Cheng Laboratory


Abstract

In crowd counting datasets, each person is annotated by a point, which is usually the center of the head. And the task is to estimate the total count in a crowd scene. Most of the state-of-the-art methods are based on density map estimation, which convert the sparse point annotations into a “ground truth” density map through a Gaussian kernel, and then use it as the learning target to train a density map estimator. However, such a “ground-truth” density map is imperfect due to occlusions, perspective effects, variations in object shapes, etc. On the contrary, we propose Bayesian loss, a novel loss function which constructs a density contribution probability model from the point annotations. Instead of constraining the value at every pixel in the density map, the proposed training loss adopts a more reliable supervision on the count expectation at each annotated point. Without bells and whistles, the loss function makes substantial improvements over the baseline loss on all tested datasets. Moreover, our proposed loss function equipped with a standard backbone network, without using any external detectors or multi-scale architectures, plays favourably against the state of the arts. Our method outperforms previous best approaches by a large margin on the latest and largest UCF-QNRF dataset


 Counting dense crowds the number of participants in political rallies, civil unrest, social and sport events, etc. In addition, methods for crowd counting also have great potentials to handle similar tasks in other domains, including estimating the number of vehicles in traffic congestion, counting the cells and bacteria from microscopic images, and animal crowd estimations for ecological survey

Crowd counting is a very challenging task because:
1) Dense crowds often have heavy overlaps and occlusions between each other.
2) Perspective effects may cause large variations in human size, shape, and appearance in the image.

Related Work

Detection-then-counting

Most of the early works estimate crowd count by detecting or segmenting individual objects in the scene. This kind of methods has to tackle great challenges from two respects. Firstly, they produce more accurate results (e.g. bounding-boxes or masks of instances) than the overall count which is computationally expensive and most suitable in lower density crowds. In overcrowded scenes, clutters and severe occlusions make it unfeasible to detect every single person, despite the progress in related fields. Secondly, training object detectors require bounding-box or instance mask annotations, which is much more labour-intensive in dense crowds.


Direct count regression.


To avoid the more complex detection problem, some researchers proposed to directly learn a mapping from image features to their counts. Former methods in this category rely on hand-crafted features, such as SIFT, LBP etc., and then learn a regression model. Chan proposed to extract edge, texture and other low-level features of the crowds, and lean a Gaussian Process regression model for crowd counting.

Density map estimation. 


This kind of methods takes advantage of the location information to learn a map of density values for each training sample and the final count estimation can be obtained by summing over the predicted density map. Lempitsky and Zisserman proposed to transform the point annotations into a density map by the Gaussian kernel as “ground-truth”. Then they train their models using a least-square objective. This kind of training framework has been widely used in recent methods. Furthermore, thanks to the excellent feature learning ability of deep CNNs, CNN based density map estimation methods have achieved the state-of-the-art performance for crowd counting. One major problem of this framework is how to determine the optimal size of the Gaussian kernel which is influenced by many factors. To make matters worse, the models are trained by a loss function which applies supervision in a pixel-to-pixel manner. Obviously, the performance of such methods highly depends on the quality of the generated “ground-truth” density maps.


Hybrid training. 


Several works observed that crowd counting benefits from mixture training strategies, multi-task, multi-loss, etc. DecideNet to adaptively decide whether to use a detection model or a density map estimation model. This approach takes advantage of mixture-of-experts where a detection based model can estimate crowds accurately in low-density scenes while the density map estimation model is good at handling crowded scenes. However, this method requires external pre-trained human detection models and is less efficient. Some researchers proposed to combine multiple losses to assist each other. Train a deep CNN by alternatively optimizing a pixel-wise loss function and a global count regression loss. A similar training approach was adopted by Zhang, in which they first train their model via the density map loss and then add a relative count loss in the last few epochs.





 Visualization of the posterior label probability. We construct an entropy map using Eq. (15), which measures the uncertainty on the label a pixel in the density map belongs to. The color is warmer, the value is larger. (a) Input image. (b)-(c): Entropy maps with different σ, without background pixel modelling. (e)-(f): Entropy maps with different d, with background pixel modelling. (d): Blend of the input image and the entropy map in (e)




The experimental results in Table 1 and the highlights can be summarized as follows:
• BAYESIAN+ achieves the state-of-the-art accuracy on all the four benchmark datasets. On the latest and the toughest UCF-QNRF dataset, it reduces the MAE and MSE values of the best method (CL-CNN). It is worth mentioning that our method does not use any external detection models or multi-scale structures.
• BAYESIAN+ consistently improves the performance of BAYESIAN by around 3% on all the four datasets.
• Both BAYESIAN and BAYESIAN+ outperform BASELINE significantly on all the four datasets. BAYESIAN+ makes 15% improvements on UCFQNRF, 9% on ShanghaiTechA, 8% on ShanghaiTechB, and 8% on UCF CC 50, respectively.

Conclusions

Authors propose a novel loss function for crowd count estimation with point supervision. Different from previous methods that transform point annotations into the “ground-truth” density maps using the Gaussian kernel with pixel-wise supervision, our loss function adopts more reliable supervision on the count expectation at each annotated point. Extensive experiments have demonstrated the advantages of our proposed methods in terms of accuracy, robustness, and generalization. The current form of our formulation is fairly general and can easily incorporate other knowledge, e.g., specific foreground or background priors, scale and temporal likelihoods, and other facts to further improve the proposed method.


Comments

Popular posts from this blog

ABOD and its PyOD python module

Angle based detection By  Hans-Peter Kriegel, Matthias Schubert, Arthur Zimek  Ludwig-Maximilians-Universität München  Oettingenstr. 67, 80538 München, Germany Ref Link PyOD By  Yue Zhao   Zain Nasrullah   Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4, Canada  Zheng Li jk  Northeastern University Toronto, Toronto, ON M5X 1E2, Canada I am combining two papers to summarize Anomaly detection. First one is Angle Based Outlier Detection (ABOD) and other one is python module that  uses ABOD along with over 20 other apis (PyOD) . This is third part in the series of Anomaly detection. First article exhibits survey that covered length and breadth of subject, Second article highlighted on data preparation and pre-processing.  Angle Based Outlier Detection. Angles are more stable than distances in high dimensional spaces for example the popularity of cosine-based sim...

Cybersecurity Threats in Connected and Automated Vehicles based Federated Learning Systems

  Ranwa Al Mallah , Godwin Badu-Marfo , Bilal Farooq image Courtesy: Comparitech Abstract Federated learning (FL) is a machine learning technique that aims at training an algorithm across decentralized entities holding their local data private. Wireless mobile networks allow users to communicate with other fixed or mobile users. The road traffic network represents an infrastructure-based configuration of a wireless mobile network where the Connected and Automated Vehicles (CAV) represent the communicating entities. Applying FL in a wireless mobile network setting gives rise to a new threat in the mobile environment that is very different from the traditional fixed networks. The threat is due to the intrinsic characteristics of the wireless medium and is caused by the characteristics of the vehicular networks such as high node-mobility and rapidly changing topology. Most cyber defense techniques depend on highly reliable and connected networks. This paper explores falsified informat...

MLOps Drivenby Data Quality using ease.ml techniques

 Cedric Renggli, Luka Rimanic, Nezihe Merve Gurel, Bojan Karlas, Wentao Wu, Ce Zhang ETH Zurich Microsoft Research Paper Link ease.ml reference paper link Image courtesy 99designes Developing machine learning models can be seen as a process similar to the one established for traditional software development. A key difference between the two lies in the strong dependency between the quality of a machine learning model and the quality of the data used to train or perform evaluations. In this work, we demonstrate how different aspects of data quality propagate through various stages of machine learning development. By performing joint analysis of the impact of well-known data quality dimensions and the downstream machine learning process, we show that different components of a typical MLOps pipeline can be efficiently designed, providing both a technical and theoretical perspective. Courtesy: google The term “MLOps” is used when this DevOps process is specifically applied to ML. Diffe...