Skip to main content

Efficient Annotation of Objects for Video Analysis

By - Swetha Sirnam, Anand Mishra,Guru Prasad Hegde, C V Jawahar

Post Link

Abstract 
Accurately annotated large video data is critical for the development of reliable surveillance and automotive related vision solutions. Paper proposes an efficient and yet accurate annotation scheme for objects in videos (pedestrians in this case) with minimal supervision. They annotate objects with tight bounding boxes. They propagate the annotations across the frames with a self training based approach. An energy minimization scheme for the segmentation is the central component of our method. Unlike the popular grab cut like segmentation schemes, we demand minimal user intervention. Since our annotation is built on an accurate segmentation, our bounding boxes are tight. Approach is  validated the performance on multiple publicly available datasets. 

Paper focus on efficient object bounding box annotation in videos and accurate human pose annotation in videos.

Object Annotation in Videos.

Object Annotation is one of the most fundamental problems in vision community because of its application in a wide variety of tasks and it is also a part of many high-level problems. Solution proposes an efficient and yet accurate annotation scheme with tight bounding boxes, for object in videos with minimal supervision The annotations are propagated across the frames using selflearning based approach. An energy minimization scheme for the segmentation is the core component of our method. Figure 1.1 shows an example of annotation of a person walking on road. Some of the applications of object annotation are motion analysis, event detection, surveillance systems, transport, sports analytics.


Figure 1.1: Example showing objects (human) being annotated with a tight bounding box in case of pedestrian videos taken from TUD-Stadtmitte dataset


Paper looks back for Video annotation tools. 
  • VideoAnnEX - IBM developed video annotation tool
  • ViPER - open-source tool developed by Language and Media Process Lab
  • LabelMe video (LMV) - open-source Web accessible video annotation system
  • VATIC - Web based video annotation tool hosted on amazon's mechanical turk to investigate crowd sourcing platforms.
For detailed description on tools please refer white paper

Approach

Paper adopts GrabCut based segmentation method for videos to obtain high accurate annotations  for objects in large scale videos efficiency.




Proposed framework: given a set of videos, user annotates selected key frames. They propagate the annotations for the entire sequence and use it as initialization for the approach and get a tight bounding box.

Energy Minimization Framework


Segmentation of an image can be expressed as a vector of binary random variables 
X = {X1, X2, ..., Xn}, where each random variable Xi takes a label xi ∈ {0, 1} based on whether it is object or background.




Energy Minimization Framework. Iteratively estimate the foreground and background GMMs and perform graphcut (graph min-cut). Given an image and bounding box which represents the object of interest, output is the object segmented from background.



Paper proposes the following simple but effective modifications for efficient segmentation and propagation to accurately annotate objects from the videos: 

  • Usage of relaxed interpolation to get the relaxed bounding box for the object of interest. It is calculated initially for the entire sequence (using the user annotations on key frames). 
  • Compute foreground and background GMM model at the key frames using the user annotations and these models are propagated to in-between frames. 
  • The above pre-computed models are used to segment object in nearby frames. As the object will be very similar in the nearby frames. 
  • The object neighborhood is sufficient to segment the object rather than taking the entire image. This reduces the computations drastically without effecting the results.

Sample Result


Performance Metrics

Average area overlap


It is the intersection area divided by union of ground truth (GT) and the bounding box (BB) generated by an annotation approach. The mean of this measure is calculated by dividing with total number of frames in the database. 



Recall

Recall is computed as a fraction of true positive and true positive plus false negative, which are defined as follows

Result


Consider an example where an user have a video of 10, 000 frames to be annotated. For a key frame interval of 10, user has to annotate only 1000 frames in our approach. This reduces the human efforts by 90 % without compromising on the accurateness of the annotations

Conclusion


Paper have presented a framework for semi-automatic object annotation to generate accurate Ground Truth (GT) data in large scale from videos. Especially, the approach is suitable for generating GT for mission critical applications like surveillance and autonomous driving. The method of object annotation is based on segmentation and its propagation which results in accurate bounding boxes around the objects. The proposed framework outperforms interpolation based approaches and almost mimics human annotation ability with only minimal user interaction (predominantly at key frames) which makes it scalable to generate large sets of GT. Process is verified the claims by conducting comprehensive experiments on multiple challenging video datasets. Approach can prove useful in generating ground truth and annotations for large scale surveillance and automotive related videos with substantial reduction in human efforts.





Additional content for reference 

Types of Annotations


Class Labels 

Class label annotation is one of the primary annotation, which assigns a label for each class. It is used to solve classification related problems. In classification problems, the task is to identify objects in images or actions in videos respectively.


Bounding Box 

The class labels give information about the objects in the scene but they do not describe about their position in the scene. To describe the object position in the image, one can draw a rectangle around the object to know it’s position. Hence, the object is annotated with a bounding box such that the object lies completely inside the box.



Semantic Labeling 

Although, the bounding box provides information regarding the position of object in the image, but it is not precise. Often it is a loose bounding box,



Semantic labeling annotates objects at pixel level, it annotates each pixel as object (i.e., 1) or background (i.e., 0) in case of only foreground-background segmentation. In case of multiple objects, each pixel is assigned to a class.

Human Pose 

Semantic labeling provides pixel-level detail about the scene but it fails to provide information regarding the object. For example, a human can be annotated at pixel-level using semantic labeling, but information regarding human(for example posture, limb locations) is not known. Knowing posture of human helps in semantically reasoning about the scene and will also help in solving other problems like cloth parsing, action recognition etc. Human posture is nothing but the human layout that is the skeleton, it can be represented by joint positions. For a full body human pose, a person is annotated with 14 key-points that constitute to form skeleton (e.g., head, neck, left-right shoulder, left-right elbow, left-right wrist, left-right hip, left-right knee and left-right ankle).


Hand Pose 

To capture the hand pose, 21 hand-joint positions are annotated. Table 2.5 shows few datasets which capture hand pose annotations. By capturing hand pose, we can know the position of each and every finger joint, which has many applications. A few applications would be sign-language recognition, augment reality games, hand gesture recognition etc.




Figure 2.5: Image showing the 21 key-points of hand pose annotation where ‘T’, ‘I, ‘M, ‘R, ‘P denote ‘Thumb, ‘Index, ‘Middle, ‘Ring, ‘Pinky fingers'. 

image shows the 21 hand-joint positions which are captured for annotation. The joints are ordered in this way: Wrist, TMCP, IMCP, MMCP, RMCP, PMCP, TPIP, TDIP, TTIP, IPIP, IDIP, ITIP, MPIP, MDIP, MTIP, RPIP, RDIP, RTIP, PPIP, PDIP, PTIP, where ‘T’, ‘I, ‘M, ‘R, ‘P denote ‘Thumb, ‘Index, ‘Middle, ‘Ring, ‘Pinky fingers. ‘MCP, ‘PIP, ‘DIP, ‘TIP'.

Head Pose 

Head Pose. To capture the head pose of person, yaw, pitch and roll are required to be annotated. image show the yaw, pitch and roll along Y, X and Z axis respectively. Capturing head pose is important in some applications and is a part of other problems like capturing eye gaze. For example,
to score the driver’s driving attention in videos, the driver attention score depends on the head pose. Table 2.6 shows few datasets which caputre the head pose annotations.



Facial Points 

Head pose captures the head position but it does not provide detailed information of
face like eyes, nose, lips, cheeks etc. To capture the facial expressions, it is required to annotate the
facial points. Figure 2.7 shows example of 68 facial landmarks annotated on a sample face image. These facial landmarks can be used for identification, gesture recognition, person mood estimation etc.



Other Annotations

Apart from the above discussed annotations, other annotations include 3D object representations (like 3D shape, 3D human pose etc), eye gaze, lip reading (associates speaker utterances to words), scene summary etc. For example given an image, the scene summary annotation would be a sentence describing the image (example: A cat is under the table). The applications include visual-dialog system, image captioning. Extending it to videos, the annotation would be a video summary (retaining only the important segments in video).



Comments

Popular posts from this blog

ABOD and its PyOD python module

Angle based detection By  Hans-Peter Kriegel, Matthias Schubert, Arthur Zimek  Ludwig-Maximilians-Universität München  Oettingenstr. 67, 80538 München, Germany Ref Link PyOD By  Yue Zhao   Zain Nasrullah   Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4, Canada  Zheng Li jk  Northeastern University Toronto, Toronto, ON M5X 1E2, Canada I am combining two papers to summarize Anomaly detection. First one is Angle Based Outlier Detection (ABOD) and other one is python module that  uses ABOD along with over 20 other apis (PyOD) . This is third part in the series of Anomaly detection. First article exhibits survey that covered length and breadth of subject, Second article highlighted on data preparation and pre-processing.  Angle Based Outlier Detection. Angles are more stable than distances in high dimensional spaces for example the popularity of cosine-based similarity measures for text data. Object o is an out

TableSense: Spreadsheet Table Detection with Convolutional Neural Networks

 - By Haoyu Dong, Shijie Liu, Shi Han, Zhouyu Fu, Dongmei Zhang Microsoft Research, Beijing 100080, China. Beihang University, Beijing 100191, China Paper Link Abstract Spreadsheet table detection is the task of detecting all tables on a given sheet and locating their respective ranges. Automatic table detection is a key enabling technique and an initial step in spreadsheet data intelligence. However, the detection task is challenged by the diversity of table structures and table layouts on the spreadsheet. Considering the analogy between a cell matrix as spreadsheet and a pixel matrix as image, and encouraged by the successful application of Convolutional Neural Networks (CNN) in computer vision, we have developed TableSense, a novel end-to-end framework for spreadsheet table detection. First, we devise an effective cell featurization scheme to better leverage the rich information in each cell; second, we develop an enhanced convolutional neural network model for tab

DEEP LEARNING FOR ANOMALY DETECTION: A SURVEY

-By  Raghavendra Chalapathy  University of Sydney,  Capital Markets Co-operative Research Centre (CMCRC)  Sanjay Chawla  Qatar Computing Research Institute (QCRI),  HBKU  Paper Link Anomaly detection also known as outlier detection is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data. Typically the anomalous items will translate to some kind of problem such as bank fraud, a structural defect, medical problems or errors in a text. Anomalies are also referred to as outliers, novelties, noise, deviations and exceptions Hawkins defines an outlier as an observation that deviates so significantly from other observations as to arouse suspicion that it was generated by a different mechanism. Aim of this paper is two-fold, First is a structured and comprehensive overview of research methods in deep learning-based anomaly detection. Furthermore the adoption of these methods