Efficient Annotation of Objects for Video Analysis

By - Swetha Sirnam, Anand Mishra,Guru Prasad Hegde, C V Jawahar

Abstract
Accurately annotated large video data is critical for the development of reliable surveillance and automotive related vision solutions. Paper proposes an efficient and yet accurate annotation scheme for objects in videos (pedestrians in this case) with minimal supervision. They annotate objects with tight bounding boxes. They propagate the annotations across the frames with a self training based approach. An energy minimization scheme for the segmentation is the central component of our method. Unlike the popular grab cut like segmentation schemes, we demand minimal user intervention. Since our annotation is built on an accurate segmentation, our bounding boxes are tight. Approach is validated the performance on multiple publicly available datasets.

Paper focus on efficient object bounding box annotation in videos and accurate human pose annotation in videos.

Object Annotation in Videos.

Object Annotation is one of the most fundamental problems in vision community because of its application in a wide variety of tasks and it is also a part of many high-level problems. Solution proposes an efficient and yet accurate annotation scheme with tight bounding boxes, for object in videos with minimal supervision The annotations are propagated across the frames using selflearning based approach. An energy minimization scheme for the segmentation is the core component of our method. Figure 1.1 shows an example of annotation of a person walking on road. Some of the applications of object annotation are motion analysis, event detection, surveillance systems, transport, sports analytics.

Figure 1.1: Example showing objects (human) being annotated with a tight bounding box in case of pedestrian videos taken from TUD-Stadtmitte dataset

Paper looks back for Video annotation tools.

VideoAnnEX - IBM developed video annotation tool
ViPER - open-source tool developed by Language and Media Process Lab
LabelMe video (LMV) - open-source Web accessible video annotation system
VATIC - Web based video annotation tool hosted on amazon's mechanical turk to investigate crowd sourcing platforms.

For detailed description on tools please refer white paper

Approach

Paper adopts GrabCut based segmentation method for videos to obtain high accurate annotations for objects in large scale videos efficiency.

Proposed framework: given a set of videos, user annotates selected key frames. They propagate the annotations for the entire sequence and use it as initialization for the approach and get a tight bounding box.

Energy Minimization Framework

Segmentation of an image can be expressed as a vector of binary random variables

X = {X1, X2, ..., Xn}, where each random variable Xi takes a label xi ∈ {0, 1} based on whether it is object or background.

Energy Minimization Framework. Iteratively estimate the foreground and background GMMs and perform graphcut (graph min-cut). Given an image and bounding box which represents the object of interest, output is the object segmented from background.

Paper proposes the following simple but effective modifications for efficient segmentation and propagation to accurately annotate objects from the videos:

Usage of relaxed interpolation to get the relaxed bounding box for the object of interest. It is calculated initially for the entire sequence (using the user annotations on key frames).
Compute foreground and background GMM model at the key frames using the user annotations and these models are propagated to in-between frames.
The above pre-computed models are used to segment object in nearby frames. As the object will be very similar in the nearby frames.
The object neighborhood is sufficient to segment the object rather than taking the entire image. This reduces the computations drastically without effecting the results.

Sample Result

Performance Metrics

Average area overlap

It is the intersection area divided by union of ground truth (GT) and the bounding box (BB) generated by an annotation approach. The mean of this measure is calculated by dividing with total number of frames in the database.

Recall

Recall is computed as a fraction of true positive and true positive plus false negative, which are defined as follows

Result

Consider an example where an user have a video of 10, 000 frames to be annotated. For a key frame interval of 10, user has to annotate only 1000 frames in our approach. This reduces the human efforts by 90 % without compromising on the accurateness of the annotations

Conclusion

Paper have presented a framework for semi-automatic object annotation to generate accurate Ground Truth (GT) data in large scale from videos. Especially, the approach is suitable for generating GT for mission critical applications like surveillance and autonomous driving. The method of object annotation is based on segmentation and its propagation which results in accurate bounding boxes around the objects. The proposed framework outperforms interpolation based approaches and almost mimics human annotation ability with only minimal user interaction (predominantly at key frames) which makes it scalable to generate large sets of GT. Process is verified the claims by conducting comprehensive experiments on multiple challenging video datasets. Approach can prove useful in generating ground truth and annotations for large scale surveillance and automotive related videos with substantial reduction in human efforts.

Additional content for reference

Types of Annotations

Class Labels

Class label annotation is one of the primary annotation, which assigns a label for each class. It is used to solve classification related problems. In classification problems, the task is to identify objects in images or actions in videos respectively.

Bounding Box

The class labels give information about the objects in the scene but they do not describe about their position in the scene. To describe the object position in the image, one can draw a rectangle around the object to know it’s position. Hence, the object is annotated with a bounding box such that the object lies completely inside the box.

Semantic Labeling

Although, the bounding box provides information regarding the position of object in the image, but it is not precise. Often it is a loose bounding box,

Semantic labeling annotates objects at pixel level, it annotates each pixel as object (i.e., 1) or background (i.e., 0) in case of only foreground-background segmentation. In case of multiple objects, each pixel is assigned to a class.

Human Pose

Semantic labeling provides pixel-level detail about the scene but it fails to provide information regarding the object. For example, a human can be annotated at pixel-level using semantic labeling, but information regarding human(for example posture, limb locations) is not known. Knowing posture of human helps in semantically reasoning about the scene and will also help in solving other problems like cloth parsing, action recognition etc. Human posture is nothing but the human layout that is the skeleton, it can be represented by joint positions. For a full body human pose, a person is annotated with 14 key-points that constitute to form skeleton (e.g., head, neck, left-right shoulder, left-right elbow, left-right wrist, left-right hip, left-right knee and left-right ankle).

Hand Pose

To capture the hand pose, 21 hand-joint positions are annotated. Table 2.5 shows few datasets which capture hand pose annotations. By capturing hand pose, we can know the position of each and every finger joint, which has many applications. A few applications would be sign-language recognition, augment reality games, hand gesture recognition etc.

Figure 2.5: Image showing the 21 key-points of hand pose annotation where ‘T’, ‘I, ‘M, ‘R, ‘P denote ‘Thumb, ‘Index, ‘Middle, ‘Ring, ‘Pinky fingers'.

image shows the 21 hand-joint positions which are captured for annotation. The joints are ordered in this way: Wrist, TMCP, IMCP, MMCP, RMCP, PMCP, TPIP, TDIP, TTIP, IPIP, IDIP, ITIP, MPIP, MDIP, MTIP, RPIP, RDIP, RTIP, PPIP, PDIP, PTIP, where ‘T’, ‘I, ‘M, ‘R, ‘P denote ‘Thumb, ‘Index, ‘Middle, ‘Ring, ‘Pinky fingers. ‘MCP, ‘PIP, ‘DIP, ‘TIP'.

Head Pose

Head Pose. To capture the head pose of person, yaw, pitch and roll are required to be annotated. image show the yaw, pitch and roll along Y, X and Z axis respectively. Capturing head pose is important in some applications and is a part of other problems like capturing eye gaze. For example,

to score the driver’s driving attention in videos, the driver attention score depends on the head pose. Table 2.6 shows few datasets which caputre the head pose annotations.

Facial Points

Head pose captures the head position but it does not provide detailed information of

face like eyes, nose, lips, cheeks etc. To capture the facial expressions, it is required to annotate the

facial points. Figure 2.7 shows example of 68 facial landmarks annotated on a sample face image. These facial landmarks can be used for identification, gesture recognition, person mood estimation etc.

Other Annotations

Apart from the above discussed annotations, other annotations include 3D object representations (like 3D shape, 3D human pose etc), eye gaze, lip reading (associates speaker utterances to words), scene summary etc. For example given an image, the scene summary annotation would be a sentence describing the image (example: A cat is under the table). The applications include visual-dialog system, image captioning. Extending it to videos, the annotation would be a video summary (retaining only the important segments in video).

TableSense: Spreadsheet Table Detection with Convolutional Neural Networks

- By Haoyu Dong, Shijie Liu, Shi Han, Zhouyu Fu, Dongmei Zhang Microsoft Research, Beijing 100080, China. Beihang University, Beijing 100191, China Paper Link Abstract Spreadsheet table detection is the task of detecting all tables on a given sheet and locating their respective ranges. Automatic table detection is a key enabling technique and an initial step in spreadsheet data intelligence. However, the detection task is challenged by the diversity of table structures and table layouts on the spreadsheet. Considering the analogy between a cell matrix as spreadsheet and a pixel matrix as image, and encouraged by the successful application of Convolutional Neural Networks (CNN) in computer vision, we have developed TableSense, a novel end-to-end framework for spreadsheet table detection. First, we devise an effective cell featurization scheme to better leverage the rich information in each cell; second, we develop an enhanced convolutional neural network model for...

SRI Blog

Search This Blog