By - Swetha Sirnam, Anand Mishra,Guru Prasad Hegde, C V Jawahar
Post Link
AbstractAccurately annotated large video data is critical for the development of reliable surveillance and automotive related vision solutions. Paper proposes an efficient and yet accurate annotation scheme for objects in videos (pedestrians in this case) with minimal supervision. They annotate objects with tight bounding boxes. They propagate the annotations across the frames with a self training based approach. An energy minimization scheme for the segmentation is the central component of our method. Unlike the popular grab cut like segmentation schemes, we demand minimal user intervention. Since our annotation is built on an accurate segmentation, our bounding boxes are tight. Approach is validated the performance on multiple publicly available datasets.
Paper focus on efficient object bounding box annotation in videos and accurate human pose annotation in videos.
Object Annotation in Videos.
Object Annotation is one of the most fundamental problems in vision community because of its application in a wide variety of tasks and it is also a part of many high-level problems. Solution proposes an efficient and yet accurate annotation scheme with tight bounding boxes, for object in videos with minimal supervision The annotations are propagated across the frames using selflearning based approach. An energy minimization scheme for the segmentation is the core component of our method. Figure 1.1 shows an example of annotation of a person walking on road. Some of the applications of object annotation are motion analysis, event detection, surveillance systems, transport, sports analytics.
Figure 1.1: Example showing objects (human) being annotated with a tight bounding box in case of pedestrian videos taken from TUD-Stadtmitte dataset
Paper looks back for Video annotation tools.
- VideoAnnEX - IBM developed video annotation tool
- ViPER - open-source tool developed by Language and Media Process Lab
- LabelMe video (LMV) - open-source Web accessible video annotation system
- VATIC - Web based video annotation tool hosted on amazon's mechanical turk to investigate crowd sourcing platforms.
For detailed description on tools please refer white paper
Approach
Paper adopts GrabCut based segmentation method for videos to obtain high accurate annotations for objects in large scale videos efficiency.
Proposed framework: given a set of videos, user
annotates selected key frames. They propagate the annotations for the entire sequence and use it as initialization for
the approach and get a tight bounding box.
Energy Minimization Framework
Segmentation of an image can be expressed as a vector of binary random variables
X = {X1, X2, ..., Xn}, where each
random variable Xi
takes a label xi ∈ {0, 1} based on
whether it is object or background.
Energy Minimization Framework. Iteratively estimate the foreground and background GMMs and
perform graphcut (graph min-cut). Given an image and bounding box which represents the object of interest,
output is the object segmented from background.
Paper proposes the following simple but effective
modifications for efficient segmentation and propagation to
accurately annotate objects from the videos:
- Usage of relaxed interpolation to get the relaxed bounding box for the object of interest. It is calculated initially for the entire sequence (using the user annotations on key frames).
- Compute foreground and background GMM model at the key frames using the user annotations and these models are propagated to in-between frames.
- The above pre-computed models are used to segment object in nearby frames. As the object will be very similar in the nearby frames.
- The object neighborhood is sufficient to segment the object rather than taking the entire image. This reduces the computations drastically without effecting the results.
Sample Result
Performance Metrics
Average area overlap
It is the intersection area divided by union of ground truth (GT) and the bounding box
(BB) generated by an annotation approach. The mean of
this measure is calculated by dividing with total number of
frames in the database.
Recall
Recall is computed as a fraction of true positive
and true positive plus false negative, which are defined as follows
Result
Consider an example where an user have a video of
10, 000 frames to be annotated. For a key frame interval of
10, user has to annotate only 1000 frames in our approach.
This reduces the human efforts by 90 % without compromising on the accurateness of the annotations
Conclusion
Paper have presented a framework for semi-automatic object annotation to generate accurate Ground Truth (GT) data
in large scale from videos. Especially, the approach is suitable for generating GT for mission critical applications like
surveillance and autonomous driving. The method of object annotation is based on segmentation and its propagation which results in accurate bounding boxes around the
objects. The proposed framework outperforms interpolation based approaches and almost mimics human annotation
ability with only minimal user interaction (predominantly at
key frames) which makes it scalable to generate large sets
of GT. Process is verified the claims by conducting comprehensive experiments on multiple challenging video datasets. Approach can prove useful in generating ground truth
and annotations for large scale surveillance and automotive
related videos with substantial reduction in human efforts.
Additional content for reference
Types of Annotations
Class Labels
Class label annotation is one of the primary annotation, which assigns a label for each class. It is used to solve classification related problems. In classification problems, the task is to identify objects in images or actions in videos respectively.
Bounding Box
The class labels give information about the objects in the scene but they do not describe about their position in the scene. To describe the object position in the image, one can draw a rectangle around the object to know it’s position. Hence, the object is annotated with a bounding box such that the object lies completely inside the box.
Semantic Labeling
Although, the bounding box provides information regarding the position of object in the image, but it is not precise. Often it is a loose bounding box,
Semantic labeling annotates objects at pixel level, it annotates each pixel as object (i.e., 1) or background (i.e., 0) in case of only foreground-background segmentation. In case of multiple objects, each pixel is assigned to a class.
Human Pose
Semantic labeling provides pixel-level detail about the scene but it fails to provide information regarding the object. For example, a human can be annotated at pixel-level using semantic labeling, but information regarding human(for example posture, limb locations) is not known. Knowing posture of human helps in semantically reasoning about the scene and will also help in solving other problems like cloth parsing, action recognition etc. Human posture is nothing but the human layout that is the skeleton, it can be represented by joint positions. For a full body human pose, a person is annotated with 14 key-points that constitute to form skeleton (e.g., head, neck, left-right shoulder, left-right elbow, left-right wrist, left-right hip, left-right knee and left-right ankle).
Hand Pose
To capture the hand pose, 21 hand-joint positions are annotated. Table 2.5 shows few datasets which capture hand pose annotations. By capturing hand pose, we can know the position of each and every finger joint, which has many applications. A few applications would be sign-language recognition, augment reality games, hand gesture recognition etc.
Figure 2.5: Image showing the 21 key-points of hand pose annotation where ‘T’, ‘I, ‘M, ‘R, ‘P denote ‘Thumb, ‘Index, ‘Middle, ‘Ring, ‘Pinky fingers'.
image shows the 21 hand-joint positions which are captured for annotation. The joints are ordered in this way: Wrist, TMCP, IMCP, MMCP, RMCP, PMCP, TPIP, TDIP, TTIP, IPIP, IDIP, ITIP, MPIP, MDIP, MTIP, RPIP, RDIP, RTIP, PPIP, PDIP, PTIP, where ‘T’, ‘I, ‘M, ‘R, ‘P denote ‘Thumb, ‘Index, ‘Middle, ‘Ring, ‘Pinky fingers. ‘MCP, ‘PIP, ‘DIP, ‘TIP'.
Head Pose
Head Pose. To capture the head pose of person, yaw, pitch and roll are required to be annotated. image show the yaw, pitch and roll along Y, X and Z axis respectively. Capturing head pose is important in some applications and is a part of other problems like capturing eye gaze. For example,
to score the driver’s driving attention in videos, the driver attention score depends on the head pose. Table 2.6 shows few datasets which caputre the head pose annotations.
Facial Points
Head pose captures the head position but it does not provide detailed information of
face like eyes, nose, lips, cheeks etc. To capture the facial expressions, it is required to annotate the
facial points. Figure 2.7 shows example of 68 facial landmarks annotated on a sample face image. These facial landmarks can be used for identification, gesture recognition, person mood estimation etc.
Other Annotations
Apart from the above discussed annotations, other annotations include 3D object representations (like 3D shape, 3D human pose etc), eye gaze, lip reading (associates speaker utterances to words), scene summary etc. For example given an image, the scene summary annotation would be a sentence describing the image (example: A cat is under the table). The applications include visual-dialog system, image captioning. Extending it to videos, the annotation would be a video summary (retaining only the important segments in video).
Comments