Skip to main content

Video Retargeting using Gaze

-By Kranthi Kumar Rachavarapu, Moneish Kumar, Vineet Gandhi, Ramanathan Subramanian

 CVIT, IIIT Hyderabad.  University of Glasgow, Singapore.





Gaze-tracking has been my curious topic, that lead to pick today's summary. In my earlier blog-post on gaze, I explained on how language reader gazes reciprocates intent and absorbs the meaning. Here is the visualization of it.


An example of fixations and saccades over text. This is the typical pattern of eye movement during reading. The eyes never move smoothly over still text.

Skilled readers move their eyes during reading on the average of every quarter of a second. During the time that the eye is fixated, new information is brought into the processing system. Although the average fixation duration is 200–250 ms (thousandths of a second), the range is from 100 ms to over 500 ms. The distance the eye moves in each saccade (or short rapid movement) is between 1 and 20 characters with the average being 7–9 characters. The saccade lasts for 20–40 ms and during this time vision is suppressed so that no new information is acquired. There is considerable variability in fixations (the point at which a saccade jumps to) and saccades between readers and even for the same person reading a single passage of text. Skilled readers make regressions back to material already read about 15 percent of the time. The main difference between faster and slower readers is that the latter group consistently shows longer average fixation durations, shorter saccades, and more regressions. These basic facts about eye movement have been known for almost a hundred years, but only recently have researchers begun to look at eye movement behavior as a reflection of cognitive processing during reading




Some interesting information on gaze tracking

Eye tracking is the process of electronically locating the point of a person's gaze, or following and recording the movement of the point of gaze. Various technologies exist for accomplishing this task; some methods involve attachments to the eye, while others rely on images of the eye taken without any physical contact. A special lens or film can be affixed to the cornea, incorporating precise position sensors to follow physical movements of the eye. A tiny mirror or electro-mechanical transducer, embedded in the lens or film, can use light beams or electromagnetic fields to quantify the eye's orientation and follow changes in gaze position. Such devices have proven sensitive and accurate, as long as they don't slip on the surface of the eye.

Eye position and movement can be detected remotely, without involving any attachments to the cornea. For most people, this method is preferred over mechanical methods because it is non-invasive and portable. A device called a micro-projector transmits an infrared beam at the eye, and the reflection patterns are picked up by a set of sensors. The reflections may occur from the cornea or from the retina as the infrared beam passes through the lens, into the eye, and back out.

Above info taken from wikipedia

Abstract
We present a novel approach to optimally retarget videos for varied displays with differing aspect ratios by preserving salient scene content discovered via eye tracking. Our algorithm performs editing with cut, pan and zoom operations by optimizing the path of a cropping window within the original video while seeking to (i) preserve salient regions, and (ii) adhere to the principles of cinematography. Our approach is (a) content agnostic as the same methodology is employed to re-edit a wide-angle video recording or a close-up movie sequence captured with a static or moving camera, and (b) independent of video length and can in principle re-edit an entire movie in one shot. 

Our algorithm consists of two steps. The first step employs gaze transition cues to detect time stamps where new cuts are to be introduced in the original video via dynamic  programming. A subsequent step optimizes the cropping window path (to create pan and zoom effects), while accounting for the original and new cuts. The cropping window path is designed to include maximum gaze information, and is composed of piece wise constant, linear and parabolic segments. It is obtained via L(1) regularized convex optimization which ensures a smooth viewing experience. We test our approach on a wide variety of videos and demonstrate significant improvement over the state-of-the-art, both in terms of computational complexity and qualitative aspects. A study performed with 16 users confirms that our approach results in a superior viewing experience as compared to gaze driven re-editing and letterboxing methods, especially for wide-angle static camera recordings


Video re-targeting in Action



The phenomenal increase in multimedia consumption has led to the ubiquitous display devices of today such as LED TVs, smartphones, PDAs and in-flight entertainment screens. While viewing experience on these varied display devices is strongly correlated with the display size, resolution and aspect ratio, digital content is usually created with a target display in mind, and needs to be manually re-edited (using techniques like pan-and-scan) for effective rendering on other devices.Therefore,automated algorithms which can retarget the original content to effectively render on novel displays are of critical importance. Retargeting algorithms can also enable content creation for non expert and resource-limited users. For instance,small/mid level theatre houses typically perform recordings with a wide-angle camera covering the entire stage as costs incurred for professional video recordings are prohibitive (requiring a multi-camera crew, editors etc.). Smart retargeting and compositing can convert static camera recordings with low-resolution faces into professional-looking videos with editing operations such as pan, cut and zoom.


Commonly employed video retargeting methods are nonuniform scaling (squeezing), cropping and letterboxing. However, squeezing can lead to annoying distortions; letterboxing results in large portions of the display being unused, while cropping can lead to the loss of scene details. Several efforts have been made to automate the retargeting process, the early work by Liu and Gleicher posed re-targeting as an optimization problem to select a cropping window inside the original recording. Other advanced methods like content-aware warping, seam carving then followed. However, most of these methods rely on bottomup saliency derived from computational methods which do not consider high-level scene semantics such as emotions, which humans are sensitive to. Recently, Jain et al. proposed Gaze Driven Re-editing (GDR), which preserves human preferences in scene content without distortion via user gaze cues andre-edits the original video introducing novel cut,pan and zoom operations. However, their method has limited applicability due to a) extreme computational complexity and b) the hard assumptions made by the authors regarding the video content.


Problem Statement 

The proposed algorithm takes as input (a) a sequence of frames t = [1: N], where N is the total number of frames; (b) the raw gaze points over multiple users, gi t, for each frame t and subject i and (c) the desired output aspect ratio. The output of the algorithm is the edited sequence to the desired aspect ratio, which is characterized by a cropping window parametrized by the x-position (x∗ t) and zoom (zt) at each frame. The edited sequence introduces new panning movements and cuts within the original sequence, aiming to preserve the cinematic and contextual intent of the video. Before delving into the technical details, we put forward a discussion on desired characteristics of such an editing algorithm from a cinematographic perspective. We also discuss how the literature from cinematography inspires our algorithmic choices. Millerson and Owens stress on the aspect that the shot composition is strongly coupled with what viewers will look at. If viewers donot have any idea of what they are supposed to be looking at,they will look at what ever draws their attention (randompictures produce random thoughts). This motivates the choice of using gaze data in the re-editing process. Although, notable progress has been made in computationally predicting human gaze from images, the cognitive gap still needs to be filled in. For this reason, we explicitly use the collected gaze data in the proposed algorithm as the measure of saliency. The algorithm then aims to align the cropping window with the collected gaze data.

 Data collection 

We selected a variety of clips from movies and live theatre. A total of 12 sequences are selected from four different feature films and cover diverse scenarios like dyadic conversations, conversations in crowd, action scenes, etc. The clips include a variety of shots such as close ups, distant wide angle shots, stationary and moving camera etc. Then ative aspect ratio of all these sequences is either 2.76:1 or 2.35:1. The pace of the movie sequences vary from acut every 1.6 seconds to no-cut satall in a 3 minute sequence.The live theat resequence were recorded from dress rehearsals of Arthur Miller’s play ‘Death of a salesman’ and Tennessee Williams’ play ‘Cat on a hot tin roof’. All the 4 selected theatre sequences were recorded from a static wide angle camera covering the entire stage and have an aspect ratio of 4:1. These are continuous recordings without any cuts. The combined set of movie and live theatre sequences amount to aduration of about 52 minutes (minimum length of about 45 seconds and maximum length of about 6 minutes). Five naive participants with normal vision (with or without lenses) were recruited from student community for collecting the gaze data. The participants were asked to watch the sequences resized to a frame size of 1366×768 on a 16 inch screen. The original aspect ratio was preserved during the resizing operation using letter boxing. The participants sat at approximately 60 cm from the screen. Ergonomic settings were adjusted prior to the experiment andsystemwascalibrated. Psych Toolbox extensions for MATLAB were used to display the sequences.The sequences were presented in a fixed order for all participants. The gaze data was recorded using the 60 Hz Tobii Eyex, which is an easy to operate, low cost eye-tracker.




Gaze as an indicator of importance The basic idea of video retargeting is to preserve what is important in video by removing what is not. We explicitly use gaze as the measure of importance and propose a dynamic programming optimization, which takes as input the gaze tracking data from multiple users and outputs a cropping window path which encompasses maximal gaze information. The algorithm also outputs the time stamps to introduce new cuts (if required) for more efficient storytelling. Whenever there is an abrupt shift in the gaze location, introducing a new cut in the cropping window path is a preferable option over panning movement (as fast panning would appear jarring to the viewer). However, the algorithm penalizes jump cuts (ensuring that the cropping window locations, before and after the cut are distinct enough) as well as many cuts in short succession (it is important to give the user sufficient time to absorb the details before making the next cut). More formally, the algorithm takes as input the raw gaze data gi t of ith user for all frames t = [1 : N] and outputs a state ={rt}for each frame. Where the state rt ∈[1:Wo] (where Wo is width of the original video frames) selects one among all the possible cropping window positions.




In the above image The x-coordinate of the recorded gaze data of 5 users for the sequence "Death of a Salesman" (top row). Gaussian filtered gaze matrix over all users, used as an input for optimization algorithm (middle row). The output of optimization: The optimal state sequence (black line) along with the detected cuts (black circles) for an output aspect ratio of 1:1(bottom row).



Results The results are computed on all the 12 clips (8 movie sequences & 4 theatre sequences) mentioned in earlier All the sequences were retargeted from their native aspect ratio to 4:3 and 1:1 using our algorithm. We also compute results using Gaze Driven Editing (GDR) algorithm by Jain et al. for the case of 4:3 aspect ratio,over all the sequences.The results with GDR were computed by first detecting original cuts in the sequences and then applying the algorithm shot by shot. Some of the example results and comparisons. An explicit comparison on output with and without zoom is shown in image. All the original and retargeted sequences are provided in the supplementary material.




We used CVX toolbox with MOSEK  for convex optimization. The parameters used for the algorithm are given inTable1.Same set of parameters are used for all theatre and movie sequences.


Conclusion





The figure shows original frames and overlaid outputs from our method and GDR (coloured, within white rectangles) on a long sequence. Plot shows the x-position of the center of the cropping windows for our method (black curve) and GDR (red curve)overtime.Gaze data of 5 users for the sequence are overlaid on the plot. Unlike GDR, our method does not involve hard assumptions and is able to better include gaze data (best viewed under zoom).



Comments

Popular posts from this blog

ABOD and its PyOD python module

Angle based detection By  Hans-Peter Kriegel, Matthias Schubert, Arthur Zimek  Ludwig-Maximilians-Universität München  Oettingenstr. 67, 80538 München, Germany Ref Link PyOD By  Yue Zhao   Zain Nasrullah   Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4, Canada  Zheng Li jk  Northeastern University Toronto, Toronto, ON M5X 1E2, Canada I am combining two papers to summarize Anomaly detection. First one is Angle Based Outlier Detection (ABOD) and other one is python module that  uses ABOD along with over 20 other apis (PyOD) . This is third part in the series of Anomaly detection. First article exhibits survey that covered length and breadth of subject, Second article highlighted on data preparation and pre-processing.  Angle Based Outlier Detection. Angles are more stable than distances in high dimensional spaces for example the popularity of cosine-based similarity measures for text data. Object o is an out

TableSense: Spreadsheet Table Detection with Convolutional Neural Networks

 - By Haoyu Dong, Shijie Liu, Shi Han, Zhouyu Fu, Dongmei Zhang Microsoft Research, Beijing 100080, China. Beihang University, Beijing 100191, China Paper Link Abstract Spreadsheet table detection is the task of detecting all tables on a given sheet and locating their respective ranges. Automatic table detection is a key enabling technique and an initial step in spreadsheet data intelligence. However, the detection task is challenged by the diversity of table structures and table layouts on the spreadsheet. Considering the analogy between a cell matrix as spreadsheet and a pixel matrix as image, and encouraged by the successful application of Convolutional Neural Networks (CNN) in computer vision, we have developed TableSense, a novel end-to-end framework for spreadsheet table detection. First, we devise an effective cell featurization scheme to better leverage the rich information in each cell; second, we develop an enhanced convolutional neural network model for tab

DEEP LEARNING FOR ANOMALY DETECTION: A SURVEY

-By  Raghavendra Chalapathy  University of Sydney,  Capital Markets Co-operative Research Centre (CMCRC)  Sanjay Chawla  Qatar Computing Research Institute (QCRI),  HBKU  Paper Link Anomaly detection also known as outlier detection is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data. Typically the anomalous items will translate to some kind of problem such as bank fraud, a structural defect, medical problems or errors in a text. Anomalies are also referred to as outliers, novelties, noise, deviations and exceptions Hawkins defines an outlier as an observation that deviates so significantly from other observations as to arouse suspicion that it was generated by a different mechanism. Aim of this paper is two-fold, First is a structured and comprehensive overview of research methods in deep learning-based anomaly detection. Furthermore the adoption of these methods