Video Retargeting using Gaze

-By Kranthi Kumar Rachavarapu, Moneish Kumar, Vineet Gandhi, Ramanathan Subramanian

CVIT, IIIT Hyderabad. University of Glasgow, Singapore.

Gaze-tracking has been my curious topic, that lead to pick today's summary. In my earlier blog-post on gaze, I explained on how language reader gazes reciprocates intent and absorbs the meaning. Here is the visualization of it.

An example of fixations and saccades over text. This is the typical pattern of eye movement during reading. The eyes never move smoothly over still text.

Skilled readers move their eyes during reading on the average of every quarter of a second. During the time that the eye is fixated, new information is brought into the processing system. Although the average fixation duration is 200–250 ms (thousandths of a second), the range is from 100 ms to over 500 ms. The distance the eye moves in each saccade (or short rapid movement) is between 1 and 20 characters with the average being 7–9 characters. The saccade lasts for 20–40 ms and during this time vision is suppressed so that no new information is acquired. There is considerable variability in fixations (the point at which a saccade jumps to) and saccades between readers and even for the same person reading a single passage of text. Skilled readers make regressions back to material already read about 15 percent of the time. The main difference between faster and slower readers is that the latter group consistently shows longer average fixation durations, shorter saccades, and more regressions. These basic facts about eye movement have been known for almost a hundred years, but only recently have researchers begun to look at eye movement behavior as a reflection of cognitive processing during reading

Some interesting information on gaze tracking

Eye tracking is the process of electronically locating the point of a person's gaze, or following and recording the movement of the point of gaze. Various technologies exist for accomplishing this task; some methods involve attachments to the eye, while others rely on images of the eye taken without any physical contact. A special lens or film can be affixed to the cornea, incorporating precise position sensors to follow physical movements of the eye. A tiny mirror or electro-mechanical transducer, embedded in the lens or film, can use light beams or electromagnetic fields to quantify the eye's orientation and follow changes in gaze position. Such devices have proven sensitive and accurate, as long as they don't slip on the surface of the eye.

Eye position and movement can be detected remotely, without involving any attachments to the cornea. For most people, this method is preferred over mechanical methods because it is non-invasive and portable. A device called a micro-projector transmits an infrared beam at the eye, and the reflection patterns are picked up by a set of sensors. The reflections may occur from the cornea or from the retina as the infrared beam passes through the lens, into the eye, and back out.

Above info taken from wikipedia

Abstract

We present a novel approach to optimally retarget videos for varied displays with differing aspect ratios by preserving salient scene content discovered via eye tracking. Our algorithm performs editing with cut, pan and zoom operations by optimizing the path of a cropping window within the original video while seeking to (i) preserve salient regions, and (ii) adhere to the principles of cinematography. Our approach is (a) content agnostic as the same methodology is employed to re-edit a wide-angle video recording or a close-up movie sequence captured with a static or moving camera, and (b) independent of video length and can in principle re-edit an entire movie in one shot.

Our algorithm consists of two steps. The first step employs gaze transition cues to detect time stamps where new cuts are to be introduced in the original video via dynamic programming. A subsequent step optimizes the cropping window path (to create pan and zoom effects), while accounting for the original and new cuts. The cropping window path is designed to include maximum gaze information, and is composed of piece wise constant, linear and parabolic segments. It is obtained via L(1) regularized convex optimization which ensures a smooth viewing experience. We test our approach on a wide variety of videos and demonstrate significant improvement over the state-of-the-art, both in terms of computational complexity and qualitative aspects. A study performed with 16 users confirms that our approach results in a superior viewing experience as compared to gaze driven re-editing and letterboxing methods, especially for wide-angle static camera recordings

Video re-targeting in Action

The phenomenal increase in multimedia consumption has led to the ubiquitous display devices of today such as LED TVs, smartphones, PDAs and in-ﬂight entertainment screens. While viewing experience on these varied display devices is strongly correlated with the display size, resolution and aspect ratio, digital content is usually created with a target display in mind, and needs to be manually re-edited (using techniques like pan-and-scan) for eﬀective rendering on other devices.Therefore,automated algorithms which can retarget the original content to eﬀectively render on novel displays are of critical importance. Retargeting algorithms can also enable content creation for non expert and resource-limited users. For instance,small/mid level theatre houses typically perform recordings with a wide-angle camera covering the entire stage as costs incurred for professional video recordings are prohibitive (requiring a multi-camera crew, editors etc.). Smart retargeting and compositing can convert static camera recordings with low-resolution faces into professional-looking videos with editing operations such as pan, cut and zoom.

Commonly employed video retargeting methods are nonuniform scaling (squeezing), cropping and letterboxing. However, squeezing can lead to annoying distortions; letterboxing results in large portions of the display being unused, while cropping can lead to the loss of scene details. Several eﬀorts have been made to automate the retargeting process, the early work by Liu and Gleicher posed re-targeting as an optimization problem to select a cropping window inside the original recording. Other advanced methods like content-aware warping, seam carving then followed. However, most of these methods rely on bottomup saliency derived from computational methods which do not consider high-level scene semantics such as emotions, which humans are sensitive to. Recently, Jain et al. proposed Gaze Driven Re-editing (GDR), which preserves human preferences in scene content without distortion via user gaze cues andre-edits the original video introducing novel cut,pan and zoom operations. However, their method has limited applicability due to a) extreme computational complexity and b) the hard assumptions made by the authors regarding the video content.

Problem Statement

The proposed algorithm takes as input (a) a sequence of frames t = [1: N], where N is the total number of frames; (b) the raw gaze points over multiple users, gi t, for each frame t and subject i and (c) the desired output aspect ratio. The output of the algorithm is the edited sequence to the desired aspect ratio, which is characterized by a cropping window parametrized by the x-position (x∗ t) and zoom (zt) at each frame. The edited sequence introduces new panning movements and cuts within the original sequence, aiming to preserve the cinematic and contextual intent of the video. Before delving into the technical details, we put forward a discussion on desired characteristics of such an editing algorithm from a cinematographic perspective. We also discuss how the literature from cinematography inspires our algorithmic choices. Millerson and Owens stress on the aspect that the shot composition is strongly coupled with what viewers will look at. If viewers donot have any idea of what they are supposed to be looking at,they will look at what ever draws their attention (randompictures produce random thoughts). This motivates the choice of using gaze data in the re-editing process. Although, notable progress has been made in computationally predicting human gaze from images, the cognitive gap still needs to be ﬁlled in. For this reason, we explicitly use the collected gaze data in the proposed algorithm as the measure of saliency. The algorithm then aims to align the cropping window with the collected gaze data.

Data collection

We selected a variety of clips from movies and live theatre. A total of 12 sequences are selected from four diﬀerent feature ﬁlms and cover diverse scenarios like dyadic conversations, conversations in crowd, action scenes, etc. The clips include a variety of shots such as close ups, distant wide angle shots, stationary and moving camera etc. Then ative aspect ratio of all these sequences is either 2.76:1 or 2.35:1. The pace of the movie sequences vary from acut every 1.6 seconds to no-cut satall in a 3 minute sequence.The live theat resequence were recorded from dress rehearsals of Arthur Miller’s play ‘Death of a salesman’ and Tennessee Williams’ play ‘Cat on a hot tin roof’. All the 4 selected theatre sequences were recorded from a static wide angle camera covering the entire stage and have an aspect ratio of 4:1. These are continuous recordings without any cuts. The combined set of movie and live theatre sequences amount to aduration of about 52 minutes (minimum length of about 45 seconds and maximum length of about 6 minutes). Five naive participants with normal vision (with or without lenses) were recruited from student community for collecting the gaze data. The participants were asked to watch the sequences resized to a frame size of 1366×768 on a 16 inch screen. The original aspect ratio was preserved during the resizing operation using letter boxing. The participants sat at approximately 60 cm from the screen. Ergonomic settings were adjusted prior to the experiment andsystemwascalibrated. Psych Toolbox extensions for MATLAB were used to display the sequences.The sequences were presented in a ﬁxed order for all participants. The gaze data was recorded using the 60 Hz Tobii Eyex, which is an easy to operate, low cost eye-tracker.

Gaze as an indicator of importance The basic idea of video retargeting is to preserve what is important in video by removing what is not. We explicitly use gaze as the measure of importance and propose a dynamic programming optimization, which takes as input the gaze tracking data from multiple users and outputs a cropping window path which encompasses maximal gaze information. The algorithm also outputs the time stamps to introduce new cuts (if required) for more eﬃcient storytelling. Whenever there is an abrupt shift in the gaze location, introducing a new cut in the cropping window path is a preferable option over panning movement (as fast panning would appear jarring to the viewer). However, the algorithm penalizes jump cuts (ensuring that the cropping window locations, before and after the cut are distinct enough) as well as many cuts in short succession (it is important to give the user suﬃcient time to absorb the details before making the next cut). More formally, the algorithm takes as input the raw gaze data gi t of ith user for all frames t = [1 : N] and outputs a state ={rt}for each frame. Where the state rt ∈[1:Wo] (where Wo is width of the original video frames) selects one among all the possible cropping window positions.

In the above image The x-coordinate of the recorded gaze data of 5 users for the sequence "Death of a Salesman" (top row). Gaussian ﬁltered gaze matrix over all users, used as an input for optimization algorithm (middle row). The output of optimization: The optimal state sequence (black line) along with the detected cuts (black circles) for an output aspect ratio of 1:1(bottom row).

Results The results are computed on all the 12 clips (8 movie sequences & 4 theatre sequences) mentioned in earlier All the sequences were retargeted from their native aspect ratio to 4:3 and 1:1 using our algorithm. We also compute results using Gaze Driven Editing (GDR) algorithm by Jain et al. for the case of 4:3 aspect ratio,over all the sequences.The results with GDR were computed by ﬁrst detecting original cuts in the sequences and then applying the algorithm shot by shot. Some of the example results and comparisons. An explicit comparison on output with and without zoom is shown in image. All the original and retargeted sequences are provided in the supplementary material.

We used CVX toolbox with MOSEK for convex optimization. The parameters used for the algorithm are given inTable1.Same set of parameters are used for all theatre and movie sequences.

Conclusion

The ﬁgure shows original frames and overlaid outputs from our method and GDR (coloured, within white rectangles) on a long sequence. Plot shows the x-position of the center of the cropping windows for our method (black curve) and GDR (red curve)overtime.Gaze data of 5 users for the sequence are overlaid on the plot. Unlike GDR, our method does not involve hard assumptions and is able to better include gaze data (best viewed under zoom).

Cybersecurity Threats in Connected and Automated Vehicles based Federated Learning Systems

Ranwa Al Mallah , Godwin Badu-Marfo , Bilal Farooq image Courtesy: Comparitech Abstract Federated learning (FL) is a machine learning technique that aims at training an algorithm across decentralized entities holding their local data private. Wireless mobile networks allow users to communicate with other fixed or mobile users. The road traffic network represents an infrastructure-based configuration of a wireless mobile network where the Connected and Automated Vehicles (CAV) represent the communicating entities. Applying FL in a wireless mobile network setting gives rise to a new threat in the mobile environment that is very different from the traditional fixed networks. The threat is due to the intrinsic characteristics of the wireless medium and is caused by the characteristics of the vehicular networks such as high node-mobility and rapidly changing topology. Most cyber defense techniques depend on highly reliable and connected networks. This paper explores falsified informat...

SRI Blog

Search This Blog