Abstract
Visual repetition is ubiquitous in our world. It appears in human activity (sports, cooking), animal behavior (a bee’s waggle dance), natural phenomena (leaves in the wind) and in urban environments (flashing lights). Estimating visual repetition from realistic video is challenging as periodic motion is rarely perfectly static and stationary. To better deal with realistic video, we elevate the static and stationary assumptions often made by existing work. Our spatiotemporal filtering approach, established on the theory of periodic motion, effectively handles a wide variety of appearances and requires no learning. Starting from motion in 3D we derive three periodic motion types by decomposition of the motion field into its fundamental components. In addition, three temporal motion continuities emerge from the field’s temporal dynamics. For the 2D perception of 3D motion we consider the viewpoint relative to the motion; what follows are 18 cases of recurrent motion perception. To estimate repetition under all circumstances, our theory implies constructing a mixture of differential motion maps: F , ∇∇F , ∇∇⋅⋅F and ∇∇××F . We temporally convolve the motion maps with wavelet filters to estimate repetitive dynamics. Our method is able to spatially segment repetitive motion directly from the temporal filter responses densely computed over the motion maps. For experimental verification of our claims, we use our novel dataset for repetition estimation, better-reflecting reality with non-static and non-stationary repetitive motion. On the task of repetition counting, we obtain favorable results compared to a deep learning alternative.
Visual repetitive motion is common in our everyday experience as it appears in sports, music-making, cooking and other daily activities. In natural scenes, it appears as leaves in the wind, waves in the sea or the drumming of a woodpecker, whereas our encounters of visual repetition in urban environments include blinking lights, the spinning of wind turbines or a waving pedestrian. In this work we reconsider the theory of periodic motion and propose a method for estimating repetition in real-world video.
Improving our ability to estimate repetition in realistic video is important in numerous aspects. In computer vision, periodic motion has proven to be useful for action classification , action localization , human motion analysis, structure from motion, animal behavior study and camera calibration. From a biological perspective, repetition is fascinating as the human visual system relies on rhythm and periodicity to approximate velocity, estimate progress and to trigger attention.
To understand the origin and appearance of visual repetition we rethink the theory of periodic motion inspired by existing work. We follow a differential geometric approach, starting from the divergence, gradient and curl components of the 3D flow field. From the decomposition of the motion field and its temporal dynamics, we derive three motion types and three motion continuities to arrive at 3×3 fundamental cases of intrinsic periodicity in 3D. For the 2D perception of 3D intrinsic periodicity, the observer’s viewpoint can be somewhere in the continuous range between two viewpoint extremes. Finally, we arrive at 18 fundamental cases for the 2D perception of 3D intrinsic periodic motion.
Estimating repetition in practice remains challenging. First and foremost, repetition appears in many forms due to its diversity motion types and motion continuity. Sources of variation in motion appearance include the action class, origin of motion and the observer’s viewpoint. Moreover, the motion appearance is often non-static due to a moving camera or as the observed phenomena develops over time. In practice, repetitions are rarely perfectly periodic but rather are non-stationarity. Existing literature generally assumes static and stationary repetitive motion. As reality is more complex, we here address the challenges involved with non-static and non-stationary by proposing a novel method for estimating repetition in real-world video.
To deal with the diverse and possibly non-static motion appearance in realistic video, our theory implies representing the video with a mixture of first-order differential motion maps. For non-stationary temporal dynamics the fixed-period Fourier transform is not suitable. Instead, we handle complex temporal dynamics by decomposing the motion into a time-frequency distribution using the continuous wavelet transform. To increase robustness and to be able to handle camera motion, we combine the wavelet power of all motion representations. Finally, we alleviate the need for explicit tracking or motion segmentation by segmenting repetitive motion directly from the wavelet power. On the task of repetition counting, our method performs well on an existing video dataset and our novel QUVA Repetition dataset which emphasizes on more realistic video.
A preliminary version of this work appeared as Runia. The current manuscript largely maintains the original theory while making significant improvements to the method for repetition estimation. Specifically, we simplify our approach by removing the need for explicit motion segmentation prior to repetition estimation. Instead, we obtain a foreground motion segmentation directly from the wavelet filter responses densely computed over the motion maps. As the most discriminative motion representation is not known a priori, our previous work employed a self-quality assessment to select the representation best measurable. However, selecting a single most discriminative representation is inherently unsuitable for handling significant variations due to camera motion or motion evolution over the course of the video. We improve this by combining the wavelet power of all representations for robustness and viewpoint invariance. Together the two improvements simplify our method while improving or giving comparable results on the task of repetition counting. More precisely, the contributions of our work are as follows:
We rethink the theory of periodic motion to arrive at a classification of periodic motion. Starting from the 3D motion field induced by an object periodically moving through space, we decompose the motion into three elementary components: divergence, curl and shear. From the motion field decomposition and the field’s temporal dynamics, we identify 9 fundamental cases of periodic motion in 3D. For the 2D perception of 3D periodic motion we consider the observer’s viewpoint relative to the motion. Two viewpoint extremes are identified, from which 18 cases of 2D repetitive appearance emerge.
Our spatiotemporal filtering method addresses the wide variety of repetitive appearances and effectively handles non-stationary motion. Specifically, diversity in motion appearance handled by representing video as six differential motion maps that emerge from the theory. To identify the repetitive dynamics in the possibly non-stationary video, we use the continuous wavelet transform to produce a time-frequency distribution densely over the video. Directly from the wavelet responses we localize the repetitive motion and determine the repetitive contents.
Extending beyond the video dataset of Levy and Wolf (2015), we propose a new dataset for repetition estimation, that is more realistic and challenging in terms of non-static and non-stationary videos. To encourage further research on video repetition, we will make the dataset and source code available as download.
Repetition Estimation
3×3 Cartesian table of the motion type times the motion continuity. These are the basic cases
of periodicity in 3D emerging from the motion field decomposition and the temporal dynamics.
The examples are: escalator, leaping frog, bouncing ball, pirouette, tightening a bolt, laundry
machine, inflating a tire with repetitive texture, inflating a balloon and a breathing anemone
Categorization of Motion Types
Observed flow: the 18 fundamental cases for 2D perception of 3D recurrence. The perception follows from the motion pattern (), motion continuity () and the viewpoint on the continuous interval between the two extremes: side and front view. denotes flow direction, denotes a vanishing point, denotes a rotation point, denotes expansion point. Dashed grey lines for constant motion indicate the need for texture to perceive recurrence. Pairs 4–16, 5–17 and 6–18 appear similar at first sight but vary in their signal profile
Non-static Repetition
Example video displaying girl on a swing captured from three distinct viewpoints. Moving from one end of the continuous viewpoint spectrum (frontal) to the other (side) results in a dramatic change of motion appearance. The in-between viewpoint leaves the motion measurements either skewed or asymmetrical. In practice, we combine the motion representations to emphasize the one best measurable
Non-stationary Repetition
Data Set
Dataset statistics of YTSegments and QUVA Repetition
Conclusion
Paper categorized 3D intrinsic periodic motion as translation, rotation or expansion depending on the first-order differential decomposition of the motion field. Additionally, we distinguish three periodic motion continuities: constant, intermittent and oscillatory motion. For the 2D perception of 3D periodicity, the camera will be somewhere in the continuous range between two viewpoint extremes. What follows are 18 fundamentally different cases of repetitive motion appearance in 2D. The practical challenges associated with repetition estimation are the wide variety in motion appearance, non-stationary temporal dynamics and camera motion. Our method addresses all these challenges by computing a diversified motion representation, employing the continuous wavelet transform and combining the power spectra of all representations to support viewpoint invariance. Whereas related work explicitly localizes the foreground motion, our method performs repetitive motion segmentation directly from the wavelet power maps resulting in a simplified approach. We verify our claims by improving the state-of-the-art on the task of repetition counting on our challenging new video dataset. The method requires no training and requires only a minimum number of hyper-parameters which are fixed throughout the paper. We envision applications beyond repetition estimation as the wavelet power and scale maps can support localization of low- and high-frequency regions suitable for region pruning or action classification
Comments