Skip to main content

Visual Discourse Parsing

-By

Arjun R Akula ,  Song-Chun Zhu

University of California, Los Angeles


Abstract


Text-level discourse parsing aims to unmask how two segments (or sentences) in the text are related to each other. We propose the task of Visual Discourse Parsing, which requires understanding discourse relations among scenes in a video. Here we use the term scene to refer to a subset of video frames that can better summarize the video. In order to collect a dataset for learning discourse cues from videos, one needs to manually identify the scenes from a large pool of video frames and then annotate the discourse relations between them. This is clearly a time consuming, expensive and tedious task. In this work, we propose an approach to identify discourse cues from the videos without the need to explicitly identify and annotate the scenes. We also present a novel dataset containing 310 videos and the corresponding discourse cues to evaluate our approach. We believe that many of the multidiscipline Artificial Intelligence problems such as Visual Dialog and Visual Storytelling would greatly benefit from the use of visual discourse cues.

Discourse structure aids in understanding a piece of text by linking it with other text units (such as surrounding clauses, sentences, etc.) from its context. A text span may be linked to another span through semantic relationships such as contrast relation, causal relation, etc. Text-level discourse parsing algorithms aim to unmask such relationships in text, which is central to many downstream natural language processing (NLP) applications such as information retrieval, text summarization, sentiment analysis  and question answering. Recently, there has been a lot of focus on multidiscipline Artificial Intelligence (AI) research problems such as visual storytelling and visual dialog. Solving these problems requires multi-modal knowledge that combines computer vision (CV), NLP, and knowledge representation & reasoning (KR), making the need for commonsense knowledge and complex reasoning more essential. In an effort to fill this need, we introduce the task of Visual Discourse Parsing.


Task Definition. 


The concrete task in Visual Discourse Parsing is the following - given a video, understand discourse relationships among its scenes. Specifically, given a video, the task is to identify a scene’s relation with the context. Here we use the term scene to refer to a subset of video frames that can better summarize the video. We use Rhetorical Structure Theory (RST) to capture discourse relations among the scenes. Consider for example, nine frames of a video shown in Figure 1. We can represent the discourse structure of this video using only 3 out of 9 frames, i.e. there are only 3 scenes for this video. The discourse structure in the Figure 1 interprets the video as follows: the event “person going to the bathroom and cleaning his stains” is caused by the event “the person spilling coffee over his shirt”; the event “the person used his handkerchief to dry the water on his shirt” is simply an elaboration of the event “person going to the bathroom and cleaning his stains”.


RST info from wiki


Rhetorical relations or coherence relations or discourse relations are paratactic (coordinate) or hypotactic (subordinate) relations that hold across two or more text spans. It is widely accepted that notion of coherence is through text relations like this. RST using rhetorical relations provide a systematic way for an analyst to analyse the text. An analysis is usually built by reading the text and constructing a tree using the relations. The following example is a title and summary, appearing at the top of an article in Scientific American magazine (Ramachandran and Anstis, 1986). The original text, broken into numbered units, is

  1. [Title:] The Perception of Apparent Motion
  2. [Abstract:] When the motion of an intermittently seen object is ambiguous
  3. the visual system resolves confusion
  4. by applying some tricks that reflect a built-in knowledge of properties of the physical world


In the figure, numbers 1,2,3,4 show the corresponding units as explained above. The fourth unit and the third unit form a relation "Means". The fourth unit is the essential part of this relation, so it is called the nucleus of the relation and third unit is called the satellite of the relation. Similarly second unit to third and fourth unit is forming relation "Condition". All units are also spans and spans may be composed of more than one unit.

Approach







In the step 3 of our algorithm, we use the standard machine translation encoder-decoder RNN model. As RNN suffers from decaying of gradient and blowing-up of gradient problem, we use LSTM units, which are good at memorizing long-range dependencies due to forget-style gates. The sequence of video frames are passed to the encoder. The last hidden state of the encoder is then passed to the decoder. The decoder generates the discourse structure as a sequence of words.


 Evaluation


We evaluate our approach by using the following three metrics:

(a) BLEU score We used the BLEU score (Papineni et al., 2002) to evaluate the translation quality of the discourse structure generated from the videos. We computed our BLEU score on the tokenized predictions and ground truth.

(b) Relations Accuracy Each video, in our dataset, contains two discourse relations. The Relations Accuracy metric is defined as the total number of relations correctly predicted by the model.

(c) Edges Accuracy Each video, in our dataset, contains two edges. The Edges Accuracy metric is defined as the total number of edges (i.e. RST node nuclearity directions) correctly predicted by the model.

(d) Relations+Edges Accuracy Here, we compute the correctness of the complete discourse structure, i.e. the predicted discourse structure will be considered correct only if all the relations and the edges are correctly predicted by the model.


Conclusions


Paper presented an end-to-end learning approach to identify the discourse structure of videos. Central to our approach is the use of text descriptions of videos to identify discourse relations. In the future, we plan to extend this dataset to include longer videos that need more than three sentences to describe. We also intend to experiment with multi-task learning approaches. Our results indicate that there is significant scope for improvement.

Comments

Popular posts from this blog

TableSense: Spreadsheet Table Detection with Convolutional Neural Networks

 - By Haoyu Dong, Shijie Liu, Shi Han, Zhouyu Fu, Dongmei Zhang Microsoft Research, Beijing 100080, China. Beihang University, Beijing 100191, China Paper Link Abstract Spreadsheet table detection is the task of detecting all tables on a given sheet and locating their respective ranges. Automatic table detection is a key enabling technique and an initial step in spreadsheet data intelligence. However, the detection task is challenged by the diversity of table structures and table layouts on the spreadsheet. Considering the analogy between a cell matrix as spreadsheet and a pixel matrix as image, and encouraged by the successful application of Convolutional Neural Networks (CNN) in computer vision, we have developed TableSense, a novel end-to-end framework for spreadsheet table detection. First, we devise an effective cell featurization scheme to better leverage the rich information in each cell; second, we develop an enhanced convolutional neural network model for...

ABOD and its PyOD python module

Angle based detection By  Hans-Peter Kriegel, Matthias Schubert, Arthur Zimek  Ludwig-Maximilians-Universität München  Oettingenstr. 67, 80538 München, Germany Ref Link PyOD By  Yue Zhao   Zain Nasrullah   Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4, Canada  Zheng Li jk  Northeastern University Toronto, Toronto, ON M5X 1E2, Canada I am combining two papers to summarize Anomaly detection. First one is Angle Based Outlier Detection (ABOD) and other one is python module that  uses ABOD along with over 20 other apis (PyOD) . This is third part in the series of Anomaly detection. First article exhibits survey that covered length and breadth of subject, Second article highlighted on data preparation and pre-processing.  Angle Based Outlier Detection. Angles are more stable than distances in high dimensional spaces for example the popularity of cosine-based sim...

Bike sharing Dynamic Re-positioning

-By Xinghua Zheng1, Ming Tang1, Hankz Hankui Zhuo1*, Kevin X. Wen Paper Link Abstract Bike Sharing Systems (BSSs) have been adopted in many major cities of the world due to traffic congestion and carbon emissions. Although there have been approaches to exploiting either bike trailers via crowdsourcing or carrier vehicles to reposition bikes in the “right” stations in the “right” time, they do not jointly consider the usage of both bike trailers and carrier vehicles. In this paper, we aim to take advantage of both bike trailers and carrier vehicles to reduce the loss of demand with regard to the crowdsourcing of bike trailers and the fuel cost of carrier vehicles. In the experiment, we exhibit that our approach outperforms baselines in several datasets from bike sharing companies. Bike-sharing systems (BSSs) typically have a set of base stations that are strategically placed throughout a city and each station has a fixed number of docks, e.g., Capital Bike-share, ...