-By
Arjun R Akula , Song-Chun Zhu
University of California, Los Angeles
Abstract
Text-level discourse parsing aims to unmask how two segments (or sentences) in the text are related to each other. We propose the task of Visual Discourse Parsing, which requires understanding discourse relations among scenes in a video. Here we use the term scene to refer to a subset of video frames that can better summarize the video. In order to collect a dataset for learning discourse cues from videos, one needs to manually identify the scenes from a large pool of video frames and then annotate the discourse relations between them. This is clearly a time consuming, expensive and tedious task. In this work, we propose an approach to identify discourse cues from the videos without the need to explicitly identify and annotate the scenes. We also present a novel dataset containing 310 videos and the corresponding discourse cues to evaluate our approach. We believe that many of the multidiscipline Artificial Intelligence problems such as Visual Dialog and Visual Storytelling would greatly benefit from the use of visual discourse cues.
Discourse structure aids in understanding a piece of text by linking it with other text units (such as surrounding clauses, sentences, etc.) from its context. A text span may be linked to another span through semantic relationships such as contrast relation, causal relation, etc. Text-level discourse parsing algorithms aim to unmask such relationships in text, which is central to many downstream natural language processing (NLP) applications such as information retrieval, text summarization, sentiment analysis and question answering. Recently, there has been a lot of focus on multidiscipline Artificial Intelligence (AI) research problems such as visual storytelling and visual dialog. Solving these problems requires multi-modal knowledge that combines computer vision (CV), NLP, and knowledge representation & reasoning (KR), making the need for commonsense knowledge and complex reasoning more essential. In an effort to fill this need, we introduce the task of Visual Discourse Parsing.
Task Definition.
The concrete task in Visual Discourse Parsing is the following - given a video, understand discourse relationships among its scenes. Specifically, given a video, the task is to identify a scene’s relation with the context. Here we use the term scene to refer to a subset of video frames that can better summarize the video. We use Rhetorical Structure Theory (RST) to capture discourse relations among the scenes. Consider for example, nine frames of a video shown in Figure 1. We can represent the discourse structure of this video using only 3 out of 9 frames, i.e. there are only 3 scenes for this video. The discourse structure in the Figure 1 interprets the video as follows: the event “person going to the bathroom and cleaning his stains” is caused by the event “the person spilling coffee over his shirt”; the event “the person used his handkerchief to dry the water on his shirt” is simply an elaboration of the event “person going to the bathroom and cleaning his stains”.
RST info from wiki
Rhetorical relations or coherence relations or discourse relations are paratactic (coordinate) or hypotactic (subordinate) relations that hold across two or more text spans. It is widely accepted that notion of coherence is through text relations like this. RST using rhetorical relations provide a systematic way for an analyst to analyse the text. An analysis is usually built by reading the text and constructing a tree using the relations. The following example is a title and summary, appearing at the top of an article in Scientific American magazine (Ramachandran and Anstis, 1986). The original text, broken into numbered units, is
- [Title:] The Perception of Apparent Motion
- [Abstract:] When the motion of an intermittently seen object is ambiguous
- the visual system resolves confusion
- by applying some tricks that reflect a built-in knowledge of properties of the physical world
In the figure, numbers 1,2,3,4 show the corresponding units as explained above. The fourth unit and the third unit form a relation "Means". The fourth unit is the essential part of this relation, so it is called the nucleus of the relation and third unit is called the satellite of the relation. Similarly second unit to third and fourth unit is forming relation "Condition". All units are also spans and spans may be composed of more than one unit.
Approach
In the step 3 of our algorithm, we use the standard machine translation encoder-decoder RNN model. As RNN suffers from decaying of gradient and blowing-up of gradient problem, we use LSTM units, which are good at memorizing long-range dependencies due to forget-style gates. The sequence of video frames are passed to the encoder. The last hidden state of the encoder is then passed to the decoder. The decoder generates the discourse structure as a sequence of words.
Evaluation
We evaluate our approach by using the following three metrics:
(a) BLEU score We used the BLEU score (Papineni et al., 2002) to evaluate the translation quality of the discourse structure generated from the videos. We computed our BLEU score on the tokenized predictions and ground truth.
(b) Relations Accuracy Each video, in our dataset, contains two discourse relations. The Relations Accuracy metric is defined as the total number of relations correctly predicted by the model.
(c) Edges Accuracy Each video, in our dataset, contains two edges. The Edges Accuracy metric is defined as the total number of edges (i.e. RST node nuclearity directions) correctly predicted by the model.
(d) Relations+Edges Accuracy Here, we compute the correctness of the complete discourse structure, i.e. the predicted discourse structure will be considered correct only if all the relations and the edges are correctly predicted by the model.
Conclusions
Paper presented an end-to-end learning approach to identify the discourse structure of videos. Central to our approach is the use of text descriptions of videos to identify discourse relations. In the future, we plan to extend this dataset to include longer videos that need more than three sentences to describe. We also intend to experiment with multi-task learning approaches. Our results indicate that there is significant scope for improvement.
Comments