Skip to main content

Visual Discourse Parsing

-By

Arjun R Akula ,  Song-Chun Zhu

University of California, Los Angeles


Abstract


Text-level discourse parsing aims to unmask how two segments (or sentences) in the text are related to each other. We propose the task of Visual Discourse Parsing, which requires understanding discourse relations among scenes in a video. Here we use the term scene to refer to a subset of video frames that can better summarize the video. In order to collect a dataset for learning discourse cues from videos, one needs to manually identify the scenes from a large pool of video frames and then annotate the discourse relations between them. This is clearly a time consuming, expensive and tedious task. In this work, we propose an approach to identify discourse cues from the videos without the need to explicitly identify and annotate the scenes. We also present a novel dataset containing 310 videos and the corresponding discourse cues to evaluate our approach. We believe that many of the multidiscipline Artificial Intelligence problems such as Visual Dialog and Visual Storytelling would greatly benefit from the use of visual discourse cues.

Discourse structure aids in understanding a piece of text by linking it with other text units (such as surrounding clauses, sentences, etc.) from its context. A text span may be linked to another span through semantic relationships such as contrast relation, causal relation, etc. Text-level discourse parsing algorithms aim to unmask such relationships in text, which is central to many downstream natural language processing (NLP) applications such as information retrieval, text summarization, sentiment analysis  and question answering. Recently, there has been a lot of focus on multidiscipline Artificial Intelligence (AI) research problems such as visual storytelling and visual dialog. Solving these problems requires multi-modal knowledge that combines computer vision (CV), NLP, and knowledge representation & reasoning (KR), making the need for commonsense knowledge and complex reasoning more essential. In an effort to fill this need, we introduce the task of Visual Discourse Parsing.


Task Definition. 


The concrete task in Visual Discourse Parsing is the following - given a video, understand discourse relationships among its scenes. Specifically, given a video, the task is to identify a scene’s relation with the context. Here we use the term scene to refer to a subset of video frames that can better summarize the video. We use Rhetorical Structure Theory (RST) to capture discourse relations among the scenes. Consider for example, nine frames of a video shown in Figure 1. We can represent the discourse structure of this video using only 3 out of 9 frames, i.e. there are only 3 scenes for this video. The discourse structure in the Figure 1 interprets the video as follows: the event “person going to the bathroom and cleaning his stains” is caused by the event “the person spilling coffee over his shirt”; the event “the person used his handkerchief to dry the water on his shirt” is simply an elaboration of the event “person going to the bathroom and cleaning his stains”.


RST info from wiki


Rhetorical relations or coherence relations or discourse relations are paratactic (coordinate) or hypotactic (subordinate) relations that hold across two or more text spans. It is widely accepted that notion of coherence is through text relations like this. RST using rhetorical relations provide a systematic way for an analyst to analyse the text. An analysis is usually built by reading the text and constructing a tree using the relations. The following example is a title and summary, appearing at the top of an article in Scientific American magazine (Ramachandran and Anstis, 1986). The original text, broken into numbered units, is

  1. [Title:] The Perception of Apparent Motion
  2. [Abstract:] When the motion of an intermittently seen object is ambiguous
  3. the visual system resolves confusion
  4. by applying some tricks that reflect a built-in knowledge of properties of the physical world


In the figure, numbers 1,2,3,4 show the corresponding units as explained above. The fourth unit and the third unit form a relation "Means". The fourth unit is the essential part of this relation, so it is called the nucleus of the relation and third unit is called the satellite of the relation. Similarly second unit to third and fourth unit is forming relation "Condition". All units are also spans and spans may be composed of more than one unit.

Approach







In the step 3 of our algorithm, we use the standard machine translation encoder-decoder RNN model. As RNN suffers from decaying of gradient and blowing-up of gradient problem, we use LSTM units, which are good at memorizing long-range dependencies due to forget-style gates. The sequence of video frames are passed to the encoder. The last hidden state of the encoder is then passed to the decoder. The decoder generates the discourse structure as a sequence of words.


 Evaluation


We evaluate our approach by using the following three metrics:

(a) BLEU score We used the BLEU score (Papineni et al., 2002) to evaluate the translation quality of the discourse structure generated from the videos. We computed our BLEU score on the tokenized predictions and ground truth.

(b) Relations Accuracy Each video, in our dataset, contains two discourse relations. The Relations Accuracy metric is defined as the total number of relations correctly predicted by the model.

(c) Edges Accuracy Each video, in our dataset, contains two edges. The Edges Accuracy metric is defined as the total number of edges (i.e. RST node nuclearity directions) correctly predicted by the model.

(d) Relations+Edges Accuracy Here, we compute the correctness of the complete discourse structure, i.e. the predicted discourse structure will be considered correct only if all the relations and the edges are correctly predicted by the model.


Conclusions


Paper presented an end-to-end learning approach to identify the discourse structure of videos. Central to our approach is the use of text descriptions of videos to identify discourse relations. In the future, we plan to extend this dataset to include longer videos that need more than three sentences to describe. We also intend to experiment with multi-task learning approaches. Our results indicate that there is significant scope for improvement.

Comments

Popular posts from this blog

ABOD and its PyOD python module

Angle based detection By  Hans-Peter Kriegel, Matthias Schubert, Arthur Zimek  Ludwig-Maximilians-Universität München  Oettingenstr. 67, 80538 München, Germany Ref Link PyOD By  Yue Zhao   Zain Nasrullah   Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4, Canada  Zheng Li jk  Northeastern University Toronto, Toronto, ON M5X 1E2, Canada I am combining two papers to summarize Anomaly detection. First one is Angle Based Outlier Detection (ABOD) and other one is python module that  uses ABOD along with over 20 other apis (PyOD) . This is third part in the series of Anomaly detection. First article exhibits survey that covered length and breadth of subject, Second article highlighted on data preparation and pre-processing.  Angle Based Outlier Detection. Angles are more stable than distances in high dimensional spaces for example the popularity of cosine-based similarity measures for text data. Object o is an out

TableSense: Spreadsheet Table Detection with Convolutional Neural Networks

 - By Haoyu Dong, Shijie Liu, Shi Han, Zhouyu Fu, Dongmei Zhang Microsoft Research, Beijing 100080, China. Beihang University, Beijing 100191, China Paper Link Abstract Spreadsheet table detection is the task of detecting all tables on a given sheet and locating their respective ranges. Automatic table detection is a key enabling technique and an initial step in spreadsheet data intelligence. However, the detection task is challenged by the diversity of table structures and table layouts on the spreadsheet. Considering the analogy between a cell matrix as spreadsheet and a pixel matrix as image, and encouraged by the successful application of Convolutional Neural Networks (CNN) in computer vision, we have developed TableSense, a novel end-to-end framework for spreadsheet table detection. First, we devise an effective cell featurization scheme to better leverage the rich information in each cell; second, we develop an enhanced convolutional neural network model for tab

DEEP LEARNING FOR ANOMALY DETECTION: A SURVEY

-By  Raghavendra Chalapathy  University of Sydney,  Capital Markets Co-operative Research Centre (CMCRC)  Sanjay Chawla  Qatar Computing Research Institute (QCRI),  HBKU  Paper Link Anomaly detection also known as outlier detection is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data. Typically the anomalous items will translate to some kind of problem such as bank fraud, a structural defect, medical problems or errors in a text. Anomalies are also referred to as outliers, novelties, noise, deviations and exceptions Hawkins defines an outlier as an observation that deviates so significantly from other observations as to arouse suspicion that it was generated by a different mechanism. Aim of this paper is two-fold, First is a structured and comprehensive overview of research methods in deep learning-based anomaly detection. Furthermore the adoption of these methods