MovieQA: Understanding Stories in Movies through Question-Answering

-By Makarand Tapaswi , Yukun Zhu , Rainer Stiefelhagen

Antonio Torralba , Raquel Urtasun , Sanja Fidler

arlsruhe Institute of Technology,

Massachusetts Institute of Technology,

University of Toronto

Abstract
We introduce the MovieQA dataset which aims to evaluate automatic story comprehension from both video and text. The dataset consists of 14,944 questions about 408 movies with high semantic diversity. The questions range from the simpler “Who” did “What” to “Whom”, to “Why” and “How” certain events occurred. Each question comes with a set of five possible answers; a correct one and four deceiving answers provided by human annotators. Our dataset is unique in that it contains multiple sources of information – video clips, plots, subtitles, scripts, and DVS. We analyze our data through various statistics and methods. We further extend existing QA techniques to show that question-answering with such open-ended semantics is hard. We make this data set public along with an evaluation benchmark to encourage inspiring work in this challenging domain.

MovieQA dataset

The goal of our paper is to create a challenging benchmark that evaluates semantic understanding over long temporal data. We collect a dataset with very diverse sources of information that can be exploited in this challenging domain. Our data consists of quizzes about movies that the automatic systems will have to answer. For each movie, a quiz comprises of a set of questions, each with 5 multiple-choice answers, only one of which is correct. The system has access to various sources of textual and visual information, which we describe in detail below. We collected 408 subtitled movies and obtained their extended summaries in the form of plot synopses from Wikipedia. We crawled imsdb for scripts, which were available for 49% (199) of our movies. A fraction of our movies (60) come with DVS transcriptions.

Plot synopses

are movie summaries that fans write after watching the movie. Synopses widely vary in detail and range from one to 20 paragraphs but focus on describing content that is directly relevant to the story. They rarely contain detailed visual information (e.g. character appearance) and focus more on describing the movie events and character interactions. We exploit plots to gather our quizzes.

Videos and subtitles.

An average movie is about 2 hours in length and has over 198K frames and almost 2000 shots. Note that video alone contains information about e.g., “Who” did “What” to “Whom”, but may be lacking in information to explain why something happened. Dialogues play an important role, and only both modalities together allow us to fully understand the story. Note that subtitles do

not contain speaker information. In our dataset, we provide video clips rather than full movies.

DVS

is a service that narrates movie scenes to the visually impaired by inserting relevant descriptions in between dialogues. These descriptions contain sufficient “visual” information about the scene that they allow visually impaired audience to follow the movie. DVS thus acts as a proxy for a perfect vision system and is another source for answers.

Scripts.

The scripts that we collected are written by screenwriters and serve as a guideline for movie making. They typically contain detailed descriptions of scenes, and, unlike subtitles, contain both dialogues and speaker information. Scripts are thus similar, if not richer in content to DVS+Subtitles, however, are not always entirely faithful to the movie as the director may aspire to artistic freedom.

Representations for Text and Video

TF-IDF

is a popular and successful feature of information retrieval. In our case, we treat plots (or other forms of text) from different movies as documents and compute a weight for each word. We set all words to lower case, use stemming, and compute the vocabulary V which consists of words w that appears more than θ times in the documents. We represent each sentence (or question or answer) in a bag-of-words style with a TF-IDF score for each word.

Word2Vec.

A disadvantage of TF-IDF is that it is unable to capture the similarities between words. We use the skip-gram model and train it on roughly 1200 movie plots to obtain domain-specific, 300-dimensional word embeddings. A sentence is then represented by mean-pooling its word embeddings. We normalize the resulting vector to have unit norm.

SkipThoughts.

While the sentence representation using mean pooled Word2Vec discards word order, SkipThoughts use a Recurrent Neural Network to capture the underlying sentence semantics. We use the pre-trained model to compute a 4800-dimensional sentence representation.

Video.

To answer questions from the video, we learn an embedding between a shot and a sentence, which maps the two modalities in a common space. In this joint space, one can score the similarity between the two modalities via a simple dot product. This allows us to apply all of our proposed question-answering techniques in their original form. To learn the joint embedding we follow visual expressions which extend semantic embeddings to video. Specifically, we use the GoogLeNet architecture convolutions as well as hybrid-CNN scene recognition using places to extract framewise features, and mean-pool the representations over all frames in a shot. The embedding is a linear mapping of the shot representation and an LSTM on word embeddings on the sentence side, trained using the ranking loss on the MovieDescription Dataset.

Neural similarity Architecture

Accuracy

Conclusion

MovieQA data set which aims to evaluate automatic story comprehension from both video and text. Our dataset is unique in that it contains several sources of information – video clips, subtitles, scripts, plots and DVS. We provided several intelligent baselines and extended existing QA techniques to analyze the difficulty of our task. Our benchmark with an evaluation server is online at http://movieqa.cs.toronto.edu.

GitHub Code

https://github.com/makarandtapaswi/MovieQA_benchmark

TableSense: Spreadsheet Table Detection with Convolutional Neural Networks

- By Haoyu Dong, Shijie Liu, Shi Han, Zhouyu Fu, Dongmei Zhang Microsoft Research, Beijing 100080, China. Beihang University, Beijing 100191, China Paper Link Abstract Spreadsheet table detection is the task of detecting all tables on a given sheet and locating their respective ranges. Automatic table detection is a key enabling technique and an initial step in spreadsheet data intelligence. However, the detection task is challenged by the diversity of table structures and table layouts on the spreadsheet. Considering the analogy between a cell matrix as spreadsheet and a pixel matrix as image, and encouraged by the successful application of Convolutional Neural Networks (CNN) in computer vision, we have developed TableSense, a novel end-to-end framework for spreadsheet table detection. First, we devise an effective cell featurization scheme to better leverage the rich information in each cell; second, we develop an enhanced convolutional neural network model for...

SRI Blog

Search This Blog