Skip to main content

MovieQA: Understanding Stories in Movies through Question-Answering

-By Makarand Tapaswi , Yukun Zhu , Rainer Stiefelhagen
 Antonio Torralba , Raquel Urtasun , Sanja Fidler 
arlsruhe Institute of Technology, 
Massachusetts Institute of Technology, 
University of Toronto





Abstract
We introduce the MovieQA dataset which aims to evaluate automatic story comprehension from both video and text. The dataset consists of 14,944 questions about 408 movies with high semantic diversity. The questions range from the simpler “Who” did “What” to “Whom”, to “Why” and “How” certain events occurred. Each question comes with a set of five possible answers; a correct one and four deceiving answers provided by human annotators. Our dataset is unique in that it contains multiple sources of information – video clips, plots, subtitles, scripts, and DVS. We analyze our data through various statistics and methods. We further extend existing QA techniques to show that question-answering with such open-ended semantics is hard. We make this data set public along with an evaluation benchmark to encourage inspiring work in this challenging domain.









MovieQA dataset

The goal of our paper is to create a challenging benchmark that evaluates semantic understanding over long temporal data. We collect a dataset with very diverse sources of information that can be exploited in this challenging domain. Our data consists of quizzes about movies that the automatic systems will have to answer. For each movie, a quiz comprises of a set of questions, each with 5 multiple-choice answers, only one of which is correct. The system has access to various sources of textual and visual information, which we describe in detail below. We collected 408 subtitled movies and obtained their extended summaries in the form of plot synopses from Wikipedia. We crawled imsdb for scripts, which were available for 49% (199) of our movies. A fraction of our movies (60) come with DVS transcriptions. 

Plot synopses 

are movie summaries that fans write after watching the movie. Synopses widely vary in detail and range from one to 20 paragraphs but focus on describing content that is directly relevant to the story. They rarely contain detailed visual information (e.g. character appearance) and focus more on describing the movie events and character interactions. We exploit plots to gather our quizzes.

Videos and subtitles. 
An average movie is about 2 hours in length and has over 198K frames and almost 2000 shots. Note that video alone contains information about e.g., “Who” did “What” to “Whom”, but may be lacking in information to explain why something happened. Dialogues play an important role, and only both modalities together allow us to fully understand the story. Note that subtitles do
not contain speaker information. In our dataset, we provide video clips rather than full movies.

DVS 
is a service that narrates movie scenes to the visually impaired by inserting relevant descriptions in between dialogues. These descriptions contain sufficient “visual” information about the scene that they allow visually impaired audience to follow the movie. DVS thus acts as a proxy for a perfect vision system and is another source for answers.

Scripts. 
The scripts that we collected are written by screenwriters and serve as a guideline for movie making. They typically contain detailed descriptions of scenes, and, unlike subtitles, contain both dialogues and speaker information. Scripts are thus similar, if not richer in content to DVS+Subtitles, however, are not always entirely faithful to the movie as the director may aspire to artistic freedom.


Representations for Text and Video

TF-IDF 
is a popular and successful feature of information retrieval. In our case, we treat plots (or other forms of text) from different movies as documents and compute a weight for each word. We set all words to lower case, use stemming, and compute the vocabulary V which consists of words w that appears more than θ times in the documents. We represent each sentence (or question or answer) in a bag-of-words style with a TF-IDF score for each word. 

Word2Vec. 

A disadvantage of TF-IDF is that it is unable to capture the similarities between words. We use the skip-gram model and train it on roughly 1200 movie plots to obtain domain-specific, 300-dimensional word embeddings. A sentence is then represented by mean-pooling its word embeddings. We normalize the resulting vector to have unit norm. 

SkipThoughts. 

While the sentence representation using mean pooled Word2Vec discards word order, SkipThoughts use a Recurrent Neural Network to capture the underlying sentence semantics. We use the pre-trained model to compute a 4800-dimensional sentence representation. 

Video. 

To answer questions from the video, we learn an embedding between a shot and a sentence, which maps the two modalities in a common space. In this joint space, one can score the similarity between the two modalities via a simple dot product. This allows us to apply all of our proposed question-answering techniques in their original form. To learn the joint embedding we follow visual expressions which extend semantic embeddings to video. Specifically, we use the GoogLeNet architecture convolutions as well as hybrid-CNN scene recognition using places to extract framewise features, and mean-pool the representations over all frames in a shot. The embedding is a linear mapping of the shot representation and an LSTM on word embeddings on the sentence side, trained using the ranking loss on the MovieDescription Dataset.




Neural similarity Architecture


Accuracy 


Conclusion


MovieQA data set which aims to evaluate automatic story comprehension from both video and text. Our dataset is unique in that it contains several sources of information – video clips, subtitles, scripts, plots and DVS. We provided several intelligent baselines and extended existing QA techniques to analyze the difficulty of our task. Our benchmark with an evaluation server is online at http://movieqa.cs.toronto.edu.
 

GitHub Code



Comments

Popular posts from this blog

ABOD and its PyOD python module

Angle based detection By  Hans-Peter Kriegel, Matthias Schubert, Arthur Zimek  Ludwig-Maximilians-Universität München  Oettingenstr. 67, 80538 München, Germany Ref Link PyOD By  Yue Zhao   Zain Nasrullah   Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4, Canada  Zheng Li jk  Northeastern University Toronto, Toronto, ON M5X 1E2, Canada I am combining two papers to summarize Anomaly detection. First one is Angle Based Outlier Detection (ABOD) and other one is python module that  uses ABOD along with over 20 other apis (PyOD) . This is third part in the series of Anomaly detection. First article exhibits survey that covered length and breadth of subject, Second article highlighted on data preparation and pre-processing.  Angle Based Outlier Detection. Angles are more stable than distances in high dimensional spaces for example the popularity of cosine-based similarity measures for text data. Object o is an out

TableSense: Spreadsheet Table Detection with Convolutional Neural Networks

 - By Haoyu Dong, Shijie Liu, Shi Han, Zhouyu Fu, Dongmei Zhang Microsoft Research, Beijing 100080, China. Beihang University, Beijing 100191, China Paper Link Abstract Spreadsheet table detection is the task of detecting all tables on a given sheet and locating their respective ranges. Automatic table detection is a key enabling technique and an initial step in spreadsheet data intelligence. However, the detection task is challenged by the diversity of table structures and table layouts on the spreadsheet. Considering the analogy between a cell matrix as spreadsheet and a pixel matrix as image, and encouraged by the successful application of Convolutional Neural Networks (CNN) in computer vision, we have developed TableSense, a novel end-to-end framework for spreadsheet table detection. First, we devise an effective cell featurization scheme to better leverage the rich information in each cell; second, we develop an enhanced convolutional neural network model for tab

DEEP LEARNING FOR ANOMALY DETECTION: A SURVEY

-By  Raghavendra Chalapathy  University of Sydney,  Capital Markets Co-operative Research Centre (CMCRC)  Sanjay Chawla  Qatar Computing Research Institute (QCRI),  HBKU  Paper Link Anomaly detection also known as outlier detection is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data. Typically the anomalous items will translate to some kind of problem such as bank fraud, a structural defect, medical problems or errors in a text. Anomalies are also referred to as outliers, novelties, noise, deviations and exceptions Hawkins defines an outlier as an observation that deviates so significantly from other observations as to arouse suspicion that it was generated by a different mechanism. Aim of this paper is two-fold, First is a structured and comprehensive overview of research methods in deep learning-based anomaly detection. Furthermore the adoption of these methods