Skip to main content

MovieQA: Understanding Stories in Movies through Question-Answering

-By Makarand Tapaswi , Yukun Zhu , Rainer Stiefelhagen
 Antonio Torralba , Raquel Urtasun , Sanja Fidler 
arlsruhe Institute of Technology, 
Massachusetts Institute of Technology, 
University of Toronto





Abstract
We introduce the MovieQA dataset which aims to evaluate automatic story comprehension from both video and text. The dataset consists of 14,944 questions about 408 movies with high semantic diversity. The questions range from the simpler “Who” did “What” to “Whom”, to “Why” and “How” certain events occurred. Each question comes with a set of five possible answers; a correct one and four deceiving answers provided by human annotators. Our dataset is unique in that it contains multiple sources of information – video clips, plots, subtitles, scripts, and DVS. We analyze our data through various statistics and methods. We further extend existing QA techniques to show that question-answering with such open-ended semantics is hard. We make this data set public along with an evaluation benchmark to encourage inspiring work in this challenging domain.









MovieQA dataset

The goal of our paper is to create a challenging benchmark that evaluates semantic understanding over long temporal data. We collect a dataset with very diverse sources of information that can be exploited in this challenging domain. Our data consists of quizzes about movies that the automatic systems will have to answer. For each movie, a quiz comprises of a set of questions, each with 5 multiple-choice answers, only one of which is correct. The system has access to various sources of textual and visual information, which we describe in detail below. We collected 408 subtitled movies and obtained their extended summaries in the form of plot synopses from Wikipedia. We crawled imsdb for scripts, which were available for 49% (199) of our movies. A fraction of our movies (60) come with DVS transcriptions. 

Plot synopses 

are movie summaries that fans write after watching the movie. Synopses widely vary in detail and range from one to 20 paragraphs but focus on describing content that is directly relevant to the story. They rarely contain detailed visual information (e.g. character appearance) and focus more on describing the movie events and character interactions. We exploit plots to gather our quizzes.

Videos and subtitles. 
An average movie is about 2 hours in length and has over 198K frames and almost 2000 shots. Note that video alone contains information about e.g., “Who” did “What” to “Whom”, but may be lacking in information to explain why something happened. Dialogues play an important role, and only both modalities together allow us to fully understand the story. Note that subtitles do
not contain speaker information. In our dataset, we provide video clips rather than full movies.

DVS 
is a service that narrates movie scenes to the visually impaired by inserting relevant descriptions in between dialogues. These descriptions contain sufficient “visual” information about the scene that they allow visually impaired audience to follow the movie. DVS thus acts as a proxy for a perfect vision system and is another source for answers.

Scripts. 
The scripts that we collected are written by screenwriters and serve as a guideline for movie making. They typically contain detailed descriptions of scenes, and, unlike subtitles, contain both dialogues and speaker information. Scripts are thus similar, if not richer in content to DVS+Subtitles, however, are not always entirely faithful to the movie as the director may aspire to artistic freedom.


Representations for Text and Video

TF-IDF 
is a popular and successful feature of information retrieval. In our case, we treat plots (or other forms of text) from different movies as documents and compute a weight for each word. We set all words to lower case, use stemming, and compute the vocabulary V which consists of words w that appears more than θ times in the documents. We represent each sentence (or question or answer) in a bag-of-words style with a TF-IDF score for each word. 

Word2Vec. 

A disadvantage of TF-IDF is that it is unable to capture the similarities between words. We use the skip-gram model and train it on roughly 1200 movie plots to obtain domain-specific, 300-dimensional word embeddings. A sentence is then represented by mean-pooling its word embeddings. We normalize the resulting vector to have unit norm. 

SkipThoughts. 

While the sentence representation using mean pooled Word2Vec discards word order, SkipThoughts use a Recurrent Neural Network to capture the underlying sentence semantics. We use the pre-trained model to compute a 4800-dimensional sentence representation. 

Video. 

To answer questions from the video, we learn an embedding between a shot and a sentence, which maps the two modalities in a common space. In this joint space, one can score the similarity between the two modalities via a simple dot product. This allows us to apply all of our proposed question-answering techniques in their original form. To learn the joint embedding we follow visual expressions which extend semantic embeddings to video. Specifically, we use the GoogLeNet architecture convolutions as well as hybrid-CNN scene recognition using places to extract framewise features, and mean-pool the representations over all frames in a shot. The embedding is a linear mapping of the shot representation and an LSTM on word embeddings on the sentence side, trained using the ranking loss on the MovieDescription Dataset.




Neural similarity Architecture


Accuracy 


Conclusion


MovieQA data set which aims to evaluate automatic story comprehension from both video and text. Our dataset is unique in that it contains several sources of information – video clips, subtitles, scripts, plots and DVS. We provided several intelligent baselines and extended existing QA techniques to analyze the difficulty of our task. Our benchmark with an evaluation server is online at http://movieqa.cs.toronto.edu.
 

GitHub Code



Comments

Popular posts from this blog

ABOD and its PyOD python module

Angle based detection By  Hans-Peter Kriegel, Matthias Schubert, Arthur Zimek  Ludwig-Maximilians-Universität München  Oettingenstr. 67, 80538 München, Germany Ref Link PyOD By  Yue Zhao   Zain Nasrullah   Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4, Canada  Zheng Li jk  Northeastern University Toronto, Toronto, ON M5X 1E2, Canada I am combining two papers to summarize Anomaly detection. First one is Angle Based Outlier Detection (ABOD) and other one is python module that  uses ABOD along with over 20 other apis (PyOD) . This is third part in the series of Anomaly detection. First article exhibits survey that covered length and breadth of subject, Second article highlighted on data preparation and pre-processing.  Angle Based Outlier Detection. Angles are more stable than distances in high dimensional spaces for example the popularity of cosine-based sim...

Ownership at Large

 Open Problems and Challenges in Ownership Management -By John Ahlgren, Maria Eugenia Berezin, Kinga Bojarczuk, Elena Dulskyte, Inna Dvortsova, Johann George, Natalija Gucevska, Mark Harman, Shan He, Ralf Lämmel, Erik Meijer, Silvia Sapora, and Justin Spahr-Summers Facebook Inc.  Software-intensive organizations rely on large numbers of software assets of different types, e.g., source-code files, tables in the data warehouse, and software configurations. Who is the most suitable owner of a given asset changes over time, e.g., due to reorganization and individual function changes. New forms of automation can help suggest more suitable owners for any given asset at a given point in time. By such efforts on ownership health, accountability of ownership is increased. The problem of finding the most suitable owners for an asset is essentially a program comprehension problem: how do we automatically determine who would be best placed to understand, maintain, ev...

Hybrid Approach to Automation, RPA and Machine Learning

- By Wiesław Kopec´, Kinga Skorupska, Piotr Gago, Krzysztof Marasek  Polish-Japanese Academy of Information Technology Paper Link Courtesy DZone   Abstract One of the more prominent trends within Industry 4.0 is the drive to employ Robotic Process Automation (RPA), especially as one of the elements of the Lean approach.     The full implementation of RPA is riddled with challenges relating both to the reality of everyday business operations, from SMEs to SSCs and beyond, and the social effects of the changing job market. To successfully address these points there is a need to develop a solution that would adjust to the existing business operations and at the same time lower the negative social impact of the automation process. To achieve these goals we propose a hybrid, human-centred approach to the development of software robots. This design and  implementation method combines the Living Lab approach with empowerment through part...