-By Makarand Tapaswi , Yukun Zhu , Rainer Stiefelhagen
Antonio Torralba , Raquel Urtasun , Sanja Fidler
arlsruhe Institute of Technology,
Massachusetts Institute of Technology,
University of Toronto
AbstractWe introduce the MovieQA dataset which aims to evaluate automatic story comprehension from both video and text. The dataset consists of 14,944 questions about 408 movies with high semantic diversity. The questions range from the simpler “Who” did “What” to “Whom”, to “Why” and “How” certain events occurred. Each question comes with a set of five possible answers; a correct one and four deceiving answers provided by human annotators. Our dataset is unique in that it contains multiple sources of information – video clips, plots, subtitles, scripts, and DVS. We analyze our data through various statistics and methods. We further extend existing QA techniques to show that question-answering with such open-ended semantics is hard. We make this data set public along with an evaluation benchmark to encourage inspiring work in this challenging domain.
MovieQA dataset
The goal of our paper is to create a challenging benchmark that evaluates semantic understanding over long temporal data. We collect a dataset with very diverse sources of information that can be exploited in this challenging domain. Our data consists of quizzes about movies that the automatic systems will have to answer. For each movie, a quiz comprises of a set of questions, each with 5 multiple-choice answers, only one of which is correct. The system has access to various sources of textual and visual information, which we describe in detail below. We collected 408 subtitled movies and obtained their extended summaries in the form of plot synopses from Wikipedia. We crawled imsdb for scripts, which were available for 49% (199) of our movies. A fraction of our movies (60) come with DVS transcriptions.
Plot synopses
are movie summaries that fans write after watching the movie. Synopses widely vary in detail and range from one to 20 paragraphs but focus on describing content that is directly relevant to the story. They rarely contain detailed visual information (e.g. character appearance) and focus more on describing the movie events and character interactions. We exploit plots to gather our quizzes.
Videos and subtitles.
An average movie is about 2 hours in length and has over 198K frames and almost 2000 shots. Note that video alone contains information about e.g., “Who” did “What” to “Whom”, but may be lacking in information to explain why something happened. Dialogues play an important role, and only both modalities together allow us to fully understand the story. Note that subtitles do
not contain speaker information. In our dataset, we provide video clips rather than full movies.
DVS
is a service that narrates movie scenes to the visually impaired by inserting relevant descriptions in between dialogues. These descriptions contain sufficient “visual” information about the scene that they allow visually impaired audience to follow the movie. DVS thus acts as a proxy for a perfect vision system and is another source for answers.
Scripts.
The scripts that we collected are written by screenwriters and serve as a guideline for movie making. They typically contain detailed descriptions of scenes, and, unlike subtitles, contain both dialogues and speaker information. Scripts are thus similar, if not richer in content to DVS+Subtitles, however, are not always entirely faithful to the movie as the director may aspire to artistic freedom.
Representations for Text and Video
TF-IDF
is a popular and successful feature of information retrieval. In our case, we treat plots (or other forms of text) from different movies as documents and compute a weight for each word. We set all words to lower case, use stemming, and compute the vocabulary V which consists of words w that appears more than θ times in the documents. We represent each sentence (or question or answer) in a bag-of-words style with a TF-IDF score for each word.
Word2Vec.
A disadvantage of TF-IDF is that it is unable to capture the similarities between words. We use the skip-gram model and train it on roughly 1200 movie plots to obtain domain-specific, 300-dimensional word embeddings. A sentence is then represented by mean-pooling its word embeddings. We normalize the resulting vector to have unit norm.
SkipThoughts.
While the sentence representation using mean pooled Word2Vec discards word order, SkipThoughts use a Recurrent Neural Network to capture the underlying sentence semantics. We use the pre-trained model to compute a 4800-dimensional sentence representation.
Video.
To answer questions from the video, we learn an embedding between a shot and a sentence, which maps the two modalities in a common space. In this joint space, one can score the similarity between the two modalities via a simple dot product. This allows us to apply all of our proposed question-answering techniques in their original form. To learn the joint embedding we follow visual expressions which extend semantic embeddings to video. Specifically, we use the GoogLeNet architecture convolutions as well as hybrid-CNN scene recognition using places to extract framewise features, and mean-pool the representations over all frames in a shot. The embedding is a linear mapping of the shot representation and an LSTM on word embeddings on the sentence side, trained using the ranking loss on the MovieDescription Dataset.
Neural similarity Architecture
Accuracy
Conclusion
MovieQA data set which aims to
evaluate automatic story comprehension from both video
and text. Our dataset is unique in that it contains several
sources of information – video clips, subtitles, scripts, plots
and DVS. We provided several intelligent baselines and extended existing QA techniques to analyze the difficulty of
our task. Our benchmark with an evaluation server is online
at http://movieqa.cs.toronto.edu.
Comments