The objective of this paper is the understanding of the long-range narrative structure of movies. This paper tries to save time by proposing to learn from the key scenes of the movie, providing a condensed look at the full storyline, instead of considering the entire movie.

Here are the four contributions specified in this paper to achieve the above objective.

(i) The research team in this paper creates the Condensed Movie Dataset (CMD) consisting of the key scenes from over 3K movies: each key scene is accompanied by a high-level semantic description of the scene, character face tracks, and metadata about the movie. The dataset used in this paper is scalable, obtained automatically from YouTube, and is freely available for anybody to download and use. It is also an order of magnitude larger than existing movie datasets in the number of movies.

(ii) The research team introduces a new story-based text-to-video retrieval task on this dataset that requires a high-level understanding of the plotline.

(iii) The research team provides a deep network baseline for this task on our dataset, combining character, speech, and visual cues into a single video embedding.

(iv) And finally, the research team demonstrate how the addition of context (both past and future) improves retrieval performance.

Paper: https://arxiv.org/pdf/2005.04208.pdf