Detecting Scene Modifications in Audiovisual Content material | by Netflix Expertise Weblog | Jun, 2023


Avneesh Saluja, Andy Yao, Hossein Taghavi

When watching a film or an episode of a TV present, we expertise a cohesive narrative that unfolds earlier than us, typically with out giving a lot thought to the underlying construction that makes all of it potential. Nevertheless, films and episodes should not atomic models, however relatively composed of smaller parts reminiscent of frames, photographs, scenes, sequences, and acts. Understanding these parts and the way they relate to one another is essential for duties reminiscent of video summarization and highlights detection, content-based video retrieval, dubbing high quality evaluation, and video enhancing. At Netflix, such workflows are carried out a whole bunch of instances a day by many groups world wide, so investing in algorithmically-assisted tooling round content material understanding can reap outsized rewards.

Whereas segmentation of extra granular models like frames and shot boundaries is both trivial or can primarily depend on pixel-based info, increased order segmentation¹ requires a extra nuanced understanding of the content material, such because the narrative or emotional arcs. Moreover, some cues might be higher inferred from modalities apart from the video, e.g. the screenplay or the audio and dialogue observe. Scene boundary detection, particularly, is the duty of figuring out the transitions between scenes, the place a scene is outlined as a steady sequence of photographs that happen in the identical time and site (typically with a comparatively static set of characters) and share a typical motion or theme.

On this weblog publish, we current two complementary approaches to scene boundary detection in audiovisual content material. The primary methodology, which might be seen as a type of weak supervision, leverages auxiliary information within the type of a screenplay by aligning screenplay textual content with timed textual content (closed captions, audio descriptions) and assigning timestamps to the screenplay’s scene headers (a.ok.a. sluglines). Within the second method, we present {that a} comparatively easy, supervised sequential mannequin (bidirectional LSTM or GRU) that makes use of wealthy, pretrained shot-level embeddings can outperform the present state-of-the-art baselines on our inner benchmarks.

Determine 1: a scene consists of a sequence of photographs.

Screenplays are the blueprints of a film or present. They’re formatted in a selected manner, with every scene starting with a scene header, indicating attributes reminiscent of the placement and time of day. This constant formatting makes it potential to parse screenplays right into a structured format. On the similar time, a) modifications made on the fly (directorial or actor discretion) or b) in publish manufacturing and enhancing are hardly ever mirrored within the screenplay, i.e. it isn’t rewritten to replicate the modifications.

Determine 2: screenplay parts, from The Witcher S1E1.

With a purpose to leverage this noisily aligned information supply, we have to align time-stamped textual content (e.g. closed captions and audio descriptions) with screenplay textual content (dialogue and action² traces), allowing for a) the on-the-fly modifications that may end in semantically related however not an identical line pairs and b) the potential post-shoot modifications which can be extra vital (reordering, eradicating, or inserting complete scenes). To handle the primary problem, we use pre educated sentence-level embeddings, e.g. from an embedding mannequin optimized for paraphrase identification, to signify textual content in each sources. For the second problem, we use dynamic time warping (DTW), a technique for measuring the similarity between two sequences which will differ in time or velocity. Whereas DTW assumes a monotonicity situation on the alignments³ which is often violated in apply, it’s sturdy sufficient to get well from native misalignments and the overwhelming majority of salient occasions (like scene boundaries) are well-aligned.

Because of DTW, the scene headers have timestamps that may point out potential scene boundaries within the video. The alignments can be used to e.g., increase audiovisual ML fashions with screenplay info like scene-level embeddings, or switch labels assigned to audiovisual content material to coach screenplay prediction fashions.

Determine 3: alignments between screenplay and video through time stamped textual content for The Witcher S1E1.

The alignment methodology above is a good way to rise up and working with the scene change activity because it combines easy-to-use pretrained embeddings with a well known dynamic programming method. Nevertheless, it presupposes the provision of high-quality screenplays. A complementary method (which actually, can use the above alignments as a characteristic) that we current subsequent is to coach a sequence mannequin on annotated scene change information. Sure workflows in Netflix seize this info, and that’s our main information supply; publicly-released datasets are additionally accessible.

From an architectural perspective, the mannequin is comparatively easy — a bidirectional GRU (biGRU) that ingests shot representations at every step and predicts if a shot is on the finish of a scene.⁴ The richness within the mannequin comes from these pretrained, multimodal shot embeddings, a preferable design alternative in our setting given the problem in acquiring labeled scene change information and the comparatively bigger scale at which we will pretrain numerous embedding fashions for photographs.

For video embeddings, we leverage an in-house mannequin pretrained on aligned video clips paired with textual content (the aforementioned “timestamped textual content”). For audio embeddings, we first carry out supply separation to try to separate foreground (speech) from background (music, sound results, noise), embed every separated waveform individually utilizing wav2vec2, after which concatenate the outcomes. Each early and late-stage fusion approaches are explored; within the former (Determine 4a), the audio and video embeddings are concatenated and fed right into a single biGRU, and within the latter (Determine 4b) every enter modality is encoded with its personal biGRU, after which the hidden states are concatenated previous to the output layer.

Determine 4a: Early Fusion (concatenate embeddings on the enter).
Determine 4b: Late Fusion (concatenate previous to prediction output).

We discover:

  • Our outcomes match and generally even outperform the state-of-the-art (benchmarked utilizing the video modality solely and on our analysis information). We consider the outputs utilizing F-1 rating for the constructive label, and likewise calm down this analysis to think about “off-by-n” F-1 i.e., if the mannequin predicts scene modifications inside n photographs of the bottom fact. This can be a extra practical measure for our use instances because of the human-in-the-loop setting that these fashions are deployed in.
  • As with earlier work, including audio options improves outcomes by 10–15%. A main driver of variation in efficiency is late vs. early fusion.
  • Late fusion is persistently 3–7% higher than early fusion. Intuitively, this end result is smart — the temporal dependencies between photographs is probably going modality-specific and needs to be encoded individually.

We’ve got offered two complementary approaches to scene boundary detection that leverage quite a lot of accessible modalities — screenplay, audio, and video. Logically, the following steps are to a) mix these approaches and use screenplay options in a unified mannequin and b) generalize the outputs throughout a number of shot-level inference duties, e.g. shot sort classification and memorable moments identification, as we hypothesize that this path could be helpful for coaching common objective video understanding fashions of longer-form content material. Longer-form content material additionally incorporates extra advanced narrative construction, and we envision this work as the primary in a collection of initiatives that goal to raised combine narrative understanding in our multimodal machine studying fashions.

Particular due to Amir Ziai, Anna Pulido, and Angie Pollema.


Leave a Reply

Your email address will not be published. Required fields are marked *