Making Little Simz' Gorilla Interactive Music Video

Last summer I had the chance to intern at Pixel Lab, working on a project with Microsoft Azure and Little Simz! Check out the website here!

I had the chance to solve a really fun problem and wanted to write about it!

The project

Microsoft Azure and Little Simz wanted to create an interactive music video using Microsoft AI, where the user can guide the video effects by clicking on song lyrics that resonate with them.

As the video plays, the user will be able to click on whatever lyrics they like. The AI will analyze the mood of these selected lyrics, and use that to guide the video effects.

I was tasked with creating a system that takes the user selection of lyrics and uses data from AI analysis to select the best matching effect. I was responsible for designing this system from the ground up.

Little Simz’s team gave us an assortment of backgrounds, along with a green screen foreground video of Simz performing. We decided that effects should be applied to the background, foreground, or both.

Here’s how we did it

Scoring based on mood/theme

Identify some overarching themes

I used ChatGPT to identify around ten to fifteen themes and moods from the song. I input Simz’s lyrics and verbal descriptions of her body language and the backing track, and ChatGPT returned the most prominent themes/moods explored by the song.

Match Themes to Lyrics

I used ChatGPT to split the song into three lengths of chunks: 1/2-1 line, 1-3 lines, and 3-5 lines (roughly). For each of these chunks, ChatGPT came up with a score for each of the themes. Think of this as giving each chunk a “mood vector,” that represents how much the chunk relates to each of the 10 themes.

The problem is that when a user clicks on a lyric chunk, they probably don’t mean that they like those specific ~3 words — what they probably mean is that they relate to this section of the song, taking into account the lyrics around them. So, you can’t just consider the “mood vector” of the specific few words they click on. Somehow, you need to consider the general area of the song as well!

So, for every word index, you can combine the overlapping small, medium, and large chunks to get a good sense of the dominant themes of that particular word! This gives a broader context to the selected lyric, while still assessing the lyrics near neighbors.

Video effect selection pool

Using ShaderToy, I came up with a large bank of potential video effects. However, multiple effects are to be applied simultaneously — some work well together while others don’t. Considering all of the possible video effect combinations, I removed the pairs that didn’t work well together.

Match Themes to Video Options

Once all of these effect combos, plates, and color tints were identified, we generated verbal descriptions for them. Now, following similar methods to lyric-theme matching, we scored these video options against the previously identified themes. Now, each video effect pair also has a “mood vector,” just like the lyric chunks!

Combining preferences with the default mood

When a user selects a lyric, they’re suggesting that they like a particular theme, not that they dislike any other theme. When a user suggests that they like a particular theme, the video should be pushed in that direction, not completely changed. When the user doesn’t click on anything at all, the video should reflect the themes present in the video with no bias towards any one theme.

As the users select lyric chunks, we aggregate all of these lyrics scores to generate a “user preference card.” Then we combine it with the baseline lyric card for the current index.

To generate the user preference theme card from the selected lyrics, we only average the two arrays when the resulting index is greater than the original baseline lyric index. This makes it so that the user’s lack of interest in a theme won’t lower the presence of that theme in the final card.

Matching to an effect

I like to imagine that there’s this 10-dimensional “mood space,” where each dimension represents a particular mood. We’ve plotted every possible video effect combination in this 10-d mood space, and we also have a point representing the user’s score card! All you have to do is find the video effect combination that’s closest to the user’s score card! We just take the Euclidean distance between the user score card and each effect pair, and return whatever effect combination that’s the closest!

This effect combination (background plate, color tint, and video effect) is then applied to the video, and voila! We’re done!