Google DeepMind's AI Innovates Video Soundtracks
June 19th, 2024

Google DeepMind's New AI Tool: Crafting Soundtracks from Video Pixels and Text Prompts

Google DeepMind has unveiled an innovative AI tool designed to generate soundtracks for videos. This tool not only uses text prompts to create audio but also analyzes the video’s visual content to produce sounds that perfectly match the scene.

By combining these two elements, DeepMind’s tool allows users to create scenes enriched with dramatic scores, realistic sound effects, or dialogue that aligns seamlessly with the characters and tone of the video. On DeepMind’s website, you can find examples showcasing the tool's capabilities, and they’re quite impressive.

For instance, a video of a car navigating through a cyberpunk-style city uses the prompt “cars skidding, car engine throttling, angelic electronic music” to generate audio that synchronizes perfectly with the car’s movements. Another example uses the prompt “jellyfish pulsating under water, marine life, ocean” to create an immersive underwater soundscape.

Although text prompts enhance the experience, they are optional. Users don’t need to painstakingly match the generated audio with the video scenes manually. DeepMind’s tool can autonomously generate an “unlimited” number of soundtracks, providing users with endless audio possibilities.

This feature sets DeepMind’s tool apart from other AI audio generators like ElevenLabs, which rely solely on text prompts. It also facilitates the pairing of audio with AI-generated videos from tools like DeepMind’s Veo and Sora, the latter of which plans to incorporate audio in the future.

DeepMind trained this tool using video, audio, and annotations containing detailed sound descriptions and dialogue transcripts. This comprehensive training enables the AI to synchronize audio events accurately with visual scenes. However, there are still some challenges. For example, the tool is working on improving lip synchronization with dialogue, as seen in a demo video featuring a claymation family. Additionally, the quality of the generated audio can be affected by the video’s quality, with grainy or distorted videos leading to noticeable drops in audio quality.

Currently, DeepMind’s tool is not widely available as it undergoes rigorous safety assessments and testing. Once released, all audio outputs will include Google’s SynthID watermark to indicate that they are AI-generated.

Subscribe to NexusAI
Receive the latest updates directly to your inbox.
Mint this entry as an NFT to add it to your collection.
Verification
This entry has been permanently stored onchain and signed by its creator.
More from NexusAI

Skeleton

Skeleton

Skeleton