Google takes on Meta, launches Imagen Video, its own text-to-video AI
Days after Meta launched Make-A-Video, an AI system that allows users to turn text prompts into high-quality video clips, tech giant Google said on Wednesday that it has introduced its own video-generating AI-powered text-to-video system, Imagen Video. It builds upon Google’s previous text-to-image system, Imagen, which was launched in May. Instead of a single still picture, however, Imagen Video builds a video out of multiple frames of output.
Given a text prompt, the video-generating system can produce high-definition videos using a base video generation model and a sequence of interleaved spatial and temporal video super-resolution models, according to a Google paper. Based on a cascade of video diffusion models, it is capable of producing videos up to a resolution of 1280×768 at 24 frames per second.
High-fidelity Videos
In its recently published paper “Imagen Video: High definition video generation with diffusion models,” Google claims that Imagen Video can generate videos with high fidelity and has a high-degree of controllability and world knowledge. The generative model’s capabilities include creating diverse videos and text animations in different artistic styles, 3D understanding, text rendering and animation.
The model, which is currently in a research phase, has been introduced five months after Imagen showed the rapid development of synthesis-based models. With its launch, the text-to-video trend seems set to explode, much like text-to-image did over the past year with DALL-E, MidJourney and Stable Diffusion.
The Architecture of Imagen Video
Imagen Video consists of a text encoder (frozen T5-XXL), a base video diffusion model, and interleaved spatial and temporal super-resolution diffusion models. To create such an architecture, Google claims it transferred findings from the previous work on diffusion-based image generation to the video generation setting. The research team also inculcated progressive distillation into the video models with classifier-free guidance for fast, high-quality sampling.
Also read: Google plans to assemble 1-million-pixel phones in India: Report
According to Google, Imagen Video is a step toward a system with a high degree of controllability and awareness of the world. It consists of seven sub-models which perform text-conditional video generation, spatial super-resolution, and temporal super-resolution.
With the entire cascade, Imagen Video generates high definition 1280×768 videos at 24 frames per second, for 128 frames — approximately 126 million pixels, according to the company.
Medical Imaging Suite
Meanwhile, Medical Imaging Suite, Google’s new technology can help with accessibility and interoperability of radiology and other imaging data. “Google pioneered the use of AI and computer vision in Google Photos, Google Image Search, and Google Lens, and now we’re making our imaging expertise, tools, and technologies available for healthcare and life sciences enterprises,” said Alissa Hsu Lynch, Global Lead of Google Cloud’s MedTech Strategy and Solutions, in a statement. “Our Medical Imaging Suite shows what’s possible when tech and healthcare companies come together.”
According to Google, imaging data accounts for as much as 90% of all healthcare data. With increase in its volume, the workload for radiologists and other healthcare professionals, tasked with manually interpreting these images for clinicians and patients, goes up. AI can help support faster, more accurate diagnosis of images and boost productivity for providers and outcomes for patients, said Google, adding that two customers are already using Medical Imaging Suite.