stability ai Archives - AI News

Stability AI unveils ‘Stable Audio’ model for controllable audio generation

Ryan Daws — Thu, 14 Sep 2023 15:57:28 +0000

Stability AI has introduced “Stable Audio,” a latent diffusion model designed to revolutionise audio generation.

This breakthrough promises to be another leap forward for generative AI and combines text metadata, audio duration, and start time conditioning to offer unprecedented control over the content and length of generated audio—even enabling the creation of complete songs.

Audio diffusion models traditionally faced a significant limitation in generating audio of fixed durations, often leading to abrupt and incomplete musical phrases. This was primarily due to the models being trained on random audio chunks cropped from longer files and then forced into predetermined lengths.

Stable Audio effectively tackles this historic challenge, enabling the generation of audio with specified lengths, up to the training window size.

One of the standout features of Stable Audio is its use of a heavily downsampled latent representation of audio, resulting in vastly accelerated inference times compared to raw audio. Through cutting-edge diffusion sampling techniques, the flagship Stable Audio model can generate 95 seconds of stereo audio at a 44.1 kHz sample rate in under a second utilising the power of an NVIDIA A100 GPU.

A sound foundation

The core architecture of Stable Audio comprises a variational autoencoder (VAE), a text encoder, and a U-Net-based conditioned diffusion model.

The VAE plays a pivotal role by compressing stereo audio into a noise-resistant, lossy latent encoding that significantly expedites both generation and training processes. This approach, based on the Descript Audio Codec encoder and decoder architectures, facilitates encoding and decoding of arbitrary-length audio while ensuring high-fidelity output.

To harness the influence of text prompts, Stability AI utilises a text encoder derived from a CLAP model specially trained on their dataset. This enables the model to imbue text features with information about the relationships between words and sounds. These text features, extracted from the penultimate layer of the CLAP text encoder, are integrated into the diffusion U-Net through cross-attention layers.

During training, the model learns to incorporate two key properties from audio chunks: the starting second (“seconds_start”) and the total duration of the original audio file (“seconds_total”). These properties are transformed into discrete learned embeddings per second, which are then concatenated with the text prompt tokens. This unique conditioning allows users to specify the desired length of the generated audio during inference.

The diffusion model at the heart of Stable Audio boasts a staggering 907 million parameters and leverages a sophisticated blend of residual layers, self-attention layers, and cross-attention layers to denoise the input while considering text and timing embeddings. To enhance memory efficiency and scalability for longer sequence lengths, the model incorporates memory-efficient implementations of attention.

To train the flagship Stable Audio model, Stability AI curated an extensive dataset comprising over 800,000 audio files encompassing music, sound effects, and single-instrument stems. This rich dataset, furnished in partnership with AudioSparx – a prominent stock music provider – amounts to a staggering 19,500 hours of audio.

Stable Audio represents the vanguard of audio generation research, emerging from Stability AI’s generative audio research lab, Harmonai. The team remains dedicated to advancing model architectures, refining datasets, and enhancing training procedures. Their pursuit encompasses elevating output quality, fine-tuning controllability, optimising inference speed, and expanding the range of achievable output lengths.

Stability AI has hinted at forthcoming releases from Harmonai, teasing the possibility of open-source models based on Stable Audio and accessible training code.

This latest groundbreaking announcement follows a string of noteworthy stories about Stability. Earlier this week, Stability joined seven other prominent AI companies that signed the White House’s voluntary AI safety pledge as part of its second round.

You can try Stable Audio for yourself here.

(Photo by Eric Nopanen on Unsplash)

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with Digital Transformation Week.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

The post Stability AI unveils ‘Stable Audio’ model for controllable audio generation appeared first on AI News.

Getty is suing Stable Diffusion’s creator for copyright infringement

Ryan Daws — Wed, 18 Jan 2023 09:05:33 +0000

Stock image service Getty Images is suing Stable Diffusion creator Stability AI over alleged copyright infringement.

Stable Diffusion is one of the most popular text-to-image tools. Unlike many of its rivals, the generative AI model can run on a local computer.

Apple is a supporter of the Stable Diffusion project and recently optimised its performance on M-powered Macs. Last month, AI News reported that M2 Macs can now generate images using Stable Diffusion in under 18 seconds.

Text-to-image generators like Stable Diffusion have come under the spotlight for potential copyright infringement. Human artists have complained their creations have been used to train the models without permission or compensation.

Getty Images has now accused Stability AI of using its content and has commenced legal proceedings.

In a statement, Getty Images wrote:

“This week Getty Images commenced legal proceedings in the High Court of Justice in London against Stability AI claiming Stability AI infringed intellectual property rights including copyright in content owned or represented by Getty Images. It is Getty Images’ position that Stability AI unlawfully copied and processed millions of images protected by copyright and the associated metadata owned or represented by Getty Images absent a license to benefit Stability AI’s commercial interests and to the detriment of the content creators.

Getty Images believes artificial intelligence has the potential to stimulate creative endeavors. Accordingly, Getty Images provided licenses to leading technology innovators for purposes related to training artificial intelligence systems in a manner that respects personal and intellectual property rights. Stability AI did not seek any such license from Getty Images and instead, we believe, chose to ignore viable licensing options and long-standing legal protections in pursuit of their stand-alone commercial interests.”

While the images used for training alternatives like DALL-E 2 haven’t been disclosed, Stability AI has been transparent about how their model is trained. However, that may now have put the biz in hot water.

In an independent analysis of 12 million of the 2.3 billion images used to train Stable Diffusion, conducted by Andy Baio and Simon Willison, they found it was trained using images from nonprofit Common Crawl which scrapes billions of webpages monthly.

“Unsurprisingly, a large number came from stock image sites. 123RF was the biggest with 497k, 171k images came from Adobe Stock’s CDN at ftcdn.net, 117k from PhotoShelter, 35k images from Dreamstime, 23k from iStockPhoto, 22k from Depositphotos, 22k from Unsplash, 15k from Getty Images, 10k from VectorStock, and 10k from Shutterstock, among many others,” wrote the researchers.

Platforms with high amounts of user-generated content such as Pinterest, WordPress, Blogspot, Flickr, DeviantArt, and Tumblr were also found to be large sources of images that were scraped for training purposes.

The concerns around the use of copyrighted content for training AI models appear to be warranted. It’s likely we’ll see a growing number of related lawsuits over the coming months and years unless a balance is found between enabling AI training and respecting the work of human creators.

In October, Shutterstock announced that it was expanding its partnership with DALL-E creator OpenAI. As part of the expanded partnership, Shutterstock will offer DALL-E images to customers.

The partnership between Shutterstock and OpenAI will see the former create frameworks that will compensate artists when their intellectual property is used and when their works have contributed to the development of AI models.

(Photo by Tingey Injury Law Firm on Unsplash)

Relevant: Adobe to begin selling AI-generated stock images

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

The post Getty is suing Stable Diffusion’s creator for copyright infringement appeared first on AI News.

Stable Diffusion text-to-image generator is now publicly available

Ryan Daws — Wed, 24 Aug 2022 10:54:05 +0000

Text-to-image generator Stable Diffusion is now available for anyone to put to the test.

Stable Diffusion is developed by Stability AI and was initially released for researchers earlier this month. The image generator claims to deliver a breakthrough in speed and quality that can run on consumer GPUs.

The model is based on the latent diffused model created by CompVis and Runway but enhanced with insights from conditional diffusion models by Stable Diffusion’s lead generative AI developer Katherine Crowson, Open AI, Google Brain, and others.

“This model builds on the work of many excellent researchers and we look forward to the positive effect of this and similar models on society and science in the coming years as they are used by billions worldwide,” said Emad Mostaque, CEO of Stability AI.

The core dataset was trained on LAION-Aesthetics, a dataset that filters the 5.85 billion images in the LAION-5B dataset based on how “beautiful” an image was, building on ratings from the alpha testers of Stable Diffusion.

Stable Diffusion runs on computers with under 10GB of VRAM and generates 512×512 pixel resolution images in just a few seconds.

“We’re excited that state-of-the-art text-to-image models are being built openly and we are happy to collaborate with CompVis and Stability.ai towards safely and ethically releasing the models to the public and help democratise ML capabilities with the whole community,” commented Apolinário, ML Art Engineer at AI community Hugging Face.

Stable Diffusion goes head-to-head against other text-to-image models including Midjourney, DALL-E 2, and Imagen.

DALL-E 2 vs Midjourney vs StableDiffusion mega thread: photography, illustration, painters, abstract

these image synths are like instruments – it's amazing we'll get so many of them, each with a unique "sound" 🤯

rules: same prompt, 1:1 aspect ratio, no living artists pic.twitter.com/47syy7uPJJ
— fabians.eth (@fabianstelzer) August 20, 2022

An interactive space to test Stable Diffusion has been created here.

(Image Credit: Fabian Stelzer)

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

The post Stable Diffusion text-to-image generator is now publicly available appeared first on AI News.