MusicGen Reimagined

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 11

MusicGen Reimagined: Meta’s

Under-the-Radar Advances in AI
Music
Exploring the overlooked but remarkable progress of MusicGen

Max Hilsdorf
·
Follow
Published in

Towards Data Science

·
7 min read
·
1 day ago
266
5
An image symbolizing how Music AI products can elevate music-making for everyone. Image
generated through a conversation with ChatGPT and DALL-E-3.

How it started…
In February 2023, Google made waves with their generative music
AI MusicLM. At that point, two things became clear:

1. 2023 would be the breakthrough year for AI-based


music generation

2. A new model would overshadow MusicLM in no


time

Many anticipated that the next breakthrough model would be ten times
the size of MusicLM in terms of model parameters and training data. It
would also raise the same ethical issues, including restricted access to
the source code and the use of copyrighted training material.

Today, we know that only half of this was true.

Released in June 2023, Meta’s MusicGen model brought some massive


improvements, including…

1. Higher quality music output (24kHz → 32kHz)

2. More natural-sounding instruments

3. The option to condition the generation on any melody (I


wrote a blog post about this)

…all while using less training data, open-sourcing the code and model
weights, and using only commercially licensed training material.

Six months later, the hype has slowly subsided. However, Meta’s
research team FAIR has continued publishing papers and updating the
code to incrementally improve MusicGen.

… how it’s going


Since its release, Meta has upgraded MusicGen in two key ways:

1. Higher quality generation using multi-band diffusion


2. More lively outputs thanks to stereo generation

While this may sound like two small improvements, it makes a big
difference. Listen for yourself! Here is a 10-second piece generated
with the original MusicGen model (3.3B parameters):
Generated track taken from the official MusicGen demo page.

The prompt used was:

earthy tones, environmentally conscious, ukulele-infused, harmonic,


breezy, easygoing, organic instrumentation, gentle grooves

Now, here is an example of the output MusicGen can produce six


months later based on the same prompt:
Generated track created with MusicGen 3.3B stereo by the author.

If you are listening through smartphone speakers, the difference might


not be very noticeable. On other devices, you should be able to hear
that the overall sound is much clearer and natural and that the stereo
sound makes the composition more lively and exciting.

In this blog post, I want to showcase these improvements, explain why


they matter and how they work, and provide some example
generations.

Multi-Band Diffusion — What Does That Do?


To understand what multi-band diffusion is and why it makes a
difference, let us look at how the original MusicGen
model [1] produced its outputs.

30 seconds of audio at a sample rate of 34kHz are represented in a


computer with almost 1 million numbers. Generating something like
that sample-by-sample is comparable to generating 10 full novels with
ChatGPT.

Instead, Meta relies on neural audio compression techniques. Their


compression model, EnCodec [2], can compress music from 34kHz to
roughly 0.05kHz, all while maintaining the relevant information to
reconstruct it to the original sample rate. EnCodec consists of an
encoder, which compresses the audio, and a decoder, which recreates
the original sounds (Figure 1).
Figure 1 — Encodec: Meta’s neural audio compression model. Image by author.

Now back to MusicGen. Instead of generating music at full sample rate,


it generates it at 0.05kHz and lets EnCodec “reconstruct” it, resulting
in high-fidelity outputs at minimal computing time & cost (Figure 2).

Figure 2 — MusicGen: A user prompt (text) is converted to an encoded audio signal which is
then decoded to produce the final result. Image by author.

While EnCodec is an impressive technology, its compression is not


lossless. There are noticeable artifacts in the reconstructed audio
compared to the original. Listen for yourselves!

Original Audio
EnCodec music example taken from the official EnCodec demo page.

Reconstructed Audio
EnCodec music example taken from the official EnCodec demo page.
As MusicGen fully relies on EnCodec, it is a major bottleneck for the
quality of the generated music. That is why Meta decided to work on
improving EnCodec’s decoder part. In August 2023, they had
developed an updated decoder for EnCodec leveraging multi-band
diffusion [3].

One problem Meta saw with EnCodec’s original decoder was that it
tended to generate low frequencies first and higher frequencies after.
Unfortunately, this meant that any errors/artifacts in the low
frequencies would distort the high frequencies as well, drastically
decreasing the output quality.

Multi-band diffusion addresses this problem by generating different


sections of the frequency spectrum independently before combining
them. The researchers found that this procedure significantly
improved the generated outputs. The differences are clearly noticeable
from my perspective. Listen to the same track with the original
EnCodec decoder and the multi-band diffusion decoder:

Original Decoder
Generated track taken from the Multi-Band Diffusion demo page.

Multi-Band Diffusion Decoder


Generated track taken from the Multi-Band Diffusion demo page.

One of the core issues of current text-to-music systems is that there is


always an unnatural quality to the sounds it produces, especially for
acoustical instruments. Multi-band diffusion makes the output sound
much cleaner and more natural and takes MusicGen to a new level.

Why is Stereo Sound so Significant?


Up until now, most generative music models have been producing
mono sound. This means MusicGen does not place any sounds or
instruments on the left or right side, resulting in a less lively and
exciting mix. The reason why stereo sound has been mostly overlooked
so far is that generating stereo is not a trivial task.

As musicians, when we produce stereo signals, we have access to the


individual instrument tracks in our mix and we can place them
wherever we want. MusicGen does not generate all instruments
separately but instead produces one combined audio signal. Without
access to these instrument sources, creating stereo sound is hard.
Unfortunately, splitting an audio signal into its individual sources is a
tough problem (I’ve published a blog post about that) and the tech is
still not 100% ready.

Therefore, Meta decided to incorporate stereo generation directly into


the MusicGen model. Using a new dataset consisting of stereo music,
they trained MusicGen to produce stereo outputs. The researchers
claim that generating stereo has no additional computing costs
compared to mono.
Although I feel that the stereo procedure is not very clearly described
in the paper, my understanding it works like this (Figure 3): MusicGen
has learned to generate two compressed audio signals (left and right
channel) instead of one mono signal. These compressed signals must
then be decoded separately before they are combined to build the final
stereo output. The reason this process does not take twice as long is
that MusicGen can now produce two compressed audio signals at
approximately the same time it previously took for one signal.

Figure 3 — MusicGen stereo update. Note that the process was not sufficiently documented in
the paper for me to be 100% sure about this. Take it as an educated guess. Image by author.

Being able to produce convincing stereo sound really sets MusicGen


apart from other state-of-the-art models like MusicLM or Stable Audio.
From my perspective, this “little” addition makes a huge difference in
the liveliness of the generated music. Listen for yourselves (might be
hard to hear on smartphone speakers):

Mono
Stereo

Conclusion
MusicGen was impressive from the day it was released. However, since
then, Meta’s FAIR team has been continually improving their product,
enabling higher quality results that sound more authentic. When it
comes to text-to-music models generating audio signals (not MIDI
etc.), MusicGen is ahead of its competitors from my perspective (as of
November 2023).

Further, since MusicGen and all its related products (EnCodec,


AudioGen) are open-source, they constitute an incredible source of
inspiration and a go-to framework for aspiring AI audio engineers. If
we look at the improvements MusicGen has made in only 6 months, I
can only imagine that 2024 will be an exciting year.

Another important point is that with their transparent approach, Meta


is also doing foundational work for developers who want to integrate
this technology into software for musicians. Generating samples,
brainstorming musical ideas, or changing the genre of your existing
work — these are some of the exciting applications we are already
starting to see. With a sufficient level of transparency, we can make
sure we are building a future where AI makes creating music more
exciting instead of being only a threat to human musicianship.
Note: While MusicGen is open-source, the pre-trained models may not
be used commercially! Visit the audiocraft GitHub repository for more
detailed information on the intended use for all its components.

References

[1] Copet et al. (2023). Simple and Controllable Music


Generation. https://arxiv.org/pdf/2306.05284.pdf

[2] Défossez et al. (2022). High Fidelity Neural Audio


Compression. https://arxiv.org/pdf/2210.13438.pdf

[3] Roman et al. (2023). From Discrete Tokens to High-Fidelity


Audio Using Multi-Band Diffusion. https://arxiv.org/abs/2308.02560

You might also like