Separate vocals from a track using python

17/03/2024 03:06 Separate vocals from a track using python - DEV Community
Vicente G. Reyes
Posted on 13 févr. 2023
Separate vocals from a track using python

#python #music #programming
What I've planned to do in the past was learn how to separate vocals from a track
programmatically and not depend on software-as-a-service to perform the separation
of vocals from a track. This article shows how to separate the vocals of a song from
the instruments using my new favorite library, Librosa. You can check out the Google
Colab Notebook here.
The idea sparked when I wanted to separate individual tracks of a song, so I went to
Product Hunt and discovered melody ml. This discovery started the urge to learn ML
for music, hence the discovery of the Python library, librosa.
By the way, I ran out of RAM, which made my notebook explode.
GIF
https://dev.to/highcenburg/separate-vocals-from-a-track-using-python-4lb5 1/9
Icen Reyes
@icenreyes · Follow
Something crashed! waaaaaaaaa
10:34 AM · Jan 31, 2023
Reply Copy link to post
Read more on X
Install and import dependencies

pip install librosa matplotlib IPython
import librosa
from librosa import display
import numpy as np
import IPython.display as ipd
import matplotlib as plt
Load and display the song.

I used My Last Serenade by KSE as I wondered how the growling or shouting parts
of the song would come out.
y, sr = librosa.load('My Last Serenade.wav')

ipd.Audio(data=y[90*sr:110*sr], rate=sr)
We slice a 20 second snippet in the chorus of the song. We show the audio using
ipd.Audio (tbh, this is a bit exhausting). Photo is shown below because I couldn't find a
way to upload audio here on DEV.
We separate a complex-valued spectrogram D into its magnitude

(S) and phase (P) components, convert the time stamps into
frames, plot the data, then display the full spectrogram of the
data
S_full, phase = librosa.magphase(librosa.stft(y))
idx = slice(*librosa.time_to_frames([90*110], sr=sr))
fig, ax = plt.pyplot.subplots()
img = display.specshow(librosa.amplitude_to_db(S_full[:, idx], ref=np.max), y_a
fig.colorbar(img, ax=ax)
Line by line explanation

S_full, phase = librosa.magphase(librosa.stft(y)) - we separate the magnitude
and phase of the track using short-time fourier transform by representing a signal in
the time-frequency domain by computing discrete Fourier Transforms(DFT)(y)
idx = slice(*librosa.time_to_frames([90*110], sr=sr)) - slice a the part of the song

then convert it to stft frames using the time_to_frames function of librosa
img = display.specshow(librosa.amplitude_to_db(S_full[:, idx], ref=np.max),

y_axis='log', x_axis='time', sr=sr, ax=ax) - display the spectrogram of the 20
second sliced part of the song by converting the amplitude spectrogram to a dB-scaled
spectrogram of the magnitude, then compares the magnitude and phase of the track
and returns a new array containing the element-wise maxima then it plots the y and x
axis
Below is the image of the spectrum:
Decomposing the spectrogram

S_filter = librosa.decompose.nn_filter(S_full, aggregate=np.median, metric='cos
S_filter = np.minimum(S_full, S_filter)

S_filter = librosa.decompose.nn_filter(S_full, aggregate=np.median,
metric='cosine', width=int(librosa.time_to_frames(2, sr=sr))) - we filter the
vocals by its nearest neighbors, aggregate their median values, compare their frames
using cosine similarity and contain those frames to be separated by 2 seconds and
suppress other sounds from the spectrum
S_filter = np.minimum(S_full, S_filter) - we get the calculated data in the memory

of the S_full and S_filter variables to get the minimum value.
Display the background and foreground spectrum of the audio

margin_i, margin_v = 3, 11
power = 3
mask_i = librosa.util.softmask(S_filter, margin_i * (S_full - S_filter), power=

mask_v = librosa.util.softmask(S_full - S_filter, margin_v * S_filter, power=po
S_foreground = mask_v * S_full

S_background = mask_i * S_full

margin_i, margin_v = 3, 11 - we use margins to reduce loss in sound in the vocals
and instrumented masks
power = 3 - returns the soft mask computed in a numerically stable way
S_foreground = mask_v * S_full and S_background = mask_i * S_full - multiply the

masks with the input spectrum to separate the components
Plotting the full spectrum, background and foreground spectrum

fig, ax = plt.pyplot.subplots(nrows=3, sharex=True, sharey=True)
img = display.specshow(librosa.amplitude_to_db(S_full[:, idx], ref=np.max), y_a
ax[0].set(title='Full Spectrum')
ax[0].label_outer()
display.specshow(librosa.amplitude_to_db(S_background[:, idx], ref=np.max), y_a

ax[1].set(title='Background Spectrum')
ax[1].label_outer()
display.specshow(librosa.amplitude_to_db(S_foreground[:, idx], ref=np.max), y_a

ax[2].set(title='Foreground Spectrum')
ax[2].label_outer()
fig.colorbar(img, ax=ax)
Recover the foreground audio from the masked spectrogram

and playback the audio
y_foreground = librosa.istft(S_foreground * phase)
ipd.Audio(data=y_foreground[90*sr:110*sr], rate=sr)

y_foreground = librosa.istft(S_foreground * phase) - inverses the short-time
fourier transform
ipd.Audio(data=y_foreground[90*sr:110*sr], rate=sr) - plays back the vocals from
the track
Conclusion
This seemed easy at first thought and when I was reading the documentation but
digging under the code made me realize that this idea was a little more complex. But,
what made me continue was when I read about nearest neighbors in one part of the
documentation which made me realize that I will be getting my hands on Machine
Learning in the future with this library.

Separate vocals from a track using python - DEV Community

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Separate vocals from a track using python - DEV Community

Uploaded by

Copyright:

Available Formats

17/03/2024 03:06 Separate vocals from a track using python - DEV Community

By the way, I ran out of RAM, which made my notebook explode.

Something crashed! waaaaaaaaa

10:34 AM · Jan 31, 2023

Reply Copy link to post

Install and import dependencies

Load and display the song.

y, sr = librosa.load('My Last Serenade.wav')

We separate a complex-valued spectrogram D into its magnitude

Line by line explanation

idx = slice(*librosa.time_to_frames([90*110], sr=sr)) - slice a the part of the song

img = display.specshow(librosa.amplitude_to_db(S_full[:, idx], ref=np.max),

Below is the image of the spectrum:

Decomposing the spectrogram

Line by line explanation

S_filter = np.minimum(S_full, S_filter) - we get the calculated data in the memory

Display the background and foreground spectrum of the audio

mask_i = librosa.util.softmask(S_filter, margin_i * (S_full - S_filter), power=

S_foreground = mask_v * S_full

Line by line explanation

power = 3 - returns the soft mask computed in a numerically stable way

S_foreground = mask_v * S_full and S_background = mask_i * S_full - multiply the

Plotting the full spectrum, background and foreground spectrum

display.specshow(librosa.amplitude_to_db(S_background[:, idx], ref=np.max), y_a

display.specshow(librosa.amplitude_to_db(S_foreground[:, idx], ref=np.max), y_a

Recover the foreground audio from the masked spectrogram

Line by line explanation

You might also like

idx = slice(librosa.time_to_frames([90110], sr=sr)) - slice a the part of the song