OpenAI Jukebox and Google Colab “Tutorial”

9 min readJun 18, 2020




If you do not want to pay Google, click the Colab link below, then hit “Help” on top, then “Send Feedback”, and then “Continue Anyways”. Let them know that K80, T4, and P4 GPUs are not powerful enough for what you need to do and that they should provide P100 GPUs free as before!

First off, Google Colab is a site where you can use Google’s GPU for whatever you want, including making computer music which is what we’ll be doing.

First, open this link to Google Colab


So first you need a GPU, specifically a P100 or V100, that’s done by Google automatically.

If you’re some wild power user and own one of those GPUs (they cost about $6k to $15k), go to the top right where it says “Connect”, click the arrow to the right and select “Connect to local runtime” where you’ll be prompted to install notebook on the machine, and give it a URL so both your browser and the machine can connect.

You can use Google Drive to store the output, this is useful in case your session disconnects letting you grab whatever the machine saves, even if you don't download the files from the notebook, or the machine stops. It’s recommended you do this.

Before going too deep we need to run two boxes. That can be done using the play button to the left of the first line of the code boxes.

First, run the box with “!pip install git” with the play button on the left.

Button on the left!

The first box that needs to be run whenever you start a new machine/GPU has a Github link (above). It should say “Successfully installed” in the output box below if everything works right, it usually does.

Run the box with “import jukebox” in it.

If you’re not using Google Drive, edit the next box with “model = “5b_lyrics” # or “1b_lyrics”” to make the variable “ =” be “ = ‘/content/sample_data’”.

If you are using Google Drive, edit “ =” to “ = /content/gdrive/My Drive/NeuralNetwork/”. This will make a new folder in your Google Drive called NeuralNetwork where everything from this will go.

Run that box. That will start to download the base generators for the music! That can be a few minutes at the very least, so continue following this for the meantime, since you can queue up code to run as soon as that finishes (any changes you make after a box is queued up will NOT be done, you will need to run it again). The box you just ran should be finished when the outbox box (below the code) has “0: Loading prior in eval mode


Now you can decide within the next 3 boxes if you’re

  • Using no sample (This can be extremely random and sometimes not generate any music)
  • Going from a sample (The most reliable choice)
  • Upscaling a previous creation to a higher level (Discussed at the end of this article)

If you’re using no sample, run the box with “mode = ‘ancestral’

If you’re going from a sample, look at the second box and edit the “audio_file” parameter to be “/content/”. (If you can, edit it down in something like Audacity or Adobe Audition to be only the however many seconds sample you want in Mono, 16 bit, 44.1 KHZ).

In the screenshot above I’m using Google Drive to manage my samples. If you’re using Google Drive, go to your drive and upload the sample in the NeuralNetwork folder like any other file and edit. In the example above I’ve made a folder called samples where all of them are stored.

To upload a file if you’re not using Google Drive, go to the left hand side of the window, and click on the folder.


That will bring you to a file viewer for the machine. It uses the “Content” folder automatically, drop any WAV file in the empty space under the “sample_data” folder, and then edit “audio_file=/content/” to have the file name at the end such as “/content/song.wav

Any file/sample you upload WILL BE CASE SESITIVE. So if I type song.wav, but the file is Song.wav, that will cause an error! Always double check both are the same.

Make sure to edit “prompt_length_in_seconds=” as this tells it where to pick up from to start generating. Run this box.


Run the box with “sample_hps = Hyperparams” no matter what mode you pick.


Next box decides how long you want the sample to be by editing “sample_length_in_seconds”. 60 seconds seems to be a sweet spot of making things not take forever to render, 80 is usually better for longer songs if you don't plan to upscale to level 0 (discussed later). If you are using a sample, it can cut down on the time needed to render since it already has some info for however many seconds at the start. 60 seconds even with a 10 or 18 second sample at the start takes about an hour or so at the very least.

Remember to run this box!


So first off decide what artist and genre(s) you want. Artists supported is on this “V2” list If a simple page find doesn’t work on V2, remember it uses underscores for space or special characters!

Genres supported can be found here , you can combine them as well if it’s something like Jazz Rock, or if it’s two words like New Wave. The genre can heavily impact what voice or sound the AI uses, for example “Alternative Rock” will make Dave Matthews a female sometimes for whatever reason, while “Rock” will be actually Dave Matthews.

If you’re not sure what genre your artist fits under, check and search the artist, listen to some samples, and use whatever genre you like listed there since that’s what the AI is told for that sample & artist. If you have no idea, check as it’s the dataset OpenAI used to grab genres from, album page genres are usually good. (AS OF NOVEMBER 2020 THIS DATASET IS NOW OFFLINE, HOWEVER SOME ALBUM PAGES ARE STILL SAVED ON ARCHIVE.ORG)

You can also put lyrics in, if it’s a continuation with lyrics in the original sample fed to the AI, it’ll recognize the words being said and pick up from there, this also gives the lyric structure more consistency and increases the likelihood of them being said in order of how it was input. If you don't want any lyrics, leave it blank.

Next, run the box with “sampling_temperature = .98”. You can change this to .99 ot 1.0, but .98 makes it the most random unless you want to limit how wild an AI can get.

Try and keep lyrics within the same general structure or genre as the artist input, if you input lyrics from a reggae or rap song, it’ll more than likely make it in that cadence compared to singing it.


Now, run the code starting with “if sample_hps.mode == ‘ancestral’” This will generate the first audio! It’s gonna be fuzzy as this is “Level 2”. Level 2 is what it initially makes before you can run the upsampler, which figure out what the high end should sound like from that fuzz.

One important note is that once this starts running, check on it every so often as Google Colab free using cloud GPUs times out every 30 to 60 minutes if the tab is not being focused on, you can get by this leaving it in another window being a single tab though.

Once level 2 finishes, you should see a folder named Level 2 on the left hand side or in your Google Drive. This makes 3 folders inside that, along with 3 different versions of audio. The folders have HTML, json and png files that put together the lyric focus visualization (the HTML lyric visualizers only work in non-chromium Microsoft Edge), lyrics, artist, title, and genre. The TAR file is the raw data it uses to upscale audio later on, so probably save that, but you don't have to.


If you like what you hear from level 2, you can now go onto level 1, which is halfway to the final quality of 0. Level 1 takes around twice the time of Level 2 to make. Since Google Colab free only lets you run 12 hours in the cloud maximum, making anything at Level 0 is near impossible as that can take 12 to 24 hours, usually less from a sample being around 6. Each segment can take 4 to 7 minutes depending on the time of day you start your notepad, or how much you’ve used Google Colab in the past.

Run the boxes above in order, the box starting with “#Set this False if” will download the upscalers to get Level 2 audio to Level 1 and 0. The box starting with “zs = upsample” will attempt to run all the way to level 0 from level 2. Once it finishes Level 1 it’ll make a folder where Level 2 was on the left hand side or in your Google Drive. After that if you feel like that’s good enough quality, you can stop the process by clicking the square placed where you ran the code.

If you plan on making more in the same instance, go to the “Runtime” tab in the top left, and click “Restart Runtime”. Then start all over. Make sure to download your audio before restarting as it will automatically overwrite any files.


Ok, let’s say you’ve made a song but only got to Level 2 or 1 before the machine disconnected. Well with the latest version of the notebook, you can upsample a creation without having to run all at once in the same machine/instance where you generated the song.

First, go to whatever directory you have the “” variable set to, for this I’ll use Google Drive, but you can use this without Drive active by following the directions for uploading a sample above.

When you’re at the directory in Google Drive “” is set to, upload your folder titled as whatever level you got to, only using underscores. For example if you stopped at level 1, upload the folder with the audio as level_1. The folder you upload at bare minimum only needs the data.pth.tar file, but uploading the entire folder with the WAV files is fine.

While I like to make sure all my settings are identical to when I set up the notebook to generate the audio just to be safe, you don’t have to do this as it’ll pull the information from the .tar file. At the very least make sure you’re in the same range of the timecode you continued your sample from if you used one, the sample file itself, and the duration of the track.

Also if you’re using a sample, run the box with “mode = ‘primed’” in it before running the upsampling box with the info of the sample you used in the original generation.

Run the box above to have the Notebook recognize what level you’re starting from, here I’m upsampling a level 2 to 1, and the outbox box has notified me that it’s recognized the level. From there run everything as normal.