Jazz soloĀ¶
Objective: Train a Long Short-Term Memory (LSTM) network to generate music.
Download auxiliary files and dataĀ¶
%%bash
gdown -q 1yZ5vKsZiyZZaGfBP-ixfN1PeFljvrt3e
unzip -q jazz_generator.zip
rm jazz_generator.zip
Import librariesĀ¶
import IPython
import music21
from jazz_generator.data_utils import *
from keras import layers, Input, Model, optimizers
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
2024-08-08 00:26:10.244675: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-08-08 00:26:10.265940: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-08-08 00:26:10.271962: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-08-08 00:26:10.286545: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
music21.__version__
'6.7.1'
Load the dataĀ¶
The preprocessing of the musical data is rendered in terms of musical values. In music theory, a "value" captures the information needed to play multiple notes at the same time.
Load the raw music data and preprocess it into "values".
X, Y, n_values, indices_values, chords = load_music_utils('./jazz_generator/original_metheny.mid')
print('Number of training examples:', X.shape[0])
print('Tx (length of sequence):', X.shape[1])
print('Total number of unique values:', n_values)
print('Shape of X:', X.shape)
print('Shape of Y:', Y.shape)
print('Number of chords:', len(chords))
Number of training examples: 60 Tx (length of sequence): 30 Total number of unique values: 90 Shape of X: (60, 30, 90) Shape of Y: (30, 60, 90) Number of chords: 19
X: A ($m$, $T_x$, 90) dimensional array.
- It has $m$ training examples, each of which is a snippet of $T_x$ musical values.
- At each time step, the input is one of 90 different possible values, represented as a one-hot vector.
- For example, X[$i$, $t$, :] is a one-hot vector representing the value of the $i-th$ example at time $t$.
Y: A ($Ty$, $m$, 90) dimensional array
- It is essentially the same as X, but shifted one step to the left (to the past).
- The data in Y is reordered, where $Ty$ = $Tx$. This format makes it more convenient to feed into the LSTM.
- The model will use the previous values to predict the next value.
- So the sequence model will try to predict $y^{<t>}$ given $x^{<1>}$, ..., $x^{<t>}$.
n_values: The number of unique values in this dataset.
indices_values: Python dictionary mapping integers 0 through 89 to musical values.
chords: Chords used in the input midi.
A snippet of the audio from the training set:
IPython.display.Audio('./jazz_generator/30s_seq.mp3')
Create the modelĀ¶
Implement the model composed of $T_x$ LSTM cells where each cell is responsible for learning the following note based on the previous note and context. Each cell has the following schema:
- [$X_{t}$, $a0_{t-1}$, $c0_{t-1}$] -> RESHAPE() -> LSTM() -> DENSE()
The model will call the LSTM layer $T_x$ times using a for-loop. It is important that all $T_x$ copies have the same weights, the steps should have shared weights that aren't re-initialized.
# Number of dimensions for the hidden state of each LSTM cell
n_a = 64
# Length of the sequences in the corpus
Tx = X.shape[1]
# Referencing a globally defined shared layer will utilize the same layer-object instance at each time step
reshaper = layers.Reshape((1, n_values))
LSTM_cell = layers.LSTM(n_a, return_state=True)
densor = layers.Dense(n_values, activation='softmax')
# Define inputs, the initial hidden state 'a0' and initial cell state 'c0'
inputs = Input(shape=(Tx, n_values))
a0 = Input(shape=(n_a,))
c0 = Input(shape=(n_a,))
a, c = a0, c0
# Create an empty list to append the outputs while iterate
outputs = []
# Loop over Tx
for t in range(Tx):
# Select the t-th time step vector from inputs
x = inputs[:, t, :]
# Use Reshape() to reshape x to be (1, n_values)
x = reshaper(x)
# Perform one step of the LSTM_cell
a, _, c = LSTM_cell(x, initial_state=[a, c])
# Apply Dense() to the hidden state output of LSTM_Cell
output = densor(a)
# Add the output to "outputs"
outputs.append(output)
# Create model instance
model = Model(inputs=[inputs, a0, c0], outputs=outputs)
2024-08-08 00:26:18.904512: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2021] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 2048 MB memory: -> device: 0, name: NVIDIA GeForce GTX 1650, pci bus id: 0000:01:00.0, compute capability: 7.5
Compile the modelĀ¶
model.compile(optimizer=optimizers.Adam(learning_rate=0.01), loss='categorical_crossentropy')
Fit the modelĀ¶
a0 = np.zeros((X.shape[0], n_a))
c0 = np.zeros((X.shape[0], n_a))
history = model.fit([X, a0, c0], list(Y), epochs=100, verbose=0)
2024-08-08 00:26:33.972734: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:531] Loaded cuDNN version 8907
plt.figure()
plt.plot(history.history['loss'])
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.show()
Generate musicĀ¶
Implement a music inference model to sample a sequence of musical values. It uses the trained "LSTM_cell" and "densor" from the previous model to generate a sequence of values.
# Length of the sequences in the corpus
Ty = Tx
# Define the input of the model
x0 = layers.Input(shape=(1, n_values))
# Define initial hidden state and cell state for the decoder LSTM
a0 = layers.Input(shape=(n_a,))
c0 = layers.Input(shape=(n_a,))
x, a, c = x0, a0, c0
# Create an empty list of "outputs" to later store the predicted values
outputs = []
# A KerasTensor cannot be used as input to a TensorFlow function. The function should be wrapped in a layer.
class MyLayer(layers.Layer):
def call(self, output):
# Select the next value according to "output" and set "x" to be the one-hot representation of the selected value
x = tf.math.argmax(output, axis=-1)
x = tf.one_hot(x, depth=n_values)
return x
# Loop over Ty and generate a value at every time step
for t in range(Ty):
# Perform one step of LSTM_cell
a, _, c = LSTM_cell(x, initial_state=[a, c])
# Apply Dense layer to the hidden state output of the LSTM_cell
output = densor(a)
# Append the prediction "output" to "outputs"
outputs.append(output)
x = MyLayer()(output)
# Use RepeatVector(1) to convert x into a tensor with shape=(None, 1, 90)
x = layers.RepeatVector(1)(x)
# Create model instance with the correct "inputs" and "outputs"
inference_model = Model(inputs=[x0, a0, c0], outputs=outputs)
The inference model generates a sequence of values. The values are then post-processed into musical chords (meaning that multiple values or notes can be played at the same time).
Most computational music algorithms use some post-processing because it's difficult to generate music that sounds good without it. The post-processing does things like clean up the generated audio by making sure the same sound is not repeated too many times, or that two successive notes are not too far from each other in pitch, and so on.
One could argue that a lot of these post-processing steps are hacks; also, a lot of the music generation literature has also focused on hand-crafting post-processors, and a lot of the output quality depends on the quality of the post-processing and not just the quality of the model. But this post-processing does make a huge difference, so it is used in this implementation as well.
_ = generate_music(inference_model, indices_values, chords, output_file='./my_music.midi')
mid2wav('./my_music.midi', output_file='./my_music.wav')
IPython.display.Audio('./my_music.wav')
Predicting new values for different set of chords. 1/1 āāāāāāāāāāāāāāāāāāāā 3s 3s/step Generated 22 sounds using the predicted values for the set of chords ("1") and after pruning 1/1 āāāāāāāāāāāāāāāāāāāā 0s 31ms/step Generated 22 sounds using the predicted values for the set of chords ("2") and after pruning 1/1 āāāāāāāāāāāāāāāāāāāā 0s 33ms/step Generated 22 sounds using the predicted values for the set of chords ("3") and after pruning 1/1 āāāāāāāāāāāāāāāāāāāā 0s 21ms/step Generated 22 sounds using the predicted values for the set of chords ("4") and after pruning 1/1 āāāāāāāāāāāāāāāāāāāā 0s 19ms/step Generated 22 sounds using the predicted values for the set of chords ("5") and after pruning Your generated music is saved in./my_music.midi
!rm -rf ./jazz_generator ./my_music*