MLFan

What if you could get feedback from virtual fans?

This project sought to examine whether a machine learning model could be applied to simulate human perception of certain song qualities.

By training a basic regression model on a set of transformed audio statistics called MFCCs, this web app classifies audio samples in two dimensions: valence, one measure of a song’s relative buoyancy, and energy, one measure of a song’s intensity.

In effect, what MLFan allows musicians to do is train a virtual fan on their existing repertoire, then have the model predict energy and valence scores for their newer music.

Overview

  
      MLFan Workflow →
  	
    Training Samples
      
        20 songs
      
    Feature Extractor
      
        meyda.js, p5.js
      
      MFCC Training Data
    
        20 songs x 20 samples x 10 frames x 39 statistics
      
      Regression Model
      
        ml5.js, p5.js
      
    Testing Samples
    
        10 songs
      
      MFCC Testing Data
      
        10 songs x 20 samples x 10 frames x 39 statistics

MLFan use two separate components: the first, a feature extractor, converts clips of high-fidelity audio streams into manageable arrays of numbers. These numbers, called MFCCs, can best be thought of as the relative amount of power of a small frequency band at each moment during the sample. Because they describe intensity at a range of human-friendly frequencies, they’re a helpful way to distill the complex waveforms of recorded music into something our model can work with.

At the same time that these numbers are being extracted, the user must also input valence and energy scores for these samples. This means that the ML model can begin to work through each sample and understand how a given set of MFCCs might correspond to a specific energy or valence score.

The second component, a regression model, takes the MFCCs and valence/energy ratings from the training set of songs and builds a predictive model. This web application allows for a variety of hyperparameters to be specified, including training epochs, learning rate, and model complexity.

The feature extractor program generates a live, scrolling visualization of 39 cepstral statistics: the 13 first-order MFCC, plus first- and second-order rates of change, also called Δ-MFCCs and ΔΔ-MFCCS. — The feature extractor program generates a live, scrolling visualization of 39 cepstral statistics: the 13 first-order MFCC, plus first- and second-order rates of change, also called Δ-MFCCs and ΔΔ-MFCCs.

The test songs were classified by averaging five sets of human ratings (the songs’ writers), shown in red. The model’s predictions, in blue, do a good job of preserving relative relationships between songs (i.e. which of these two has higher valence or energy?), and the trend lines have similar slopes, indicating that the model understood the difference between a high-energy song and a low-energy song. A bias or linear transformation term may be suitable to widen the ‘spread’ of the model’s predictions. — The test songs were classified by averaging five sets of human ratings (the songs’ writers), shown in **red**. The model’s predictions, in **blue**, do a good job of preserving relative relationships between songs (i.e. which of these two has higher valence or energy?), and the trend lines have similar slopes, indicating that the model understood the difference between a high-energy song and a low-energy song. A bias or linear transformation term may be suitable to widen the ‘spread’ of the model’s predictions.

Optimization

Once we have a functioning workflow, we must work to refine the regression model and find the set of setup parameters that minimizes its overall loss.

For a regression in two dimensions, we’d like to minimize the mean Euclidean distance. In the illustration above, this means finding the model that provides the shortest average distance between each human-provided score and each model-predicted score.

The regression model web app computes predictions for all testing samples across all 48 combinations of parameters. Here’s a sample output for model 29: 26 statistics (13 MFCCs plus their first differences), a learning rate of 0.002, and 30 training epochs.

*Training the model for longer does not correlate with stronger outcomes; in fact, training any candidate model for 100 epochs provided worse outcomes than any similar model trained for 50 epochs.*

The second web application, the regression model, lets the user try out many combinations of parameters in a single runtime. The model that performed best on our samples was one of the simplest: taking only the MFCCs (no rate-of-change statistics), training for 50 epochs, and using a higher learning rate.

READ THE FULL PAPER HERE