Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



20 Commits

Repository files navigation

Transfer learning from speech to music: towards language-sensitive emotion recognition models


In this study, we address emotion recognition using unsupervised feature learning from speech data, and test its transferability to music. Our approach is to pre-train models using speech in English and Mandarin, and then fine-tune them with excerpts of music labeled with categories of emotion. Our initial hypothesis is that features automatically learned from speech should be transferable to music. Namely, we expect the intra-linguistic setting (e.g., pre-training on speech in English and fine-tuning on music in English) should result in improved performance over the cross-linguistic setting (e.g., pre-training on speech in English and fine-tuning on music in Mandarin). Our results confirm previous research on cross-domain transferability, and encourage research towards language-sensitive Music Emotion Recognition (MER) models.


This script loads an audio file and makes predictions of the perceived emotion using the Russell circumplex model of emotion as a classifier (four classes). Quadrant 1 relates to positive arousal - positive valence (e.g., happy), Quadrant 2 relates to positive arousal - negative valence (e.g., angry), Quadrant 3 relates to negative arousal - negative valence (e.g., sad), and Quadrant 4 relates to negative arousal - positive valence (e.g., relaxed). The models have been previously trained with speech in English (Librispeech) and Mandarin (AISHELL) and transfer learning has been performed to fine-tune on music in English (4Q-Emotion) and Mandarin (CH-818). The audio directory includes some examples to test on.


Install Docker, for Ubuntu go here.


Clone this repository and build the container with all corresponding installations. This might take a while since it will install Tensorflow from scratch:

git clone
cd quad-pred
docker build -t quadpred .
docker run -it --rm -v /abspath/quad-pred/audio/anger_1.mp3:/audio.mp3 -v /abspath/quad-pred/audio/:/outdir quadpred -s e -m e -i /audio.mp3 -o /outdir/anger_1.npy

Otherwise, you can also install dependencies using:

pip3 install -r requirements.txt
python3 -s e -m e -i audio/anger_1.mp3 -o audio/anger_1.npy

You can also change the flags in the contructor method to output a taggram or print the mean probability of the classifier over the whole clip.


alt text


Calculating output for file: audio/anger_1.mp3
Using model: ./models/speech_eng_2_music_eng/model_over_8.spec.it_1.feat_ext.json 
Mean predictions for file: audio/anger_1.npy
Quadrant 1 (positive arousal, positive valence): 0.20719571
Quadrant 2 (positive arousal, negative valence): 0.7511331
Quadrant 3 (negative arousal, negative valence): 0.021125803
Quadrant 4 (negative arousal, positive valence): 0.020545341


You can use the --help flag to see the complete list of information with python3 --help or docker run --rm quadpred -h.

usage: [-h] -s SPEECH -m MUSIC -i INPUT

optional arguments:
  -h, --help            show this help message and exit
  -s SPEECH, --speech SPEECH
                        Select from pretrained models on speech in english (e)
                        or mandarin (m)
  -m MUSIC, --music MUSIC
                        Select music of data for transfer learning: english
                        (e) or mandarin (m)
  -i INPUT, --input INPUT
                        Select filename to make predictions


Link to paper

    author = {Juan Sebasti{\'a}n G{\'o}mez-Ca{\~n}{\'o}n and Estefan{\'i}a Cano and Perfecto Herrera and Emilia G{\'o}mez},
    title = {Transfer learning from speech to music: towards language-sensitive emotion recognition models},
    year = {2020},
    booktitle = {Proceedings of the 28th European Signal Processing Conference (EUSIPCO)},
    address = {Amsterdam, The Netherlands},
    pages = {136--140}


No description, website, or topics provided.







No releases published


No packages published