Music Emotion Volatility Prediction

A Feature-Based Analysis

Status: Research Completed

This research investigates whether the emotional volatility of music—how much listeners' emotional responses vary—can be predicted from acoustic features, and whether this prediction differs from predicting average emotional responses.

While most music emotion recognition focuses on mean responses, this study explores the often-overlooked dimension of emotional variance, revealing which songs evoke consistent emotions versus those producing highly varied listener experiences.

Presentation Video

Watch the complete research presentation explaining methodology, findings, and implications.

Music Emotion Volatility Research Presentation

▶

View on GitHub

Research Hypothesis

H1 — Emotional Volatility is Predictable

Acoustic structure predicts variance of emotion better than, or at least comparably to, mean emotion.

Dataset & Approach

The study utilized the PMEmo (Personalized Music Emotion Dataset), which contains:

Perceptual acoustic features
Listener ratings across multiple contexts
Emotion labels (Arousal and Valence dimensions)

This dataset enabled cross-listener context sensitivity analysis, making it ideal for examining emotional variance across diverse listening experiences.

6,373 Initial Acoustic Features

767 Songs Analyzed

4 Target Variables

227 Features for MEAN Targets

322 Features for STD Targets

Methodology Pipeline

Stage 1: Unsupervised Feature Pruning

Variance Filtering: Removed 279 near-constant features with variance below 1e-6

Collinearity Reduction: Eliminated 2,254 redundant features with correlation exceeding 0.95

Result: Feature space reduced from 6,373 to 3,840 features

Stage 2: Mutual Information Ranking

Applied Pareto selection, retaining features accounting for top 20% of total mutual information per target:

mean_A: 135 features selected
mean_V: 172 features selected
std_A: 184 features selected
std_V: 144 features selected

⚠️ Critical Finding: Zero intersection among all four targets — predicting mean versus variance requires fundamentally different acoustic information.

Model Training

Algorithm: ElasticNet Regression
Data Split: 70% Train / 20% Validation / 10% Test
Preprocessing: Standardization after split to prevent data leakage
Hyperparameter Optimization: Alpha and L1 ratio tuned via validation performance

Results: Static Feature Dataset

Target	Train R²	Validation R²	Test R²	Interpretation
mean_A	0.797	0.780	0.707	Excellent — mean arousal is highly predictable
mean_V	0.551	0.562	0.493	Moderate — mean valence is predictable but weaker
std_A	0.234	0.039	0.220	Low — arousal variance is difficult to predict
std_V	0.107	0.029	-0.028	Unpredictable — valence variance not captured by features

Results: Dynamic Feature Dataset

Given the poor volatility prediction from static features, I tested whether dynamic temporal statistics (computed from time-varying acoustic properties) would better capture emotional variance.

Target	Test R²	Best Alpha	Best L1 Ratio
mean_A	0.657	0.001	0.1
mean_V	0.506	0.001	0.1
std_A	0.211	0.001	0.9
std_V	-0.009	0.01	0.7

Hypothesis Outcome: Not Supported

Dynamic features produced similar results to static features, confirming that acoustic structure alone cannot reliably predict emotional volatility, particularly for valence.

Visualization: Static Feature Results

Predicted vs. actual emotion values for static acoustic features across all four targets.

Static Features - Mean Arousal Prediction

Static Features - Mean Valence Prediction

Static Features - Arousal Volatility Prediction

Static Features - Valence Volatility Prediction

Visualization: Dynamic Feature Results

Predicted vs. actual emotion values for dynamic temporal features across all four targets.

Dynamic Features - Mean Arousal Prediction

Dynamic Features - Mean Valence Prediction

Dynamic Features - Arousal Volatility Prediction

Dynamic Features - Valence Volatility Prediction

Key Findings & Implications

1. Acoustic Determinism vs. Listener Factors

While acoustic features excellently predict mean arousal (R² = 0.707) and moderately predict mean valence (R² = 0.493), they fail to explain variance in emotional responses. This suggests emotional volatility arises from listener-specific factors—personal history, context, mood, cultural background—rather than acoustic structure.

2. Distinct Feature Sets for Mean vs. Variance

The zero-intersection result in feature selection reveals that predicting average emotional response requires fundamentally different acoustic information than predicting variability. No single feature was important for all four prediction tasks.

3. Implications for Music Recommendation Systems

Truly personalized music recommendation systems must account for both acoustic properties and listener-specific contextual factors to understand the full spectrum of emotional response. Systems that only consider acoustic features will miss the interpersonal variance that makes music experiences uniquely personal.

This research demonstrates that some aspects of musical emotion are acoustically determined (mean responses), while others emerge from the interaction between music and individual listeners (emotional variance). Understanding this distinction is crucial for advancing both music emotion recognition and personalized listening experiences.

🎵 Future work will explore incorporating listener metadata, contextual information, and temporal dynamics to better model emotional volatility.