Music concert

Predicting if a song will break into the Billboard Top 10, through logistic regression, using R

Summary

The music industry is highly concentrated in terms of revenue - the majority of the money record companies make is from songs that “blow up” - a record company will generate nearly all of its revenue from songs that chart on the Billboard Top 10. The problem is that the vast majority of songs are flops, which makes identifying the rare hits that much trickier.

Therefore, an analytical model that uses musical characteristics (such as pitch, tempo and energy) in order to predict whether a song will make the Billboard Top 10 will be extremely useful to record companies in deciding which songs to allocate marketing resources toward.

Data Source

This data is sourced from Kaggle, view it here. The data contains information on 7000 songs, approximately just 15% of which charted on the Billboard Top 10. There are over 30 variables, that contain information on the musical characteristics of the song, including pitch, tempo, energy etc.

First Problem Solving approach (and the problems that arose)

A logistic regression model is well suited to such problems, which involve a binary classification (identifying if a song is a hit - 1, or flop - 0). However, because the dataset is imbalanced (only 15% are hits), the classic “cut-off” model probability of 0.5 (above which a song gets classified as a hit) does not work. While 98% of the flops get classified correctly, only 19% of the hits are classified correctly - which needs fixing.

Arriving at a better model

The task becomes clear - to identify the right cut - off probability to use as a decision threshold (above which songs get predicted as a hit). For this, me and my team used different approaches, including a cost - based penalty approach as well as one based on finding the optimal cut - off using the Receiver Operating Characteristics (ROC) curve.

Software Packages / tools used

For the cost - based penalty approach, my team and I used the R programming environment, and associated packages.

For the ROC curve approach, my team and I used Microsoft Excel’s Solver Optimization module.

Results

Using the cost - based penalty approach : 76% of the hits and 70% of the flops were classified correctly.

Using the ROC curve : 67% of the hits, and 78% of the flops were classified correctly.

Managerial implications / decision making

We think that the results of our model will be helpful to record companies. While there is a slight drop off in the flop prediction accuracy, we think that is offset by the reasonably high prediction probability for the hits - which will allow them to invest in the songs where revenue and profits lie.