Discover Top Posts Tagged with #stem talk

Thoughts on the Paper: (A) Data in the Life: Authorship Attribution in Lennon-McCartney Songs

BACKGROUND: in 2019, an article was published in the Harvard Data Science Review, outlining a machine learning model, which had been trained to predict the authorship of Lennon-McCartney songs with an accuracy of 76%. Notably, when the model was given the song In My Life, whose composer is notoriously disputed, it predicted John Lennon to be the composer of the verse with 81% certainty. For some, this paper settled the decade-long debate.

So, the TLDR of my thoughts are that no, this paper does not prove John is the (sole) composer of In My Life, however, the paper itself doesn’t really claim this and is quite transparent about its shortcomings.

More detailed analysis below the cut. I did my best to put this all in layman's terms, both with regards to the math as well as the music theory, but if anything isn't clear please do feel free to ask.

For one, the data set used is quite small, only 70 songs, because the authors of the paper thought using songs composed post-66 would skew results, since the Beatles quit touring and subsequently became more “studio-oriented”; ergo their songwriting process probably changed after that. (Personally, I find this kind of an odd take, seen as the recording process for Revolver had far more in common with the process for Sgt. Pepper’s than say that for Help! – kind of a minor detail in this context though.) The authors do acknowledge this and utilized a statistical method – which I won’t go into here – to try and counter-balance the small size of the data set, but it is basically impossible to check their model for accuracy beyond the existing songs, as the available Lennon-McCartney songs are a finite group. Continuous evaluation of the model, as one might be able to do with an image-recognizing AI, is not an option here.

The other thing is the criteria; basically, the way the model works, is it’s shown the 70 songs of known authorship, with each song represented by a list of properties the song’s melody or harmony either posses or don’t. An example of such a property is the presence of a given note within the key of the song, or the presence of a sequence of two specific chords. Going through all the songs, the model tries to assess if it can identify any of these properties as particularly unique to John or Paul’s writing, when cross comparing all of them. My issue here is that the properties looked at were quite subjectively chosen and were also more or less hand-picked to create a model that predicts with an acceptable accuracy. But, seen as we have a finite data set (as mentioned before), we can’t really be sure the model isn’t overfitting.

Overfitting is when your model works very well with the data set you trained it with but does not generalize well to new data. A good visual example of overfitting is the graph below, where the goal is to draw a line, above which all the dots are blue and below which all the dots are red. The green line is perfectly accurate for the dots we have but will likely be inaccurate if more dots that follow the more general trend, represented by the black line, are added to the set.

My point is, given the limited data, there is virtually no way to know if the trained model is a green line or a black line, but I have reason to believe it’s much more likely to be a green line, despite the authors doing their best to counter-act this.

The in my opinion biggest issue, however, is that the model was specifically not trained with songs that are known to have been Lennon-McCartney collaborations, is not capable to predict whether a song is in fact a collaboration and, notably, when given songs we know to be collaborations, quite randomly assigned them. See the table below:

Probability is always on a 0 to 1 scale, and the authors set 1 as the case "written by Paul with 100% certainty" and 0 as the case "written by John with 100% certainty". (If you’re wondering about the numbers in parentheses, that’s the probabilities when taking the measurement uncertainty (which can be either positive or negative, resulting in an interval) into account)

The thing about In My Life is, though, that unless one thinks Paul is straight up lying, his quite detailed account to me highly suggests he was at least present when the song was composed, leaving the door open to the possibility that the music of this song was a result of some sort of collaboration, which John even corroborated in the case of the song's middle eight.

Now, the authors do acknowledge this disregarding of collaboration as an obvious shortcoming, but what might it mean for the authorship debate, if I Want To Hold Your Hand, a song both John and Paul professed to have written “eyeball-to-eyeball”, is predicted as less likely to have been written by Paul than In My Life is?

This also does not take into account that any given song fragment can in fact be composed by two people. Say, someone plays the chords and the other sings the melody over it; or someone sings a melody and the other takes it and varies it slightly. Of course, we have virtually no data of minute details like this, so it’s perfectly understandable that the model doesn’t take this into account, but as a songwriter, this is the type of thing that, in my estimation, renders this research not very useful in the discussion of who wrote which song.

With that being said, the paper does not purport to have definitely solved the question of authorship of disputed Beatles songs. More, it is presented as the first step in a possible new field of musical analysis, utilizing statistics to track certain songwriting “signatures”. In the final paragraph, they say that further research could lead to the establishing of “influence networks”, where musical ideas travelling from artist to artist throughout music history might be observable, which to me sounds like a far more interesting use of this methodology.

If any conclusion can be drawn with regards to In My Life, it is that the song tends to be more in the style of John’s writing – with the caveat that this is specifically when considering the criteria picked for this particular model. Paul is a known imitator and adventurer as a songwriter, though, and the model, being imperfect, doesn’t always accurately place him as the likely composer of songs we know he wrote – and the same with John, though his songs tend to come with less uncertainty. Take a look at this distribution, which I find interesting:

Notice how the model is generally less good at predicting Paul’s songs accurately? I think this is consistent with Paul being more of a musical chameleon than John was. Again, I want to reiterate that the authors took this option into consideration and do not claim their model’s prediction is gospel. As they write themselves:

While it is tempting to interpret the results of our model as revelations of a song’s true author, other interpretations are just as compelling. For example, a disputed song such as “In My Life” which according to our model has a high probability of the verse and bridge each being written by Lennon, may in fact have been written by McCartney who stated he composed the song in the style of Smokey Robinson and the Miracles (Turner, 1999), but actually wrote in the style of Lennon,* whether consciously or subconsciously. Songs with high probabilities of being written by Lennon or McCartney are mainly indications that the songs have musical features that are consistent with the Lennon or McCartney songs used in the development of our model.

*Minor comment here from me: the authors did no research into Smokey Robinson's style of writing. For all they know, Smokey and John have a similar style or at least could happen to share the criteria the model assessed to be "John-like".

Also, I would recommend taking a look at this cool interactive feature (works best on desktop) the authors developed, which visualizes the properties chosen for the authorship assessment on a scale from "John" to "Paul" for all Beatles songs from 1962 to 1966. While the model is imperfect, it's a fun tool to play around with and understand how and why specific songs were assessed as written by one or the other. Also, as someone who's done some coding, I admire the work that went into this alone.

#stem talk

Trending Tags

Recently Viewed Tags

#stem talk