Top 50 YouTube Channels follow Zipf's Law

#dc#dc comics#batman#bruce wayne#tim drake#batfamily#dc fanart#batfam#dick grayson

seen from Greece

seen from Australia
seen from Canada
seen from Brazil

seen from Türkiye
seen from Slovakia

seen from China
seen from Japan

seen from France

seen from Netherlands

seen from France
seen from United States

seen from Russia

seen from France
seen from Georgia

seen from United Kingdom
seen from United States
seen from Italy
seen from United States
seen from China
Top 50 YouTube Channels follow Zipf's Law
Most popular customers' first names at a small pizza resturant compared to Zipf's Law
YouTube views vs. expectations following Zipf's Law
Zipf's Law holds for Canadian metropolitan areas
Zipf's Law, and frequency-v-length of words, in the Universal Declaration of Human Rights
just some simple comparative plots
Today while recording pod I had reason to want to compare the distribution of frequencies of the top 20 most used words in an English corpora and the most used emojis, both with each other and with a Zipfian distribution. I got these frequencies from my co-host Daniel, so interrogate him if you wanna know more about the underlying data.
Basically, there's this idea that frequencies of words in natural language follows a law that says that the second most common word will have half of the occurrences of the most frequent, the third most frequent a third of the most frequent word and so on. This is known as Zipf's Law, and it holds sort of for other stuff as well. Anyway, how to go about testing this?
Well, I wrote a script in R that takes the top 20 frequencies of emojis and words that I got from Daniel and wrangles it a bit, computes what a perfect Zipfian score would be and makes some plots. It's not something fancy, but it does some practical things so I thought I'd share.
I've uploaded the script to GitHub in case that's of use to anyone.
It does a couple of things that might be useful to others:
1) relativises the frequencies (divide by max number) so that the number series are comparable
2) computes a perfect Zipfian score in a very simple way (as long as you know that the max value is 1, which is it because of step (1), then this is a simple way of doing it)
3) shows how you can use reshape2::melt() in conjunction with ggplot to make plotting several series at the same time easier
4) makes a scatterplot matrix, which is always nice :)
I also faffed about a bit with some fun things like menu() to ask the user if you really want to install packages and fiddled with the package pacman to install and/or load packages at the same time. That stuff isn't necessary really, but hey. Why not?
For now I just got the script to make some very simple Pearson correlation tests. There's more complex things one can do, but I thought that might do for now at least. To cut to the chase: these distributions look very similar indeed. Both to each other and to Zipf.
Anyway, not a complicated script but hey - just for fun :)
random scripts I couldn't fit in elsewhere. Contribute to HedvigS/just_for_fun development by creating an account on GitHub.
Vsauce’s The Zipf Mystery..