Discover Top Posts Tagged with #zipf

Top 50 YouTube Channels follow Zipf's Law

#data #youtube channels #follow #zipf

Most popular customers' first names at a small pizza resturant compared to Zipf's Law

#data #most #zipf

YouTube views vs. expectations following Zipf's Law

#data #youtube #zipf

Zipf's Law holds for Canadian metropolitan areas

#data #zipf #holds #canadian #metropolitan #areas

Zipf's Law, and frequency-v-length of words, in the Universal Declaration of Human Rights

#data #zipf #universal declaration #rights

days of yesterday by Al Q

#days #yesterday #music #publisher #zipf #wagon #train #flickr

just some simple comparative plots

Today while recording pod I had reason to want to compare the distribution of frequencies of the top 20 most used words in an English corpora and the most used emojis, both with each other and with a Zipfian distribution. I got these frequencies from my co-host Daniel, so interrogate him if you wanna know more about the underlying data.

Basically, there's this idea that frequencies of words in natural language follows a law that says that the second most common word will have half of the occurrences of the most frequent, the third most frequent a third of the most frequent word and so on. This is known as Zipf's Law, and it holds sort of for other stuff as well. Anyway, how to go about testing this?

Well, I wrote a script in R that takes the top 20 frequencies of emojis and words that I got from Daniel and wrangles it a bit, computes what a perfect Zipfian score would be and makes some plots. It's not something fancy, but it does some practical things so I thought I'd share.

I've uploaded the script to GitHub in case that's of use to anyone.

It does a couple of things that might be useful to others:

1) relativises the frequencies (divide by max number) so that the number series are comparable

2) computes a perfect Zipfian score in a very simple way (as long as you know that the max value is 1, which is it because of step (1), then this is a simple way of doing it)

3) shows how you can use reshape2::melt() in conjunction with ggplot to make plotting several series at the same time easier

4) makes a scatterplot matrix, which is always nice :)

I also faffed about a bit with some fun things like menu() to ask the user if you really want to install packages and fiddled with the package pacman to install and/or load packages at the same time. That stuff isn't necessary really, but hey. Why not?

For now I just got the script to make some very simple Pearson correlation tests. There's more complex things one can do, but I thought that might do for now at least. To cut to the chase: these distributions look very similar indeed. Both to each other and to Zipf.

Anyway, not a complicated script but hey - just for fun :)

random scripts I couldn't fit in elsewhere. Contribute to HedvigS/just_for_fun development by creating an account on GitHub.

#rstats #zipf