Connected by language
I’m working on a natural language process machine learning project. Zipf’s law seems to come up all the time. @vsauceblog has a nice video about it. Let’s test the law with because and try to show how we are all connected by language.
Zipf's law - states that given a large sample of words used, the frequency of any word is inversely proportional to its rank in the frequency table.
Quick lookup at http://www.wordcount.org/ we can see that because is ranked as 113. Looks like a common word.
Let’s proof Zipf’s Law. For a quick test I’ve used data from https://www.wordfrequency.info and Numbers on the mac to graph the data. The data contains top 5000 words from an American English dataset.
Here is how the data looks.
There are a lot of data points here and the scale is not the best. Hopefully, you can see the pattern of the data and that it follows the power law. You can see the R squared value in the left top corner. That shows how close the dataset matches the power law (between 0 and 1). In data science you always want to see the same data from different vantage points. Recognizing that it follows the power law we can switch the axis to use logarithmic scale.
Now we see something interesting. The data appears to follow a line.
Here is the same dataset but looking at the first 50 words. It shows the power law a bit better.
and using logarithmic scale.
So lets test because! In the dataset because comes in at rank 89. And the first word the has the frequency of 22 million. An estimate using the above formula puts the because count at 247624 and the actual count in the dataset is 438539. The size of the whole dataset is 520 million words. The difference between the estimate and the actual values is 0.04%. This is almost the same.
The best part is that every (that we know of) language follows this pattern.
Here is an image from the “Zipf’s word frequency law in natural language: A critical review and future directions” https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4176592/










