So you know those dumb little wordcloud things?
You know, where like, they go through your blog and find the words you use most often, and then spit out stylized text with the most often used words as the biggest ones so you can embed or screenshot them or whatever?
Like, the idea is really cool in theory. A standardized analysis generating an artifact characteristic of you, easily digestible at a glance.
Except in practice everyone's word cloud ends up being "like, people, think, want, make, get..." -- i.e. basically just a bag of the most common words in the english language (presuming they speak mostly english).
But what I actually want is a collection of words I use more than the average person does. And while we're at it, also a collection of words I use less than the average person does.
It's on Siikr now. New blogs don't get it yet, only blogs that were indexed as of a few days ago (still working on optimizations to allow for real time generation).
The words in green are the words you use weirdly often.
The words in red are the words you suspiciously seem to avoid.
In both cases, the bigger the word, the more weird your usage of it is relative to all of the other blogs in Siikr's index. This is limited to the most extreme 100 words in both directions.
Hovering over a word gives you some statistics about how much it should appear in your blog vs how much it actually appears in your blog.
So that's fun and everything -- but it can and very well might get even more fun.
Because generating this meant creating a list of all of the words used by every blog, and storing a bunch of numbers per word per blog. Currently, that's ~9 million associations over ~57k words.
Every blog->word relation stores frequency statistics, and every word itself keeps a running average of its frequency across all blogs.
Which means we could in theory (and almost certainly will in practice), treat each word as a dimension in a 57 thousand dimensional space.
Then treat each user as a point in that 57 thousand dimensional space, where their coordinates in the space are (user_word_freq - avg_word_freq).
From there, we can measure the distance (as cosine similarity, or euclidean distance, or even just raw inner product) between users, and return for your blog, an ordered list of:
Dopplegangers - blogs most like yours (closest to your blog in 57k dimensional word frequency space).
Foils- blogs least like yours (furthest from yours in 57k dimensional word frequency space).
Manic Pixy Dream Friends - blogs that overuse the same words you overuse (closest to your blog in 57k freq-space with respect to only positive vector components)
Least Like Un-You - blogs that avoid the same word you avoid (closest to your blog in 57k freq space with respect to just the negative vector components)