Topic modelling is an interesting approach to clustering documents. It is used for scientific papers, news articles, documents, web pages, and more.
seen from China

seen from United States
seen from China
seen from United States
seen from United Kingdom
seen from United States
seen from China

seen from United States

seen from United States
seen from Brazil
seen from United States
seen from China
seen from Ireland
seen from Brazil
seen from United States
seen from Brazil

seen from United States

seen from Canada
seen from United States
seen from Ireland
Topic modelling is an interesting approach to clustering documents. It is used for scientific papers, news articles, documents, web pages, and more.
Topic Modeling involves counting words and grouping similar word patterns to infer topics within unstructured data. So let’s say you’re a software company and want to know what customers are saying about particular features of your product.
Topic modelling is an interesting approach to clustering documents. It is used for scientific papers, news articles, documents, web pages, and more.
When you are looking through the company database manually for a crucial piece of information, it is highly time-consuming and practically impossible. With the growing amount of data in recent years, it is difficult to obtain the relevant and desired information in a short period, especially during urgent matters. In such cases, we can use Topic Modeling to mine through the data and fetch the information we are looking for quickly.
On-Topic or Off-Topic?
Ever wondered how you would categorize things you read or write about?
This is called topic modelling, and it’s simple with Latent Dirichlet Allocation (LDA), so let’s take it for a spin! If you want to do it yourself, check out this article and the accompanying Jupyter notebook.
While it’s easy for humans to put things into categories -- in terms of my fanfiction reading habits, I can quickly tell you that I read humor, tragedy, and unusual AUs -- it’s much harder for a computer to determine genre from examining a text. Topic modelling has been shown to highly successful when looking at messages or emails, though, so let’s have a look at some comments!
I scraped all the comments on my own AO3 works and stacked them together -- above is an excerpt. By looking at all these comments, will certain topic groups stand out?
With a few short steps, we can build a model using this data! It depends on the number of categories you want, so you’ve got to play around a little, but it’s fast so you can do that. I finally settled on four categories, shown below:
Looks like we’ve got some decent distinctions!
Topic 0 falls nicely into the category of ‘BnHA fanfic’
Topic 1 likewise is about ‘ATLA fanfic’
Topic 2 is hard to distinguish unless you’re me, and know exactly what kinds of comments I tend to get at the end of my ATLA tragedy, your name upon my gravestone (although I’m a bit surprised the word ‘laugh’ is in there instead of ‘cry’).
Topic 3 I’d call general comments.
By the way, the reason some words appear to be missing their endings is because all the text was lemmatized and stemmed (basically changed to their root form).
So what’s the math doing there? Well, what this has actually done was training a model! With these formulae, any comment not in the dataset can be categorized, using the words present in the comment. Let’s try it out!
Here’s the latest comment in my inbox and what the model predicts:
It scores high in the ‘general comments’ section, which is fair.
I’ll try a few more...
Well this was definitely a comment on an ATLA fic! :D
Curious about the special category for your name upon my gravestone ... well, this comment just gets classified in the ‘general’ section, although the next highest candidate is in fact the correct category. Still, this is pretty fun! (Topic modelling worked very well on a different dataset of news articles -- I wasn’t expecting much of a result here, but happy to find that manages to find some groupings!)
Finally, let’s finish off with a kind of cluster map for the different topics!
What’s next? I’m thinking of looking at popular one-shots in a big fandom... my default is ATLA, but if there’s something different you want to see, let me know!
Topicgraph is a beta tool built by JSTOR Labs as part of its Reimagining the Monograph project. It helps researchers explore scholarly books by letting them understand at a glance all the topics covered within a book and then navigate directly to those pages about topics they’re researching.
“ I have been working for some time on a National Science Foundation funded project with the American Institutes for Research to examine how machine classification working with taxonomies can be used to give greater insight into science and engineering activities, outputs and impact. An update on our project has just been published in the UK research periodical Research Fortnight, written with Evgeny Klochikin of the American Institutes of Research.”