Computational content analysis for selecting threads
Working with a massive online forum presents a problem for the ethnographer. As I have written about before, having hundreds of thousands of posts in a community all of which can been seen as interactions requires some sort of computational selection criteria.
The criteria I have developed revolves around the notion of thread intensity. I want to know which are the most active threads in terms of number of posts and time between posts, but also in terms of contributors. To slice the data in these ways I have created a spreadsheet with all the posts for a particular subforum threaded into discussions. I first count the number of posts per thread using COUNTIF statements, then calculate the delta between posting times for each post and then the average delta for each thread. I know which usernames correspond to moderators, scientific staff, and amateur participants, so I categorise each post by type of poster and count the number of unique contributors.
So far, this approach has worked relatively nicely. I does not replace looking for specific keywords through KWIC searches, doing concordance analysis to look for themes or SNA, but it does show me the threads that the community has assembled around. By setting different criteria I can go from say 2500 threads to 30 with a selection criteria that I can account for and most importantly argue for. 30 threads is then a manageable number to perform manual content analysis on and decide which should be analysed in detail at the interactional level.











