Discover Top Posts Tagged with #mobcitsci

Computational content analysis for selecting threads

Working with a massive online forum presents a problem for the ethnographer. As I have written about before, having hundreds of thousands of posts in a community all of which can been seen as interactions requires some sort of computational selection criteria.

The criteria I have developed revolves around the notion of thread intensity. I want to know which are the most active threads in terms of number of posts and time between posts, but also in terms of contributors. To slice the data in these ways I have created a spreadsheet with all the posts for a particular subforum threaded into discussions. I first count the number of posts per thread using COUNTIF statements, then calculate the delta between posting times for each post and then the average delta for each thread. I know which usernames correspond to moderators, scientific staff, and amateur participants, so I categorise each post by type of poster and count the number of unique contributors.

So far, this approach has worked relatively nicely. I does not replace looking for specific keywords through KWIC searches, doing concordance analysis to look for themes or SNA, but it does show me the threads that the community has assembled around. By setting different criteria I can go from say 2500 threads to 30 with a selection criteria that I can account for and most importantly argue for. 30 threads is then a manageable number to perform manual content analysis on and decide which should be analysed in detail at the interactional level.

#digital methods #mobcitsci

Scraping talk.galaxyzoo.org

In the middle of 2014 Zooniverse froze galaxyzooforum.org and started a new forum at talk.galaxyzoo.org. This new forum has an entirely new structure and is custom built to provide features specific to Zooniverse's citizen science platform such as allowing users to share collections of galaxy images that they have classified. From a technical perspective it is much more complicated than the old forum and involves a lot of AJAX. This means that I have had to create an entirely new scraping tool and since the forum structure is not expressed though static URLs, the tool must interact with the dynamic AJAX elements of the site.

Webscraper.io provides the ability to interact with dynamic elements such as rollovers and javascript buttons, but the tricky part is getting the timing right. On talk.galaxyzoo.org, subforms have unique URLs, but all pagination in the list of threads and pagination within threads themselves is achieved through javascript elements that depending on the complexity of the images posted etc take different amounts of time to load. The site is designed so that pages do not refresh until all the content is loaded and this means that if the timing of the scraper is wrong it will collect the previous page's information despite having trigged a new page to load. This is combined with the added complexity that users can choose to display a collection of galaxy images beside their post which drastically increases the number of images to be loaded. While it would be possible to set a very long pause before the scraper collects information after triggering a page change, that isn't a very efficient way to work with a large number of threads and posts. In the end, the sweet spot for this site on my connection seems to be a delay of about 2000ms.

The JSON script for the scraper looks like this:

{"startUrl":"http://talk.galaxyzoo.org/#/boards","selectors":[{"parentSelectors":["board_page","board"],"type":"SelectorLink","multiple":true,"id":"thread","selector":"div.discussion-summaries div.list div.title a","delay":""}, {"parentSelectors":["thread","thread_page"],"type":"SelectorElement","multiple":true,"id":"post","selector":"li div.post","delay":""}, {"parentSelectors":["post"],"type":"SelectorText","multiple":false,"id":"user","selector":"a.user:nth-of-type(2)","regex":"","delay":""}, {"parentSelectors":["post"],"type":"SelectorText","multiple":false,"id":"content","selector":"div.content","regex":"","delay":""}, {"parentSelectors":["post"],"type":"SelectorGroup","id":"links","selector":"p a","delay":"","extractAttribute":"href"}, {"parentSelectors":["post"],"type":"SelectorText","multiple":false,"id":"date","selector":"span.on-hover","regex":"","delay":""}, {"parentSelectors":["thread"],"type":"SelectorElementClick","multiple":true,"id":"thread_page","selector":"div.discussion-topic div.pages","clickElementSelector":"div.discussion-topic a.page-link","clickElementUniquenessType":"uniqueText","clickType":"clickMore","discardInitialElements":false,"delay":"4000"}, {"parentSelectors":["board"],"type":"SelectorElementClick","multiple":true,"id":"board_page","selector":"div.discussion-summaries div.pages","clickElementSelector":"div.discussion-summaries a.page-link","clickElementUniquenessType":"uniqueText","clickType":"clickMore","discardInitialElements":false,"delay":"2000"}, {"parentSelectors":["thread"],"type":"SelectorElement","multiple":false,"id":"collection","selector":"div.stack.discussions div.one-third","delay":""}, {"parentSelectors":["collection"],"type":"SelectorLink","multiple":false,"id":"collection_title","selector":"h3 > a","delay":""}, {"parentSelectors":["collection"],"type":"SelectorText","multiple":false,"id":"collection_user","selector":"a.user","regex":"","delay":""}, {"parentSelectors":["_root"],"type":"SelectorLink","multiple":true,"id":"board","selector":"div.board-summary div.title a","delay":""}],"_id":"talk_galaxyzoo_org"}

#Digital methods #MobCitSci

Re-threading the once threaded

One of the problems with scraping is that the order the scraper moves through a forum site may not follow the logical reading order for posts. This means that while I managed to collect all 65,000 posts into a series of spreadsheet files for each sub-forum, the order the posts appeared made threads totally unreadable. If I only wanted to perform descriptive statistical analysis on the posts this wouldn't be a problem, but I want to analyse the discussions as interactions and so clearly they needed resorting. I tried to import the CSV files into qualitative data analysis program MaxQDA and use it to sort out the posts, but quickly realised that what I needed was to see each thread as separate document organised by sub-forum in the QDA database and not each post organised by thread. This meant that I needed to reconstitute the threads within the spreadsheet files and split each thread off as separate workbook files before importing them into MaxQDA. Fortunately, after a great deal of trying I managed to get Excel to perform the task:

I performed a custom sort on each spreadsheet. Level 1 of the sort was the thread title (this groups all the posts for a thread), and level 2 was the timestamp (this sorted the posts within each thread).

I then used pivot tables to filter the data by thread and make each thread its own sheet.

Then I used this VBA script to automatically save each sheet as a new workbook (VBA scripts only work easily on Windows versions of Excel so I was forced to fire-up the old virtual machine on my Mac to make this work).

And voila! I imported the spreadsheets for each thread into MaxQDA and ended up with a threaded database of posts to search through and code.

#Digital methods #MobCitSci

Scraping galaxyzooforum.org

Between 2007 and 2014 Galaxy Zoo participants maintained a forum at galaxyzooforum.org that was very active for at least the first 3 or 4 years. It amassed over 65,000 posts in nearly 20,000 threads and had just over 9,000 members.

To scrape the posts from galaxyzoo.org I used Martin Sbalodis’ webscraper.io Chrome extension to write my own tool. Since galaxyzooforum.org is based on Simple Machines’ relatively uncommon SMF2 forum platform, there were no pre made scrapers or hacks to poll the forum backend for information as there is for platforms like phpBB. Since SMF2 as implemented on galaxyzooforum.org has no dynamic elements it was relatively straightforward to create a tool that loads each page in succession and collects the posts.

The tool I created breaks down each post into the poster, time posted, text, URLs and image URLs. In addition, the post order within each thread is collected to make reassembling the threads after scraping easier.

The JSON script for the scraper looks like this:

{"startUrl":"http://www.galaxyzooforum.org/index.php?board=18.0","selectors":[{"parentSelectors":["_root","pagination_forum"],"type":"SelectorLink","multiple":true,"id":"pagination_forum","selector":"div.pagelinks.floatleft a.navPages","delay":""},{"parentSelectors":["_root","pagination_forum"],"type":"SelectorLink","multiple":true,"id":"thread","selector":"td.subject span a","delay":""},{"parentSelectors":["thread","pagination_thread"],"type":"SelectorElement","multiple":true,"id":"post","selector":"div.post_wrapper","delay":""},{"parentSelectors":["post"],"type":"SelectorText","multiple":false,"id":"user","selector":"h4 a","regex":"","delay":""},{"parentSelectors":["post"],"type":"SelectorText","multiple":false,"id":"time","selector":"div.keyinfo div.smalltext","regex":"","delay":""},{"parentSelectors":["post"],"type":"SelectorText","multiple":false,"id":"text","selector":"div.inner","regex":"","delay":""},{"parentSelectors":["thread","pagination_thread"],"type":"SelectorLink","multiple":true,"id":"pagination_thread","selector":"a.navPages","delay":""},{"parentSelectors":["post"],"type":"SelectorGroup","id":"links","selector":"div.inner > a.bbc_link","delay":"","extractAttribute":"href"},{"parentSelectors":["post"],"type":"SelectorGroup","id":"images","selector":"div.inner > img.bbc_img","delay":"","extractAttribute":"src"}],"_id":"galaxyzoo"}

Essentially, the scrape loads each thread of a subforum and then recursively each page of posts within a thread. It uses specific CSS elements of the forum structure to identify specific information and produce a CSV file that contains all the data.

Since there is 65,000 posts on galaxyzooforum.org, I decided to scrape each subforum independently to limit the damage if the scraper broke. This procedure worked well and I was able to scrape each subforum to a separate CSV file. I then took this CSV file and used TextWrangler’s brilliant ‘Zap Gremlins’ function to clean up some strange artefacts that appeared because of changes in text encoding.

#Digital methods #MobCitSci

Drowning in forum posts

My training as a social scientist has focused on ethnographic methods and the use of detailed video ethnography has been the cornerstone of my research practice. But, like many social scientists in the last few years, my attention has turned more and more to online interactions between people and this has posed significant challenges. As many have lamented, the sheer volume of data that must be managed is in many ways incompatible with the detailed, situated approaches I am trained in. Finding ways to sort through enormous flows of interactions and make rigorous choices about which interactions to focus my detailed analysis on has proven to be both fascinating and frustrating. A case in point is my current work to understand informal learning in communities of amateur 'citizen scientists' who contribute to large science projects through their efforts online. One of the communities I am studying has had an active discussion forum for the past eight years and finding a way to manage 65,000 posts, many of which form detailed scientific discussions that take place over several months, has forced me to find new ways of working.

#CAQDAS #Digital methods #MobCitSci

Scraping talk.galaxyzoo.org

The JSON script for the scraper looks like this:

#Digital methods #MobCitSci

#mobcitsci

Trending Tags

Recently Viewed Tags

#mobcitsci