Using Mechanical Turk (Eventually) to Follow Researchers Part 2
In a previous post, I compiled a list of authors from CHI'13 to ask mechanical turk workers to find the researcher's twitter usernames. I tested with a batch of 20 researchers and decided that it would be better to start with an algorithmic list of potential candidates (I will refer to the list as Researchers Twitter Results) that turkers vet rather than find by themselves.
In this post, I share the code I used to compile a list of search results for "<authors name> twitter" (RTR) and discuss the problems with detecting whether the twitter name belongs to a CHI author.
I built two scripts for compiling RTR. (1) A PhantomJS script that runs a server and returns google search result urls. (2) A python script that asks the server for search results and puts them in a big python dict.
I've been using PhantomJS for side projects. PhantomJS is a headless scriptable Webkit that lets one use Javascript. The script runs a server that accepts a google search url (e.g. http://www.google.com/search?q=banana) in a get param and returns a json list of resulting urls. This is pretty useful as Google's custom search API's free is pretty stingy at 100 queries per day.
The python script queries the PhantomJS server for all researcher names and compiles them into a big json dict. I didn't get banned from google, I wait an average of 3 seconds between queries. You can see full results here in this json file. Here are a couple examples:
The first result is the right twitter page:
"https://twitter.com/AlGruner",
"https://www.facebook.com/public/Peter-Alan-Gruner-Jr",
"http://www.discogs.com/artist/Alan%2BGruner",
"http://www.imdb.com/name/nm0344412/",
"http://www.intelius.com/people/Peter-Gruner/08pxz2t9kvv",
"http://en.wikipedia.org/wiki/Billy_Kidman",
"http://www.manta.com/c/mm3tplc/allen-k-gruner-attorney",
"http://www.yasni.fr/alan%2Bgruner/recherche%2Bpersonne",
"https://myspace.com/billykidman12345",
"http://www.lawyers.com/louisville/kentucky/Allen-K-Gruner-4567896-f/",
"https://twitter.com/AlGruner"
If all results were like this, it would be easy to find all of the usernames I'm looking for.
This example has 10 twitter usernames...
"https://twitter.com/TallChineesGuy",
"https://twitter.com/AlexLovesChachi",
"https://twitter.com/Alexjensen90",
"https://twitter.com/AlexGaareJansen",
"https://twitter.com/awpjansen",
"https://twitter.com/AlexJansenBirch",
"https://twitter.com/phaoust",
"https://twitter.com/TheeAlexJansen",
"https://twitter.com/LanginStrife",
"https://twitter.com/alexculinair",
"https://twitter.com/TallChineesGuy"
This example has none. It's not likely they have a twitter account.
"http://researcher.ibm.com/view.php%3Fperson%3Dus-vlh",
"http://hanson.massachusettsbox.com/c-82090.htm",
"http://womeninplanetaryscience.wordpress.com/2010/12/30/vicki-hansen-celebrating-research-and-inquiry-and-respect-for-different-ideas/",
"http://www.intelius.com/people/Vicki-Redoutey/061zkb02n8y",
"http://jslhr.highwire.org/cgi/content/abstract/32/1/2",
"http://mail.free-knowledge.org/references/authors/vicki_l__hanson.html",
"http://www.slideshare.net/mikecrabb/aiding-data-gathering-in-web-usability-studies",
"http://www.youinweb.com/profiles/02341/vicki-l-lyall_344727063.htm",
"http://www.peekyou.com/vicki_hanson",
"http://www.directority.com/MA/Hanson/Mental-Health-Services/Mental-Health/384379-Lyall-Vicki-L.htm",
"http://researcher.ibm.com/view.php%3Fperson%3Dus-vlh"
The distribution of the number of results is as follows:
0:218
1:94
2:312
3:153
4:105
5:116
6:40
7:40
8:48
9:41
10:48
11:62
12:1
So, there may be more to vet than I originally anticipated. I also thought about problems in vetting as I tried to do it myself.
Researchers may not talk about research on twitter.
If tweets are private so you can't read them to help understand if it's research.