EdX and breadcrumbs of data
The quest for modulestore.json and other tales of EdX data analytics is a multi-chapter blog story examining my experiences of educational data mining based on a short course I took while at UC Berkeley.
EdX is one of the big MOOC platforms out there, originating from MIT and Harvard. They provide the platform in the clouds but you can download the full code also and run it on your own. The courses are formed in the traditional MOOC way: there are weeks, and per week there are (video) lectures, exercises and assignments, discussion and other materials. The exercises can take several forms, like multi-choices, inputting equations, numbers or text or pointing from the picture, or if you really need something you can develop your own modules too.
The cool part for us, the researchers, is users’ interaction with the site is stored in a huge file. I mean huge, enormous! And the data logged is also rather detailed, for videos the level is checking when transcripts are displayed and when not. Everything is stored in a nice format that can be accessed
{"username": "miller", "host": "class.stanford.edu", "session": "fa715506e8eccc99fddffc6280328c8b", "event_source": "browser", "event_type": "hide_transcript", "time": "2013-07-31T06:27:10.877992+00:00", "ip": "27.7.56.215", "event": "{\"id\":\"i4x-Medicine-HRP258-videoalpha-09839728fc9c48b5b580f17b5b348edd\", \"code\":\"fQ3-TeuyTOY\", \"currentTime\":0}", "agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.72 Safari/537.36", "page": "https://class.stanford.edu/courses/Medicine/HRP258/Statistics_in_Medicine/courseware/495757ee7b25401599b1ef0495b068e4/6fd116e15ab9436fa70b8c22474b3c17/" }
{"username": "jane", "host": "class.stanford.edu", "event_source": "server", "event_type": "/courses/Education/EDUC115N/How_to_Learn_Math/modx/i4x://Education/EDUC115N/combinedopenended/c415227048464571a99c2c430843a4d6/get_results", "time": "2013-07-31T06:27:06.222843+00:00", "ip": "67.166.146.73", "event": "{\"POST\": { \"task_number\": [\"0\"]}, \"GET\": {}}", "agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36", "page": null }
(Examples are documented as part of a parser)
However, to conduct any research on this, we first need to get the correct data out from these files. So try to extract whatever is there and understand what it means: the first example is related to use of transcripts, but the latter one is someone checking the results on a certain exercise. So if we examine interactions with video, the latter is just waste! Luckily we are not alone here. Both Stanford and Harvard have written some scripts to extract the data. On the course, I took more careful look on the Stanford side of the scripts, and first just tried to run the script, however that try was not that successful:
The authenticity of host 'deploy.prod.class.stanford.edu (54.215.215.123)' can't be established.
So it seemed their scripts have interesting dependencies with Stanford servers and we indeed needed to get our hands dirty. I did what I do best: wrote some additional code and solved the problems.









