After several weeks of effort, we currently have fan transcripts saved for all of the Witch Films that we plan to compare. As mentioned in our âTroubleshootingâ post, there are several affordances of using fan transcripts (versus scripts or shooting scripts):
Transcripts typically do not record information about action or instructions for the camera, and so when we compare transcripts we are comparing only final dialogue. Thus, the cosine similarity scores between transcripts ought to determine the similarity between the spoken words that viewers hear. For the sake of our project, this is a more interesting comparison than, say, that between shooting scripts, which include a lot of extraneous information.
Fan transcripts proved much more reliable than other potential captures of dialogue, particularly closed caption data. Many CC files were garbled, especially in the case of older films. (In many cases, such as with The Witches of Eastwick, the fan transcripts appeared to have been rendered from closed caption data and then edited by a fan. In these situations, we removed time stamps from the file but left the rest of it intact.)
By looking for fan transcripts, we are once again allowing the audience and the internet to lead us toward popular films (as weâve done with the creation of our initial list of films and with the selection of images). This means that while we believe that, say, Little Witches (Jane Simpson, 1996) should have been included in the comparison, it wonât make it into our digital analysis because no fan transcript is available for that film. The availability of a fan transcript acts as a kind of filter, limiting us to only those films that have a popular following.
When looking for transcripts, we tended to have a lot of luck with Drewâs Script-O-Rama, Springfield! Springfield!, and Fandom.* If multiple scripts were available, we attempted to discern which appeared to most accurately represent the dialogue. After locating a script, we copied it from the web and pasted it into a Text Edit file.
As far as text cleanup goes, we removed all apostrophes from each file. (As mentioned in our âTroubleshootingâ post, SameDiff does not read contractions.) Unfortunately, this turned âheâllâ into âhellâ and âIâllâ into âIll,â so weâll need to go back in after we run word occurrence data and account for that. After removing all apostrophes, we did a quick scroll through each file to ensure that nothing appeared incredibly amiss after pasting the text into the file. This is when we realized that in a cluster of transcripts circa 2005, the lowercase âLâ and an uppercase âiâ had been transposed. (So it appeared as if the word Iâll was in order, but it was actually Lâii.) We did global searches for iâs and Lâs in each transcript and corrected for this error. Then we saved as a plain text document and uploaded to a Google Drive folder.
A link to all of the locations for the fan transcripts can be found in our Witch Things spreadsheet.
* For scripts and shooting scripts, there are several less robust options, including IMSDB and the American Film Scripts Online database. We are indebted to Melissa Jones at the Georgetown University Library for her help throughout the text collection process. The libraryâs Film and Media > Scripts & Archives page was also quite useful. Â