Messing Around with Perl and MongoDB
I created a script that parses the actor list from IMDB (just for fun). It was a good chance to finally mess around with Perl (last time I used it was back in 2007) and most especially MongoDB.
Perl is pretty fun, but it’s much crazier than I remembered. There are a ton of short cuts and default values. All this seems powerful, but allows you to write incredibly terse/difficult to maintain code.
Typically, I would just use AWK for something like this, but it seemed nice to be able to parse/connect to the database in the same script…
The first time I ran the script, it took like 6 hours to insert/update 84,000 records. I stopped the script and checked my MongoDB book. I added 1 index, using ensure_index. I ran the script again, and it finished all 1.2M records very quickly (yay).
#!/opt/local/bin/perl use utf8; use MongoDB; use JSON; my $connection = MongoDB::Connection->new(host => 'localhost', port => 27017); my $imdb = $connection->imdb; my $actors = $imdb->actors; $actors->ensure_index({"name" => 1}); $actorName = ""; $title = ""; $year = ""; $episode = ""; $characterName = ""; $castingName = ""; while (<>) { chomp; $curLine = $_; $actorLine = 0; if ($curLine =~ /^[^\t]/ && !($curLine =~ /^$/)) { $curLine =~ /^(?<name>.*?)\t(?<remainder>.*)$/i; $actorName = $+{'name'}; $curLine = $+{'remainder'}; $actorLine = 1; } if ($curLine =~ /^\t.*/ || $actorLine) { #match title and year $curLine =~ /(?<title>.*?)\((?<year>.*?)\)/; $title = $+{'title'}; $year = $+{'year'}; #match episode $curLine =~ /\{(?<episode>.*?)\}/; $episode = $+{'episode'}; #match charname $curLine =~ /\[(?<charname>.*?)\]/; $characterName = $+{'charname'}; #match casting name $curLine =~ /\[(?<casting>as .*?)\]/; $castingName= $+{'casting'}; $actorName = &trim($actorName); $title = &trim($title); $year = &trim($year); $episode = &trim($episode); $characterName = &trim($characterName); $castingName = &trim($castingName); # We only want to deal with movies for the time being if (!defined($episode)) { my $data = $actors->find_one({name => $actorName}); if (defined($data)) { #if the record for this actor exists my @titles = @{$data->{'titles'}}; push(@titles, { title => $title, year => $year, characterName => $characterName, castingName => $castingNamen }); $actors->update({name => $actorName}, { name => $actorName, titles => \@titles }); } else { #if the record for this actor does not exist my @tmp = (); push(@tmp, { title => $title, year => $year, characterName => $characterName, castingName => $castingNamen }); $actors->insert({ name => $actorName, titles => \@tmp }); } } } } sub trim { my $string = shift; $string =~ s/^\s+//; $string =~ s/\s+$//; return $string; }