Speeding up CSV imports with rails
I was recently working on a project that required frequent importing of S3-hosted CSV files containing hundreds of thousands of users. My first pass at the import was fairly standard:
require 'rubygems' require 'fog' require 'csv' start_time = Time.now counter = 0 tenant = Tenant.first connection = Fog::Storage.new({ :provider => 'AWS', :aws_access_key_id => 'xxx', :aws_secret_access_key => 'xxx' }) directory = connection.directories.get("xxx") file = directory.files.get('imports/test-import-medium.txt') body = file.body CSV.parse(body, col_sep: "|", headers: true) do |row| row_hash = row.to_hash user = User.new( first_name: row_hash["FirstName"], last_name: row_hash["LastName"], address: row_hash["Address1"], address2: row_hash["Address2"], city: row_hash["City"], state: row_hash["State"], zip: row_hash["ZipCode"], email: row_hash["Email"], gender: row_hash["Gender"], ) user.set_random_password user.memberships.build(tenant: tenant, status: Membership::STATUSES[:created]) user.save! counter += 1 end end_time = Time.now puts "#{counter} users imported #{((end_time - start_time) / 60).round(2)} minutes (#{( counter / (end_time - start_time)).round(2)} users/second)"
I ran that version against a test file containing 25,000 records. The result was -- well, I'm not sure what the result was because I stopped it over an hour in when it had only imported 5,000 records.
After scratching my head for a minute and doing a little debugging, I discovered that the slowest part of the script was the "user.save!." Sure, connecting to s3 and streaming down the file wasn't lightning-fast, but the real culprit was saving a user. It was extraordinarily slow for some reason.
Then it hit me, the reason I was setting a password via my custom "set_random_password" method was because the User model used has_secure_password. My workflow didn't even require the user to have a password at this point, but has_secure_password requires a password on creation, so I was passing in a dummy password to pass the validation. The generating the password was fast, but encrypting the passwords was where things were slowing down. I was taking a significant hit for something that I didn't even need.
So, my first step was to rip out has_secure_password and write my own encryption/authentication methods. That way, I had control over the validation which meant I no longer needed to pass in a password during this import process. That alone was a huge gain. After that change, I re-ran the import and the result was:
25000 users imported 12.25 minutes (34.01 users/second)
12 minutes was a significant improvement from the first pass, but at that rate, it would still take 3-5 hours to process a file containing 200k rows (assuming the rate slowed with larger files). So, I began looking for big optimization gains.
After some quick googling, I found https://github.com/zdennis/activerecord-import. This looked extremely promising, so I cranked out some code to test it out
start_time = Time.now counter = 0 tenant = Tenant.first users = [] ... CSV.parse(body, col_sep: "|", headers: true) do |row| row_hash = row.to_hash user = User.new( first_name: row_hash["FirstName"], last_name: row_hash["LastName"], address: row_hash["Address1"], address2: row_hash["Address2"], city: row_hash["City"], state: row_hash["State"], zip: row_hash["ZipCode"], email: row_hash["Email"], gender: row_hash["Gender"] ) user.memberships.build(tenant: tenant, status: Membership::STATUSES[:created]) users << user counter += 1 end User.import users end_time = Time.now puts "#{counter} users imported #{((end_time - start_time) / 60).round(2)} minutes (#{( counter / (end_time - start_time)).round(2)} users/second)"
The good news: it was FAST
25000 users imported 1.57 minutes (265.99 users/second)
The bad news, it completely ignored the build() association and didn't save the membership records. This was because activerecord-import can't handle associations. When it inserts records, nothing is returned, so building an associated model just wouldn't work. As frustrating as that was though, I wasn't willing to abandon the activerecord-import path -- 265 records/second was a huge impremovent over 34.
My next pass is where things got a little less attractive, but I was willing to sacrafice some elegance for the sake of speed. I decided to import the users, then loop through the CSV again, get the User.id for each freshly inserted record, manually build a Membership model, then import an array of Memberships. My theory was that doing two loops and a lookup of every user to get their id would still be faster than a non activerecord-import solution.
require 'rubygems' require 'fog' require 'csv' start_time = Time.now counter = 0 tenant = Tenant.first users = [] memberships = [] ... CSV.parse(body, col_sep: "|", headers: true) do |row| row_hash = row.to_hash user = User.new( first_name: row_hash["FirstName"], last_name: row_hash["LastName"], address: row_hash["Address1"], address2: row_hash["Address2"], city: row_hash["City"], state: row_hash["State"], zip: row_hash["ZipCode"], email: row_hash["Email"], gender: row_hash["Gender"] ) users << user counter += 1 end User.import users counter = 0 CSV.parse(body, col_sep: "|", headers: true) do |row| row_hash = row.to_hash user = User.where("email = ?",row_hash["Email"]).first membership = Membership.new( tenant: tenant, status: Membership::STATUSES[:created], user: user ) memberships << membership counter += 1 end Membership.import memberships
I ran it again and the results were:
25000 users imported 2.72 minutes (153.3 users/second)
OK, so not as fast as before, but still plenty fast. Even with 200k records, this would certainly finish within 30-45 minutes. I felt good about the path I was on, but began looking for other improvements. First, I was getting an entire user record in the second loop, when I really only needed the User.id:
user = User.where("email = ?",row_hash["Email"]).first
So, I replaced that line with the following:
user_id = User.where("email = ?",email).select(:id).first.id
I was also making unecessary calls within the loop to get the value of Membership::STATUSES[:created], so I moved that out of the loop and set a "status" variable once. I was also passing in an entire Tenant object, when again, all I needed was the Tenant.id, so I fixed that too.
My final version looked like this
... start_time = Time.now users = [] memberships = [] counter = 0 tenant_id = Tenant.first.id status = Membership::STATUSES[:created] ... CSV.parse(body, col_sep: "|", headers: true) do |row| row_hash = row.to_hash user = User.new( first_name: row_hash["FirstName"], last_name: row_hash["LastName"], address: row_hash["Address1"], address2: row_hash["Address2"], city: row_hash["City"], state: row_hash["State"], zip: row_hash["ZipCode"], email: row_hash["Email"], gender: row_hash["Gender"] ) users << user counter += 1 end User.import users counter = 0 CSV.parse(body, col_sep: "|", headers: true) do |row| row_hash = row.to_hash user_id = User.where("email = ?",row_hash["Email"]).select(:id).first.id membership = Membership.new( tenant_id: tenant_id, status: status, user_id: user_id ) memberships << membership counter += 1 end Membership.import memberships end_time = Time.now puts "#{counter} users imported in #{((end_time - start_time) / 60).round(2)} minutes."
And when I ran it, I got the following
25000 users imported 2.4 minutes (173.71 users/second)
Those last little optimizations had a 20 records per second improvement. From where I started, the import process went from 5-10 hours to under thirty minutes for 200k records. That's a huge improvement and the time savings will yield huge rewards for my client. Not a bad day's work.