Some details on SaneBox's downtime on Wednesday
In the old days, I would hold a software release for 7 days. That 7 day rule was unbreakable. No additional non-bug fix changes would be allowed and much of the company time would be spent identifying and fixing bugs and getting the release ready.
In the world of Ruby on Rails with tests written first and code second, and the power of git to be able to pick and choose changes for release we are able to lower that release time to hours.
It is likely that most customers of SaneBox don't even realize how often we are releasing new software. Someone making a bug report might realize it when their fix gets pushed thru our development cycle in a matter of hours.
Usually the release causes a 20 second down time of the website and up to 10 minutes of down time for filtering for some accounts as our system is restarted.
On Wednesday August 21 at 8:45AM ET, we were making one of those releases. This change included adding Rails 3.2 asset pipeline capability to our website. It had been developed on a branch over a period of weeks and tested and code reviewed as we do with any other change.
But, little did we know that this line of code deep in the Ruby gem that gets executed during deploy:assets:precompile:
was about to cause us endless problems. It deletes everything in the Rails cache directory. No, it doesn't just delete things related to the pipelining or related to assets, it deletes all your cache files. Sigh...
OK. Big deal you think. Shouldn't cache files be temporary and easily replaced? Yes. But they are there to allow the system to avoid doing expensive operations all the time.
For instance, it is necessary for SaneBox to keep a temporary cache of the meta information on the emails we are tracking. This cache gets created when you sign up. New items get added to it as new email arrives in the Inbox. And this cache gets refreshed periodically. If it should disappear or become invalid for some reason, our processors simply go back to email server and recreate it.
But it is one thing to recreate the cache for one account or one machine or even a couple of machines than it is to recreate all the cache for everyone. By a factor of 100.
We always have spare bandwidth and cycles and RAM but not to allow a spike of 100. And no database on earth can withstand a spike of x100.
So here's what happened on Wednesday:
1. Our 8:45AM release wiped out the cache on every machine
2. We shortly noted an odd usage uptick off the database but no errors. Specifically no exceptions. The system was running just fine. Except it suddenly was running "a little" hotter than usual with a queue of queries we don't normally see on startup.
3. We identified a sql query that was too big.
4. We fielded another release to fix the sql query. Which we now know wiped out our cache again. And big surprise fixing the symptom didn't fix the problem.
5. Upon further diagnostic we realize that our cache has been wiped out. We now know the problem causing all the symptoms but not the cause.
6. We include the sql query fix in the previous tag and release that and start doing 2 things. 1) Ease the machines and database so that the cache can be recreated for all accounts. 2) Figure out what it was in the release that deleted the cache.
7. After some mysql configuration adjustments and various software changes to stop processes from being reaped prematurely, allow mysql backlog algorithms to operate under this new load, we were making some progress when our main mysql instance hung (we have filed a bug report for this).
8. We had to bring the database down and up. A process that takes over an hour.
9. At this point, not wanting to risk another database hang, I decided we would bring up one machine at a time. Let the processing clear and then bring the next one up. We have a lot of machines so this was not an easy choice to make. But again we started to make progress.
10. While all this is going on the other half of the team is scrambling to find that evil line of code and design a fix.
11. Oh yes, and some of the apache processes on some of the web serving machines got into an odd state because of the database issues so they needed to be restarted from scratch. But since it wasn't all of them that were having problems we needed to debug that as well.
12. We had many customers fully filtered by 3PM and finally had all customers fully filtered by 7:15PM.
Not a great day. Because of that one line of code we inconvenienced many SaneBox-ers from all over the world. And I think my engineering team and support team has aged an additional decade.
I do personally apologize to the entire SaneBox community. This is exactly the kind of thing that should never, ever happen. Each production release requires my sign off. So, yes this is my fault. And it is now my job to make sure it never happens again.
For those needing a silver lining: we did release the new software the next morning without any new surprises and the website is now 2-3x faster with a zillion additional little and big improvements. Oh yes, did I mention that Rails 4 asset pipelining doesn't wipe out your entire cache!
So, to the two people that made it down to this paragraph, you are asking yourself two questions:
1) What was it in our development process that allowed this to happen
2) What can we change about it to avoid this ever happening again
I am working on that and will report back.