Building Wanelo is Wanelo's engineering blog, featuring technical tales of triumph, daring and woe. Sometimes cats. We are definitely hiring. Email [email protected] if you're curious.
When Wanelo gets a brand new workstation the first thing we install on it is Sprout. Sprout is a collection of OS X-specific recipes that allow you to install common utilities and applications that every Ruby developer has and will appreciate.
Anyone who has worked at Pivotal Labs would feel comfortable with the workstations that spawn from sprout-wrap, because they’ve probably worked on one before. Sprout is based on chef-soloist that allows a developer to run a set of chef recipes from their local machine. Recipes have been built for applications like Chrome, RubyMine and iTerm. Other common OS X settings — ones that are changed on every new workstation — can be switched, including turning SSH on, changing the default keyboard repeat rate, installing sane git aliases, rbenv and bash completion.
You can automatically clone git repositories into your ~/workspace directory. You can install PostgreSQL, ImageMagick, node, Dropbox, PhantomJS, GitX, Caffeine and Heroku Toolbelt. If you’ve found a utility useful in a development setting, or flipped a switch somewhere in System Preferences, a recipe probably exists in one of Sprout’s cookbooks.
So who is the useful for?
We at Wanelo have grown considerably in the last few months I’ve been there. We’ve hired four new people. That’s two pairing stations that two pairs would normally have to set up. Instead, we took the time to set up asoloistrc file — the file where you specify which recipes to run.
Running bundle exec soloist inside the proper directory will ensure all applications are installed. After unboxing new workstations (or formatting the out-of-date ones) a full run of sprout-wrap takes about two and a half hours. Subsequently, each run takes about 3 minutes.
Customize!
At Wanelo we wrote a recipe that runs bundle install and sets up our development database. It gets our Ruby web application to a state where we can run foreman start on a brand new machine. We also wrote a recipe that installs our VPN on our machines. The most recent recipe we wrote installs our favorite vim plugins, configurations and theme.
We have reached convergence
Sprout-wrap has the benefit of following chef’s principles and requires recipes to be idempotent. So when a developer includes the recipes that, for instance, installs cowsay, going forward each workstation will now have cowsay. Which means every developer is happy to work on any workstation. Because, hey, now you have cowsay on every workstation.
Sprout enables us as pairing developers to build a stable cluster of similar machines where there are minimal development bottlenecks. Having someone who is familiar and can set up sprout-wrap on your machines is hugely valuable for a growing organization, and can save hours, likely days, of developers' time.
You can learn and find instructions on how to set up and install sprout on GitHub.
Deploying at Wanelo tends to be high-frequency and low-stress, since we have most aspects of our systems performance graphed in real time. We can roll out new code to a percentage of app servers, monitor app server and db performance, check error rates, and then finish up the deploy.
However, there’s one area where I’ve always wanted better metrics: on the client side. In particular, I want better visibility into uncaught JavaScript exceptions. Client-side error tracking is a notoriously difficult problem -- browser extensions can throw errors, adding noise to your reports; issues may manifest only in certain browsers or with certain network conditions; exception messages tend to be generic, and line-numbers are unhelpful, since scripts are usually minified; data has to be captured and collected from users’ browsers and reported via http before a user navigates to a new page. And on and on.
On the other hand, many sites are moving more and more functionality client-side these days, so it’s becoming increasingly important to know when there are problems in the browser.
I have yet to see a great solution to this problem, so I try to ask about other companies’ client-side error tracking whenever I can. I usually hear one of two answers: A.) We don’t track them (but we’d like to), or B.) We built our own in-house tracking system; sometimes it helps us catch issues, but usually it’s a firehose of random errors that we can’t trace back to a particular issue.
There’s a middle path between these two answers that I think will end up being the “just right” solution for us: client-side error rate tracking. Essentially, ignore all error messages and calculate the total count of client side errors per minute relative to “page views." The goal of this sort of tracking isn’t to pinpoint each new client-side issue, but just to answer the question: did we break something during this deploy that’s going to prevent our users from having a good experience on the site?
tracking should not require any new services or infrastructure, or add much load to our current infrastructure
data needs to be graphable alongside our backend metrics
data needs to be available fast, ideally in around 5 minutes or less
Most of our backend metrics are graphed in Circonus, and we display the most relevant ones on a big dashboard in the office. The graphs and meters on the dashboard are calibrated well enough that they go red infrequently, and when they do, we notice (and usually take action). My aspiration for our client-side error tracking is for it to be on this dashboard and work the same way — updating in real-time and going red if (and only if) there’s an actionable issue. Luckily, Circonus can consume data from a lot of different sources, including metrics from hosted third-party tools like New Relic — in fact, it can do custom checks to any JSON endpoint that is public, or that allows authentication with a URL param or request header.
The custom-check capability is a win, because it lets us use Fastly (our CDN) to serve our client-side error beacon and then set up a Circonus check to the Fastly stats API to get a count of errors. Fastly isn't exactly intended for this use case, so it requires a bit of configuring to set up, but there are a few advantages: first, a spike in error beacon traffic won’t increase load on any of our infrastructure; second, we make extensive use of Fastly for serving assets and pages, so we already have a reason to track Fastly day to day (and, of course, error pixels load fast : )
Setup
1. Add an on-error handler that loads the error beacon
The first step was to add an on-error handler near the top of our main application JavaScript file. We kept it simple. If you already have on-error handlers (or your third-party scripts use them) you may want to have the handler call the previous handler when it's done. Otherwise, the on-error function just needs to insert an img tag into the page when an error is caught.
2. Deploy the beacon to Fastly
Second, we deployed our error pixel on Fastly. For Fastly, we set the pixel up on its own domain so it would be easy to track all of our client-side error traffic as a separate service. We also had to update the default VCL slightly — we wanted the backend TTL to be long, so Fastly never needs to fetch a new version of the asset from our servers, but set the response cache control headers on the asset to be short (max-age=1), so new errors will always trigger a request to Fastly (i.e., the asset shouldn’t be loaded from the browser cache).
3. Set up a Circonus JSON check to the Fastly API
Third, we set up a Circonus JSON check to poll the Fastly stats API every minute. Since Fastly has some rate limits on pulling very recent data from the stats API, we have the check get a count of error pixel requests from 30 minutes ago. This gets us a count of errors per minute — to turn it into a rate (i.e., percentage of page views that have a JavaScript error), we graph the ratio of errors per that minute to pageviews per that minute. (We’re using stats on our CSS and JS asset requests to calculate number of pageviews, assuming some number of asset requests per minute, but if you wanted to be more precise, you could add a “pageviews” tracking beacon to your web layouts and use that rate for your denominator.)
If we want to see this data immediately, without the 30-minute delay, we can -- we just need to log into Fastly and look at the real-time graphs on the Fastly dashboard.
And that's it!
Results
We’re successfully collecting this data now, and we’re graphing it in Circonus. This is the "errors last week" view:
So far, this experiment looks like a qualified success. We were able to roll out tracking with less than a day of work, and, if something major gets broken, we’ll see a spike in the graph in 30 minutes. However, since this metric is a bit noisier than others we collect, it takes a fairly widespread issue before it’s obvious on the graph. So, there are some ways we want to improve this tracking in the future...
TODO:
1. Segmented rates by browser
The top priority for future work here is adding error rates by browser/operating system (and possibly browser version). Right now our graphs will tell us if there’s an issue on a popular page in all (or most) browsers, but if we had an issue in just one browser, like Internet Explorer 9, it might get lost in the noise. It would be great to have a composite graph that’s composed of stacked rates for each major browser, so we could see if one was growing out of the proportion to the others.
2. Closer-to-real-time data
The data we pull from the Fastly API is a bit older than the data in most of our Circonus graphs, so there’s some mental overhead to remembering to offset the one graph from the others by 30 minutes to think about overall systems health. It would be great to be able to pull fresher data from the Fastly API without the rate limits, or be able to offset the graphs in Circonus by a set amount of time.
Addendum
Other collection options: Google Analytics
If your tools don’t allow you to beacon data some place and graph it, but you still want this sort of tracking, you may be able to do similar tracking with your web analytics solution. For example, instead of an image beacon, you could send a Google Analytics custom variable. On the plus side, a lot of web analytics tools track browser and OS version information by default; however, if this tracking relies on JavaScript, it may not run in some cases, depending on where the original uncaught exception happened.
On the other hand, a site as large as Wanelo no longer qualifies for full data collection on the free Google Analytics service, so we use 5% sampling. Clearly, sampling client-side errors is not the best idea :)
Where this won’t work
For us, I think graphing a client-side error rate is the happy medium between having no visibility into client-side exceptions and having a high-maintenance error-tracking service. However, there are use cases where this technique wouldn’t be a great fit: for example, if your company does large, infrequent releases, error rates may not be enough information. Our releases tend to be relatively small, so usually, if we see error rates climbing, we know which code changes are the likely culprits, and where to start looking for the issue.
Multi-process or multi-threaded design for Ruby daemons? GIL to the rescue :)
MRI Ruby has a global interpreter lock (GIL), meaning that even when writing multi-threaded Ruby only a single thread is on-CPU at a point in time. Other distributions of Ruby have done away with the GIL, but even in MRI threads can be useful. The Sidekiq background worker gem takes advantage of this, running multiple workers in separate threads within a single process.
If the workload of a job blocks on I/O, Ruby can context-switch to other threads and do other work until the I/O finishes. This could happen when the workload reaches out to an external API, shells out to another command, or is accessing the file system.
If the workload of a process does not block on I/O, it will not benefit from thread switching under a GIL, as it will be, instead, CPU-bound. In this case, multiple processes will be more efficient, and will be able to take better advantage of multi-core systems.
So… why not skip threads and just deal with processes? A number of reasons.
Threads are more memory efficient. Fewer processes means less virtual memory allocation, allowing for more workers on fewer hosts. This can result in considerably less cost over the course of a year. Garbage collection fixes in Ruby 2 promise better shared memory management between forked processes, but I have yet to see material benefit from this in production.
Context switching between processes is more expensive than context switching between threads. This is because process context-switching involves switching out the memory address space. Thread switching happens within the same address space.
Even when pooling database connections through a connection manager like PGBouncer, more processes means more idle connections, and if you're not careful (i.e., you don't monitor connection count) it's easy to accidentally max out your database connection limit. We are particularly aggressive on this front (having burned ourselves a few times), sometimes going so far as to force ActiveRecord to release connections back into its connection pool before starting long, blocking requests. For instance:
class PushNotificationWorker < WaneloWorker def perform!(user_id, message) User.find(user_id) ActiveRecord::Base.clear_active_connections! PushNotification.new(user, message).deliver! end end
This way other threads are able to get the connection out of the pool before the first thread finishes.
Sometimes a workload will purposefully block (or sleep), for instance, in a daemon process that only wakes up every N seconds to do some work. The spanx (https://github.com/wanelo/spanx) gem works this way, with multiple actors running in separate threads.
It's much easier to manage a single process daemon, from an operational point of view, than a set of daemons. In SmartOS this means that the SMF definition for the service does not have to manage multiple processes. Additionally this prevents a situation where one actor may not start, which might happen with multi-process design. It's much less confusing to type "svcadm disable spanx-watcher" when there's a problem, than to track down four separate services in order to stop them all (having said that, SMF supports "noop" service that can be declared as a single dependency for several others, thus stopping noop service also stops the dependents).
In our Chef cookbook for automating running Sidekiq background jobs as a service in SmartOS, we define a pool of Sidekiq workers attached to a set of queues, with a configurable concurrency. This allows us to run CPU-bound background jobs as a pool of single-threaded multi-process workers. Conversely, we can configure IO-bound jobs, such as workers that need to "talk" to external APIs, as a pool with high concurrency (often as high as 10 or 20). If workers have to wait 2-3 seconds for API calls to return, that's a lot of time left for other jobs to be processing in parallel.
Related: In his presentation "Accelerating Wanelo to 200K RPM", Konstantin Gredeskoul shows how to use NewRelic to calculate ratio of CPU to IO in our ruby stack, to determine how many single-threaded unicorn processes to run on a multi-core system for optimal utilization.
See also: "Threads (in Ruby): Enough Already" by Yehuda Katz.
Quick heads-up on our upcoming webinar with Joyent on Manta
A few months back, one of our engineers Atasay Gokkaya published a fantastic overview of how we at Wanelo use Joyent's new innovative object store Manta for a massively parallelized user retention analysis, using just a few lines of basic UNIX commands in combination with map/reduce paradigm.
I also recently went onstage with Joyent's VP of Engineering Bryan Cantrill for a fireside chat at VentureBeat’s CloudBeat, discussing Wanelo's use of Manta, as well as our excitement about Joyent's cloud. If you missed it, or are interested to learn more about the subject, we're continuing the discussion with a live webinar on Tuesday, October 29th.
Atasay and I will dive deep into our team's experience using Joyent Manta storage and big data analytics service.
It's an hour-long webinar, and we'll cover the following:
How we solved the problem of user event data collection on a massive scale, and very cheaply
How Joyent Manta storage and big data analytics service allowed us to use the collected data to analyze user behavior and retention over many months, and run our queries in mere minutes
We'll discuss the unique benefits of using Joyent Manta Storage Service, including ease of use, flexibility, performance, and cost-savings
We'll answer any questions from the audience as much as time permits.
Detangling Business Logic in Rails Apps with PORO Events and Observers
With any Rails app that evolves along with substantial user growth and active feature development, pretty soon a moment comes when there appears to be a decent amount of tangled logic, AKA "technical debt."
A typical example would be a user registration controller's "register" action, which upon a successful registration might coordinate a bunch of actions related to the registration but unrelated to one another, such as:
Sending the user a welcome email
Logging an analytics event for future reporting
Queueing up a job to notify user's Facebook friends
Running a check against a spam database of IP addresses to validate the new account
Running recommendation engine logic to suggest topics to follow
These are all concerns that are independent of one another, but happen when a user registers. Some of these actions happen immediately, some even within a single transaction, and some asynchronously (in another thread, or in a background job).
This topic has been given a lot of discussion on this famous thread, where even DHH chimed in. We'll use the example discussed in that thread, and the version that DHH presented (slightly compacted) below. Basically, a controller that's creating a comment and then performing a bunch of related actions, such as posting to Twitter and Facebook, or running it through a spam check.
class PostsController def create @entry = current_user.entries.find(params[:id]) return head(:bad_request) if SpamChecker.spammy?(params[:post][:body]) @comment = @entry.comments. create!(params[:post]. permit(:title, :body). merge(author: current_user)) Notifications.new_comment(@comment).deliver if @comment.share_on_twitter? TwitterPoster.new(current_user, @comment.body).post end if @comment.share_on_facebook? FacebookPoster.new(current_user, @comment.body). action(:comment) end end end
In this blog post we'll examine an event-based approach to decoupling this business logic, a method that's been pretty successful within the Wanelo codebase thus far.
Simple World
In a small Rails application it might be tempting to put this type of logic directly on the controller as DHH did above, but this approach, while simple and easy to understand, might make things a bit difficult as the application matures. For one, testing this action becomes a challenge: many unrelated services need to be stubbed out, and many permutations tested. What if a Twitter post succeeds, but a Facebook post doesn't? In other words, these concerns become "tightly coupled" inside the controller. And what if we want to also create comments from another place, perhaps an API controller? This certainly applied to our case at Wanelo.
A possible solution could be to split the logic into various methods on the ActiveRecord Comment model, and implement them as callbacks, such as after_create. Unfortunately this approach suffers from similar problems: many of the above actions do not belong inside the model, and will only pollute it with tangled code and external dependencies. Why does the Comment model need to know about posting to Facebook or a spam checker? This certainly is a matter of taste, but experience shows that attaching this behavior directly onto the model does not work out well in the long run, as the model classes become bloated and full of external dependencies.
Another solution could be to create a layer of "services" -- standalone plain Ruby classes, which then encapsulate this logic in one place, and then multiple controllers can just call into it. This is the approach that the author of the thread proposed. While I do think that this solution is better than stuffing this logic into the controller, and assuming it's valuable to have this logic be reusable, it still crams these concerns together, making that class challenging to test.
Ultimate Goal?
It's really important to decide what our goals are. At Wanelo, we want to be able to build software that's easy to change, and to adapt to the constantly moving product requirements and experiments, as well as to the high-scalability demands of the popular site that we've become.
We want code that's easily testable, easy to understand, and easy to maintain. Tangled code of unrelated concerns inside controllers, models or services does not meet this standard: it's hard to test, and could be hard to change (assuming many more layers are added on top of the simple example, which tends to happen in larger apps).
Events to the Rescue
Events and event-driven architecture offer a nice design pattern for splitting this logic into self-contained units (observers), which can declare interest in a certain business event independently of one another. When an event happens, observers get notified. It's pretty straightforward, but it moves the dependency association into subscribing each observer to the event.
Ruby's Observable module provides a very simple way to create observers and tie them to an "observable" event (which is a class including the module).
But we found ourselves wanting a bit more. For starters, it's nice to encapsulate events into plain Ruby classes that might wrap some useful event data (for example, a comment object, for the CommentPublishedEvent). Another feature we wanted was the ability to notify a group of observers within a transaction block, and others outside of the transaction.
Ventable to the Rescue
Enter Ventable: a very simple plain Ruby gem that implements the Observer pattern in a slightly more flexible way, and provides convenient configuration DSL that makes connecting events to interested observers (or listeners) declarative. This "connecting" logic goes into your Rails initializer folder, and provides a nice "map" of what happens when important business events happen.
Here is an example of how things might be connected in your application based on the previous example:
require 'ventable' class CommentCreatedEvent include Ventable::Event attr_accessor :user, :comment def initialize(user, comment) @user = user @comment = comment end end # config/event_initializer.rb CommentCreatedEvent.configure do notifies Notifications, TwitterService, FacebookService, SpamChecker end # lib somewhere class TwitterService # implementation skipped... def self.handle_comment_created event if event.comment.share_on_twitter? self.new(event.user).post_comment(event.comment.body) end end end class FacebookService # implementation skipped... def self.handle_comment_created event if event.comment.share_on_facebook? new(event.user).post_comment(event.comment.body) end end # SpamChecker skipped for brevity # app/controllers/post_controller.rb class PostsController def create @entry = current_user.entries.find(params[:id]) @comment = @entry.comments. create!(params[:post]. permit(:title, :body). merge(author: current_user)) CommentCreatedEvent.new(current_user, @comment).fire! end end
There is a lot going on above, but it's also pretty obvious what's happening -- another power of this eventing approach. First we are defining a CommentCreatedEvent class, to wrap user and comment instances, and then we configure this event using the DSL to notify several observers (which in this case are all ruby classes). We can now use generic FacebookService and TwitterService (which could encapsulate multiple Twitter and Facebook operations; a plus in my book), which all have a callback method, called by the eventing gem upon firing the event.
Diving Deeper
In Wanelo code base, we currently have 30 distinct events, which are all fired at various points throughout the lifecycle of our application. Some events are fired in web request, some during background jobs. Currently Ventable dispatch mechanism only supports in-process ruby observers, but it would not be difficult to extend it to support a queueing mechanism, such as RabbitMQ.
We defined a hierarchy of events in lib/wanelo/events directory of our rails app, and they all subclass a Base class. This class defines a couple of additional features.
It automatically includes Ventable::Event in each subclass of the Base class
It defines a transaction block that is then used (when defining individual events) to notify some observers inside the transaction, and some outside of it. We do this so that database transaction is not left open unnecessarily for too long -- operations such as as posting to Facebook can take seconds to complete. Keeping database transactions as short as possible is pretty much required for any high-traffic web application.
It automatically subscribes each concrete event (ie, a subclass of base) to metrics module, which transmits a UDP packet to our statsd metrics aggregator, for each of the 30 events in our app
Let's take a look at what this looks like:
module Wanelo module Event class Base class << self def transaction @transaction ||= ->(b) { ActiveRecord::Base.transaction do b.call end } end def metrics @metrics ||= Wanelo::Metrics.instance \ if defined?(Wanelo::Metrics) end end def self.inherited(klass) klass.instance_eval do include Ventable::Event group :transaction, &Wanelo::Event::Base.transaction # Always notify statsd notifies ->(event) { self.metrics.handle_event(event) } end end end end end
By defining the transaction group and binding it to the proc, we are able to use inside: option when configuring events, such as so:
Wanelo::Event::ProductSave.configure do notifies Product, SaveNotification, SaveAction, inside: :transaction notifies ProductSaveWorker, ResaveEmailWorker end
Ventable calls all observers in the order defined in the configuration. The first four observers are called inside the transaction block, while the last two are called after the transaction had already committed.
Pros and Cons
Advantages
Modeling business events as actual Ruby classes has many advantages, that might not be obvious from the get-go. For example, as we can see from the example above, it is trivial to subscribe a global concern such as metrics listener to ALL interesting events at once. With just a few lines of code we can suddenly enable tracking and graphing every interesting business metric that is modeled in code as a Ventable event. This is very powerful.
Another obvious advantage of this approach is that the code pieces relating to a particular piece of functionality can be placed inside classes implementing this functionality. In the above example, the code to post the comment to Twitter could live inside the TwitterService class, instead of inside some arbitrary controller.
Finally, it becomes very easy to see how the events are dispatched and glued together by reviewing the event_initializer.rb file, which we tend to keep in our config/initializers folder. Whenever you see a "handle_event" method anywhere, quickly open up event initializer and you can see what the other interested parties of this event are, what order they are being called, and whether they are executing inside a transaction.
Disadvantages
There are a few downsides to this approach, however. As software developers we should always look for the trade offs between solutions, and try to understand what we gain or lose with each implementation.
In this case, there are a couple I can think of:
It may not be trivial to figure out ALL of the actions that happen when a certain event fires. One must inspect event_initializer.rb in order to figure this out.
If firing some events causes other events to also fire (not recommended!), it may further complicate debugging the exact sequence of actions that happened.
If "around" blocks are used, such as in a transaction, nested events may further obscure what happens inside the outer or inner transaction boundary.
Having said that, our experience shows that a healthy mix of the Service design pattern and the events, provides the best-of-breed solution and achieving very modular approach to business logic modeling. It allows us to easily create new event types, and even more easily to configure any part of our app to be notified when the event fires.
Really Really Really Deleting SMF Service Instances on Illumos
We recently ran into a tricky situation with a custom SMF service we maintain on our Joyent SmartOS hosts. The namespace for the service instance (defined in upstream code) had changed, which meant that as our Chef automation upgraded the service instances to the latest code, we ended up with a lot of duplicate service instances that each had a unique namespace.
After wrestling with the best way to batch delete/reinstall the service (using Chef's knife cli), we found a way to improve our old process.
Normally, we would delete services with something like svccfg delete <service_name>, but this doesn't work well if you need to delete a number of services, especially if they have similar namespaces. Further, we found that running this in a loop against the output of svcs -a -H | grep <service_name> wasn't effective because service configurations could linger even after the service instance had been deleted.
Digging into man svccfg, we came up with a way to enumerate services and service configurations more cleanly with svccfg:
for service in $(svccfg list | grep nad); do sudo svcadm disable -s $service done for instance in $(svccfg list | grep nad); do sudo svccfg delete $instance done
A Cost-effective Approach to Scaling Event-based Data Collection and Analysis
With millions of people now using Wanelo across various platforms, collecting and analyzing user actions and events becomes a pretty fun problem to solve. While in most services user actions generate some aggregated records in database systems and keeping those actions non-aggregated is not explicitly required for the product itself, it is critical for other reasons such as user history, behavioral analytics, spam detection and ad hoc querying.
If we were to split this problem into two sub-problems, they would probably be “data collection" and “data aggregation and analysis."
UPDATE: please checkout the following presentation from Surge2013 Conference for another view into this project:
First question: Why don’t we use our relational database backend for this?
A popular approach to solving these problems is with the use of a relational database. As most people use this successfully up to a point, we can call this a proven approach. Web applications and APIs already have clients to connect to the database backends and can insert new records (data collection solved). The SQL already provides the perfect interface to the relational algebra that could be executed on the data for any purpose (data aggregation and analysis solved).
"Up to a point"?
Using an (open source) relational database is a very cost-effective solution, and can be scaled by optimizing configurations, isolating database servers, and queuing and batching writes. You can certainly buy yourself time on the data collection front with this approach. RDBMSes are highly adaptable for different use cases via configuration, and are multi-purpose.
But the data aggregation and analysis problem becomes more difficult as the traffic scales up. It grows so large that a single database server is unable to perform a simple aggregation in a reasonable amount of time. Doing any sort of optimization for the querying (e.g., indexing) means presorting and writing extra data, which increases disk space used as well as the write throughput necessary to commit these writes to the disk.
At this point, a better solution is required.
Let’s focus on collecting this data first!
Sometimes easy solutions can become complicated. Sometimes it’s good to take a step back and look for simpler answers. "Simplicity is complexity resolved," after all :)
So one day we said (as no doubt others who have faced similar challenges have said): Wait, event data never changes. One user comes and does action X on object Y at time T on platform Z, and no part of this information ever changes. We can look at user action logs as system logs. Appending writes to files is super cheap and most of the building blocks to manipulate files as needed are out there in the OS environment. These problems were solved many years ago, just for different purposes.
There are many specialized software packages out there (old and new) for log collection and many of these benefit from these properties above. Think Flume, syslog implementations, Fluentd and Scribe (last updated two years ago). All of these can address different aspects of the data collection problem at different levels of granularity and may be chosen and used per your requirements.
We ended up using Rsyslog, a syslog implementation, to collect logs from our application/API servers. The protocol is very easy and syslog clients can be configured to buffer data on the client side for intermittent network outages. We wrote a really simple (literally a few lines of code) client wrapper that serializes a user event with field-delimited format. A simplified version looks like this:
Our syslog server aggregates these event logs into huge field-delimited files, and logadm rotates these files into our NFS for storing them temporarily.
We just solved the data collection problem, hopefully for quite some time, by deploying Rsyslog along with a few lines of code. Note that syslog implementations have been around for years and are very stable, so the maintenance tasks involved are limited. Any point in the system (e.g., buffers) can be monitored, and as your data reliability and consistency requirements change over time, other solutions can replace the log collection.
OK. That was the easy part. Now how exactly do we analyze all this big data we hear so much about?
Science: It works, friends.
As you know, there is this area of computer science known as distributed computing, which Google has applied to a programming model for processing large data sets with a parallel distributed algorithm on a cluster of similar nodes, and called it Map/Reduce in 2004.
As this is a model, there are different frameworks implementing these concepts in different ways. One very popular open-source toolset is Hadoop. Another one we’ve been watching closely is Spark.
While the details of how Map/Reduce works are beyond the scope of this post, it might be good to mention that this programming model requires a different way of thinking, and can be cumbersome at first for developers who are not familiar with it. In addition, depending on the framework, Map/Reduce solutions for problems might require extra implementation around how the data will be manipulated and will be flowing between different phases of the model.
One important practical aspect of deploying a typical Map/Reduce system is deciding whether it should be “on demand" or permanently deployed. Each option has a trade-off: running a semi-permanent cluster of decent size provides immediacy of the querying, but is expensive. Creating an “on-demand" cluster is much cheaper, but requires extra time to move the data from storage to the cluster before any queries can begin.
If extreme efficiency and cost are important, the “on-demand" approach is the way to go. This probably applies to many startups, and certainly applied to us.
Manta to the rescue!
While we were thinking about what would be the most cost-effective way of storing, aggregating and running our analytics queries on these huge log files, we heard about Joyent’s Manta, and had the chance to be involved in the Beta program.
Manta is Joyent’s new cloud-based object storage system, which enables the storing and processing of data simultaneously, without the need to move data between storage and compute. It offers strongly consistent data semantics, closer to UNIX file system properties, and, most interestingly, a native compute facility that allows processing of the objects in a fully featured SmartMachine environment, in parallel across many nodes.
The fundamental technology behind a SmartMachine is the concept of zones, which was inspired by FreeBSD jails. In Manta, every phase of a Map/Reduce job gets its own zone and the only communication between one phase and the next is through its output and input. Different zones are never aware of other zones on the system.
Most Map/Reduce frameworks aim to schedule the map phase near the data, so as not to move data around as much (data locality). In Manta terms, this means the zone will get instantiated “on top" of the data.
Ah, good old UNIX commands
Having a compute environment that makes it possible to use good old UNIX commands like grep, awk, cut, sed, uniq and sort on the data that will be streamed in between the phases seamlessly makes writing Map/Reduce jobs a really simple task, even for people with no prior experience. It’s also worth remembering that these commands have been optimized over the last few decades.
In Manta, the objects stored in the system are identified by keys (or paths). When a Map/Reduce job is being scheduled, it is provided with some keys referring to the objects whose content will be streamed as inputs to the first phase of the job defined. When the last phase of the job completes, a number of output objects are created in the system.
The compute zones for each phase start in a few seconds and start streaming and processing data. There is no configuration necessary (unless something specific is required) and the compute zones are totally managed by the framework.
Our integration
As we have been rotating our field-delimited user action logs daily, a script uploads the latest file to Joyent object storage from our NFS with the same schedule. We also did a one-time upload of all the previous user action logs into the system.
Considering this setup, all of the user action data is in the storage, partitioned by date, waiting to be processed for any sort of analysis. The data will get updated daily, but obviously this frequency can be changed per the time requirements for the analysis.
Querying user actions
Before jumping into a real life scenario, it might be good to explain what is happening with a few simple examples.
Most developers have had some experience with log parsing using a handful of UNIX commands while debugging things. All of these apply for our user action logs as well. Considering the simplified user action log pattern,
What we are doing here is implementing basic relational algebra primitives with a couple commands. Whatever your power tool of choice, everything is possible :)
Manta provides a Map/Reduce framework that streams the contents of objects in the storage into the initial phase of a Map/Reduce job and then connects outputs of phases to inputs of next phases until the last phase. This allows these types of commands to execute in parallel across many objects with a single API call. The data collection and filtering concepts we are using are not new, but we think Manta provides a very simple and familiar interface to taking this type of data analysis to the next level in terms of simplicity and efficiency.
In Manta, there are two phase types: mapping and reducing, referring to the first two steps of the five-step Map/Reduce model.
A map phase will basically run for each of the objects passed, providing multiple outputs. A reduce phase would run either for the output of a preceding map phase or for all of the input objects providing a single output.
Given this setup, we would be able to come up with a simple job that has one map phase and one reduce phase:
If we provide last week’s user action logs (last 7 objects) to the Map/Reduce job above, we would come up with the total number of follow actions that happened on all platforms within the last week. Since the map phases will run in parallel, the performance of this query would be similar to running it on a single day of user action logs. In theory, the reduce phase grows linearly with the number of objects involved, but the time spent in the map phase will not increase as we add new objects.
Most analytical queries will require a single final reduce step, which may become a bottleneck, but most of the processing/filtering can be moved to the mapping phase (if possible), resulting in high parallelization.
In order to make things more clear, let’s take a look at a concrete example: how we calculate our retention metrics.
Example: cohort retention analysis
Cohort retention can be defined as the question of the following: Given a specific group of users registered, how many of these users came back and did something after some amount of time?
We will do this in two Map/Reduce jobs. The first one will count the size of the cohort and, as a side effect, will store the user ids of this cohort into an object. The second job will count how many of these users came back and did something within the given time window.
Cohort calculation:
# Map phase: awk -F'|' '{ if( $3 == “register_action” ) { print $1 } }' # Reduce phase: sort | \ uniq | \ mtee /wanelo/stor/tmp/cohort_user_ids | \ wc -l # Objects passed: a set of action logs in the date range # of the cohort we are considering
Retention calculation:
# Map phase: awk -F'|' '{ print $1 }' # Reduce phase: sort | \ uniq > period_uniq_ids && \ comm -12 period_uniq_ids /assets/wanelo/stor/tmp/cohort_user_ids | \ wc -l # Objects passed: a set of action logs in the date range # of the time window we would like to see the retention for
For us, generating cohort retention reports that span a large portion of our historical data requires going through billions (order of magnitude) of rows. Using the above described approach allows us to extract the results in under a couple of minutes.
Long story short, we got
No sampling, perfect significance, fully parallelized, 4 lines of UNIX commands that are passed as a configuration to our underlying Map/Reduce wrapper using a Manta client.
Neat.
We also came up with our own domain-specific helper methods that generate these types of queries and poll Manta for results. Our internal metrics use this API heavily and report aggregated results into our time series DB. Then our dashboards use this time series DB for graphing aggregated metrics.
Conclusions!
In our case for storing and analyzing user activity data, taking a file-based approach to data collection was more appropriate than a database-centric approach, knowing that we would be able to express many of our analytics queries using UNIX commands on the user action logs. Taking advantage of proven technologies such as syslog provided a very cost-effective data collection solution.
Running these queries on Joyent’s Manta naturally extended this method to work across many objects in parallel, dramatically shortening the time required for them.
Manta was instrumental, as it offloaded the management of the data flow for Map/Reduce, simplifying our job for streaming data in and out of the computation.
Manta also provided a compute environment that is as friendly as a local environment, greatly simplifying operations and making everything intuitive. This removed some of the barriers to quickly start working with parallelized computations, but also did not limit us from uploading custom Map/Reduce scripts or binaries to do more complex or efficient computations.
At the end of the day, we ended up with a powerful internal analytics framework that’s easy to learn, maintain and extend.
As Manta was just announced a few days ago, we wanted to share some of our experience and insights on how we were able to utilize it in production. We hope this was useful for anyone currently solving similar problems or evaluating their options.
Please feel free to reach out to us with questions or comments!
We recently gave a talk at the SFRoR Meetup here in San Francisco about how we scaled this rails app to 200K RPM in six months. There were a lot of excellent questions at the meetup, and so we decided to put the slides up on SlideShare.
Without further ado, here it is. Feedback and comments are always welcome.
Scaling Wanelo.com 100x in Six Months
by Konstantin Gredeskoul and Eric Saxby
High Read/Write Performance PostgreSQL 9.2 and Joyent Cloud
At Wanelo we are pretty ardent fans of PostgreSQL database server, but try not to be dogmatic about it.
I have personally used PostgreSQL since version 7.4, dating back to some time in 2003 or 4. I was always impressed with how easy it was to get PostgreSQL installed on a UNIX system, how quick it was to configure (only two config files to edit), and how simple it was to create and authenticate users.
Of course I also played with MySQL back then, and always found it a bit wanting. Its user authentication and password configuration made little sense to me, and I found myself constantly having to look it up just to start using the database. MySQL won the battle during those early days by providing a native Windows one-click installer and good performance out of the box, but it seems to have lost ground in recent years as a growing number of high-profile companies (notably Heroku and Instagram) adopt PostgreSQL, and as fear and loathing over the fuzziness of MySQL's open source nature and Oracle's involvement increases.
Traditionally, PostgreSQL has had a richer feature set, better stability and higher scalability, but slightly poorer performance. I don't believe the performance difference is anywhere near what it used to be, if at all, but doing yet another benchmark is beyond the scope of this post. We the developers get to choose our tools from time to time, and my open source database of choice has always been PostgreSQL.
Wanelo is currently hosted on a high-performance cloud -- Joyent Cloud, something we've mentioned in previous posts. Joyent Cloud comes with close to native hardware RAID I/O performance, and as Wanelo has been scaling up rapidly, we've been taking advantage of that high I/O throughput. Even with that advantage though we recently vertically sharded our database in order to spread the write load across more than one RAID server.
In this post, I'll go over some of our settings in postgresql.conf, which have been adjusted for high-performance/throughput and large RAM sizes. I would like to credit Josh Berkus and his PGExperts consultancy for providing us with timely and necessary assistance in tuning PostgreSQL these last few months.
postgresql.conf
You can grab a gist of our postgresql.conf file here.
Additionally, you can review our open source Chef cookbook for installing PostgreSQL on Joyent Cloud by compiling it from sources, which comes with some of those settings by default.
Finally, this is an excellent resource for tuning PostgreSQL performance.
RAM
As we are running most of our databases on 80GB instances, we set shared_buffers to 12GB. We did this because PostgreSQL takes great advantage of file system caches. The related parameter effective_cache_size tells the PostgreSQL query planner exactly how much RAM is available for caching. On Joyent this cache is provided by the extremely efficient ZFS ARC cache.
We've set work_mem to a somewhat large value of 65MB (per process) so that our SQL sorts don't go to disk.
IO
We've set asynchronous commit to off, so that we can buffer several writes together. Commit delay is set to 100 microseconds, so that any commits arriving within that time are buffered and synced together. This provides benefit mostly during peak times, when write volume is very high. During other times it simply delays each commit by 100 microseconds, which is acceptable for us.
Please let us know if you have any questions about any of the values in this config file. Discussion welcome :)
This past weekend a number of us were focused on a really important annual prime time television event (the Puppy Bowl, of course). Turns out other people out there were watching some other sporting event, which leads to the rest of this story.
At Wanelo we care a lot about the positive experience of our users, and central to a great user experience is a speedy site to use. While we track a plethora of metrics and use a number of alerting tools to ensure that we know when anything untoward is happening, lately we've grown enamored of Circonus and its ability to alert on derivatives -- the sort of thing that is extremely difficult if not impossible to do in Nagios.
Super Bowl Sunday afternoon, we were unpleasantly surprised to receive an alert that the rate of product saves had suddenly dropped. Usually this foretells some disaster in the infrastructure, as the ability to post, save and resave products is one of the most critical aspects of Wanelo. Here's what we were seeing:
That big dip in the middle was the cause of our alerts.
Checking the site and our other monitors, everything seemed to be working correctly. Yesterday morning we realized that this timing aligns perfectly with events at the Super Bowl. Our alerts went off the when the halftime show began, followed by a steep incline when the power went out at the stadium and product saving ramped back up.
Wanelo's recent surge in popularity rewarded our engineers with a healthy stream of scaling problems to solve.
Among the many performance initiatives launched over the last few weeks, vertical sharding has been the most impactful and interesting so far.
By vertical sharding we mean a process of increasing application scalability by separating out some number of tables from the main database and into a dedicated database instance to spread both read and write load.
Vertical sharding is often contrasted with "horizontal" sharding, where higher scalability is achieved by adding servers with identical schema to host a slice of the available data.
Horizontal sharding is generally a great long-term solution if the architecture supports it, but vertical sharding can often be done quicker and can buy you some time to implement a longer-term redesign.
To the Limit
Under high application load, there is a physical limit on how many writes a single database server can take per second. Of course it depends on the type of RAID, file system and so on, but regardless, there is a hard limit.
Reads are somewhat easier to scale -- multiple levels of caching and spreading reads to database replicas form a familiar scaling strategy for read-heavy sites. Writes, however, are a whole different story.
When a database is falling behind because an application makes too many transaction commits per second, it typically starts queuing up incoming requests, and subsequently slows down the average latency of web requests and the application. If the write load is sufficiently high, then read-only replicas may have trouble catching up while also serving a high volume of read requests.
We noticed that the replication lag between our read replicas and the master was often hovering at large numbers (in hundreds of MBs or even several GBs of data). This manifested to users as strange errors, when rows created on the master could not be found during subsequent requests. 404 pages would be returned for records just created. Even worse, related records created in after_create callbacks would be missing even when the parent record was found, causing application errors.
This graph below shows a clear correlation between the number of errors on the site (red line) and replication lag (blue and purple areas). Ack.
Splitting reads and writes made our database layer a distributed system, putting us in CAP theorem territory. While this improved our availability, we realized that we now had to worry about consistency and partition tolerance.
In more practical terms, we needed to reduce the write load on the master to allow our replicas to catch up, especially during peak traffic.
Finding It
We looked at where all the writes were coming from using the amazing pg_stat_statements PostgreSQL library, and it was one of two tables in our schema receiving upwards of 150 inserts per second (our database was doing about 4K commits per second at the time, which can be deduced by comparing xact_commit values in the pg_stat_databases view).
The graph below shows the day-over-day growth of reads on one of the largest tables in our database, and the one we moved out.
This read- and write-heavy table was rather large in size, and also had four indexes on it. For every insert, all four indexes needed to be updated. This meant that PostgreSQL was actually doing more like 500 inserts per second for this ActiveRecord model, if you count each index as a separate table.
Our day-over-day growth projected over the rate of inserts was not sustainable for a database also handling every other type of read and write operation for our application.
So once we identified this table as the one we wanted to split, we put together the following plan.
Doin' It
Go through our application code (Rails 3.2) and replace any joins involving this table with a helper method on that model.
Each helper method would assume that the table "lived" in its own database, and so queries would be broken up into two or sometimes three separate fetches based on the ActiveRecord model's database connection.
Add an "establish_connection" call at the top of this ActiveRecord model, to connect it to the dedicated database defined in database.yml (even in development, and tests).
Go through the app and fix all the broken tests :)
Team members Atasay and Kaan pair-programmed over the weekend and knocked out most of the required changes. Once all of the tests were passing with a 2-database configuration, we felt validated that this approach was working, and started thinking about the deployment.
Deployinating It
Here the "a-ha" moment came when we realized that one of the replicas for our main schema could be promoted to be the master database of the new schema.
There were five steps.
Configure a live streaming replica of the current master, to be used for the new table exclusively.
Take the site down for about ten minutes of planned downtime. Sean ensured our down page had a working kitten cam.
Promote the replica into a master role, so that it can receive writes. It was now master for the sharded table.
Deploy the new code.
Bring the site live.
With some followup:
Configure a streaming replica for the sharded database.
Delete unused tables in the sharded database.
Delete the sharded table from the main database.
Measuring It
After the deploy we discovered that the new master database needed to have analyze run on the table for it to perform adequately, although it was also just warming up the filesystem ARC cache. After that initial warming period, the site hummed along as usual and the next day we were greeted by dramatically dropped I/O on all databases involved, a much faster website latency, and more blissfully obsessed users than ever.
In this graph below of virtual filesystem reads and writes on our master database, you can clearly see where the the sharding happened. There is a dramatic drop in both reads and writes.
While this provides us with some room to grow, we know that sharding this large table horizontally is just around the corner.
Understanding It
Some of the reasons why vertical sharding works may be obvious, but some may be less so:
Writes are now balanced between two servers.
Fewer writes to each database means that there is less data streaming to read-replicas. These now have no issues catching up to the master.
A smaller database means that a larger percentage of the database fits in RAM, reducing filesystem I/O. This makes reads less expensive.
Filesystem I/O can be cached in the ARC more efficiently, reducing physical disk I/O. This also makes reads less expensive.
Database query caching is now tuned to the load of each database. Radically different access patterns on a single database causes cache eviction.
Thinking About It
As we keep growing, this table is destined to become a standalone web service, behind a clean JSON API which will provide an abstraction above its (future) horizontally sharded implementation. Who knows what data store it will use then. We're big fans of PostgreSQL, but that's the beauty of using APIs -- whether it's PostgreSQL, Redis, Cassandra or even a filesystem datastore, the API can stay the same. Today we made a small step toward this architecture.
Feel free to leave a comment with questions or suggestions.
Endnotes
We use PostgreSQL 9.2.2 and are happily hosted on the Joyent Public Cloud. We run on Rails 3.2 and Ruby 1.9.3.
For splitting database reads and writes to read-replicas, we are using Makara (TaskRabbit's open-sourced Ruby gem), which we forked for use with PostgreSQL.
The Big Switch: How We Rebuilt Wanelo from Scratch and Lived to Tell About It
Originally published here on 14 Sep 2012.
The Wanelo you see today is a completely different website than the one that existed a few months ago. It’s been rewritten and rebuilt from the ground up, as part of a process that took about two months. We thought we’d share the details of what we did and what we learned, in case someone out there ever finds themselves in a similar situation, weighing the risks of either working with a legacy stack or going full steam ahead with a rewrite.
In the Beginning..
Back in February, Wanelo was a fast-growing service with some ecstatic members and a lot of promise. The technology stack that powered Wanelo 1.0 was a pretty typical Java-based one: Sun's Java 1.5 under Hibernate, Spring and Struts 2 (formerly WebWork), all running atop Tomcat and MySQL 5. At the time, this choice was a big step forward, especially compared to the bloated alternative of J2EE.
The original stack served its purpose, enabling the prototyping of various features and allowing Wanelo to gain traction among users. As Java is pretty well optimized on the server, it also offered many reasonable scaling options.
New Team
As soon as the new engineering team had assembled, the question of whether or not we wanted to keep the Java stack was a heated topic of discussion. Most of us were familiar with Java and Java-based web frameworks. So much advice around the web leaned toward not rewriting the existing software, but rather extending it incrementally or simply embracing the old platform and "fixing it."
We were well aware though that since Wanelo was (and still is) a young and unproven startup, the most important objective of the technology powering it was the ability to move forward as rapidly as possible and learn, without compromising the future integrity of the underlying platform.
So in choosing our new stack, we were hoping to harness the immense productivity gains promised by the more modern options, in particular those offered by the dynamic languages.
New Stack
Thus, against the tide, and with many reservations, we made the decision to do a complete rewrite.
We chose Ruby as the language, and Ruby on Rails as the underlying framework. The vibrant ecosystem of open-source projects around Ruby was a strong motivating factor, as was our team's prior experience building successful Ruby-based web apps. We were also keen to switch from MySQL to PostgreSQL, for many well-publicized reasons.
From Zero to Launch
Today's Wanelo is the result of a two-month-long rewrite, and the migration of its data from MySQL to PostgreSQL.
Since that initial launch, we've been adding and extending features daily and deploying continuously. The growth of the Wanelo community has continued unabated, and we managed to launch a highly successful iOS app on top of our API layer -- all with a team of about 11 people (not all engineers).
Now that it seems to safe to say that our unorthodox approach was validated, we wanted to highlight some of the key data points of this switch.
First, Some Stats
The original Wanelo Java app had about 70k lines of code (100k lines total if you include JSP and XML files). Not all of the code was in use, and there were no automated tests or deployment scripts.
The new Wanelo Ruby app had about 7k lines of code at launch (Ruby, CoffeeScript, Haml, SCSS), and another 5k lines of tests. We’ve since added many new features, and are now at just 11k lines of code and about 8k lines of test code, for a ratio of 1:0.7. So still under 20k lines of code after four months of development.
Our new application is built atop Ruby 1.9.3, Rails 3.2, Devise, CoffeeScript, Compass, Haml and SCSS, and it uses PostgreSQL 9.1, Redis and Solr for search. We use minitest, Jasmine and Capybara for testing, CarrierWave for image handling, RABL for API generation, SideKiq for background jobs, Chef for provisioning new boxes on the cloud and Capistrano for deploying our code. We run our environment on Joyent Cloud, store images on Amazon S3, and use Fastly and CloudFront as our image CDN.
Database Layer
The Java app relied on 53 database tables, some of which were still MyISAM, and some of which were InnoDB. Only about 10 of them had a lot of data. The database server performed significantly better after an upgrade to Percona Server, but was still showing a load average of about 4-5 during busy times.
The new PostgreSQL schema contained only 13 database tables upon launch, as we determined most of the MySQL tables weren't in active use. After launch, we had to add some missing indices on tables, and after that the load on the server settled to about 1.5-2 during high-traffic times.
We spent a fair amount of time deciding how to migrate the data, and writing data migrations. Here’s how that went down.
The Great Migration
Early on we chose the following approach for our data migration strategy:
We would use an existing tool to migrate the schema mostly “as is” from MySQL to PostgreSQL. We'll call this PostgreSQL schema containing the old MySQL tables the "legacy schema.”
Configure the legacy schema in database.yml to belong to the "legacy" Rails environment. Use host: localhost to prevent rake from dropping/recreating this schema automatically (only non-remote schemas are dropped automatically).
Create a project directory called db/legacy_migrations and add appropriate rake tasks to migrate the database schema using migrations inside that directory. This rake task was called "db:legacy:migrate" instead of the regular "db:migrate."
Start developing the new Ruby application, and while adding the new schema migrations for the new models, also add the legacy migrations to move the data from the old tables to the new. As we were pair programming, one pair could work on the new feature and a new model, while the other pair worked on migrating the data.
One early finding we bumped into was that the old tables and new tables would have to live in the same PostgreSQL schema so that we could easily copy/transform data using SQL between tables. If we put the legacy tables in another database or schema, it was not as easy to move data between them. This might be a limitation of PostgreSQL. Luckily, our Java application used singular names of entities (such as “user”) while the Rails app uses plural. So we could keep the same names and go from the "user" table to "users,” within the same database schema.
The second challenge was the actual transfer of data from MySQL to PostgreSQL as is. This proved to be no small feat, as we ended up using a custom-built project with a few rake tasks, on top of the mysql2psql gem. In that mini-project, we used the gem to export from MySQL to *.sql files ready for import into PostgreSQL, but we had to munge some of them first to make them work with Pg. For example, the MySQL ENUM column type was exported as VARCHAR(0), and we had to fix that because Pg did not accept that data type. Similarly, bit(1) columns were not properly exported and had to be converted prior to export into regular integer columns. Luckily, there weren't that many special cases we had to deal with, and mysql2psql provided most of the functionality out of the box.
We also found that the gem appears to have an exponentially increasing performance penalty based on the number of rows exported. So we wrote a rake task that split all rows in the largest table into chunks of 100k, and then started one export per chunk in parallel, using multiple processes. This allowed us to eventually export data from MySQL into a Pg-acceptable SQL file very quickly (under 30 minutes for a 5GB InnoDB file). Our initial attempts to do so took over 20 hours.
The exported files used PostgreSQL's "COPY" command, which is very fast. So the import of the files took a significantly shorter amount of time -- about 15 minutes tops. Now we just had to run the legacy migrations to populate our schema.
Our switch from from the old stack to the new one took place on June 27, with four hours of downtime (we could have done it faster, but we decided that four hours was acceptable for such a large migration). There were six steps:
Shut down the old site, and fire up a live video chat on the placeholder page to keep visitors entertained :)
Export from MySQL to files, then import those files to PostgreSQL (30 minutes), using several parallel processes and threads working at the same time on different tables.
Run the schema migrations (10 minutes) on the new schema.
Run the legacy data migrations (1-2 hours), moving data from old tables to the new.
Bring the new app up internally and do a quick smoke test.
Bring the new app live to the world.
Despite the enormous risk with this type of change, our launch was relatively uneventful. We fixed a few bugs throughout the day, and added a few redirects we had forgotten. But now we were running on a new streamlined platform, highly optimized for rapid development, and built atop the latest version of Ruby on Rails. We were ready for the next phase.
Conclusion
There are several reasons why a complete rewrite worked out in our case. I've been thinking about it a lot, and I think it comes down to the following:
The old Java codebase was large and difficult to navigate, yet the MVP feature set for Wanelo was relatively small and had an early estimate of 2 months of work with 3 developer pairs.
The lack of automated tests in the Java codebase gave us no confidence that by incrementally changing the existing software we would not break anything.
Conversely, using test-driven development in the new codebase gave us a lot of confidence about delivered and accepted features.
There were only ~10 database tables needing data migration to the new schema. As with the source code, the old schema contained many tables and/or columns that were no longer used, and so migrating the data allowed us to clean things up.
It was critical to augment the early engineering team with several experienced Rubyists, who were instrumental in getting the initial set of tools and processes to work together.
As the entire team was new, it seemed right to get everyone involved with the new codebase early on, instead of spending cycles learning the old one.
We were and still are very lucky to have an absurdly brilliant team.
Hope this post helps you make the right decision in your situation, if you ever find yourself faced with the question of to rewrite or not to rewrite. I hope we've proven that in some cases a rewrite is the right choice, but it really depends on many factors.
Stay tuned for more juicy tidbits about Wanelo’s technology and team on this blog.