Building AdServer in AWS Cloud
Few weeks ago a friend of mine asked me if I could help him with building an ad serving server. Of course I asked why - there are so many options available? He answered that with their volume (couple of billions impressions a month) he believes they can find a cheaper solutions with few extra features they might need. And they were right. In this post I would like to talk about my learnings from this really interesting project. After few weekends and evenings the system is up and running and serving couple of millions ads a day.
Initially there were couple of hard requirements - I didn't initially question them as I had zero knowledge about the business, but as I learned along the way, they really weren't that critical. But they had significant impact on the architecture.
My lesson #1: Make sure you understand the business in deep so that you can architect the system appropriately.
Low latency: The request has to be served in less than 100ms
Geo look up: The ads have to be customized based on the geo (reverse IP lookup)
High availability: System has to be up 99.99% of time
Written in NodeJS - for the high throughput
Understanding what is available and choosing AWS
There are many technologies available, open sourced, paid, in the cloud, on premise, etc.. I chose to go with AWS (http://aws.amazon.com/) - they have many sweet features, scaleable infrastructure and you can use them as building blocks. (And I understand it a bit).
Route53 - best (and most reliable) DNS manager available with great feature "latency based routing". If selected, they always route the request to your closest deployment (AWS region) which reduces network latency.
Elastic Load Balancers, Autoscaling Groups, Reactive Autoscaling - it's fairly easy to configure your system in AWS to scale up and down based on your metrics. Elastic Load Balancers can distribute traffic across all running instances; autoscaling groups will add / remove instances based on current needs.
Simple Storage Service (S3) - I personally call it high latency key/value data store and in my system it completely removed need to a database (with an exception of reporting database).
ElastiCache - Memcahed as a service - it just works and you just need to worry about the host name.
Simple Queuing Service - excellent service for storing data before you have a chance (or resource) to read and process them.
ElasticBeanstalk - deployment service. Very simple one, but if it works for you, I wouldn't go with anything more complex.
And the last feature - it all works together - you'll save A LOT OF time.
I would recommend checking each service out - they have many useful features and I have been using only fraction of them.
I was nicely surprised that the architecture I originally designed is still the same that the production service is running on. No major adjustments had to be done.
The AdServer deployment consists of Elastic Load Balancer, multiple WebServers managed by autoscaling group, backed by ElastiCache (Memcached) servers for storing the temporary state.
The Ads are programmed via Admin interface, that is deployed in one region only and Ads configurations (JSON files) are stored in S3 that each AdServer synchs with every 5 minutes. The banners are uploaded via Admin, stored to S3 and distributed via CloudFront (CDN by AWS).
All traffic events (impressions, clicks, etc.) are stored in Simple Queue Service (SQS) and these are regularly processed by Analytics engine that is deployed in one region only and is scaling up and down based on the queue size. Processed (reduced) data is stored in MySql where they are available for marketers via Admin interface to view the stats, CTRs, Impressions, by country, device, etc. (Sky is the limit:)).
One central service - Monitor - is used for registering and deregistering the WebServers (registration assigns unique ID, based on the WebServer regions provides ElastiCache Url, SQS endpoint etc.). Also, all the webservers are publishing their "hearbeat" to the monitor with basic stats (number of requests, etc.). Monitor has a simple UI where I can see all running instances per region etc.
The AdServers are deployed in multiple AWS regions, are completely stateless and are scaling up and down based on the requests latency. After some testing I figured out this is best metric to represent "how easily the system can "breathe" and serve requests". This can be easily configured in ElasticBeanstalk - when latency is higher than 200ms, the system will add a new instance, if it's lower than 50ms, it will remove one instance. I optimized the server startup time to 3mins, so the whole system is really "elastic".
My learning from architecting system in the cloud:
Make sure your servers are stateless and can startup automatically (ideally via autoscaling group). I regularly terminate EC2 instance just to confirm it works.
For stateful services (Databases etc.) use something that comes in a box - S3, Hosted MySQL, DynamoDB. You don't have to manage these services and typically many other customers use them too - the vendor (AWS) have to take care of us and guarantees the reliability.
Be distributed by default - make sure your system is ready for multiregional deployment, understand where your bottlenecks and resiliency gaps are and be ready for their failures. Understand your failure scenarios and automate those (i.e. AWS region is down - make sure Route53 knows where to send failover traffic).
When a request comes to the domain, it's routed by Route53 to the closest deployment (i.e. request from New York is routed to Virginia AWS datacenter, from San Francisco is routed to Oregon AWS datacenter). Then Elastic Load Balancer will pick one instance of WebServers in autoscaling group and send there the request. A request is processed and returned to customer.
Now comes the beauty of NodeJS and Javascript - all events tracking and reporting is done asynchronously so it's not blocking the request.
Once the request is processed and returned to the user, data is logged in SQS, Analytics engine will pick it up, aggregate the data and store it into MySQL DB.
If there is a need for storing any request data (i.e. which banners the user with this IP has seen), Memcached will be used. If data expires in Memcached, best recommended banner will be shown.
My learning from processing requests - if anything can timeout, go wrong, etc., make sure you fail fast and return default response. For example I wrote a wrapper around Memcached client - if the request is not returned in less than 10ms, best banner will be calculated.
It's hard to run a service if you can't tell if it's running. I am using three different groups of metrics:
External metrics - Banners will show up, click through works, requests latency are within XYZ ms. Best tools - manual tests, Pingdom, scripts
Business metrics - number of requests processed, number of impressions, clicks, CTRs, daily volume of ads, etc. We chose to build our own stuff and implement couple of neat features to support marketers in doing their job.
System metrics - CPU utilization, number of running instances, number of requests, number of errors, type of errors, etc. We are using our own Monitor dashboard that tracks all running instances, requests per "zones (Ads placement on a specific page)" that can alert us if there is a drop. For the server and deployment metrics we are using CloudWatch for basic stuff and New Relic for detailed level metrics - also here you can nicely set up alerts.
My learning from running the system:
If you can document it, automate it - Are you checking the dashboard every day? Define an alert for that. Are you worried about traffic drops? Start capturing the metric and have an alert on that?
Remove no-longer relevant alerts and metrics - prevent metrics overload, make sure you can be laser focused on the metrics that matters.
Start with business metrics - there is no reason to run a system for the sake of running a system. Understand the business metrics (CTRs, sales, etc.) and align your technical metrics (latency, availability, etc.) with them.
Identify and build workarounds for failing dependencies
There are probably many dependencies in every system (DB, Cache, DNS, ...). Continuously identify those dependencies and design your system in a way that it can still function without them - even in degraded mode (if that is acceptable).
For example in my system, we had a hard dependency on Geo Lookup - it was slowing the startup, sometimes failing startup of an instances. I rewrote the Geo Lookup, moved it into a separate system (that's not on the schema yet) and if the system is not available, we are just not Geo targeting Ads, but still fully serving them (and of course we get an alert that the service is down).
"Aim for success, not for perfection"
But always have your eyes open to learn how to make your system better. So if you have any feedback, thoughts, ideas, please leave me a comment.