David is coding @davidiscoding-blog - Tumblr Blog

Elasticsearch on AWS - quick and easy

I fell in love with Elasticsearch! In this blog post I would like to share my experiences with building Elasticsearch cluster on AWS and give you step by step instructions on how you can do the same. You can learn everything about Elasticsearch at https://www.elastic.co/.

Getting started - prepare AWS

Create security group - es_cluster - in the region where you want to deploy Elasticsearch.

So that you can access instances in this security group, make sure that you have access for SSH and port 9200 from your IP address. To keep it simple I usually add access for all TCP traffic - something like on the picture below.

Create keypair es_keys.pem that you will use to access the instances

Launch instances

You will need to do this steps as many times as many instances you want to launch in your cluster.

Launch instance - go to AWS Console -> EC2 -> Launch Instances

Choose Ubuntu Server 14.04 LTS (HVM), SSD Volume Type - ami-d05e75b8

For instance type I would recommend either r3.x family or i2.x family

Make sure you have enough SSD capacity - depends on your needs. I would go for at least 10′s of GBs.

Note: It’s better to use instance with default SSD capacity rather then EBS - to access EBS you need to access network which impacts the performance

Make sure you choose es_cluster security group

Make sure you select es_keys.pem

Click Launch

Install and start Elasticsearch - wait a minute to make sure instance can properly start and then:

SSH into your instance with: ssh -i es_keys.pem [email protected]

Create init.sh script - ie. using vim init.sh and fill it with following - make sure you update highlighted (your AWS Keys, region where you are running this, max heap size - set it to half of the size of your RAM)

sudo apt-get update sudo apt-get install openjdk-7-jre-headless --yes sudo apt-get install unzip --yes wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.4.2.zip unzip elasticsearch-1.4.2.zip cd elasticsearch-1.4.2/ ./bin/plugin install elasticsearch/elasticsearch-cloud-aws/2.4.1 bin/plugin -install lukas-vlcek/bigdesk echo ' cloud: aws: access_key: YOUR_ACCESS_KEY secret_key: YOUR_PRIVATE_KEY region: us-east-1 discovery: type: ec2 discovery.ec2.groups: es_cluster ' > config/elasticsearch.yml export ES_HEAP_SIZE=6000m ./bin/elasticsearch -d

Save the file

Run it by: sh init.sh

Congratulations - you are running Elasticsearch

Make sure you add more instances into your cluster for resiliency and better performance. You can now navigate to http://ec2-xx-xxx-xx-xx.compute-1.amazonaws.com:9200/_plugin/bigdesk and see the status of your cluster.

Now you need to make sure you can access your cluster from other instances - make sure you add them to the es_cluster security group or add their security group to the es_cluster.

What to do next?

Of course, now you want to upload some data or run some queries. In the next post, I will give you examples in NodeJS how you can do exactly that!

Also, we will consider various cluster setups to make sure you optimize it’s performance - we will discuss all about the search nodes, data nodes and master nodes.

Realtime Big Data with AWS Kinesis

As mentioned in my previous post about building AdServer in AWS cloud - when you are running a larger distributed system it's very important not to be blind when it comes to understanding performance of your service - both business and technical.

Where are these extra requests coming from?

While we had solid understanding of the daily business performance - CTRs, top performing banners, etc. - after a while I found that the server load is higher than expected and that it doesn't correlate with the number of impressions / clicks. I was seeing about 5 times more requests to the servers than I was seeing number of impressions. If you are getting 5-15mils impressions a day, these extra requests mean significant portion of your AWS bill.

Getting realtime visibility

There are many ways how to get to the root cause of this problem - analyze your logs or introduce better logging to gain more visibility. I've decided to experiment with AWS Kinesis - "fully managed service for real-time processing of streaming data at massive scale". There were two primary reasons for this: a) I will see detailed breakdown of the traffic b) I can build interesting real-time reporting capabilities for the marketing team.

What I decided to do:

Send every request with it's context (type, ip address, country, banner url, ...) to AWS Kinesis.

Build an in-memory aggregation engine that will enable one-hour sliding window view on the realtime data

Expose this data via nice web ui

Connecting the firehose

As the first step you need to instrument your code to send every request to Kinesis. As I already had in my NodeJS code module for tracking every request into SQS for further processing, it was very simple to add one line to publish into Kinesis as well.

As you can see in the example, you can write into Kinesis any arbitrary data - the only limit is the data size 50Kb. In my case I just stringified my object and stored it into base64 encode. Then I modified my setTrackingValue function - added writeToKinesis(logItem) - (logItem - object that holds all tracking data).

When I redeployed AdServer via ElasticBeanstalk and switched the DNS routing - it started to work and I saw my data flowing into Kinesis.

Kinesis configuration: I am using one stream with one shard - currently its 1 MB/s limit per shard is more than enough.

Aggregating data

Now the data is in Kinesis, it will stay there for 24 hours. So let's make sure we read them and use them.

In my code I am continuously reading the data from the stream and recording it into a in-memory structure where it is aggregated per minute.

Here is an simple code example of data aggregation per minute:

Realtime.prototype.readStream = function (streamId, shardIteratorId) { var self = this; var params = { ShardIterator: shardIteratorId, Limit: 1000 }; self.kinesis.getRecords(params, function(err, data) { if (err) console.log(err, err.stack); // an error occurred else { if (data['Records'].length > 0) { for (var i in data['Records']) { var js = JSON.parse(new Buffer(data['Records'][i]['Data'] || '', 'base64').toString('utf8')); self.recordData(js); } } self.readStream(streamId, data['NextShardIterator']); } }); } Realtime.prototype.recordData = function (js) { var self = this; try { // getting data from the json var minute = (new Date(js['time'])).getMinutes(); var hour = (new Date(js['time'])).getHours(); var action = js['action']; var zone_id = js['ad']['zone_id']; var imageUrl = ''; var country = ''; var device = ''; // for action 'banner' or 'click' get extra details if (action == 'banner' || action == 'click') { imageUrl = js['ad']['imageUrl']; country = js['ad']['trackingdata']['country']; device = js['ad']['trackingdata']['device']; zone_id = js['ad']['trackingdata']['zone']; } if (self.hourly[minute]) { // recording zone if (self.hourly[minute][zone_id]) { // recording actions if (self.hourly[minute][zone_id][action]) { self.hourly[minute][zone_id][action]++; } else { self.hourly[minute][zone_id][action] = 1; } } else { // recording actions self.hourly[minute][zone_id] = {}; self.hourly[minute][zone_id][action] = 1; } } else { self.hourly[minute] = {}; // recording zone if (self.hourly[minute][zone_id]) { // recording actions if (self.hourly[minute][zone_id][action]) { self.hourly[minute][zone_id][action]++; } else { self.hourly[minute][zone_id][action] = 1; } } else { // recording actions self.hourly[minute][zone_id] = {}; self.hourly[minute][zone_id][action] = 1; } } } catch (Exception) { } }

Once you start reading the stream, it will keep aggregating the data (per minute/zone/action).

Now just expose the data via REST API and we are almost done.

app.get('/api/v1/stream/data', ensureAuthenticated, function(req, res){ res.send(real.getHourlyData()); });

Show me the data

Now the data is available, let's use AngularJS and Highcharts and show the data. I will skip code examples here. The two charts bellow are showing me the sliding window of last 60mins of two zones (implementation of an ad on a page).

"Bad" zone: this zone clearly shows that only 20% of requests turn into impression. We need to figure out with the publisher what's happening and why we have some many "wasted" calls to "getAd" which returns jsons (with banner and click url).

"Good" zone: here you can clearly see that almost every "getAdHTML" is turned into impression "banner".

Conclusion

Within one day of coding I was able to instrument my code to use AWS Kinesis so that I can have real-time view into what's happening with my AdServer. This way I know exactly with what content publishers I should work to fix their code.

Now that we have Kinesis in place, there is unlimited number of applications:

Realtime Ads performance: impressions, clicks, CTRs

Trending banners, countries, IPs

Setup alarms if anything goes south - drop in impressions, etc.

Use Kinesis connectors (EMR, DynamoDB) to store aggregated data for further processing

And many others

#aws #kinesis #realtime #bigdata

Building AdServer in AWS Cloud

Few weeks ago a friend of mine asked me if I could help him with building an ad serving server. Of course I asked why - there are so many options available? He answered that with their volume (couple of billions impressions a month) he believes they can find a cheaper solutions with few extra features they might need. And they were right. In this post I would like to talk about my learnings from this really interesting project. After few weekends and evenings the system is up and running and serving couple of millions ads a day.

Requirements

Initially there were couple of hard requirements - I didn't initially question them as I had zero knowledge about the business, but as I learned along the way, they really weren't that critical. But they had significant impact on the architecture.

My lesson #1: Make sure you understand the business in deep so that you can architect the system appropriately.

Now to the requirements

Low latency: The request has to be served in less than 100ms

Geo look up: The ads have to be customized based on the geo (reverse IP lookup)

High availability: System has to be up 99.99% of time

Written in NodeJS - for the high throughput

Understanding what is available and choosing AWS

There are many technologies available, open sourced, paid, in the cloud, on premise, etc.. I chose to go with AWS (http://aws.amazon.com/) - they have many sweet features, scaleable infrastructure and you can use them as building blocks. (And I understand it a bit).

Particularly I liked:

Route53 - best (and most reliable) DNS manager available with great feature "latency based routing". If selected, they always route the request to your closest deployment (AWS region) which reduces network latency.

Elastic Load Balancers, Autoscaling Groups, Reactive Autoscaling - it's fairly easy to configure your system in AWS to scale up and down based on your metrics. Elastic Load Balancers can distribute traffic across all running instances; autoscaling groups will add / remove instances based on current needs.

Simple Storage Service (S3) - I personally call it high latency key/value data store and in my system it completely removed need to a database (with an exception of reporting database).

ElastiCache - Memcahed as a service - it just works and you just need to worry about the host name.

Simple Queuing Service - excellent service for storing data before you have a chance (or resource) to read and process them.

ElasticBeanstalk - deployment service. Very simple one, but if it works for you, I wouldn't go with anything more complex.

And the last feature - it all works together - you'll save A LOT OF time.

I would recommend checking each service out - they have many useful features and I have been using only fraction of them.

Architecture

I was nicely surprised that the architecture I originally designed is still the same that the production service is running on. No major adjustments had to be done.

The AdServer deployment consists of Elastic Load Balancer, multiple WebServers managed by autoscaling group, backed by ElastiCache (Memcached) servers for storing the temporary state.

The Ads are programmed via Admin interface, that is deployed in one region only and Ads configurations (JSON files) are stored in S3 that each AdServer synchs with every 5 minutes. The banners are uploaded via Admin, stored to S3 and distributed via CloudFront (CDN by AWS).

All traffic events (impressions, clicks, etc.) are stored in Simple Queue Service (SQS) and these are regularly processed by Analytics engine that is deployed in one region only and is scaling up and down based on the queue size. Processed (reduced) data is stored in MySql where they are available for marketers via Admin interface to view the stats, CTRs, Impressions, by country, device, etc. (Sky is the limit:)).

One central service - Monitor - is used for registering and deregistering the WebServers (registration assigns unique ID, based on the WebServer regions provides ElastiCache Url, SQS endpoint etc.). Also, all the webservers are publishing their "hearbeat" to the monitor with basic stats (number of requests, etc.). Monitor has a simple UI where I can see all running instances per region etc.

The AdServers are deployed in multiple AWS regions, are completely stateless and are scaling up and down based on the requests latency. After some testing I figured out this is best metric to represent "how easily the system can "breathe" and serve requests". This can be easily configured in ElasticBeanstalk - when latency is higher than 200ms, the system will add a new instance, if it's lower than 50ms, it will remove one instance. I optimized the server startup time to 3mins, so the whole system is really "elastic".

My learning from architecting system in the cloud:

Make sure your servers are stateless and can startup automatically (ideally via autoscaling group). I regularly terminate EC2 instance just to confirm it works.

For stateful services (Databases etc.) use something that comes in a box - S3, Hosted MySQL, DynamoDB. You don't have to manage these services and typically many other customers use them too - the vendor (AWS) have to take care of us and guarantees the reliability.

Be distributed by default - make sure your system is ready for multiregional deployment, understand where your bottlenecks and resiliency gaps are and be ready for their failures. Understand your failure scenarios and automate those (i.e. AWS region is down - make sure Route53 knows where to send failover traffic).

Life of one request

When a request comes to the domain, it's routed by Route53 to the closest deployment (i.e. request from New York is routed to Virginia AWS datacenter, from San Francisco is routed to Oregon AWS datacenter). Then Elastic Load Balancer will pick one instance of WebServers in autoscaling group and send there the request. A request is processed and returned to customer.

Now comes the beauty of NodeJS and Javascript - all events tracking and reporting is done asynchronously so it's not blocking the request.

Once the request is processed and returned to the user, data is logged in SQS, Analytics engine will pick it up, aggregate the data and store it into MySQL DB.

If there is a need for storing any request data (i.e. which banners the user with this IP has seen), Memcached will be used. If data expires in Memcached, best recommended banner will be shown.

My learning from processing requests - if anything can timeout, go wrong, etc., make sure you fail fast and return default response. For example I wrote a wrapper around Memcached client - if the request is not returned in less than 10ms, best banner will be calculated.

Don't be blind

It's hard to run a service if you can't tell if it's running. I am using three different groups of metrics:

External metrics - Banners will show up, click through works, requests latency are within XYZ ms. Best tools - manual tests, Pingdom, scripts

Business metrics - number of requests processed, number of impressions, clicks, CTRs, daily volume of ads, etc. We chose to build our own stuff and implement couple of neat features to support marketers in doing their job.

System metrics - CPU utilization, number of running instances, number of requests, number of errors, type of errors, etc. We are using our own Monitor dashboard that tracks all running instances, requests per "zones (Ads placement on a specific page)" that can alert us if there is a drop. For the server and deployment metrics we are using CloudWatch for basic stuff and New Relic for detailed level metrics - also here you can nicely set up alerts.

My learning from running the system:

If you can document it, automate it - Are you checking the dashboard every day? Define an alert for that. Are you worried about traffic drops? Start capturing the metric and have an alert on that?

Remove no-longer relevant alerts and metrics - prevent metrics overload, make sure you can be laser focused on the metrics that matters.

Start with business metrics - there is no reason to run a system for the sake of running a system. Understand the business metrics (CTRs, sales, etc.) and align your technical metrics (latency, availability, etc.) with them.

Identify and build workarounds for failing dependencies

There are probably many dependencies in every system (DB, Cache, DNS, ...). Continuously identify those dependencies and design your system in a way that it can still function without them - even in degraded mode (if that is acceptable).

For example in my system, we had a hard dependency on Geo Lookup - it was slowing the startup, sometimes failing startup of an instances. I rewrote the Geo Lookup, moved it into a separate system (that's not on the schema yet) and if the system is not available, we are just not Geo targeting Ads, but still fully serving them (and of course we get an alert that the service is down).

"Aim for success, not for perfection"

But always have your eyes open to learn how to make your system better. So if you have any feedback, thoughts, ideas, please leave me a comment.

#AWS #Cloud #distributed systems #nodejs

Trending Blogs

Recently Viewed Blogs

David is coding