sleepcoding @agaoglu - Tumblr Blog

Optimizing nginx cache parameters with scalding

When we started thinking about different types of machines for content caching, one of the questions was the capacity. Mainly RAM, and disk capacity of course, because our content is composed of generally large files. It should be big, is the easiest answer, but when the active content changes overtime, it's not a solution. The fresh content has to be cached continuously and stale content will idly sit in that big cache, causing massive waste.

Other side of this story is there is no definitive answer. There is no global rule that says this amount of RAM will be enough. It depends on the content and more importantly how it changes. If it's a news site publishing some dozen articles with a few photos everyday, it's highly probable that those images are simply never accessed after a few days, and it means a few hundred megabytes of cache is plenty. If it's an archive with millions of content mainly receiving long-tail traffic, you'll probably need much more than that. And if it's a moderately sized website with heavy content receiving organic and social traffic, it's probably a mix of those two scenarios and noone knows what your effective cache-size can be.

By the way, these problems are very well known in the industry. Content caching products generally offer configuration options to tune according to the needs. Very well known and widely used varnish for example, has a complete language to help administrators define their content behavior in any way they like. We use nginx for our content delivery system and it mainly has 3 knobs to turn:

proxy_cache_valid

proxy_cache_min_uses

proxy_cache_path inactive

valid directive sets the duration for responses to be used. For example if you configure valid as 10m that means the response from upstream will be used for 10 minutes and it will be invalidated after that and the next request will be passed to upstream. It is also possible and a more general practice for upstreams to determine the validity of their responses (using Expires and Cache-Control) so this directive is used as a last resort only. In addition to that, we are currently focused on caching static content like images that never gets invalidated, so we'll set this like 1 year and not mess with it for the sake of simplicity.

min_uses sets the number of requests after which the response will be cached (A direct quote from nginx docs). In practice this means if you set this as 2, for each key, the first response will bypass the cache but nginx will remember the number of requests. When the second request comes in, nginx will see that this is the second request for this key so it will store the response, and the subsequent requests will be answered from there.

inactive parameter of cache_path directive is the missing element in that min_uses scenario. Cached data that are not accessed during the time specified by the inactive parameter get removed from the cache regardless of their freshness (Another quote). This parameter also sets for how long nginx will remember request counts. If inactive is set at 30 minutes, 2 subsequent requests for the same key in the previous example must arrive within 30 minutes in order for the response to be stored. Otherwise the second request will be seen as the first because nginx wouldn't remember the other request.

These two parameters offer a suprisingly wide range of possibilities for cache efficiency. Reminding the problem at hand, we want to minimize both the cachesize and our upstream traffic which are inversely correlated. Meaning, you can minimize upstream traffic if you simply cache and hold on to everything; and minimize cachesize if you don't cache anything and serve everything from upstream. We have our boundary conditions and our function is like (cachesize, hitratio) = f(inactive, min_uses) but the problem is that f depends on how the content is being accessed which is not deterministic.

What we can do here is experiment and hope that the law of large numbers apply. Run the content servers with different configurations for some specified number of days and monitor the upstream traffic and maximum cache size. But the problem is that this is very costly and prone to error. If different configurations are run on different days this means theoretical independent error on each experiment is not well... independent. Or we can send the traffic to different configurations and run all of them together but that means we'll need a huge setup depending on the number of configurations we want to test.

Or we can approximate the scenarios. We have our accesslogs for content servers, if we can replicate the caching logic with a simpler function, we can run this over logs with different parameters and simply calculate our variables. It will take minutes instead of days, and we don't have to setup anything. Our log is stored in HDFS so it's only a problem of writing some MapReduce-like thingy to convert it into a virtual cache. As far as simple functions go, i don't think anything can beat scalding, so here goes:

We want to find out the variables for all the combinations of these inactive and min_uses values.

Source is an example here. Ours is actually composed of parquet files with whole lot of different fields but we only need those 4 for this analysis.

Here we can do some basic filtering/cleaning. These are examples only.

This assumes the cache keys are the request URI's. If this is not true, like we have some rewrite rules changing cache keys, we need to change here accordingly.

This first grouping will generate the histories of all the keys in the given data. By max'ing and setting bytessent as filesize is the main approximation part. In theory the content cached might be larger than what sent to client (e.g. client interrupted the transmission). The euses field of each content will be a map containing request counts at precise milliseconds.

Since we want to find out more than one inactive X min_uses configuration, we inject them here.

This is where we convert access history as the changes to cache state. It's basically a modified sessionization.

Since the access history was stored in a map, we need to sort it by accesstimes first. Then we start folding over a list which will be this key's contribution history to the store. At finish, the list will hold tuples of positive numbers with accesstimes, meaning the file is stored to cache; and negative numbers with accesstimes meaning the file is removed from the cache. We also carry the number of requests (uses) in order to account for min_uses.

We group by inactive X min_uses and run all of the contribution histories sorted by their access times, adding and subtracting to cachesize as necessary and keeping track of the maximum size reached. We also aggregate upstream requests/transfer for non-negative history entries as they mean requests are passed to the upstream.

As you can see, this analysis does not aggregate total # of requests or transfer. I simply omitted them as they're a simple matter of group count and sum.

Nothing explains an analysis as good as its results so here are the numbers for one of the larger domains we run.

minuses 1 2 3 inactive 10m 3.42 1.18 0.78 30m 7.66 2.97 2.03 1h 12.46 5.17 3.63 2h 19.82 8.73 6.23 6h 39.04 18.77 13.51 1d 73.62 31.22 20.00

The extreme case of caching everything at the first request requires 73 GiB of cache capacity. Which also means that is is the total capacity our active content requires. While not an unreasonable number, not enough to make a decision. We need to look at how our upstream transfer changes in order to evaluate if that much capacity really pays off.

minuses 1 2 3 inactive 10m 233.09 279.84 301.58 30m 175.32 215.24 234.81 1h 144.12 179.90 198.23 2h 118.28 150.68 168.21 6h 89.20 118.70 136.38 1d 73.62 104.85 124.85

Minimum amount of upstream transfer is 73 GiB, the same as maximum cache capacity of course. Accepting this as our baseline, we can start compromising.

If we set our min_uses to 2, it shows 42 GiB of content is requested only once so caching them is a waste. But 31 GiB of content is requested more than once so when we don't cache it the first time, we'll need to request it again, increasing our upstream transfer to 73 + 31 = 104 GiB.

If we decrease our inactive duration to 6h from the maximum 24h, this will mean we'll start removing items earlier so our cache capacity will reach a maximum 39 GiB and we'll need 16 GiB of early-removed content to be requested again, increasing our upstream transfer to 89 GiB.

An interesting finding here is the comparison between a) min_uses=3/inactive=24h and b) min_uses=2/inactive=6h. If you look at the tables, a requires 20 GiB of cache and 124 GiB of upstream transfer, while b requires 18 GiB of cache and 118 GiB of upstream transfer. Scenario b is better than a in both aspects. A look at the scatter plot of cachesizes and upstream transfer should make things clear:

What we want is the optimal point in that curve. It is the configuration min_uses=2/inactive=6h resulting in a maximum cachesize of 18 GiB and a total upstream transfer of 118 GiB, which has the minimum normalized distance in all scenarios tested. What we can next is run a regression and find the optimal point in untested scenarios too but i don't think it'd be too far.

Forcing kafka partition leaders

We now have controlled.shutdown.enable with default value true, which means broker will leave partition leadership before shutting down. But if you're upgrading from an earlier version, the broker will have some leaders and since it won't move them automatically, you might think there should be a manual way to force move them. There is, but it's some effort. But if your producers don't work with request.required.acks -1, you will lose data so you'd probably want to take that effort.

There is a tool called Reassign Partitions and another called Preferred Replica Leader Election in kafka replication tools. In summary, we tell kafka that the partition replicas have been changed to exclude the broker we want to shutdown. If you do not have enough spare brokers, you'll end up including that broker in the replicas anyway, but only in the last position. Second step will be signalling kafka to set preferred leaders. An example will clear things up.

We (and some users we chat around) have 3 kafka brokers, and topics work with replicationfactor 3. This way we think we have a cheap setup with good level of fault tolerance. All brokers have all the data and leadership and the load is distributed amongst them, if one of them goes down (unexpectedly or more importantly for some upgrade) the other two keep the data pipeline online. Normal state with brokerids 1,2,3 is like:

$ ./bin/kafka-topics.sh --describe --topic stream-log Topic:stream-log PartitionCount:9 ReplicationFactor:3 Configs: Topic: stream-log Partition: 0 Leader: 3 Replicas: 3,2,1 Isr: 1,3,2 Topic: stream-log Partition: 1 Leader: 1 Replicas: 1,3,2 Isr: 1,3,2 Topic: stream-log Partition: 2 Leader: 2 Replicas: 2,1,3 Isr: 3,1,2 Topic: stream-log Partition: 3 Leader: 3 Replicas: 3,1,2 Isr: 3,1,2 Topic: stream-log Partition: 4 Leader: 1 Replicas: 1,2,3 Isr: 1,3,2 Topic: stream-log Partition: 5 Leader: 2 Replicas: 2,3,1 Isr: 1,3,2 Topic: stream-log Partition: 6 Leader: 3 Replicas: 3,2,1 Isr: 3,1,2 Topic: stream-log Partition: 7 Leader: 1 Replicas: 1,3,2 Isr: 1,3,2 Topic: stream-log Partition: 8 Leader: 2 Replicas: 2,1,3 Isr: 3,1,2

This example topic has 9 partitions and they are evenly distributed/replicated around 3 brokers. Say that we want to take broker 3 offline for some reason. What we want is for partitions 0, 3 and 6 (currently with leaders at broker 3) to set leaders to brokers either 1 or 2. But AFAIK there is no direct way to do that. Instead we prepare replica configs with broker 3 set as last with a json file like:

{"partitions": [ {"topic": "stream-log", "partition": 0, "replicas": [2,1,3]}, {"topic": "stream-log", "partition": 1, "replicas": [1,2,3]}, {"topic": "stream-log", "partition": 2, "replicas": [2,1,3]}, {"topic": "stream-log", "partition": 3, "replicas": [1,2,3]}, {"topic": "stream-log", "partition": 4, "replicas": [1,2,3]}, {"topic": "stream-log", "partition": 5, "replicas": [2,1,3]}, {"topic": "stream-log", "partition": 6, "replicas": [2,1,3]}, {"topic": "stream-log", "partition": 7, "replicas": [1,2,3]}, {"topic": "stream-log", "partition": 8, "replicas": [2,1,3]} ], "version":1 }

We use this file to tell kafka to change replica configs using the first tool:

$ ./bin/kafka-reassign-partitions.sh --reassignment-json-file manual_assignment.json --execute

Now after a few seconds state will be:

$ ./bin/kafka-topics.sh --describe --topic stream-log Topic:stream-log PartitionCount:9 ReplicationFactor:3 Configs: Topic: stream-log Partition: 0 Leader: 3 Replicas: 2,1,3 Isr: 3,1,2 Topic: stream-log Partition: 1 Leader: 1 Replicas: 1,2,3 Isr: 3,1,2 Topic: stream-log Partition: 2 Leader: 2 Replicas: 2,1,3 Isr: 3,1,2 Topic: stream-log Partition: 3 Leader: 3 Replicas: 1,2,3 Isr: 3,1,2 Topic: stream-log Partition: 4 Leader: 1 Replicas: 1,2,3 Isr: 3,1,2 Topic: stream-log Partition: 5 Leader: 2 Replicas: 2,1,3 Isr: 3,1,2 Topic: stream-log Partition: 6 Leader: 3 Replicas: 2,1,3 Isr: 3,1,2 Topic: stream-log Partition: 7 Leader: 1 Replicas: 1,2,3 Isr: 3,1,2 Topic: stream-log Partition: 8 Leader: 2 Replicas: 2,1,3 Isr: 3,1,2

We have changed the config but leaders are not changed. To force it:

$ ./bin/kafka-preferred-replica-election.sh

Kafka will prefer the first broker in each replica config to be the leader, so the final state will be:

$ ./bin/kafka-topics.sh --describe --topic stream-log Topic:stream-log PartitionCount:9 ReplicationFactor:3 Configs: Topic: stream-log Partition: 0 Leader: 2 Replicas: 2,1,3 Isr: 3,1,2 Topic: stream-log Partition: 1 Leader: 1 Replicas: 1,2,3 Isr: 3,1,2 Topic: stream-log Partition: 2 Leader: 2 Replicas: 2,1,3 Isr: 3,1,2 Topic: stream-log Partition: 3 Leader: 1 Replicas: 1,2,3 Isr: 3,1,2 Topic: stream-log Partition: 4 Leader: 1 Replicas: 1,2,3 Isr: 3,1,2 Topic: stream-log Partition: 5 Leader: 2 Replicas: 2,1,3 Isr: 3,1,2 Topic: stream-log Partition: 6 Leader: 2 Replicas: 2,1,3 Isr: 3,1,2 Topic: stream-log Partition: 7 Leader: 1 Replicas: 1,2,3 Isr: 3,1,2 Topic: stream-log Partition: 8 Leader: 2 Replicas: 2,1,3 Isr: 3,1,2

Now you can shutdown broker 3 to upgrade or do whatever you want to do with it. ISRs will shrink for that time, but after the broker starts again it'll catch up.

As a last note, you'll probably want to go back to the original state after the operation, what you need to do is to reverse the operation: Reassign-partitions with the original config and start prefered-replica-election again. The first reassign-partitions will output the original state for the reference, so you can save and use that as the parameter for the second run.

Systems monitoring dissected

The more monitoring you have, the more trouble you find. If you don’t have monitoring, trouble finds you.

— John Arundel (@bitfield)

January 29, 2015

This tweet just arrived after we have deployed our new monitoring system and I couldn't agree more. In a few words it explains why we and everyone trying to push their limits in monitoring keep doing it.

About 3 years ago when we started working on a number of production servers, one of the first missing things was a way to see which machine does what and being able to do so without actually sshing into the machine and running some command-line tools. After some rationalization (which i'll not be going into) and research (non of which i've conducted), we have deployed zabbix. At first, a server storing data on mysql and daemons on individual machines, pumping metrics to the server. Then we have added several advanced tools into the mix, moved storage to postgresql, developed lots of customizations, made and unmade lots of screens. Currently it still is in our TVs, our browser tabs, still sending mails when something that requires attention happens.

We have tried several other things in those past years but their benefits never outweigh the burden of another adventure of an infrastructure size deployment effort. Until we decided that we need much more metrics per machine/service (see the tweet) and we should be able to correlate wildly different metrics much more easier. None of which are zabbix's strong suit. That's when we started looking for alternatives, more specifically time-series databases. I tried to evaluate some of them but i won't be going into the discussion. We chose opentsdb because we already have HBase installed and grafana provides a fantastic UI.

But a datastore and a means to graph over the data stored is only half a monitoring solution. Next, we need a way to get the data from the machines to the store. With Zabbix, as a complete monitoring solution, this is done with zabbix-agent. This process collects system metric values from various places and communicates those to the zabbix-server. While it may be possible to tap into this communication and tee the data elsewhere, it probably requires a lot of zabbix protocol knowledge which i don't have. So i started looking for other options.

First one was tcollector, side-project of opentsdb, a series of python scripts reading values from kernel/service and sending them to a TSD. With some installation effort, it works great. Its easily extendable with extra scripts, not much overhead. But it falls short of a hidden requirement: While it's a metric value collection daemon, the only place you can send those values is opentsdb. I didn't want another collector daemon running on the machines exclusively for one datastore. The core requirement is to collect values, sending those values somewhere should be pluggable.

Which gets me to my second option: collectd. It's an independent daemon with tons of plugins which are for either collecting values or sending them to somewhere. Its written in C, performant, widely supported, etc. etc. But... I really don't like installing ttf-dejavu-extra or fontconfig just to be able to collect some kernel-level metrics from a headless server. What i mean is, it took some time for me to find a good way to install it with some TSDB plugin and without all of the RRD stuff. That's definitely not collectd's fault by the way. It's probably just me being unnecessarily uneasy. Anyways, what i couldn't find was a good way to install it with a good working TSDB plugin. Long story short, C is not my strong suit so i couldn't get much further.

Last option was a complete coincidence. I started googling from scratch and found some other monitoring tool recommending diamond as a collector daemon (i don't remember who). On the other hand ceph monitoring-management toolchain calamari uses diamond too, so we'd have a way to collect ceph metrics in a known way. After some reading through the docs and the code, it was clear that it is the collector of choice. Diamond offers the best of two previous worlds i ventured: It's composed of easily extendable/modifiable python scripts while being completely datastore agnostic. There are lots of collectors as well as handlers (datastore bindings) with sensible dependencies and a clean way to package & install. I deployed it on a select few servers to for starters.

Diamond is so datastore agnostic that metric names are not opentsdb-y. OpenTSDB naming schema recommends keeping metric names abstract, and setting anything machine/part specific as tags. So when diamond says metric=servers.server-03.cpu.cpu2.user, opentsdb should hear metric=cpu.user host=server-03,core=cpu2. TSDB handler for diamond handles the hostname-as-tag conversion, but the rest is defined in collectors so it does not go into the detail. Another problem was handler not reading TSD server replies off telnet connection, causing messages to pile up at socket queues to the point where TSD server crashes. Adding a few more requirements to the mix, it was a better idea to write a brand new TSDB handler. After a few hours of work, our diamonds now speak OpenTSDB 2 HTTP API, can use more than one TSD server and can be configured to parse tags from metrics. And they log opentsdb error messages so that we can see when we missed a metric or something. It's running without problems for about 3 weeks now and we already started to find more troubles, which was the point. Only thing i'm missing is the ability to do simple mathematical operations directly on opentsdb. One last piece in this new monitoring puzzle is alarms. I'm thinking riemann but haven't had enough time to actually start working on it.

To wrap it all up, zabbix is a great way to start monitoring. It's an opinionated tool which gets results fast when you have nothing. Because, to be honest, you don't know what to monitor unless you start monitoring. Kernels and services throw zillions of numbers at you and unless you have years of expertise it's not easy to find which numbers matter most. Zabbix templates tell you which numbers to collect, how you should graph them and even what are the normal values so that it can immediately start sending alarms when some problem happens. But not everything can be expressed in templates, and even if it was possible, zabbix-server and/or the relational datastore it sits on top has limits. That's when you need to leave the comfort zone of complete solutions and step into the flexibility of collecting, storing, graphing and alerting with specialized systems. You have to find which values you need eyes on, which low-level metric effects which part of the running service. Fear not, it's fun when you have such great tools.

#monitoring #opentsdb #diamond #grafana

Scaling Overnight

We, at nokta technology team, are responsible for the infrastructure of nokta media projects, most prominent being izlesene.com, which is the largest Turkish video-sharing website. We believe we are in a unique position being stationed in a developing country with a reputation of internet censorships and trying to make a living through an internet business. And, dare i say, one of the hardest kind: video as the model and a revenue-stream based on advertising.

If this post somehow gets attention from the tech folks living in bay area where funding can be found lying on the street and every business owns a datacenter with 10k machines, let me start with a few friendly reminders here: In this part of the world, no datacenter is that big, any seven figure funding makes headlines, and advertising budgets are abysmal. While it may be the norm for businesses out there to outsource some operations to CDNs and cloud providers to cut the costs and focus on their business, it's generally the other way around here. This means the list of things that our team is responsible for includes: networking and cabling, physical servers and their components, virtual machines, storage systems, databases, data warehouses, video transcoding, video streaming, static-content delivery, load balancers, firewalls, real-time analytics and so on. We spend our average day monitoring, troubleshooting, improving and designing the new versions of these things. March 27th was not an average day.

On March 27th, around 1500 UTC, YouTube has been banned by turkish government. ISPs applied that ban by DNS spoofing, so users trying to watch a cat diving into a box or listen to turkish pop music were reading a legal notice instead. While there is a lot to say about censorships, this post is about something else. Such a ban does not stop the average user from reaching her goal. She simply goes back to the search result and clicks on the second link. That would be us.

We are used to a similar situation when YouTube is having some kind of outage and the same user behavior applies so our traffic doubles, but those don't last beyond an hour. This was a total ban and it was here to stay. We had no idea where the traffic would reach or when the ban would be lifted for that matter. Now, after two months, ban is still in effect and we know where we are: We are now riding a 6 times bigger traffic than what we were just a few months ago.

It wouldn't be a story if we were serving 10 requests a day before and 60 requests now. So here are some numbers comparing first weeks of March and May:

And this is a direct screenshot of one of our zabbix graphs showing between Feb 16th and May 18th.

We know none of these numbers are enter-your-popular-webgiant-here grade but they are not also enter-your-average-website-here grade either. The interesting part for us here is actually the change in numbers. We think this kind of leap is not possible in a healthy internet ecosystem. It's the result of being in the unique position mentioned earlier. The same position that puts a team of five, to play with just over a hundred machines in order to keep things running. And it's sad.

It won't be honest to say that we were totally unprepared though. Good part of the last year is spent deploying and running a private openstack cloud, a storage cluster running ceph, a data processing cluster running hadoop and storm, several haproxy and nginx and varnish installations, zabbix to monitor and puppet to keep us sane, all on ubuntu linux. When the day came, our endeavors have paid off, all of the systems kept their promises. Such a solution with any proprietary system is simply impossible, where standard support doesn't go through in less than a month and advanced support we require is non-existent, not here anyways.

While each of them deserves a post on their own, here is a short overview of what we have done after March 27th and how these systems helped us.

Just when we know we need more machines, our previously purchased hardware got delivered the day after March 27th. BTW usual hardware purchase-delivery takes about a month for us, so it was lucky.

We deployed new machines for video streaming. With cobbler and puppet in place, it takes about an hour to get a machine from bare metal to ready state.

We launched additional web servers. Openstack creates an instance in seconds, required environment is set with puppet in a few minutes.

We exhausted available compute resources for openstack and started adding new compute nodes. No problem.

Adding new storage nodes to ceph is business as usual for us. It didn't require further attention.

Our search index solr was a single instance and after about 12x load, it failed. We had to redo it into a solr cloud. Didn't take more than a few hours. Then kept adding nodes.

Log collector flume needed a few configuration changes in order to pump more data into hadoop and kafka. Storm topologies are running flawlessly after some minor tuning.

Some user facing servers required kernel tuning done with sysctl.

We'd like to complete our words by thanking the open source communities of all kind, for this kind of story would not have been possible without them. We'll be posting more technical details soon.

Note: Since this post ended up on my personal blog, i have to write about the team:

Hakan Kocakulak (@hakankocakulak): Team Leader

Caglar Bilir (@caglarbilir): Systems

Ahmet Kandemir (@ahbikan): Systems

Selcuk Tunc (@ttselcuk): Software

Erdem Agaoglu (@agaoglu): Software

Note: At the time of writing this post, there are rumors about YouTube ban being lifted. Our metrics still say otherwise.

Nginx and weird 416 responses

We have been experiencing lots of 416 responses on our nginx based video servers recently. Well... not that recently in fact. We were seeing those since we saw our first access log, but it's like one client getting tens of those every second for a few seconds so we thought it may be a problematic flashplayer or something like that. Deeper inspection proved otherwise. But first, WTF is 416?

416 is a client-error type HTTP response code which is defined as "Requested Range Not Satisfiable". It means the client requested some (byte) range over a resource but the range for that resource is not applicable. Simple example would be like the file is 1MBs but the client requested the portion between 2MBs and 3MBs of it. Server cannot reach the 2nd MB of a 1MB file so it will respond with 416. All the details about this are told in rfc2616 section 14.35.

On our case, it didn't actually make much sense since most of our video is played on flash-based players and as far as we know, flashplayer is unable dispatch a HTTP request with Range header (i don't know much about flash so i am not actually sure about this). But it happens and generates a lot of traffic waste, and probably some unhappy clients.

So, after some clever but dirty tcpdumping we see those requests were not like the simple example i just mentioned but actually invalid Range headers. Out of that dirty tcpdump output we saw things like:

Range: bytes=7259-7258 Range: bytes=10513-10512 Range: bytes=0--1

There is a pattern here: some client, some browser, some code with an off-by-one error causes last-byte-position to be one less than the first-byte-position which generates a syntactically invalid Range header as rfc puts it. This is the client's problem. But the next sentence in the rfc says the recipient (nginx) MUST ignore such headers. And AFAICT if you ignore a header you should process the request as if that header was never there and you should respond 200 with all the content. It seems nginx has a bug there.

Our first course of action in these kind of situations is google the hell out of the problem, since we generally believe we are not alone. But this time we found nothing. Next, we tried apache and lighttpd with the same request and they responded 200 as rfc suggested. So we replaced our nginxes with apache... just kidding.

We thought it'd be possible to workaround this by using some lua in our config so here is what we came up with:

header_filter_by_lua ' if ngx.var.http_range then local brange = string.sub(ngx.var.http_range, 7) local start = tonumber(brange:sub(1, brange:find("-")-1)) local stop = tonumber(brange:sub(brange:find("-")+1)) if stop and start and stop < start then ngx.req.set_header("Range", nil) end end ';

Without any prior knowledge of lua and searching for every basic operation this is as good as it gets. We strip bytes= part first, then cast the parts before and after the first dash into numbers. If those numbers are syntactically invalid, remove the header. Sometimes all that string manipulation and casting will fail and start or stop vars will be nil. If that happens, do not touch anything and let the nginx core do its work.

At the time of writing this post, we haven't moved this piece into production yet but some local tests showed it's OK. Just paste it into the server section of your nginx conf and you should be good to go.

PHP ile Mongo'ya bağlanmak

Ne PHP ile ne de Mongo ile yıldızım barışık değil. Ayrıca internetlerde her yerlerde yüzlerce defa yazılmış bir şeyi tekrar yazmak da hiç huyum değildir. Ancak görünen o ki bu konuda çok temel bir kaç noktada (muhtemelen türkçe kaynak açısından) eksiklik sözkonusu. Çok kısa tutacağım.

Eğer gerekçeniz "MySQL'den hızlıymış, bizim master-slave MySQL yerine buna geçmemiz lazım" ise, şimdiye kadar "master-slave gibi bir şey" olan MongoDB replika-set kurulumunu yaptığınızı tahmin ediyorum. Sıra geldi read-mysql ile write-mysql bağlantılarınız değiştirmeye:

Durun. Read ve write için ayrı bağlantı yapmayacaksınız. Artık hem read hem write işlemlerinizi tek bir veritabanı bağlantısı üzerinden yapmanız gerekiyor. Bu bağlantı basit bir ayar sayesinde read işlemlerinizi slave'lere (secondary'lere) write işlemlerinizi de master'a (primary'e) gönderecek. Yapmanız gereken bağlantı satırınıza replika-set adınızı (kurulumda tanımladınız) ve readPreference denen bir ayarı eklemek. Yani

$writelar = new MongoClient("mongodb://192.168.1.11"); $readler = new MongoClient("mongodb://192.168.1.12");

yerine

$mongo = new MongoClient("mongodb://192.168.1.11,192.168.1.12?readPreference=secondaryPreferred", array('replicaSet' => 'rs'));

Böylece "Cannot run command XXX: not master" hatasından kurtulmuş oldunuz.

Geldik "MongoClient::__construct(): php_network_getaddresses: getaddrinfo failed: Name or service not known" gibi saçma bir hataya. "Ben bağlantı tarafına IP yazıyorum ne alaka" demeyin maalesef artık uydurma master-slave yapısı kullanmıyorsunuz. MongoDB replika-set tüm dağıtık sistemler gibi düzgün çalışan bir DNS altyapısı ister. Ama malum bugünlerde düzgün çalışan DNS bulmak zor. Ya da "eskiden DNS mi varmış", ya da "hosts dosyası ne işe yarıyor o zaman", ya da "isim olursa sistem yavaşlar IP kullanın", ya da "çok lazımsa tamam hosts'a yazarız o o kadar da yavaşlatmaz" cılardansanız yapılacak şeyi de biliyorsunuz. Kurulumu yaparken kullandığınız isimleri tüm makinalarınızda hosts dosyalarında tanımlayacaksınız. Makinalarınızın adı mongomaster ve mongoslave gibiyse:

192.168.1.11 mongomaster 192.168.1.12 mongoslave

Geçmiş olsun. Artık güvenle gidip sağda solda "bu mongo da çok kötüymüş, hiç bir şey yapamıyon olm, memcache daha iyiydi" diye atıp tutabilirsiniz.

GPUs, Hadoop and Testing Scalability

As i told numerous times before, i am currently trying to get some GPU powered image processing application to run on Hadoop. In development phase we were using a cluster of 12 machines with one Nvidia GTX 480s each, but since we are launching in a few months, we had to do some tests on our production cluster of 25 machines with two Nvidia Tesla M2050s each. In this post, i'll try to sum up the process of testing, technical details will come later.

First some reminders about our architecture. Image processing application (IPA) receives an array of images and returns an array of results of doubles. A reduceless MapReduce application divides the images in HBase into chunks, and passes those chunks to IPA. Simply put, while it's improbable for a single IPA to process thousands of images at once, whole system is able to process millions of images in parallel.

What matters (on our end) is the number of images IPA received and how much time did it take to return a resultset. Using those, we calculate a basic metric: speed in number of images processed per second (ipps). We also calculate the same speed for whole cluster, to see if we can reach a speed like nx ipps when our IPA runs at x ipps and cluster runs n IPAs in parallel (spoilers ... we can!).

To show this in numbers, we measured base IPA speed on GTX 480. While the CPU on the system also effects it a bit, its runs at 19.46 ipps on average. On the other hand, our cluster with 12 GTX 480s runs at a total speed of 231 ipps which is extremely close to 12 x 19.46 = 233.52 ipps! Looking at this numbers we assumed our system scales linearly so when we increase the number of GPUs to, say 24 we'll have 231 x 2 = 462 ipps.

With this assumption in mind, we measured base IPA on Tesla M2050, which is 14.80 ipps (yes, Tesla M2050 is about 24% slower than GTX 480) and expected to have a speed like 14.80 x 50 = 740 ipps on our production cluster with 50 Teslas. Our first results with 518 ipps was nowhere near that. We started investigating...

After some lousy ideas putting the blame on IPA folks and node configurations, we took a step back and started questioning our ways of testing. We knew there were IO and Hadoop task management overheads but they were omissible ... for jobs containing large amounts of images. We missed that the definition of large would differ amongst clusters such as one with 12 GPUs and another with 50 (!). We were testing both using 100.000 images and it could've been a small number for the latter. We slowly increased the number of images to one million and...

we got close enough to expected speed of 740 ipps with 709 ipps. MapReduce jobs in our system will process millions of images in production which means the cluster will be fully utilized. If there were only a hundred thousand images a large portion of the investment would have been wasted.

Lesson learned in scalability: You have to cut your coat according to your cloth. Or you shouldn't buy more cloth than what would be necessary to cut your coat. Or ... Whatevs, you got the point.

Lesson learned in testing: Always test your systems, then test the hell out of them and when those don't satisfy you change your tests and test again. It might cost some time but it will save money.

Java web services without (explicit) code generation - with exception handling

Finally... it's been some busy weeks which i constantly sat in front of the computer in first one and constantly moved around them in the others. Finally i found some space to finish what i started. But space is not too much so i'll keep this short.

Previously i talked about some funny web services stuff, and finished with a problem concerning exception handling. In short form: SOAP cannot transparently handle Java exceptions, so you cannot throw something in the server and expect the client to catch the same. You need some transformation.

In a longer form: SOAP has a thing called soapfault which is the closest thing you have to a java exception, but in order to use it you have to accept some rules. First of all your exception should be a checked one. Second, soapfault is basically XML so it can only transform things that can be parsed/rendered into XML. Which means you have to wrap your exception information in a JavaBean enabling it to be easily transformed into XML and back. Looks good with one little problem: what if you can't get your exception information into a bean. Maybe you have errors in an enum or even worse, you are using an interface to describe your error codes. Well, JAX-WS has nothing to offer.

Another thing JAX-WS does not offer is a decent developer documentation. You have to waste some hours debugging to see which method gives you what. Except for typesafe exception handling, ... because here goes:

MyException is the custom checked exception. For simplicity, it has two String parameters but it is possible for you to create your interface implementing custom object or enum or whatever you need. Just change the object instantiation in line 78 to suit you. Usage of this snippet will be simply like

MyService port = getPort(new QName(MY_NS, "MYServicePort"), MyService.class); port = JaxWsExceptionCatcher.catchOn(port);

And you can catch and process your exceptions just as the service is in your classpath, still without code generation.

Java web services without (explicit) code generation

I don't know you but i hate code generation. Bytecode generation may sometimes be useful, but kills debugging capabilities so should be avoided most of the time. Source-code generation on the other hand, i simply fail to understand the necessity. If some 3rd party library will write the codes i will run, why can't i simply let the library do whatever it needs over some sort of an API?

Anyways, we all know the story. If you are making use of an external SOAP web service, you are kinda forced to generate (source) code. But most of us expand this approach and generate code for SOAP web services between modules of the same project. Which is extremely unnecessary, after JAX-WS 2.0 (i guess, not sure about the version). Instead, we can give plain-old-java-interface of our service and WSDL url to JAX-WS and make it work for us.

class MyService extends Service { public MyService() throws Exception { super(new URL("http://path/to/service?wsdl"), new QName("http://service.my.org/", "MyService")); } public My getMyPort() { return getPort(new QName("http://service.my.org/", "MyPort"), My.class); } }

Above code shows the whats necessary on client side. Service we extend from is a class in JAX-WS framework. My is the interface of the service we are trying to use. This is the simplest example which you will find when you google JAX-WS without code generation. But as always noone's trying to make a life with hello-world applications.

Every module uses custom beans (complex-types) in communication so a single interface will not be enough to work (It will be if there are no complex-types). JAX-WS will auto-generate transport classes but will not touch business specific beans. So what i come up with is to make the service providing module to publish a jar with necesseary beans and web service interface. Service consuming module defines a dependency to that artifact and goes along with its life. The jar actually contains the half of the stuff what JAX-WS would generate but now, its not ugly as in generated by some magic library, its ugly as some module developer wrote it ugly so you can push him/her around. Another upside is now that you have written the instantiator code (above) you can write it anyway you like and say dependency inject using guice.

Story does not end here though. Now that you have (almost) isolated yourself from SOAP-mechanics (using guice and all) you may want your service provider's exceptions untouched. Hold tight for second post.

Using ivy and maven together

It's not logical, highly unnecessary and probably expensive. But anyhow we found ourselves in that environment no matter what. Problem stemmed from the fact that eclipse/RCP dependency system being incompatible with virtually everything out there. We were using ant/ivy and pretty happy with it but our UI side found no easy way of headless-building their application using it. Eclipse is trying to make use of maven 3 with a thing called tycho, but that's another story. Point is, they were practically forced to maven, and so was i (us).

The problem is, the eclipse project(M1), which is built using maven, depends on a project(I1) which is built using ivy. Since these projects are constantly evolving, dependency is for SNAPSHOT version. Add another oversight of choosing nexus as artifact repository manager, we ended up being unable to publish SNAPSHOTs with ivy and depending on them with maven.

We set M1's updatePolicy value to always and expected it to re-download the snapshot artifact of I1 on every change but there is more than one way to do this. Ivy relies on the timestamps of those while maven uses external metadata to identify if a SNAPSHOT artifact has been changed or not. But, ivy has no idea about an external metadata file during publish/deploy so nothing to use for maven. (1)

Nexus can actually repair missing metadata files but i think (not sure) it requires the artifacts to be deployed with uniqueVersions (that funny timestamp-like things replacing SNAPSHOT). Of course, ivy has no idea about those neither. (2)

OK, we can disable uniqueVersions and get "SNAPSHOT" without funny timestamps. But no, because maven 3 got rid of the functionality and uniqueVersion is always on. (3)

Adding (1), (2) and (3); we had a huge incompatibility problem on our hands. Some researching came back negative (maven blaming nexus, nexus blaming ivy, ivy asking questions why maven?...) we fell back to disabling ivy-publish'es and using mvn deploy:deploy-file's instead. Reconfigured our jenkins accordingly and finally evaded problems.

Bottom line: don't use ivy and maven together; it's not logical, highly unnecessary and probably expensive.

Apache ODE and CLOB issue

I took-over some responsibilities from a recently departed collegue, and with it, i was kinda forced to turn back to JEE world. Not exactly the same technologies and frameworks i am used to, but once you hate some part of something it is likely that you won't enjoy the other parts.

Anyways, first assignment was to move some WS-BPEL processes from glassfish to Apache ODE. It sounds like it should be easy since WS-BPEL is a standardized and well-acknowledged specification but only an inexperienced and/or naive developer believes that. Standards are never that standard. Only the simplest hello-world can be deployed to more than one (two at most) container without a problem, your JPA application will never port from hibernate to toplink and your standards-compliant webpage will never look like the same in IE. Without some unknown hours/days of hard work, that is.

But for that instance, i got lucky. The hard part was already done and documented (1|2|3) by Hilal Tarakci (still not twitting!), whom i've been working closer now. The last problem was the easiest one but helped me steal all the credits. ODE, by default, works using a derby database which doesn't like CLOBs larger than some size and barfs like this when it encounters one:

java.sql.SQLException: An unexpected exception was thrown ... Caused by: java.sql.SQLException: An unexpected exception was thrown ... Caused by: java.sql.SQLException: Java exception: 'A truncation error was encountered trying to shrink CLOB '' to length 1048576.: org.apache.derby.iapi.services.io.DerbyIOException'. ... Caused by: org.apache.derby.iapi.services.io.DerbyIOException: A truncation error was encountered trying to shrink CLOB '' to length 1048576.

I guess this was somewhat expected because there is a small tutorial in the installation docs of ode, showing how to configure it work on a mysql db. Distribution package also contains DDLs for Oracle, but if you're already running a postgresql server and don't want another link in the chain, you're (not) alone. Without further ado, here are the things you should do.

Create the database you wish to use on the server (you wish to use).

Get this SQL piece and execute on it.

Take this context snippet and place it into your $TOMCAT_HOME/conf/server.xml in <Host> part after modifying as necessary.

Get a jdbc jar from postgre and place it into $TOMCAT_HOME/lib

Get this properties file and place it into $TOMCAT_HOME/webapps/ode/WEB-INF/conf

Start tomcat.

And yes, i mostly got this from the original tutorial. Only thing i did was to edit SQL into the form that postgre would understand. For those of you running something bigger than tomcat, it should be easier to define a JDBC connection on JNDI.

Hadoop MapReduce job statistics (a fraction of them)

Well, this has been on my backlog for a while. The problem is extremely simple actually: when did a MapReduce job started processing? I need this info to report to my clients using my API, meaning redirecting them to the JobTracker's web interface is not an option.

Everyone using hadoop for some time knows 0.20 is the version to use, and everyone developed something other than a WordCount knows it's a PITA. API is hard to use at best, misleading and incomplete most of the time. You might wonder how hard can it get to extract a basic (and easily accessible over the web interface) piece of information such as a start time of a job, all i can say is very.

Without further ado, while i expect something like Job.instance("JOB_ID").getStartTime() here is the piece of crappy code i found to be working:

long startTime(String jobID) { Configuration conf = new Configuration(); JobClient jobClient = new JobClient(new JobConf(conf)); // deprecation WARN JobID jobID = JobID.forName(jobID); // deprecation WARN RunningJob runningJob = jobClient.getJob(jobID); Field field = runningJob.getClass() .getDeclaredField("status"); // reflection !!! field.setAccessible(true); JobStatus jobStatus = (JobStatus) field.get(runningJob); return jobStatus.getStartTime(); // finally }

As noted above, JobConf and JobID are deprecated. But since there is no way of working with anything non-deprecated, we reluctantly accept that. What we may not accept is working with reflection, but well... I couldn't find any other way (please point me if you know). It is actually funny to have that information in the status field of runningJob but not able to access with because of a lack of getStartTime() method which reads from it. (BTW v0.21 is closer to what i expect but it is largely unusable for various reasons.)

On the other hand, my requirement wasn't that, exactly. There may be a delay between the time i have submitted a job and it started processing, highly because the cluster was busy. What i needed was when the job started actually processing, meaning the time the first task is fired on a task tracker. Now i expect something like Job.instance("JOB_ID").getTasksOrderedByStartDate().get(0).getStartTime() but i know i won't get what i expect, instead:

long actualStartTime(String jobID) { Configuration conf = new Configuration(); JobClient jobClient = new JobClient(new JobConf(conf)); // deprecation WARN JobID jobID = JobID.forName("job_201107011451_0001"); // deprecation WARN RunningJob runningJob = jobClient.getJob(jobID); TaskID firstCompletedTaskID = // deprecation WARN runningJob.getTaskCompletionEvents(0)[1].getTaskAttemptId().getTaskID(); for (TaskReport tr : jobClient.getMapTaskReports(jobID)) { if (tr.getTaskID().equals(firstCompletedTaskID)) { return tr.getStartTime(); // search !!! } } }

First task completion event belongs to the SETUP task which runs on the time of job submitting no matter what the cluster is busy with. That's because i'm getting the second one in the array using [1].

One small problem is that i'm using task completion events, not task starting events, so i am assuming the first task to get finished is also the first task to get started. This is usually correct in my case but i know it will not apply to others.

I haven't been able to find a way to get a job's finish date yet, i'm using job.end.notification.url for that. Hadoop sends a GET to a servlet on finished jobs so i simply get the time the service was called. It may not be accurate but again works for me.

In the light of these difficulties, i am thinking about a simple application that serves easily parseable job information. It would probably be rendered obsolete when 0.22 is out but it might still be useful to be able to consume such info with other languages than Java.

Scalatra result announcer w/ various datastores

In previous post, i talked about an examination result announcer application in scalatra/scalate/scalaquery. I mentioned i would try the app with different datastores in another day and post the results. That day has arrived at last.

Since i've talked about the system before I'll keep things short this time and jump straight to results.

RPS p90 In memory: 591 238 Voldemort (0.90): 552 256 Mongo (1.8.2): 548 255 Redis (2.2.12): 523 265 Cassandra (0.8.2): 504 273 HBase (0.90.3): 471 285 MySQL (5.5.15): 453 346

In memory storage means i used a scala.collection.mutable.Map object in my Controller to collect the results. The result above was measured with scala 2.9 parallel collections. Without them, the numbers were slightly smaller.

All 3rd party storage solutions were working on localhost with their default configurations. All of them have been accessed with preferred or well-known drivers.

Voldemort - Java API

Mongo - Casbah

Redis - scalaredis

Cassandra - Hector

HBase - Java API

MySQL - scalaquery/JDBC)

I did not try to optimize my code as it is another one of my goals to measure how easy it is to get best performance with little effort.

As seen in the chart, voldemort and mongo are virtually the same in terms of these simple performance measurements. I guess this is because both the storage systems work directly off memory. Since i everything is configured to defaults, voldemort was using in-process BDB-Java, and mongo was using memory-mapped files (i guess). On the other hand, while being an in-memory system too, redis missed their performance with a small amount. I feel some sort of configuration requirement there.

Cassandra and HBase with their BigTable like storage mechanics lag behind others with small margins. I don't know much about Cassandra but running a pseudo-distributed HDFS and an HBase (and on the same computer as the application and jmeter) is highly discouraged. And, i guess i know enough HBase to say that it is not the perfect fit for this example application. Results simply reflected that.

Since my linux has updated itself several times in last two months, i ran mysql tests in order to keep things fair. You might have noticed its performance is also increased, but not to the point where it may compete with others.

All the code is on github. Switch branches for different stores.

NOTE: Those system updates effected my couchdb too, which upgraded it to 1.1.0. Long story short, my application running on it outperformed everything on the list with 649 requests-per-second and 90% of requests were under 201ms.

NOTE: Another thing i discovered was if i were to use rewrites in couchdb, it damages performance to 550 RPS and 230ms p90. Interesting...

Announcing results with scalatra

As other posts in the series mentioned, i am trying out some web frameworks and data stores with a small web application which would be used to announce the results of a hypothetical exam. You can find the details in the first post. Today, the example app will be of scalatra.

Scalatra is a sinatra-like lightweight web framework written in scala. For developers living under a rock for a few years, sinatra-like means just enough tools to map a URL to a method. No ORM, no templating, no authentication over LDAP. Just URLs and methods. Scala is a programming language which tries to leverage both object-oriented and functional concepts. Google is full of sites telling reasons why scala is a great language.

It is clear that a programming language and a simple web framework is far from being enough to develop a web application nowadays. In order to query results, they have to be stored in some database first. For this specific example, i'll use mysql. A simple web-application like this one would not actually require another layer on top of the database but since i am evaluating the ecosystem rather than developing a real-life application, i'll add scala-query into the mix. I hope it will ease the pain JDBC will induce. On UI side, mixing HTML with application logic is pretty much accepted as a bad practice, so i'll use scalate as the templating engine. As all of the tools are leaning towards sbt, it will be the bowl holding the soup together.

Since scalatra is the core piece in the environment, it seemed logical to start with it and add other ingredients as i go. There are two ways to create a scalatra project skeleton. First way is simply cloning this repo. Other one requires installation of giter8 and sbt seperately. Both will come scalate included. Difference is former uses sbt 0.7 series while latter uses sbt 0.10. If you're like me and have a tendency to walk on the edge, you'll need to "g8 scalatra/scalatra-sbt" after installing these shiny tools. It will ask some questions about the project and create it.

$ g8 scalatra/scalatra-sbt organization [com.example]: name [scalatra-sbt-prototype]: resultannounce servlet_name [MyScalatraFilter]: ResultAnnouncer scala_version [2.9.0-1]: 2.9.0 version [1.0]:

After that is finished cd into your project and run "sbt". It will download some files and give you the sbt shell. Run an "update" to get your dependencies followed by a "jetty-run" to see some Hello application running on localhost:8080... Now get coding!

First thing i did was changing the default Hello screen to the login page, written in template main using jade.

get("/") { templateEngine.layout(root+"main.jade") }

After setting the form to simply POST to a url like /r/idnumber, it was easy to handle values in scalatra side.

post("/r/:idn") { val result = Result of (params("idn"),params("pass")) result match { case Some(_) => templateEngine.layout( root+"result.jade", Map("result"->result)) case _ => templateEngine.layout( root+"main.jade", Map("formErr"->"Wrong Details")) } }

The Result here is a DAO i have thrown with my limited scala and virtually non-existent scala-query knowledge.

object Result { val db = Database.forURL("jdbc:mysql:///ss?user=root", driver="com.mysql.jdbc.Driver") def of(id:String, passwd:String) = { db withSession { val q = for (e <- Results if e.id === id) yield e q.first } } } object Results extends Table[(String, String, String, String)]("results") { def id = column[String]("id", O PrimaryKey) def passwd = column[String]("passwd", O NotNull) def name = column[String]("name", O NotNull) def result = column[String]("result", O NotNull) def * = id ~ passwd ~ name ~ result }

There is a password-hash control step too but i omitted it for this post. The Results object defines the database table and enables us to query without writing any SQL.

After some manual testing to check everything is in working order, i loaded 3M results in that table and started hammering the application with jmeter. Simple stress tests on login screen which does not touch database or anything yielded 658 requests-per-second for 100 concurrent users and 90% of the requests were served within 180ms. Considering the test for couchdb on the same machine gave 500 requests-per-second with 290ms 90% line, scalatra and scalate seems like an improvement. But as i have mentioned before this is hardly a real-life scenario: Users generally won't bounce off login screen, they will login using their details and try to see their exam results. Using a jmeter workbench for this scenario, requests-per-second dropped to 396 and 90% line increased to 413ms. Now that seems a like a little stepback from what i was able to achieve with couchdb. I'll try to identify the pieces (scalate, scala-query, mysql) causing the slowdown but that's for another day.

Scalability side is the same as the couch: application is totally stateless so running mirrors behind a load-balancer should be enough to increase that RPS to the point required. Since there are no writes there shouldn't be any problem for scaling them.

The whole application took my whole day including the time needed to get to the default Hello screen. Besides mysql, all the tools here were new to me, yet i was able to make something that solves a problem. Although i am a newbie, scala is fun to work with and considering what i was able accomplish in a day, it is nowhere near complex. But couchdb still seems a better fit for this kind of problem. Results are documents and a document store with HTML rendering capabilities is all one can ask for.

UPDATE: Code is now on github

Resource synchronization on Hadoop clusters with ZooKeeper - Part II

Straight from where i left. GPUs are massively paralel in contrast to CPUs, hence for some parallel processes, they are damn fast. The benchmarks you see around showing performance increases over 100x are theoretically true. By theoretical i mean pure CPU vs GPU computing power. In other words, for an infinitely running computation, it is possible to get 100x more results with a GPU than you would with a CPU core given a constant amount of time. But as experienced GPGPU developers would undoubtedly know, in practice, things rarely happen that way.

First of all, only a small part of commercially meaningful computations are running infinitely. The first infinitely running computation coming to my mind is calculating the digits of pi. That surely is to make some money if you are into cryptoghrapy or something but i guess it is safe for me to say that is both a niche and a dominated market. Another computation may be fractal generation and i have yet to meet anyone making money out of generating colorful images. Businesses sell results and to get results, processes must end in some way.

Two of the well known facts of finite processes are that they need some data beforehand, and they output some data afterwards. That means the duration of any process will roughly be of IO time and computation time. A GPU may decrease computation time but since you cannot change IO time, it will eat into your 100x expectations. Bare computing time may decrease but when IO time stays the same, (actually it increases in GPGPU processes but that's another story) depending on the type of the problem, you may settle for 3x performance or less.

So, as shown in Part I, if you configure your systems to have your resources (GPUs) occupied by only one process at any given time, you do not use them optimally. Meaning, if you configure your MapReduce TaskTrackers' maximum simultaneous map tasks count to the number of your resources on the system, JobTracker will wait for each task to finish before starting another one, and your resources will sit idle in IO part of these tasks.

One solution is using more than one process. It is possible to start two processes for each resource and let one use the resource while the other one does its IO operations. After first one is done with the resource, it can signal the other one to start operation. So resources will never have to wait for IO to be done beforehand.

This process signaling mechanism fits perfectly with zookeeper's watches. You can set a watch on a znode and zookeeper will notify you when there is a change on it. In this particular problem, second process may set a watch on a common znode. When the first process is done with the resource, all it has to do is modify the znode to let the second process know it has finished. This is the exact explanation of what ResourceSynchronizer does. When you call .request() it will return the next free resource or if there aren't any free resources, it will wait for another process to call .release() to return anything. So the process will be blocked before using the resource.

You can set the same pool/resource list and spawn some processes to see the effect. Say you set your resources as ["res0", "res1"], and your resource intensive procedure takes 10 seconds. If you run 3 processes in 10 seconds, first 2 will get "res0" and "res1" respectively for their rs.request() calls, while the last process will wait till any of the first two processes call rs.release().

ResourceSynchronizer rs = new ResourceSynchronizer( new ZooKeeper("zkensemble", 20000, null), "/pool", new String[]{"res0", "res1"}); log.warn("Requesting resource..."); log.warn("Got resource : " + rs.request() + ". Working..."); Thread.sleep(10000); log.warn("Process Done! Releasing Resource"); rs.close();

Output of process #1

19:01:16,799 WARN Run - Requesting resource... 19:01:16,824 INFO ResourceSynchronizer - ZK ensemble connected 19:01:16,846 INFO ResourceSynchronizer - Resource '/pool/res0' allocated 19:01:16,846 WARN Run - Got resource : res0. Working... 19:01:26,846 WARN Run - Process Done! Releasing Resource 19:01:26,853 INFO ResourceSynchronizer - Resource released 19:01:26,861 INFO ResourceSynchronizer - ZK connection closed

Output of process #2

19:01:22,504 WARN Run - Requesting resource... 19:01:22,518 INFO ResourceSynchronizer - ZK ensemble connected 19:01:22,541 INFO ResourceSynchronizer - Resource '/pool/res1' allocated 19:01:22,541 WARN Run - Got resource : res1. Working... 19:01:32,541 WARN Run - Process Done! Releasing Resource 19:01:32,556 INFO ResourceSynchronizer - Resource released 19:01:32,564 INFO ResourceSynchronizer - ZK connection closed

Output of process #3

19:01:23,967 WARN Run - Requesting resource... 19:01:23,992 INFO ResourceSynchronizer - ZK ensemble connected 19:01:24,003 INFO ResourceSynchronizer - No available resource, waiting... 19:01:26,854 INFO ResourceSynchronizer - Retrying to get another resource 19:01:26,871 INFO ResourceSynchronizer - Resource '/pool/res0' allocated 19:01:26,871 WARN Run - Got resource : res0. Working... 19:01:36,871 WARN Run - Process Done! Releasing Resource 19:01:36,886 INFO ResourceSynchronizer - Resource released 19:01:36,894 INFO ResourceSynchronizer - ZK connection closed

Output of process #4

19:01:26,802 WARN tool.Run - Requesting resource... 19:01:26,823 INFO ResourceSynchronizer - ZK ensemble connected 19:01:26,835 INFO ResourceSynchronizer - No available resource, waiting... 19:01:26,854 INFO ResourceSynchronizer - Retrying to get another resource 19:01:26,879 INFO ResourceSynchronizer - No available resource, waiting... 19:01:32,556 INFO ResourceSynchronizer - Retrying to get another resource 19:01:32,573 INFO ResourceSynchronizer - Resource '/pool/res1' allocated 19:01:32,573 WARN Run - Got resource : res1. Working... 19:01:42,573 WARN Run - Process Done! Releasing Resource 19:01:42,581 INFO ResourceSynchronizer - Resource released 19:01:42,590 INFO ResourceSynchronizer - ZK connection closed

These 4 outputs belong to the same code piece running 4 times with a few seconds between them. If you check the timing on the logs 1st and 2nd processes got res0 and res1, just after they requested them. 3rd process took res0 just after 1st one released it. 4th process also tried for res0 after 1st one released it but couldn't make it so waited for 2nd one to release. Timestamps show that the resources were left idle for only a few milliseconds.

I have also added a REUSE configurable in the code to let resources to be reused given amount of times. For the previous example if REUSE is set to 2, first 4 processes will get "res0", "res1", "res0", "res1". Fifth one will wait for a resource to be freed up. With small modifications i am sure it can be a solution to IO increasing properties of GPGPU processes too, but it will probably require you to change the way you access your application so i am not there yet.

UPDATE: A colleague advised me to use "A semaphore implementation using ZooKeeper" as the title, which would be appropriate but not entirely correct. As a careful reader might notice, the mechanism is not binary nor does it use a counter. Instead, it holds the names of the resources it is supposed to allocate.

Resource synchronization on Hadoop clusters with ZooKeeper - Part I

"We need zookeeper to run HBase". Until past week that was basically my view of zookeeper. It is a distributed configuration and coordination service, HBase requires it so we have to put it in cluster. For a size of our cluster 1 instance seems to be OK but we are running 3 instances. These four sentences pretty much summed up what i knew about it. Fortunately i had looked to its main page previously and remember its somewhat abstract purpose : "Distributed coordination".

First, some clarification: This post is generally about coordinating processes that requires access to some limited system resource. In order to keep things simple, i used an example resource throughout the post, which is GPU. Another example may be distributed CD publishing using Hadoop on machines with a number of CD writers. Or, controlling an array of arduino devices to simulate LHC. In short, post has no direct relationship with GPGPU programming nor GPU kernel thread synchronization. If you have arrived here googling that, i'm sorry.

So, the problem: is if you put a number of CPU cores in a single computer and start running processes, operating-system will place them to the cores accordingly. Say you have a 4-core machine and try to multiply 4 matrices simultenously, each multiplication will be done on another core. Well, i don't know if there are any developments about it but that's not the case for multi GPU systems. Say you have 4 GPUs on a machine and you spawned 4 processes wishing to multiply your matrices in each one of them, you need to explicitly tell those processes not to overlap with one another. If you leave it to OS, one or more of your GPUs may sit idle while others starve for resources. I heard Mac OS can manage this but they are not suitable to our environment.

In theory: there is no way of letting anything other than yourself decide which process (read: map task in a MapReduce job) should occupy which resource. Simplest solution is to supply the process the resource identifier it should operate on. Process may be executed with appropriate parameters. But this would mean manual control of all the processes which is not possible with MapReduce. Another solution is having processes ask to some daemon process, which resource to allocate. Daemon process may hold which process uses which resource, so answer other client processes accordingly. This is actually how we were working for a while now.

In practice: this daemon process would bring some maintenance issues as any other software components. It is just another service one needs to deploy to machines in the cluster and ensure it works properly. Because of this, we were hesitant to go production with this setup, looking for another solution. I am not sure how it happened but zookeeper seemed like it can do such a thing. Let me rephrase that, we thought we can do GPGPU process synchronization with zookeeper, without actually knowing what zookeeper does. It is a distributed coordination service right, how hard could it be?

After reading the "Getting Started" part of zookeeper documentation, i saw that my hand was blackjack. I can create some data points (called znodes), load some small data in them, and access those from any other process. An altogether solution to what keeps us scratching our heads. All i needed to do is, modify my mappers a little to talk with zookeeper before and after the process. Since GPUs should be coordinated per computer basis, mappers should know which computer they run on, and the GPUs on that node. I applied occam's razor and got this.

private String hostname() throws IOException { return InetAddress.getLocalHost().getHostName(); } private static String[] discoverGpus() { File[] gpus = new File("/dev").listFiles(new FilenameFilter() { public boolean accept(File dir, String name) { return name.startsWith("nvidia") && ! name.endsWith("ctl"); } }); String[] ret = new String[gpus.length]; for (int i = 0; i < gpus.length; i++) { ret[i] = gpus[i].getName(); } return ret; }

On our nodes, GPUs are added as devices under /dev with sequential names nvidia0, nvidia1, ... And there is one another device named nvidiactl which is not a GPU. Other additions to my mappers are those.

// in setup() rs = new ResourceSynchronizer( new ZooKeeper(QUORUM_ADDRESS, TIMEOUT, null), "/gpusync/"+hostname(), discoverGpus()); // in map() String gpu = rs.request(); process(); // whatever rs.release(); // in cleanup() rs.close();

Now i cheated a little bit here and didn't include the core piece that is ResourceSynchronizer. That's because in addition to holding GPU names and supplying names to process requiring them, it does one additional and somewhat more sophisticated task concerning GPGPU operations. Seasoned GPU developers may guess what it is but i left it for Part II of this posting.

LZO vs Snappy vs LZF vs ZLIB, A comparison of compression algorithms for fat cells in HBase

Now and then, i talk about our usage of HBase and MapReduce. Although i am not able to discuss details further than what writes on my linkedin profile, i try to talk about general findings which may help others trying to achive similar goals. This post is about a recent research which tries to increase IO performance for our MapReduce jobs.

Why Compression?

HBase documentation and several posts in hbase-user mailing list tell that using some form of compression for storing data may lead to an increase in IO performance. Considering hadoop clusters almost always work on commodity machines, the reason for that is simple to explain: disks are slow. Hadoop workloads i know about are generally data-intensive, thus making the data reads a bottleneck in overall application performance. By using some sort of compression we reduce the size of our data achieving faster reads. On the other hand we now need to uncompress that data so we use some CPU cycles. It is simply trading IO load for CPU load.

If the infrastructure starves on disk capacity but has no performance problems it may be logical to use an algorithm that give huge compression ratios, losing some time on CPU but that's usually not the case. Large capacity disks are far cheaper than fast storage solutions (think SSDs) so it is better for a compression algorithm being faster than being able to give higher compression ratios. Because of that hadoop applications prefer LZO, a real-time fast compression library, to ZLIB variants. Of course these are general talks and to see real performance changes and compression ratios, one have to try those algorithms with his/her own data.

Which algorithm?

Our data is like 700kB per row and for testing purposes we have 100k rows. Each cell contains an image, more specifically a subset of an image so it is binary and supposedly not as compressable as some log file. Using no compression, our test data of 1000 items takes up 670MB and our MapReduce tasks are able to read a cell in 8.41ms.

First algorithm we tried was ZLIB, or java.util.Deflater/Inflater following this post by @jdcryans. It simply involves using Deflater just before "Put"ting data into HBase, and using Inflater just after reading data from "Result"s. The total size of our 1000 items decreased to 346MB meaning a compression ratio of 48%. But our reading performance suffered 16%, increasing the time per row to 9.73ms.

Second one was the famous LZO. Although we are unable to re-distribute it because of licensing issues, we still felt the urge test and see what we are missing. It is somewhat harder to use in hadoop (at least the recommended way), but i've managed. You can check here and here for instructions on how to set it up. On the other hand this complexity is sure to have a benefit. All other methods i talk about here compress data per item basis. LZO on the other hand will compress the whole file in HDFS, so in a regular setup it is expected to have better compression ratios since there will be similarities among the rows and it will exploit those. Anyways, our 1000 item set resulted in 398MB meaning a 41% compression ratio and we've seen a 5% increase in reading performance too: we read one item in 8.1ms compared to 8.41ms uncompressed. So it is starting to become a win-win.

Third test was a LZF implementation, ning-compress following Ferdy Galema's response to previous Deflater tip. It works the same way as it does too, like using LZFEncoder.encode just before writing to HBase and using LZFDecoder.decode just after reading. At this test our data size was 400MB meaning a compression ratio of 40%. Reading performance increased 21% with 6.63ms passed for one item.

Last one was recently announced snappy of Google. The same compress-each-item-seperately mechanism applies here with Snappy.compress and Snappy.uncompress. Data size was 403MB which mean around 40% compression ratio and we read our data at 6.37ms per item which indicate 25% increase in IO performance.

Conclusion

Algorithm Compression Ratio IO performance increase Snappy 40% 25% LZF 40% 21% LZO 41% 5% ZLIB 48% -16%

I am suspicious about something in LZO scores since I was expecting much better performance. But it doesn't matter because of our inability to redistribute it. Snappy-java with its Apache license is a clear winner. It is way easy to use too.

I have to remind again YMMV. These are the scores for a data which consist of 700kB rows, each containing a binary image data. They probably won't apply to things like numeric or text data.

Trending Blogs

Recently Viewed Blogs

sleepcoding