Optimizing nginx cache parameters with scalding
When we started thinking about different types of machines for content caching, one of the questions was the capacity. Mainly RAM, and disk capacity of course, because our content is composed of generally large files. It should be big, is the easiest answer, but when the active content changes overtime, it's not a solution. The fresh content has to be cached continuously and stale content will idly sit in that big cache, causing massive waste.
Other side of this story is there is no definitive answer. There is no global rule that says this amount of RAM will be enough. It depends on the content and more importantly how it changes. If it's a news site publishing some dozen articles with a few photos everyday, it's highly probable that those images are simply never accessed after a few days, and it means a few hundred megabytes of cache is plenty. If it's an archive with millions of content mainly receiving long-tail traffic, you'll probably need much more than that. And if it's a moderately sized website with heavy content receiving organic and social traffic, it's probably a mix of those two scenarios and noone knows what your effective cache-size can be.
By the way, these problems are very well known in the industry. Content caching products generally offer configuration options to tune according to the needs. Very well known and widely used varnish for example, has a complete language to help administrators define their content behavior in any way they like. We use nginx for our content delivery system and it mainly has 3 knobs to turn:
proxy_cache_valid
proxy_cache_min_uses
proxy_cache_path inactive
valid directive sets the duration for responses to be used. For example if you configure valid as 10m that means the response from upstream will be used for 10 minutes and it will be invalidated after that and the next request will be passed to upstream. It is also possible and a more general practice for upstreams to determine the validity of their responses (using Expires and Cache-Control) so this directive is used as a last resort only. In addition to that, we are currently focused on caching static content like images that never gets invalidated, so we'll set this like 1 year and not mess with it for the sake of simplicity.
min_uses sets the number of requests after which the response will be cached (A direct quote from nginx docs). In practice this means if you set this as 2, for each key, the first response will bypass the cache but nginx will remember the number of requests. When the second request comes in, nginx will see that this is the second request for this key so it will store the response, and the subsequent requests will be answered from there.
inactive parameter of cache_path directive is the missing element in that min_uses scenario. Cached data that are not accessed during the time specified by the inactive parameter get removed from the cache regardless of their freshness (Another quote). This parameter also sets for how long nginx will remember request counts. If inactive is set at 30 minutes, 2 subsequent requests for the same key in the previous example must arrive within 30 minutes in order for the response to be stored. Otherwise the second request will be seen as the first because nginx wouldn't remember the other request.
These two parameters offer a suprisingly wide range of possibilities for cache efficiency. Reminding the problem at hand, we want to minimize both the cachesize and our upstream traffic which are inversely correlated. Meaning, you can minimize upstream traffic if you simply cache and hold on to everything; and minimize cachesize if you don't cache anything and serve everything from upstream. We have our boundary conditions and our function is like (cachesize, hitratio) = f(inactive, min_uses) but the problem is that f depends on how the content is being accessed which is not deterministic.
What we can do here is experiment and hope that the law of large numbers apply. Run the content servers with different configurations for some specified number of days and monitor the upstream traffic and maximum cache size. But the problem is that this is very costly and prone to error. If different configurations are run on different days this means theoretical independent error on each experiment is not well... independent. Or we can send the traffic to different configurations and run all of them together but that means we'll need a huge setup depending on the number of configurations we want to test.
Or we can approximate the scenarios. We have our accesslogs for content servers, if we can replicate the caching logic with a simpler function, we can run this over logs with different parameters and simply calculate our variables. It will take minutes instead of days, and we don't have to setup anything. Our log is stored in HDFS so it's only a problem of writing some MapReduce-like thingy to convert it into a virtual cache. As far as simple functions go, i don't think anything can beat scalding, so here goes:
We want to find out the variables for all the combinations of these inactive and min_uses values.
Source is an example here. Ours is actually composed of parquet files with whole lot of different fields but we only need those 4 for this analysis.
Here we can do some basic filtering/cleaning. These are examples only.
This assumes the cache keys are the request URI's. If this is not true, like we have some rewrite rules changing cache keys, we need to change here accordingly.
This first grouping will generate the histories of all the keys in the given data. By max'ing and setting bytessent as filesize is the main approximation part. In theory the content cached might be larger than what sent to client (e.g. client interrupted the transmission). The euses field of each content will be a map containing request counts at precise milliseconds.
Since we want to find out more than one inactive X min_uses configuration, we inject them here.
This is where we convert access history as the changes to cache state. It's basically a modified sessionization.
Since the access history was stored in a map, we need to sort it by accesstimes first. Then we start folding over a list which will be this key's contribution history to the store. At finish, the list will hold tuples of positive numbers with accesstimes, meaning the file is stored to cache; and negative numbers with accesstimes meaning the file is removed from the cache. We also carry the number of requests (uses) in order to account for min_uses.
We group by inactive X min_uses and run all of the contribution histories sorted by their access times, adding and subtracting to cachesize as necessary and keeping track of the maximum size reached. We also aggregate upstream requests/transfer for non-negative history entries as they mean requests are passed to the upstream.
As you can see, this analysis does not aggregate total # of requests or transfer. I simply omitted them as they're a simple matter of group count and sum.
Nothing explains an analysis as good as its results so here are the numbers for one of the larger domains we run.
minuses 1 2 3 inactive 10m 3.42 1.18 0.78 30m 7.66 2.97 2.03 1h 12.46 5.17 3.63 2h 19.82 8.73 6.23 6h 39.04 18.77 13.51 1d 73.62 31.22 20.00
The extreme case of caching everything at the first request requires 73 GiB of cache capacity. Which also means that is is the total capacity our active content requires. While not an unreasonable number, not enough to make a decision. We need to look at how our upstream transfer changes in order to evaluate if that much capacity really pays off.
minuses 1 2 3 inactive 10m 233.09 279.84 301.58 30m 175.32 215.24 234.81 1h 144.12 179.90 198.23 2h 118.28 150.68 168.21 6h 89.20 118.70 136.38 1d 73.62 104.85 124.85
Minimum amount of upstream transfer is 73 GiB, the same as maximum cache capacity of course. Accepting this as our baseline, we can start compromising.
If we set our min_uses to 2, it shows 42 GiB of content is requested only once so caching them is a waste. But 31 GiB of content is requested more than once so when we don't cache it the first time, we'll need to request it again, increasing our upstream transfer to 73 + 31 = 104 GiB.
If we decrease our inactive duration to 6h from the maximum 24h, this will mean we'll start removing items earlier so our cache capacity will reach a maximum 39 GiB and we'll need 16 GiB of early-removed content to be requested again, increasing our upstream transfer to 89 GiB.
An interesting finding here is the comparison between a) min_uses=3/inactive=24h and b) min_uses=2/inactive=6h. If you look at the tables, a requires 20 GiB of cache and 124 GiB of upstream transfer, while b requires 18 GiB of cache and 118 GiB of upstream transfer. Scenario b is better than a in both aspects. A look at the scatter plot of cachesizes and upstream transfer should make things clear:
What we want is the optimal point in that curve. It is the configuration min_uses=2/inactive=6h resulting in a maximum cachesize of 18 GiB and a total upstream transfer of 118 GiB, which has the minimum normalized distance in all scenarios tested. What we can next is run a regression and find the optimal point in untested scenarios too but i don't think it'd be too far.










