Optimizing Your CDN Cache With Cloudflare and Nginx

Optimizing your CDN cache with Cloudflare and Nginx

Recently I have been working on a personal project to help support the Linux distro I use as a daily driver, Manjaro. I decided to set up a package mirror and allow users from the community to connect to my server and pull updates or new packages for install. Doing some research I found that when updates get pushed this can result in large spikes in traffic and I would have to be prepared to serve around 1TB of traffic a week. Because I am doing this on a budget I was trying to work out the most cost-efficient way to support all this traffic when I don’t run a home lab. So I was looking for a cloud server and a way to help reduce or at least manage traffic. Due to the way the list of mirror servers are published, it’s not something I can opt-out of quickly if I hit a data limit or ran into some other cost barrier.

I did some research and settled on a pretty basic tech stack the big player here would be Cloudflare, I would use this to cache as much of the traffic as I could limiting the amount of load on the server and hopefully cut down on the amount of data I had to serve from the server directly. The tech stack ended up as;

  • Linode Server running Linux
  • Nginx for doing the HTTP/s traffic
  • Cloudflare to run as a CDN

After a bit of setup and security tuning, I was ready to start testing and see what kind of cache hit ratio I could get out of Cloudflare. Initial tests showed I was only getting a small 30% cache hit ratio, this was way too small for what I was trying to do. So into the docs, I dived to find out more about caching setups with Cloudflare and how the mirror worked when interacting with the package manager Pacman.

So the basic break down of how Pacman works is it first fetches a particular DB file store in a predefined path. This DB file contains all the info it needs to know about what packages are available the versions on offer. It will then do a local comparison to check if there are updates to any of its installed packages, then pull the new version based on the info in this DB file.

The path to individual packages doesn’t change often, the package manager will find out about newer packages by the DB file not by pulling from a /latest path or any other symlinked paths, where the path would stay the same and the package version behind it change, except for the DB files.

So this means I can just cache everything right? well no, that pesky DB file needs to stay up to date as it’s the only way for the package manager to find new versions of packages.

Caching Headers

Ok, so let’s play around with the Cloudflare caching rules and see how we go. I pushed the cache level to the longest duration and created an exception for the DB file using Cloudflare’s page rules. This worked a little better I was up to about 35% cache HIT now.

By doing some debugging in a browser I was able to see what resources were getting cached and what was just been served by my origin server. Looks like a lot of the packages are still not getting cached, in fact, none of the packages with the extension .tar.zst an extension used mainly for arch packages with the Pacman package manager. So by doing some more digging into the Cloudflare docs we can see what they will cache by default. This page explains they only cache some very specific extensions that they find are most common on websites and no surprise here, my .tar.zst extension is not listed. Also on that page, we find out that the largest file we can cache is 512MB. So our packages will fit in the cache we just need a way to tell Cloudflare to cache them, Cloudflare has a cache setting called “Cache Everything” but as some of the other docs point out this will ignore the exceptions I created for the .db files.

So I was left with one other option, Cloudflare has a setting called “origin cache-control”, this allows you to use headers set by your origin server to control Cloudflare’s cache. More Reading Here

So by following Nginx’s documents on how to set headers on particular files and in what order Nginx will look at the settings we can optimise the requests and set the headers we want. More Reading Here

By combining all the options we have for the value of the “Cache-Control” header we can tell Cloudflare exactly how we want it to cache things. The basic break down of how to select the correct collection of values for the cache-control header is simpler than most people think. The first most important thing is, “Should this be cached?” once you have answered this question you are 90% there. You should look at caching everything that the browser considers static. So all CSS, js, images, icons, gifs, etc. Anything that is considered “static”. HTML is not generally considered static, because things get injected into a HTML page but its URL will remain the same.

Now you need to look at the life of things, how frequently will they change? It’s more likely a JS or CSS file will change vs an image or icon. So you can tweak the length of time you want things cached based on this, this controls the Max-Age variable. If it’s something you want to be updating semi-frequently or you want it to be very strict about what it does once the max-age expires you can add one of the following extra values. must-revalidate or proxy-revalidate, the proxy setting only applies to Cloudflare, not the users' browser but the must-revalidate setting works for both browsers and Cloudflare; the action it enables is related to what it will do when the max-age setting expires.

Revalidation

If you set one of the values mentioned above it directs the device to request that bit of data from the origin again, don’t use stale data you must request this from the origin. This is great for ensuring you never serve stale data, but what if it’s an image or some other bit o data that you know will never change? In that case, it’s fine to serve it even if it is stale and just update the cache the next time you do an origin request. You can get this type of behaviour with the stale-while-revalidate header. Use this to allow Cloudflare to keep serving that data while it updates the cache, rather than forcing a hard stop of serving that data and making clients wait while the data is updated.

Cacheability

The values of Public, Private, and No-Cache, have way more impact when serving content to a web browser rather than my use case but it’s an interesting value to consider. These values can be used to tweak what devices out on the internet will store what data when they are trying to cache responses. The public value is great for all of your static content, the private value is great for things like account pages or any other members-only area, lastly, the no-cache value is excellent for sensitive or frequently changing data.

The last value of note that I like to use is, no-transform. It is great for ensuring specific items are not messed with when transferred to the user. If you have things that are precompressed and then hashed or maybe things you don’t want extra compressed like an image this setting is for you. It tells the CDN and browser to not mess with the contents of the file, keep all of its bits just as they are.

Nginx Config

I set a sitewide default header of,

location / {
    add header Cache-Control "public, max-age=3600, must-revalidate";
}

This means at the least we will cache everything for 3600 seconds. But after that, it must be checked with the origin again but doesn’t have to expire and be cleared from the cache.

Next our DB files, I used the same settings as the sitewide setting but added “no-transform” to ensure compression didn’t mess with anything like the sha1 hash.

location ~* \.(db|db\.sig|gz|files|sig) {
    add header Cache-Control "public, max-age=3600, must-revalidate, no-transform";
}

Next the packages themselves, these we want to cache essentially forever. Their direct path will never change so let’s cache them for forever to maximise our hits.

location ~* \.(tar|zst|xz|xz\.sig|zst\.sig) {
    add header Cache-Control "public, max-age=604800, immutable";
}

The “immutable” value set above tells Cloudflare.

This indicates to clients that the response body does not change over time. The resource, if unexpired, is unchanged on the server and therefore the client should not send a conditional revalidation for it (e.g. If-None-Match or If-Modified-Since) to check for updates, even when the user explicitly refreshes the page. This directive has no effect on public caches like Cloudflare but does change browser behaviour.

You will note I also use the “public” value in all of the headers set out above. I used this because I am serving content with weird extensions. You can customise the cache behaviour further by playing with the available values outlined on the Cloudflare page linked above.

Once I pushed these changes and started testing my cache hit ratio went through the roof. My current cache hit ratio is constantly above 90%. I serve TB’s of traffic and only a small % comes from my origin.

Cloudflare Snap

Related Articles