Validating and caching dynamic content

UPDATE 2023-03-27: This page is obsolete, as it refers to a prior version of this blog. However, it may be of historical interest.

One of the things I enjoy about setting up my own blog with the Blosxom software is learning about the deep details of web protocols and formats that I’ve never worried about before. (This might have been the case if I’d used another blogging system, but the hackable nature of Blosxom inspires, nay, almost demands it.) Lately I’ve been educating myself about HTTP conditional GET requests and validation and caching of dynamically-generated content.

In this post I discuss the subtleties of validating and caching dynamic content in general, and then in a separate post I tell how I created the lastmodified2 plugin for Blosxom, a rewrite of the lastmodified plugin.

I’m writing this really for my own education more than anything else (under the theory that you don’t really understand something until you can explain it), but others may find it useful as well. My goal is to explain how the HTTP protocol actually works in this context (as opposed to just saying “do this” without explaining why) while at the same time avoiding “protocol geekery” that’s irrelevant to the problem at hand.

The problem

Suppose that you have a blog (or other web site) whose content is generated dynamically in response to incoming requests. (In other words, you are not using Blosxom in static mode, or another blogging system like MovableType that normally generates static pages.) In practice there are several types of web clients that might access such a site, of which the following are the most common:

web browsers used by humans viewing web pages
web proxy servers supporting browser users by accessing and caching web pages on their behalf
search engines “spidering” a site: downloading pages, following links to find more pages, and indexing the results
news aggregators downloading RSS and/or Atom feeds, typically on a periodic basis

If we generate a full dynamic response for each and every request (as is the case with standard Blosxom, for example) then this produces a lot of network traffic and server load, some of which we could avoid. In the blogosphere most of the attention has been on traffic generated by feed aggregators, but (as pointed out by Charles Miller and others) there’s really no need to treat RSS feeds as a special case, at least initially. (Some people have proposed more advanced techniques specifically tailored to RSS/Atom feeds, but these techniques presuppose use of the techniques I describe here.)

The tools at hand

We have three goals in dealing with web clients: to minimize network traffic to and from our site, to minimize server load for our site, and to deliver up-to-date content to the various users of the site; as discussed below, these three goals are often in conflict with each other, but we can usually implement a reasonable trade-off.

We have at least three approaches we can take to help achieve these goals: We can reduce the need for clients to make multiple network connections to the site, we can provide ways for clients to validate whether they need to re-download a page that they’ve previously downloaded, and we can provide clients “freshness” information telling them how long they can keep copies of pages without having to revalidate them. (There is also a fourth possible approach, namely to use compression techniques to reduce the size of page data returned to the client; I hope to discuss this in a future article.)

Persistent connections

Reducing the number of needed connections can be done through support of so-called “persistent connections.” The original HTTP 1.0 protocol required the client to open up a new network connection for each and every request; for an HTML page with lots of included images this might amount to a dozen or more connections, each of which the server has to accept and then close after responding. Support for persistent connections allows a client to open up one connection and make several requests over that connection, either one after the other or nearly simultaneously (“pipelining”).

A form of persistent connections was introduced as an extension to HTTP 1.0 and described in section 19.7.1 of RFC 2068, an earlier version of the HTTP 1.1 specification. Using this scheme a client sends a Connection: Keep-Alive header in the request to the server, with the server then keeping the connection open after sending its response. However the client needs some indication from the server as to when the response is actually complete; this is provided by a Content-length header in the response that provides the size in bytes of the response (more correctly, the “entity-body” part of the response, after the headers).

Providing a Content-length value requires determining the length of the output prior to sending it. This is easy to do for static files but more difficult for dynamic content (e.g. CGI output), and hence many content generation tools (including Blosxom and other blogging systems) do not produce Content-length headers by default.

In HTTP 1.1 a new scheme for persistent connections was introduced; in this scheme the response can be broken up into multiple “chunks", with each chunk accompanied by an indication of its size. When Apache is configured to support persistent connections (using the KeepAlive On directive) then it can automatically handle persistent connections for dynamic content without the need for a Content-length header.

Validation

Validation can be done in the HTTP protocol by using a so-called “conditional” GET request, which is in turn implemented by using one or both of two special HTTP headers sent with the request: If-modified-since and/or If-none-match. (The If-modified-since header is supported in both HTTP 1.0 and HTTP 1.1, while the If-none-match header is only in HTTP 1.1. In practice servers can and should recognize and properly handle either of them.)

For example, using the If-modified-since HTTP header a client can tell the server, “Give me this page, but only if it’s been modified since 9:08 am on December 17, 2004.” This date could represent the last time the client downloaded the page; alternatively it could represent the last time the page was actually modified, as identified in a Last-modified HTTP header returned by the site as part of the response to a previous page request from that client.

Similarly, using the If-none-match HTTP header a client can tell the server, “Give me this page, but only it’s different from the version ’foo’ I already have.” The version is identified using an “entity tag” (or “etag") assigned by the server to each new version of the page; the entity tag value is contained in an optional ETag HTTP header previously returned by the site as part of the response for that page.

Note that some people have suggested using HTTP HEAD requests for validating pages: Send a HEAD request, check the Last-modified or ETag value in the response, and then send a GET request (unconditional) if the page appears to have changed. However this approach is inferior to using a conditional GET, for at least two reasons:

Using HEAD for page validation requires two HTTP requests and responses to accomplish the same purpose as a single conditional GET request and response. This leads to increased network traffic and longer network latency relative to using conditional GETs.
A response to a HEAD request is supposed to contain the exact same HTTP headers as would the response to a GET request for the same URI; the only difference is that a response to a HEAD request doesn’t contain any actual content (i.e., an entity-body). For dynamic content this often means that to properly satisfy a HEAD request you have to generate exactly the same content you would for a GET request, only to discard the content after creating the headers; for example, this is true if you’re generating the Content-length header, and is also true for certain approaches to generating ETag and Last-modified values, as discussed below. This leads to increased server load relative to using conditional GET requests.

A site based on dynamic content should be able to properly respond to HEAD requests (as required by the HTTP specifications), but should support conditional GET requests as the primary mechanism for page validation.

Freshness and Caching

Validation can reduce the network bandwidth used by your site, since the site does not always need to send back full copies of the pages; however the clients are still hitting the site, if only to validate pages, and this still puts a load on the server. To reduce this load the site can also indicate how long a response should be considered “fresh”—in other words, how long clients can wait before having to return to the site to check for a new version of the page.

Freshness tests can be done in the HTTP protocol using one or both of two special HTTP headers sent with the site’s response to a request: Expires and/or Cache-control. The Expires header is supported in HTTP 1.0, while the Cache-control header was introduced with HTTP 1.1; sites can and should support both headers.

The Expires header is like the “use by” date on a perishable item in a grocery store: For example, a site can tell a client, “If you keep a copy of this page, throw it away after 7:00 pm on December 20, 2004; if you need a copy after that please check to see if there’s a new version available.” The Cache-control header can be used similarly, except expressing the “use by” date in terms of a time relative to the time of the request: “Don’t keep a copy of this page longer than 12 hours from now.”

Regardless of whether the Expires or Cache-control header is used, the net effect is the same: Clients downloading the page are instructed to keep a copy of the page for a specified period of time and reuse it as necessary during that time, and to avoid contacting the site again to request that page (even with a conditional GET) until that time period is over.

The strategy

Based on the techniques available, a suitable strategy for a dynamically-generated site is then as follows:

If possible, use a web server that supports HTTP 1.1 persistent connections for CGI output. In addition, when sending responses to requests add a Content-length header to identify the total number of bytes in the response, in order to support HTTP 1.0 persistent connections.
When sending responses to requests, add an ETag header to identifiy the “version number” (entity tag) for this particular version of the page, and/or a Last-Modified header to identify the date/time the page was last modified.
When sending responses to requests, also add Cache-control and Expires headers to the response to provide a “use by” date/time to clients doing caching.- When processing requests, look for the If-none-match and If-modified-since headers. If one or both are present, return the full page in the response only if necessary: if the version of the page currently available is different than the version requested in the If-none-match header, or if the page has been modified since the date in the If-modified-since header.

What’s new?

The strategy outlined above seems simple enough, but we’ve glossed over a crucial and surprisingly difficult question: How do we determine if and when a page has changed and a new version has been created?

The creators of the HTTP protocol specification suggested two different approaches to this question, with two correspondingly different ways to use the ETag header. (For various reasons too geeky to go into, ETag is a better example here than Last-modified.)

The strict approach is to consider a page to be changed if even one bit on the page changes. For example, in the context of Blosxom if you made even a single-character correction to a flavour template then any page using that template would be considered to have changed. When sending an ETag header for such a page you would then be duty-bound to update the entity tag value identifying the version for that page. Under this approach the ETag header is considered to be a “strong validator” in HTTP jargon.

A looser approach is to consider the page to be changed only if the essential “meaning” of the page changes, where you as the site author get to decide what that meaning actually is in this context. For example, if you are primarily concerned with RSS feeds then you might decide that the response sent to news aggregators and other clients should be considered changed only if there were content changes to any of the entries included in the response. You would then be free to keep the entity tag value sent in the ETag header the same as long as the underlying entries didn’t change; here you’re using the ETag header as a “weak validator.”

Having a strong validator as described above is important in cases where knowing about bit-level changes is absolutely required. The most common example of this is downloading large binary files where the download might be interrupted for some reason and the client wishes to resume from the point at which it was interrupted (as opposed to restarting the download from the beginning). For this purpose HTTP provides a mechanism whereby clients can request a range of bytes for a resource, so that (for example) a client can tell the server “give me bytes 737878-1643324 for this resource (I already have the others).”

In order for this to work properly, when the client goes back to the server to pick up the rest of the file the client has to know that the version of the file for which it’s getting the new set of bytes is exactly the same as the version for which it got the first (interrupted) set of bytes; otherwise it will end up with a corrupted copy. The ETag header can provide the necessary version information, but only if it’s a strong validator as described above.

However using the etag as a weak validator is arguably a better approach for a typical blog, both because it better fits the nature of the content (most people care more about the prose content of the page than about the exact bytes making up the page) and also because correctly implementing strong validation for dynamic content can be more difficult and time-consuming (in terms of server load), at least for Blosxom.

However implementing weak validation has its own problems as well. In particular, for Blosxom there are changes which at least for some would arguably change the “meaning” of a page but which can be difficult to detect in practice without checking for byte-for-byte changes; the most important examples are having new comments for an individual entry’s page or a new number of comments for an entry listing on an index page. In these cases the change is typically introduced through interpolating variables when processing flavour templates, and hence you can’t use the date/time modified for either the entry file or the flavour template as a guide to when the change occurred.

(You could potentially look at the date/time modified for comment-related files as stored in the Blosxom plugin state directory or elsewhere, but this would require knowing exactly what plugin is being used to generate comments, and how it stores comment-related information. This is one of the drawbacks to Blosxom’s minimal approach to blogging, in which a comments capability isn’t a standard feature of the software but has to be implemented by add-on software, with different sites using different comments plugins.)

I discuss this and other implementation issues in more detail in my next post.

For more information

Here are some useful reference documents and related material I consulted while researching the issue of validating and caching dynamic content in the course of creating the lastmodified2 plugin. The main documents of interest are the following:

“Caching Tutorial for Web Authors” by Mark Nottingham is the best introductory document I’ve found on page validation and caching.
The HTTP 1.1 protocol specification (RFC 261, “Hypertext Transfer Protocol – HTTP/1.1") is the ultimate authority for how validation and caching should work. See in particular sections 10.3.5 (“304 Not Modified”), 13 (“Caching in HTTP”), 14.9 (Cache-control header), 14.13 (Content-length header), 14.19 (ETag header), 14.21 (Expires header), 14.25 (If-modified-since header), and 14.26 (If-none-match header).

You may also find the following documents of interest:

“Conditional GET for RSS Hackers” by Charles Miller is a basic tutorial on implementing conditional GETs in the context of a blog. However it lacks an in-depth discussion of strong vs. weak validation and why the distinction matters.
The post “Joel’s RSS Problem” on Phil Ringnalda’s blog is a good example of various views on how to address the problem of blog’s being overloaded by aggregator requests, including links to related blog posts and articles.
Chapter 16 of the book Practical mod_perl has some good in-depth information on the issue of validation and caching of dynamic content and strong vs. weak validators.
Though it’s been superceded by RFC 2616, the original version of the HTTP 1.1 protocol specification (RFC 2068, “Hypertext Transfer Protocol – HTTP/1.1") is worth consulting for its description (in section 19.17) of the HTTP 1.0 “Keep-Alive” extension for persistent connections.
The original HTTP 1.0 protocol specification (RFC 1945, “Hypertext Transfer Protocol – HTTP/1.0") is mainly of historical interest. (The HTTP 1.1 specification addresses backwards compatibility for HTTP 1.0 clients.)

The problem#

The tools at hand#

Persistent connections#

Validation#

Freshness and Caching#

The strategy#

What’s new?#

For more information#