UPDATE 2023-03-27: This page is obsolete, as it refers to a prior version of this blog. However, it may be of historical interest.

In a previous post I discussed the general problem of validating and caching dynamic content. In order to implement the strategy outlined in that post I decided to create a new version of the lastmodified plugin originally created by Bob Schumaker. The lastmodified plugin was a good base to build on; however it didn’t do exactly what I wanted to do, and hence I couldn’t resist trying to improve on it.

The following material documents the lastmodified2 plugin that I created, including my notes on how I implemented page validation according to my interpretation of the HTTP 1.1 specification.

The strategy revisited

As you may recall, in my previous post I outlined an overall strategy for how to support validating and caching dynamic content. Here’s a recap of that strategy, with additional detail added on the subject of validation:

  • When sending responses to requests, add a Content-length header to identify the total number of bytes in the response.
  • When sending responses to requests, add an ETag header to identify the “version number” (entity tag) for this particular version of the page, and/or a Last-Modified header to identify the date/time the page was last modified. These are computed as follows, depending on whether weak or strong validation is used:
    • For weak validation, the Last-modified header should reflect the date/time modified of the most recently-updated “semantically-significant” component of the page. (For example, for Blosxom we consider entries to be semantically significant, but not flavour templates.) The ETag header can then supply a weak etag directly derived from this date/time.
    • For strong validation, the ETag header should change if even a single bit on a page changes; for example, it could be derived from the MD5 or SHA-1 digest of the page. A Last-modified header value could then be determined by consulting a cached copy of the ETag and Last-modified values for the URI; if there is a cache match then the Last-modified value can taken from the cache, otherwise it can be arbitrarily assigned to be a date in the recent past.
  • When sending responses to requests, also add Cache-control and Expires headers to the response to provide a “use by” date/time to clients doing caching.
  • When processing requests, look for the If-none-match andIf-modified-since headers. If one or both are present, return the full page in the response only if necessary: if the version of the page currently available is different than the version requested in the If-none-match header, or if the page has been modified since the date in the If-modified-since header.

Implementation overview

This section and the next describe in more depth how I implemented the above strategy.

First, the plugin is designed to have its behavior easily modifiable using configurable variables, as is done with other Blosxom plugins. In particular, it is possible to specify whether the plugin should do strong or weak validation ($strong boolean variable) and whether it should generate an ETag header, Last-modified header, or both ($generate_etag and $generate_mod boolean variables). By default the plugin is configured to be a “plug-compatible” replacement for the lastmodified plugin, doing weak validation and generating both ETag and Last-modified headers.

The basic plan of the lastmodified2 plugin is as follows:

  • start subroutine: Read in the cached information containing the previous ETag and Last-modified values for this URI.
  • filter subroutine: Get the information necessary for weak validation by traversing the list of entries to be displayed on the page and determining the date/time any of the entries was most recently modified. Use this last-modified date/time to create a weak ETag value.
  • skip subroutine: For weak validation interpret any If-none-match or If-modified-since headers and determine whether or not we need to send a full response. If not we can skip the actual story processing after setting Status to 304 (Not Modified) and generating any other headers appropriate for a 304 response.
  • last subroutine: For strong validation generate an MD5 digest of the page and use this to create the ETag value. Create the Last-modified header by using the cached Last-modified value if the new ETag value matches the cached ETag value, otherwise assigning a new Last-modified value in the very recent past. Then interpret any If-none-match or If-modified-since headers in the request and determine whether or not we need to send a full response. In either case send the appropriate headers.

Note that there is also a story subroutine in the lastmodified2 plugin, but its purpose is restricted to setting output variables (e.g., for use in flavour templates) for compatibility with the lastmodified plugin. It does not affect the actual caching and validation processes.

Implementation details

Like the lastmodified plugin, this version of the plugin looks for and acts upon the If-modified-since header itself, instead of letting the underlying web server deal with it. Note that the 1.3 and 2.0 versions of Apache in common use today have a feature whereby the underlying web server will handle If-modified-since checks as long as the CGI script simply sets the Last-modified header; this can be used to easily implement simple validation. (Previous versions of this plugin relied on this feature.)

The plugin also looks for and acts upon the If-none-match header. (Apache does not do this for CGI scripts, so the plugin has no choice but to do it itself.) Note that for weak validation (generate_etag set to 1 but $strong set to 0) we generate entity tag values using the date/time modified of the most recent entry, so both the If-modified-since check and the If-none-match check can be done as soon as we compute the Last-modified value, which is done in the filter subroutine. This allows the plugin to save processing time by skipping story processing (i.e., using the skip subroutine) when it does not need to return a full response to a conditional GET with If-modified-since or If-none-match.

For strong validation using ETag ($generate_etag and $strong both set to 1) the ETag value is computed as an MD5 digest of the entire page as it will be returned to the user, in order to distinguish changes that affect even a single bit of the page. We can’t skip story processing in this case since we need the complete output (including the results of interpolating variables) in order to compute the correct MD5 digest.

For strong validation using Last-modified ($generate_mod and $strong both set to 1) we also compute an MD5 digest of the entire page as it will be returned to the user, and we compare that value against a cached MD5 digest computed for the page on previous requests. If they match then we know that no changes have occurred since the previous requests, and we set the Last-modified value to the value cached with the MD5 digest. Otherwise we know that some change has occurred since the time of the previous requests, but do not know exactly when that change occurred; we therefore arbitrarily set the Last-modified value to a time just prior to the time of the current request. Note that we can’t skip story processing in this case either, since again we need the complete output (including the results of interpolating variables) in order to compute the correct MD5 digest.

Note that for weak validation ($strong set to 0) the Last-modified header does not necessarily provide the date/time at which the actual (bit for bit) contents of the page last changed; instead it provides the date/time at which the meaning of the page last changed, i.e., because the contents of at least one entry on the page were changed. It is possible for other elements on the page such as headers, footers, or comments to change without changing the meaning of the page in this sense, so in this case the Last-modified value is only a “weak validator” as defined by section 13.3.3 of the HTTP 1.1 specification. When $strong is set to 0 the entity tag provided by the ETag header is derived from the Last-modified value and hence is also only a weak validator, and we explicitly mark it as such by prefixing it with “W/", as described in section 3.11 of the HTTP 1.1 specification.

The net effect is that with weak validation if browsers, web caches, and news aggregators caching the page send a conditional GET request (i.e., with an If-none-match and/or If-modified-since header) to check the current status of the page, they will be given a brand new copy of the page only if there have been “semantically significant” changes to the page (in the words of the HTTP 1.1 specification). With strong validation they will get a new copy of the page if the page has changed in any way, no matter how slight.

Generation of the Cache-control and Expires headers is relatively straightforward: We use the value of $freshness_time directly with the max-age directive of the Cache-control header, and add it to the current date/time to create a date/time in the future for the Expires header. If $freshness_time is set to 0 then we instead send the no-cache directive with the Cache-control header and set the Expires header to a date in the past.

Generation of the Content-length header is also straightforward: We simply use the length of $blosxom::output. (This assumes of course that no other plugin will subsequently be changing that output.) Note that for HEAD requests Apache will not actually send the output but will send the Content-length header if set; in that case the Content-length value reflects the length of the output that would have been sent for a GET request, in compliance with section 14.13 of the HTTP 1.1 specification.

Note that the plugin does not generate the Last-modified and Content-length headers for a 304 (Not Modified) response, in accordance with section 10.3.5 of the HTTP 1.1 protocol specification.

Finally, for upward compatibility the lastmodified2 plugin supports the following features present in the original lastmodified plugin:

  • Optionally checking the %others hash for last-modified dates: Checking %others is one way to detect changes other than changes in the entries themselves; in particular it can be used to detect changes to flavour files used in creating the page. Unfortunately some of the entries in %others are not relevant for the page being created (e.g., flavour files for flavours other than the one currently being generated) and may cause the Last-modified time to be computed incorrectly. Also, checking %others will not detect page changes due to interpolating variables into flavour files (e.g., for comments). Finally, some plugins that replace the default Blosxom entries subroutine (including the entries_cache_meta plugin in particular) do not create the %others hash at all.

    For the above reasons this feature is deprecated; you should not use it unless you need it for upward compatibility with your current lastmodified configuration. If you want to check for changes to a page outside the entries themselves then you should simply enable strong validation.

  • Exporting variables with the last-modified time and other times in RFC 822 and ISO 8601 formats: The lastmodified2 plugin computes these variables essentially in the same way as the lastmodified plugin; see the code and the plugin documentation below for more information.

    Note that the variables $latest_rfc822 and $latest_iso8601 always refer to the date/time modified for the most recently updated entry, regardless of whether weak or strong validation is being used. The problem with interpreting $latest_rfc822 or $latest_iso8601 as a Last-modified value is that when using strong validation we wouldn’t have values for these variables until we completed generating output for the page, too late for the variables to be of any use.

If you want to use the lastmodified2 plugin to replace an existing configuration of the lastmodified plugin, change the plugin’s filename and Perl package name (i.e., in the package statement at the beginning of the code) to “lastmodified” and set the $generate_mod, $generate_etag, and $use_others configurable variables to match your current values. All other configurable variables can be left as is.

Description

This section and the succeeding ones contain more in-depth documentation of the lastmodified2 plugin to supplement the material included in the plugin itself.

The lastmodified2 plugin enables caching and validation of dynamically-generated Blosxom pages by web browsers, web proxies, news aggregators, and other clients by generating various cache-related HTTP headers in the response and supporting conditional GET requests, as described below. This can reduce excess network traffic and server load caused by requests for RSS or Atom feeds or for web pages for popular entries or categories.

The plugin generates an ETag header to identify the particular version of the page, as well as a Last-modified header based on the plugin’s determination of when the contents of the page were most recently modified. The plugin also recognizes and properly acts on an If-none-match and/or If-modified-since header in a request, enabling a client to check whether the page has changed since it last requested the page. This reduces network traffic for the site, because the server can skip returning a copy of the page if in fact it has not changed.

The plugin can also optionally generate Cache-control and/or Expires headers to specify how long copies of a page should be retained by caches. This reduces server load for the site, because web proxies and other caching clients can use a cached copy of the page and avoid sending additional requests for the page (including conditional GET requests) to the site’s server for as long as the page remains fresh. Alternatively you can use the Cache-control and Expires headers to specify that pages should not be cached at all under any circumstance. This helps ensure that users always get the most up-to-date content, at the expense of increased server load.

Finally, the plugin also generates a Content-length header containing the length in bytes of the content (“entity body” in HTTP 1.1 jargon). Providing a Content-length header supports persistent connections for clients that use the HTTP 1.0 “keep-alive” mechanism (as documented in section 19.7.1 of RFC 2068); this can reduce the number of connections to the site in some cases.

Note that at present this plugin can be used as a replacement for the lastmodified plugin and its default configuration is essentially equivalent to that of the lastmodified plugin, as discussed below.

Installation and configuration

To install the lastmodifed2 plugin copy the plugin file into your Blosxom plugin directory. You should not normally need to rename the plugin; however see the discussion below.

Configurable variables specify how the plugin handles validation ($generate_etag, $generate_mod, and $strong), caching ($generate_cache, $generate_expires, and $freshness_time), whether or not to generate any other recommended headers ($generate_length), and whether to implement features from the lastmodified plugin for compatibility ($use_others and $export_dates).

For validation the most common configurations are the following:

  • No validation: $generate_etag and $generate_mod both set to 0. The plugin does not generate ETag or Last-modified headers, and does not check If-none-match and If-modified-since headers in the request. Use this configuration if you plan to allow caching of responses (as discussed below) but for some reason you don’t want to do validation.

  • Weak validation: $generate_etag and $generate_mod both set to 1, $strong set to 0. The plugin generates both ETag and Last-modified headers based on the most recent time that any entry on the page was modified; it checks for If-none-match and/or If-modified-since headers in the request, and sends a 304 (Not Modified) response with no output when it can do so. Use this configuration if changes to your pages are only (or at least primarily) due to changes to the entries themselves. This is the default configuration, for compatibility with the lastmodified plugin.

  • Strong validation: $generate_etag, $generate_mod, and $strong all set to 1. The plugin generates both ETag and Last-modified headers based on the current page’s contents and our estimate as to when the contents were last modified; the plugin checks for If-none-match and/or If-modified-since headers in the request, and sends a 304 response when it can do so. Use this configuration if your pages contain comments or other material that is updated more frequently than the entries themselves.

  • Strong validation using ETag only: $generate_etag and $strong set to 1, $generate_mod set to 0. The plugin generates only an ETag header (not Last-modified) and checks only for an If-none-match header in the request (not If-modified-since). Use this configuration if you want to support strong validation but don’t want the performance overhead of caching Last-modified values as previously described. Note that this configuration does not support validation for HTTP 1.0 clients or other clients that do not support validation using If-none-match.

Note that if you set $generate_mod and $strong to 1 then you might as well set $generate_etag to 1 as well, since correctly using Last-modified as a strong validator requires that we generate and cache MD5 digests of the page in order to detect any changes, and these digests are also what we use to generate ETag values.

For caching the most common configurations are the following:

  • No caching: $generate_cache and $generate_expires both set to 0. The plugin does not generate either a Cache-control or Expires header, and thus web proxies and other clients will typically not cache returned pages. This is the default configuration; use it if you don’t care about caching.

  • Caching allowed: $generate_cache and $generate_expires both set to 1, and $freshness_time set to a positive integer value. The plugin generates Cache-control and Expires headers that allow for caching of returned pages for up to $freshness_time seconds from the time of the request. Use this configuration if you’d like to allow caching by proxies and other clients to reduce server hits due to GET requests (whether conditional or not), and set $freshness_time to a value comparable to the frequency with which your site is updated.

    (By default $freshness_time is set to 3,000 seconds, long enough to provide some benefit through caching by web proxies, especially during periods of heavy load, but short enough to ensure that news aggregators doing hourly polling will always use up-to-date copies of feeds.)

  • Caching prohibited: $generate_cache and $generate_expires both set to 1, and $freshness_time set to 0. The plugin generates Cache-control and Expires headers that specifically prohibit caching of returned pages. Use this configuration if you want all clients to always see the most up-to-date content.

Note that if you set $generate_cache to 1 then you might as well set $generate_expires to 1 and vice versa, in order to properly support both HTTP 1.1 and HTTP 1.0 clients; there is no performance penalty for doing so.

The other configurable variables are as follows:

  • $generate_length controls whether or not generate a Content-length header. The default is to generate the header; you can disable this by setting $generate_length to 0. Note that support of HTTP 1.0 persistent connections using Content-length requires that your web server be configured to support persistent connections in the first place; for Apache this is done using the KeepAlive On directive in the Apache configuration file.

    Also note that HTTP 1.1 clients can use persistent connections even if the Content-length header is not present, if (like Apache) the underlying web server supports HTTP 1.1 persistent connections for CGI scripts using the Connection header and chunked transfer coding. However we generate a Content-length header by default because it’s recommended by section 14.13 of the HTTP 1.1 specification.

  • $use_others controls whether changes to flavour files and other non-entry files in the Blosxom data directory should also be considered semantically significant for weak validation. Note that this feature is provided only for compatibility with the lastmodified plugin and its use is deprecated; by default it is disabled.

  • $export_dates controls whether or not the plugin should set the following variables for use in flavour templates and other plugins:

    • $now_rfc822 and $now_iso8601: Current date/time, in RFC 822 and ISO 8601 formats respectively. These variables can be used in any flavour template.

    • $latest_rfc822 and $latest_iso8601: Date/time modified of the most recently modified entry to be displayed on the page, in RFC 822 and ISO 8601 formats respectively. These variables can be used in any flavour template.

    • $others_rfc822 and $others_iso8601: Date/time modified of the most recently modified non-entry file in the Blosxom data directory, in RFC 822 and ISO8601 formats respectively. These variables can be used in any flavour template, but are set only if $use_others is set to 1.

    • $story_rfc822 and $story_iso8601: Date/time modified of the current entry, in RFC 822 and ISO 8601 formats respectively. These variables can be used in the story and date templates.

      Note that the ISO 8601 format produced is the complete date plus hours, minutes and seconds: YYYY-MM-DDThh:mm:ssTZD (e.g., 1997-07-16T19:20:30+01:00).

  • You can set the variable $debug to 1 or greater to produce additional information useful in debugging the operation of the plugin; the debug output is sent to your web server’s error log.

This plugin supplies filter, skip, story, and last subroutines. It needs to run after any other plugin whose filter subroutine changes the list of entries included in the response; otherwise the Last-modified date may be computed incorrectly. It needs to run after any other plugin whose skip subroutine does redirection (e.g., the canonicaluri plugin) or otherwise conditionally sets the HTTP status to any value other than 200. Finally, this plugin needs to run after any other plugin whose last subroutine changes the output for the page; otherwise the Content-length value (and the ETag and Last-modified values, if you are using strong validation) may be computed incorrectly. If you are encountering problems in any of these regards then you can force the plugin to run after other plugins by renaming it to, e.g., 99lastmodified2.

Bugs

Several of the following items are not in fact bugs, but the behaviors in question may cause confusion in some cases; hence their inclusion here:

  • As discussed above, with weak validation the Last-modified header generated may not always reflect the date/time at which the bit-for-bit contents of the page most recently changed, and the contents of the page may change without changing the ETag value. In particular, if changes are made to flavour files used in generating the page or comments are added to a page via variable interpolation (e.g., as done by the writeback plugin and others) then a user will not necessarily see such changes without forcing an full reload of the page (i.e., using an unconditional GET request). This should be considered a feature and not a bug; if you are not comfortable with this behavior then you should set $strong to 1 to enable strong validation.

  • When ETag generation is enabled and Last-modified disabled (or vice versa) and a request includes both an If-none-match and If-modified-since header, the plugin will not return a 304 response under any circumstances. This is not a bug, but rather complies with section 13.3.4 of the HTTP 1.1 specification: “An HTTP/1.1 origin server, upon receiving a conditional request that includes both a Last-Modified date (e.g., in an If-Modified-Since or If-Unmodified-Since header field) and one or more entity tags (e.g., in an If-Match, If-None-Match, or If-Range header field) as cache validators, MUST NOT return a response status of 304 (Not Modified) unless doing so is consistent with all of the conditional header fields in the request.”

    In other words, if a conditional request contains both tests and we can’t perform one of the tests (because we’re not generating the header value used in the test) then we can’t return a 304 regardless of the results of the other test.

  • If the Cache-control and/or Expires headers are enabled then a user requesting to view a page will not necessarily see updates to that page even if the underlying entries have been changed since the last time the user viewed the page. This should be considered a feature and not a bug; if you are not comfortable with this behavior then you should not enable generation of the Cache-control and/or Expires headers, or you should explicitly prohibit caching by setting the freshness time to 0.

  • When using the Expires header to prohibit caching, for strict consistency with the HTTP 1.1 specification (section 14.21) the date/time sent with the Expires header should be equal to the date/time sent with the Date header. However we don’t necessarily know what the exact Date value is (at least not for Apache, where it is generated by the server itself), and it’s possible that the current date/time as measured in the plugin itself may be a little bit later than the time in the Date header, so instead we set the Expires value to be a minute before the current date/time (as measured in the plugin itself).

    This should produce correct behavior for HTTP 1.0 clients relying on the Expires header, per section 10.7 of the HTTP 1.0 specification, as well as for HTTP 1.1 clients in the absence of a Cache-control header, per section 14.9.3 of the HTTP 1.1 specification.

  • As noted previously, if we’re doing strong validation using Last-modified and we don’t have a cached Last-modified value then we have to make up one; we arbitrarily set it to 5 seconds prior to the current time. Since our value for the current time may be later than that used in the Date header (as noted in the previous item), it’s possible that the Last-modified value generated may be in the future relative to that sent with the Date header, especially if the CGI script takes a long time to run (e.g., because of heavy load). This violates the HTTP 1.1 specification (see section 14.29).

    The probabability of this happening could be lessened by setting an earlier Last-modified time; however this increases the possibility of having two updates occur within the n-second time window between the ostensible Last-modified time and the current time, and there may be race conditions associated with this that could cause other problems, such as sending a Last-modified value that’s earlier than one sent previously for the same URI.

  • With strong validation using Last-modified it’s possible that the plugin may attempt to update the cache file while another plugin invocation (resulting from a simultaneous request) may attempt to read it; more seriously, two plugin invocations may attempt to both update the validator cache file simultaneously. I’ve tried to minimize problems relating to this by having the plugin write out cache data to a temporary file and then rename it to the real file; if the rename is an atomic operation then this should eliminate the problem of a plugin invocation trying to read from a partially-written validator cache file.

    As for simultaneous updates, presumably the worst that can happen is that one of the plugin invocations will fail to update the cache entry for its URI (since its changes will be overwritten by the second plugin invocation); however this simply means that the plugin won’t be able to send a 304 on a subsequent conditional GET for that URI, and will then have to update the cache file again.

  • As noted above, you may experience problems if you install this plugin with other plugins that set HTTP status in the skip subroutine. Blosxom stops executing skip subroutines as soon as one returns a true value, so whichever plugin is first in the execution order will get to set the final HTTP status.

To do

Here are some ideas for ways in which the lastmodified2 plugin could be enhanced and extended:

  • Support selective use of strong or weak validation depending on the flavour. For example, weak validation would probably work fine for RSS and Atom feeds, since they typically contain content only for entries; however strong validation may be needed for the HTML flavour of individual entry pages (and, to a lesser extent, HTML index pages) in order to pick up changes due to comments.

    Note that doing this would be perfectly compatible with the HTTP 1.1 protocol specification, since different flavours correspond to different URIs; any given URI (or set of URIs) could be either strongly validated or weakly validated independent of any other URIs.

  • Support specifying different freshness times for different types of content, e.g., for different flavours, for individual entries vs. entry index pages, and/or for current index pages vs. archive index pages.

  • Try to make the filename for the validator cache temporary file more unique to minimize the possibility of name collisions by simultaneous plugin invocations. (Perhaps use Time::HiRes to get subsecond times?)

  • For completeness, support the case where the If-none-match header has the value ’*’ (which matches any entity). See section 14.26 of the HTTP 1.1 specification for the desired behavior in this case.

  • For completeness, support conditional GETs using the If-match and/or If-unmodified-since headers within the plugin itself, in addition to If-none-match and/or If-modified-since. However note that this doesn’t appear to be necessary for Apache, since it appears to correctly make these checks as long as ETag and/or Last-modified headers are returned by the CGI script.