UPDATE 2023-03-27: This page is obsolete, as it refers to a prior version of this blog. However, it may be of historical interest.
In a previous post I discussed the general problem of validating and caching dynamic content. In order to implement the strategy outlined in that post I decided to create a new version of the lastmodified plugin originally created by Bob Schumaker. The lastmodified plugin was a good base to build on; however it didn’t do exactly what I wanted to do, and hence I couldn’t resist trying to improve on it.
The following material documents the lastmodified2 plugin that I created, including my notes on how I implemented page validation according to my interpretation of the HTTP 1.1 specification.
The strategy revisited
As you may recall, in my previous post I outlined an overall strategy for how to support validating and caching dynamic content. Here’s a recap of that strategy, with additional detail added on the subject of validation:
- When sending responses to requests, add a
Content-length
header to identify the total number of bytes in the response. - When sending responses to requests, add an
ETag
header to identify the “version number” (entity tag) for this particular version of the page, and/or aLast-Modified
header to identify the date/time the page was last modified. These are computed as follows, depending on whether weak or strong validation is used:- For weak validation, the
Last-modified
header should reflect the date/time modified of the most recently-updated “semantically-significant” component of the page. (For example, for Blosxom we consider entries to be semantically significant, but not flavour templates.) TheETag
header can then supply a weak etag directly derived from this date/time. - For strong validation, the
ETag
header should change if even a single bit on a page changes; for example, it could be derived from the MD5 or SHA-1 digest of the page. ALast-modified
header value could then be determined by consulting a cached copy of theETag
andLast-modified
values for the URI; if there is a cache match then theLast-modified
value can taken from the cache, otherwise it can be arbitrarily assigned to be a date in the recent past.
- For weak validation, the
- When sending responses to requests, also add
Cache-control
andExpires
headers to the response to provide a “use by” date/time to clients doing caching. - When processing requests, look for the
If-none-match
andIf-modified-since
headers. If one or both are present, return the full page in the response only if necessary: if the version of the page currently available is different than the version requested in theIf-none-match
header, or if the page has been modified since the date in theIf-modified-since
header.
Implementation overview
This section and the next describe in more depth how I implemented the above strategy.
First, the plugin is designed to have its behavior easily modifiable using configurable variables, as is done with other Blosxom plugins. In particular, it is possible to specify whether the plugin should do strong or weak validation ($strong
boolean variable) and whether it should generate an ETag
header, Last-modified
header, or both ($generate_etag
and $generate_mod
boolean variables). By default the plugin is configured to be a “plug-compatible” replacement for the lastmodified plugin, doing weak validation and generating both ETag
and Last-modified
headers.
The basic plan of the lastmodified2 plugin is as follows:
start
subroutine: Read in the cached information containing the previousETag
andLast-modified
values for this URI.filter
subroutine: Get the information necessary for weak validation by traversing the list of entries to be displayed on the page and determining the date/time any of the entries was most recently modified. Use this last-modified date/time to create a weakETag
value.skip
subroutine: For weak validation interpret anyIf-none-match
orIf-modified-since
headers and determine whether or not we need to send a full response. If not we can skip the actual story processing after settingStatus
to 304 (Not Modified) and generating any other headers appropriate for a 304 response.last
subroutine: For strong validation generate an MD5 digest of the page and use this to create theETag
value. Create theLast-modified
header by using the cachedLast-modified
value if the newETag
value matches the cachedETag
value, otherwise assigning a newLast-modified
value in the very recent past. Then interpret anyIf-none-match
orIf-modified-since
headers in the request and determine whether or not we need to send a full response. In either case send the appropriate headers.
Note that there is also a story
subroutine in the lastmodified2 plugin, but its purpose is restricted to setting output variables (e.g., for use in flavour templates) for compatibility with the lastmodified plugin. It does not affect the actual caching and validation processes.
Implementation details
Like the lastmodified plugin, this version of the plugin looks for and acts upon the If-modified-since
header itself, instead of letting the underlying web server deal with it. Note that the 1.3 and 2.0 versions of Apache in common use today have a feature whereby the underlying web server will handle If-modified-since
checks as long as the CGI script simply sets the Last-modified
header; this can be used to easily implement simple validation. (Previous versions of this plugin relied on this feature.)
The plugin also looks for and acts upon the If-none-match
header. (Apache does not do this for CGI scripts, so the plugin has no choice but to do it itself.) Note that for weak validation (generate_etag
set to 1 but $strong
set to 0) we generate entity tag values using the date/time modified of the most recent entry, so both the If-modified-since
check and the If-none-match
check can be done as soon as we compute the Last-modified
value, which is done in the filter
subroutine. This allows the plugin to save processing time by skipping story processing (i.e., using the skip
subroutine) when it does not need to return a full response to a conditional GET with If-modified-since
or If-none-match
.
For strong validation using ETag
($generate_etag
and $strong
both set to 1) the ETag
value is computed as an MD5 digest of the entire page as it will be returned to the user, in order to distinguish changes that affect even a single bit of the page. We can’t skip story processing in this case since we need the complete output (including the results of interpolating variables) in order to compute the correct MD5 digest.
For strong validation using Last-modified
($generate_mod
and $strong
both set to 1) we also compute an MD5 digest of the entire page as it will be returned to the user, and we compare that value against a cached MD5 digest computed for the page on previous requests. If they match then we know that no changes have occurred since the previous requests, and we set the Last-modified
value to the value cached with the MD5 digest. Otherwise we know that some change has occurred since the time of the previous requests, but do not know exactly when that change occurred; we therefore arbitrarily set the Last-modified
value to a time just prior to the time of the current request. Note that we can’t skip story processing in this case either, since again we need the complete output (including the results of interpolating variables) in order to compute the correct MD5 digest.
Note that for weak validation ($strong
set to 0) the Last-modified
header does not necessarily provide the date/time at which the actual (bit for bit) contents of the page last changed; instead it provides the date/time at which the meaning of the page last changed, i.e., because the contents of at least one entry on the page were changed. It is possible for other elements on the page such as headers, footers, or comments to change without changing the meaning of the page in this sense, so in this case the Last-modified
value is only a “weak validator” as defined by section 13.3.3 of the HTTP 1.1 specification. When $strong
is set to 0 the entity tag provided by the ETag
header is derived from the Last-modified
value and hence is also only a weak validator, and we explicitly mark it as such by prefixing it with “W/", as described in section 3.11 of the HTTP 1.1 specification.
The net effect is that with weak validation if browsers, web caches, and news aggregators caching the page send a conditional GET request (i.e., with an If-none-match
and/or If-modified-since
header) to check the current status of the page, they will be given a brand new copy of the page only if there have been “semantically significant” changes to the page (in the words of the HTTP 1.1 specification). With strong validation they will get a new copy of the page if the page has changed in any way, no matter how slight.
Generation of the Cache-control
and Expires
headers is relatively straightforward: We use the value of $freshness_time
directly with the max-age
directive of the Cache-control
header, and add it to the current date/time to create a date/time in the future for the Expires
header. If $freshness_time
is set to 0 then we instead send the no-cache
directive with the Cache-control
header and set the Expires
header to a date in the past.
Generation of the Content-length
header is also straightforward: We simply use the length of $blosxom::output
. (This assumes of course that no other plugin will subsequently be changing that output.) Note that for HEAD requests Apache will not actually send the output but will send the Content-length
header if set; in that case the Content-length
value reflects the length of the output that would have been sent for a GET request, in compliance with section 14.13 of the HTTP 1.1 specification.
Note that the plugin does not generate the Last-modified
and Content-length
headers for a 304 (Not Modified) response, in accordance with section 10.3.5 of the HTTP 1.1 protocol specification.
Finally, for upward compatibility the lastmodified2 plugin supports the following features present in the original lastmodified plugin:
Optionally checking the
%others
hash for last-modified dates: Checking%others
is one way to detect changes other than changes in the entries themselves; in particular it can be used to detect changes to flavour files used in creating the page. Unfortunately some of the entries in%others
are not relevant for the page being created (e.g., flavour files for flavours other than the one currently being generated) and may cause theLast-modified
time to be computed incorrectly. Also, checking%others
will not detect page changes due to interpolating variables into flavour files (e.g., for comments). Finally, some plugins that replace the default Blosxomentries
subroutine (including the entries_cache_meta plugin in particular) do not create the%others
hash at all.For the above reasons this feature is deprecated; you should not use it unless you need it for upward compatibility with your current lastmodified configuration. If you want to check for changes to a page outside the entries themselves then you should simply enable strong validation.
Exporting variables with the last-modified time and other times in RFC 822 and ISO 8601 formats: The lastmodified2 plugin computes these variables essentially in the same way as the lastmodified plugin; see the code and the plugin documentation below for more information.
Note that the variables
$latest_rfc822
and$latest_iso8601
always refer to the date/time modified for the most recently updated entry, regardless of whether weak or strong validation is being used. The problem with interpreting$latest_rfc822
or$latest_iso8601
as aLast-modified
value is that when using strong validation we wouldn’t have values for these variables until we completed generating output for the page, too late for the variables to be of any use.
If you want to use the lastmodified2 plugin to replace an existing configuration of the lastmodified plugin, change the plugin’s filename and Perl package name (i.e., in the package
statement at the beginning of the code) to “lastmodified” and set the $generate_mod
, $generate_etag
, and $use_others
configurable variables to match your current values. All other configurable variables can be left as is.
Description
This section and the succeeding ones contain more in-depth documentation of the lastmodified2 plugin to supplement the material included in the plugin itself.
The lastmodified2 plugin enables caching and validation of dynamically-generated Blosxom pages by web browsers, web proxies, news aggregators, and other clients by generating various cache-related HTTP headers in the response and supporting conditional GET requests, as described below. This can reduce excess network traffic and server load caused by requests for RSS or Atom feeds or for web pages for popular entries or categories.
The plugin generates an ETag
header to identify the particular version of the page, as well as a Last-modified
header based on the plugin’s determination of when the contents of the page were most recently modified. The plugin also recognizes and properly acts on an If-none-match
and/or If-modified-since
header in a request, enabling a client to check whether the page has changed since it last requested the page. This reduces network traffic for the site, because the server can skip returning a copy of the page if in fact it has not changed.
The plugin can also optionally generate Cache-control
and/or Expires
headers to specify how long copies of a page should be retained by caches. This reduces server load for the site, because web proxies and other caching clients can use a cached copy of the page and avoid sending additional requests for the page (including conditional GET requests) to the site’s server for as long as the page remains fresh. Alternatively you can use the Cache-control
and Expires
headers to specify that pages should not be cached at all under any circumstance. This helps ensure that users always get the most up-to-date content, at the expense of increased server load.
Finally, the plugin also generates a Content-length
header containing the length in bytes of the content (“entity body” in HTTP 1.1 jargon). Providing a Content-length
header supports persistent connections for clients that use the HTTP 1.0 “keep-alive” mechanism (as documented in section 19.7.1 of RFC 2068); this can reduce the number of connections to the site in some cases.
Note that at present this plugin can be used as a replacement for the lastmodified plugin and its default configuration is essentially equivalent to that of the lastmodified plugin, as discussed below.
Installation and configuration
To install the lastmodifed2 plugin copy the plugin file into your Blosxom plugin directory. You should not normally need to rename the plugin; however see the discussion below.
Configurable variables specify how the plugin handles validation ($generate_etag
, $generate_mod
, and $strong
), caching ($generate_cache
, $generate_expires
, and $freshness_time
), whether or not to generate any other recommended headers ($generate_length
), and whether to implement features from the lastmodified plugin for compatibility ($use_others
and $export_dates
).
For validation the most common configurations are the following:
No validation:
$generate_etag
and$generate_mod
both set to 0. The plugin does not generateETag
orLast-modified
headers, and does not checkIf-none-match
andIf-modified-since
headers in the request. Use this configuration if you plan to allow caching of responses (as discussed below) but for some reason you don’t want to do validation.Weak validation:
$generate_etag
and$generate_mod
both set to 1,$strong
set to 0. The plugin generates bothETag
andLast-modified
headers based on the most recent time that any entry on the page was modified; it checks forIf-none-match
and/orIf-modified-since
headers in the request, and sends a 304 (Not Modified) response with no output when it can do so. Use this configuration if changes to your pages are only (or at least primarily) due to changes to the entries themselves. This is the default configuration, for compatibility with the lastmodified plugin.Strong validation:
$generate_etag
,$generate_mod
, and$strong
all set to 1. The plugin generates bothETag
andLast-modified
headers based on the current page’s contents and our estimate as to when the contents were last modified; the plugin checks forIf-none-match
and/orIf-modified-since
headers in the request, and sends a 304 response when it can do so. Use this configuration if your pages contain comments or other material that is updated more frequently than the entries themselves.Strong validation using
ETag
only:$generate_etag
and$strong
set to 1,$generate_mod
set to 0. The plugin generates only anETag
header (notLast-modified
) and checks only for anIf-none-match
header in the request (notIf-modified-since
). Use this configuration if you want to support strong validation but don’t want the performance overhead of cachingLast-modified
values as previously described. Note that this configuration does not support validation for HTTP 1.0 clients or other clients that do not support validation usingIf-none-match
.
Note that if you set $generate_mod
and $strong
to 1 then you might as well set $generate_etag
to 1 as well, since correctly using Last-modified
as a strong validator requires that we generate and cache MD5 digests of the page in order to detect any changes, and these digests are also what we use to generate ETag
values.
For caching the most common configurations are the following:
No caching:
$generate_cache
and$generate_expires
both set to 0. The plugin does not generate either aCache-control
orExpires
header, and thus web proxies and other clients will typically not cache returned pages. This is the default configuration; use it if you don’t care about caching.Caching allowed:
$generate_cache
and$generate_expires
both set to 1, and$freshness_time
set to a positive integer value. The plugin generatesCache-control
andExpires
headers that allow for caching of returned pages for up to$freshness_time
seconds from the time of the request. Use this configuration if you’d like to allow caching by proxies and other clients to reduce server hits due to GET requests (whether conditional or not), and set$freshness_time
to a value comparable to the frequency with which your site is updated.(By default
$freshness_time
is set to 3,000 seconds, long enough to provide some benefit through caching by web proxies, especially during periods of heavy load, but short enough to ensure that news aggregators doing hourly polling will always use up-to-date copies of feeds.)Caching prohibited:
$generate_cache
and$generate_expires
both set to 1, and$freshness_time
set to 0. The plugin generatesCache-control
andExpires
headers that specifically prohibit caching of returned pages. Use this configuration if you want all clients to always see the most up-to-date content.
Note that if you set $generate_cache
to 1 then you might as well set $generate_expires
to 1 and vice versa, in order to properly support both HTTP 1.1 and HTTP 1.0 clients; there is no performance penalty for doing so.
The other configurable variables are as follows:
$generate_length
controls whether or not generate aContent-length
header. The default is to generate the header; you can disable this by setting$generate_length
to 0. Note that support of HTTP 1.0 persistent connections usingContent-length
requires that your web server be configured to support persistent connections in the first place; for Apache this is done using theKeepAlive On
directive in the Apache configuration file.Also note that HTTP 1.1 clients can use persistent connections even if the
Content-length
header is not present, if (like Apache) the underlying web server supports HTTP 1.1 persistent connections for CGI scripts using theConnection
header and chunked transfer coding. However we generate aContent-length
header by default because it’s recommended by section 14.13 of the HTTP 1.1 specification.$use_others
controls whether changes to flavour files and other non-entry files in the Blosxom data directory should also be considered semantically significant for weak validation. Note that this feature is provided only for compatibility with the lastmodified plugin and its use is deprecated; by default it is disabled.$export_dates
controls whether or not the plugin should set the following variables for use in flavour templates and other plugins:$now_rfc822
and$now_iso8601
: Current date/time, in RFC 822 and ISO 8601 formats respectively. These variables can be used in any flavour template.$latest_rfc822
and$latest_iso8601
: Date/time modified of the most recently modified entry to be displayed on the page, in RFC 822 and ISO 8601 formats respectively. These variables can be used in any flavour template.$others_rfc822
and$others_iso8601
: Date/time modified of the most recently modified non-entry file in the Blosxom data directory, in RFC 822 and ISO8601 formats respectively. These variables can be used in any flavour template, but are set only if$use_others
is set to 1.$story_rfc822
and$story_iso8601
: Date/time modified of the current entry, in RFC 822 and ISO 8601 formats respectively. These variables can be used in the story and date templates.Note that the ISO 8601 format produced is the complete date plus hours, minutes and seconds:
YYYY-MM-DDThh:mm:ssTZD
(e.g.,1997-07-16T19:20:30+01:00
).
You can set the variable
$debug
to 1 or greater to produce additional information useful in debugging the operation of the plugin; the debug output is sent to your web server’s error log.
This plugin supplies filter
, skip
, story
, and last
subroutines. It needs to run after any other plugin whose filter
subroutine changes the list of entries included in the response; otherwise the Last-modified
date may be computed incorrectly. It needs to run after any other plugin whose skip
subroutine does redirection (e.g., the canonicaluri plugin) or otherwise conditionally sets the HTTP status to any value other than 200. Finally, this plugin needs to run after any other plugin whose last
subroutine changes the output for the page; otherwise the Content-length
value (and the ETag
and Last-modified
values, if you are using strong validation) may be computed incorrectly. If you are encountering problems in any of these regards then you can force the plugin to run after other plugins by renaming it to, e.g., 99lastmodified2.
Bugs
Several of the following items are not in fact bugs, but the behaviors in question may cause confusion in some cases; hence their inclusion here:
As discussed above, with weak validation the
Last-modified
header generated may not always reflect the date/time at which the bit-for-bit contents of the page most recently changed, and the contents of the page may change without changing theETag
value. In particular, if changes are made to flavour files used in generating the page or comments are added to a page via variable interpolation (e.g., as done by the writeback plugin and others) then a user will not necessarily see such changes without forcing an full reload of the page (i.e., using an unconditional GET request). This should be considered a feature and not a bug; if you are not comfortable with this behavior then you should set$strong
to 1 to enable strong validation.When
ETag
generation is enabled andLast-modified
disabled (or vice versa) and a request includes both anIf-none-match
andIf-modified-since
header, the plugin will not return a 304 response under any circumstances. This is not a bug, but rather complies with section 13.3.4 of the HTTP 1.1 specification: “An HTTP/1.1 origin server, upon receiving a conditional request that includes both a Last-Modified date (e.g., in an If-Modified-Since or If-Unmodified-Since header field) and one or more entity tags (e.g., in an If-Match, If-None-Match, or If-Range header field) as cache validators, MUST NOT return a response status of 304 (Not Modified) unless doing so is consistent with all of the conditional header fields in the request.”In other words, if a conditional request contains both tests and we can’t perform one of the tests (because we’re not generating the header value used in the test) then we can’t return a 304 regardless of the results of the other test.
If the
Cache-control
and/orExpires
headers are enabled then a user requesting to view a page will not necessarily see updates to that page even if the underlying entries have been changed since the last time the user viewed the page. This should be considered a feature and not a bug; if you are not comfortable with this behavior then you should not enable generation of theCache-control
and/orExpires
headers, or you should explicitly prohibit caching by setting the freshness time to 0.When using the
Expires
header to prohibit caching, for strict consistency with the HTTP 1.1 specification (section 14.21) the date/time sent with theExpires
header should be equal to the date/time sent with theDate
header. However we don’t necessarily know what the exactDate
value is (at least not for Apache, where it is generated by the server itself), and it’s possible that the current date/time as measured in the plugin itself may be a little bit later than the time in theDate
header, so instead we set theExpires
value to be a minute before the current date/time (as measured in the plugin itself).This should produce correct behavior for HTTP 1.0 clients relying on the
Expires
header, per section 10.7 of the HTTP 1.0 specification, as well as for HTTP 1.1 clients in the absence of aCache-control
header, per section 14.9.3 of the HTTP 1.1 specification.As noted previously, if we’re doing strong validation using
Last-modified
and we don’t have a cachedLast-modified
value then we have to make up one; we arbitrarily set it to 5 seconds prior to the current time. Since our value for the current time may be later than that used in theDate
header (as noted in the previous item), it’s possible that theLast-modified
value generated may be in the future relative to that sent with theDate
header, especially if the CGI script takes a long time to run (e.g., because of heavy load). This violates the HTTP 1.1 specification (see section 14.29).The probabability of this happening could be lessened by setting an earlier
Last-modified
time; however this increases the possibility of having two updates occur within the n-second time window between the ostensibleLast-modified
time and the current time, and there may be race conditions associated with this that could cause other problems, such as sending aLast-modified
value that’s earlier than one sent previously for the same URI.With strong validation using
Last-modified
it’s possible that the plugin may attempt to update the cache file while another plugin invocation (resulting from a simultaneous request) may attempt to read it; more seriously, two plugin invocations may attempt to both update the validator cache file simultaneously. I’ve tried to minimize problems relating to this by having the plugin write out cache data to a temporary file and then rename it to the real file; if the rename is an atomic operation then this should eliminate the problem of a plugin invocation trying to read from a partially-written validator cache file.As for simultaneous updates, presumably the worst that can happen is that one of the plugin invocations will fail to update the cache entry for its URI (since its changes will be overwritten by the second plugin invocation); however this simply means that the plugin won’t be able to send a 304 on a subsequent conditional GET for that URI, and will then have to update the cache file again.
As noted above, you may experience problems if you install this plugin with other plugins that set HTTP status in the
skip
subroutine. Blosxom stops executingskip
subroutines as soon as one returns a true value, so whichever plugin is first in the execution order will get to set the final HTTP status.
To do
Here are some ideas for ways in which the lastmodified2 plugin could be enhanced and extended:
Support selective use of strong or weak validation depending on the flavour. For example, weak validation would probably work fine for RSS and Atom feeds, since they typically contain content only for entries; however strong validation may be needed for the HTML flavour of individual entry pages (and, to a lesser extent, HTML index pages) in order to pick up changes due to comments.
Note that doing this would be perfectly compatible with the HTTP 1.1 protocol specification, since different flavours correspond to different URIs; any given URI (or set of URIs) could be either strongly validated or weakly validated independent of any other URIs.
Support specifying different freshness times for different types of content, e.g., for different flavours, for individual entries vs. entry index pages, and/or for current index pages vs. archive index pages.
Try to make the filename for the validator cache temporary file more unique to minimize the possibility of name collisions by simultaneous plugin invocations. (Perhaps use Time::HiRes to get subsecond times?)
For completeness, support the case where the
If-none-match
header has the value ’*’ (which matches any entity). See section 14.26 of the HTTP 1.1 specification for the desired behavior in this case.For completeness, support conditional GETs using the
If-match
and/orIf-unmodified-since
headers within the plugin itself, in addition toIf-none-match
and/orIf-modified-since
. However note that this doesn’t appear to be necessary for Apache, since it appears to correctly make these checks as long asETag
and/orLast-modified
headers are returned by the CGI script.