March 17, 2016

Rewriting the CDX file format

CDX files are used to support URL+timestamp searching of web archives. They've been around for a long time, having first been used to catalog the contents of ARC files. Despite the advent of the WARC file format, they haven't changed much. I think it is past due that we reconsider the format from the ground up.

The current specification lists a large number of possible fields. Many are not used in typical scenarios.

The first field is a canonicalized URL. I.e. an URL with trivial elements (such as protocol) removed so that equivalent URLs end up with the same canonical URL here. This serves as the primary search key.

The only problem with this is that searching for content in all subdomains is not possible without scanning the entire CDX. This is because the subdomain comes before the domain. Instead, we should use a SURT (Sort-friendly URI Reordering Transform) form of the canonical URL instead. SURT URLs turn the domain/sub-domain structure around, making such queries fairly straightforward. There is essentially no downside to doing this and, in fact, a number of CDXs have been built in this manner, regardless of any "formal" standardization (as there isn't really any formal standard).

I suggest that any revised CDX format mandate the use of SURT URLs for the first field. Furthermore, we should utilize the correct SURT format. In most (probably all) current CDXs with SURT URLs, an annoying mistake has been made where the closing comma is missing. An URL that should read:
   com,example,www,)
instead reads:
   com,example,www)
The protocol prefix has been removed as unnecessary along with the opening ellipse. 

The second field should remain the timestamp with whatever precision is available in the ARC/WARC. I.e. an w3c-iso8601 of varying accuracy as per this proposed revision the WARC standard (the revision is extremely likely to be included in WARC 1.1).

The third field would remain the original URL.

The fourth field should be a content digest including the hashing algorithm. Presently, this field is missing the algorithm.

The fifth field would be the WARC record type (or a special value to indicate an ARC response record). This is the most significant change as it allows us to capture additional WARC record types (such as metadata and conversion) while also handling the existing fields in a more targeted manner (e.g. response vs revisit). It might be argued that this should be the second field to facilitate searches of a specific record type. I believe that, probably implemented, this field would allow replay tools to effectively surface any content "related" to the URL currently being viewed, a problem that I know many are trying to tackle.

The next two fields would be the WARC (or ARC) filename (this is supposed to be unique) of the file containing the record and offset at which the record exists within the (W)ARC. This is as it works currently. Some would argue for a more expressive resource locator here, but I believe that is best handled be a separate (W)ARC resolution service. Otherwise you may have to substantially rebuild your CDX index just because you moved your (W)ARCs to a new disk or service.

Lastly, there should be a single line JSON "blob" containing record type relevant additional data. For response records, this would include HTTP status code and content type which I've excluded from the "base" fields in the CDX. This part would be significantly more flexible due to the JSON format, allowing us to include optional data where appropriate etc. The full range of possible values is beyond the scope of this blog post.

There is clearly more work to be done on the JSON aspect, plus some adjustments may be necessary to the base data, but I believe that, at minimum, this is the right direction to head in. Of course, this means we have to rebuild all our CDX files in order to implement this. That's a tall order, but the benefits should be more than enough to justify that one-time cost.

2 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. I have followed the CDX debate from the sidelines and see the CDX-format itself as usable for corpus definition, while a CDX Server API allows for implementation independence. Good things, both of them.

    From my outsider point of view, there seems to be an unspoken assumption that the CDX-format and the CDX-lookup-code implementation are tied together in that the CDX-files are used as-is by the lookup-code. But a SURTed URL contains the same information as a plain one, so both work equally well as an implementation independent CDX format.

    Granted, it makes practical sense to define the format so that it is easily usable by the common tools. But I don't think it should be a requirement.

    To put it into perspective, I did a little practical experiment with using Solr for CDX lookup functionality (details at https://github.com/tokee/solr-cdx). I am sure I missed some functionality, but the CDX Server API requirements I found was easily supported by Solr. In that scenario, the CDX format is purely for (optional) import & export.

    Rambling on, I find that a large challenge here is how normalisation of URLs are done. If the CDX format should only contain one URL, I would expect that to be the original one and normalization to be done by the implementation. If it contains an original URL and a normalised one, then there is a lot of redundant information from a format point of view. Anyway, the exact rules for normalising needs to be stated, in order to make requests across collections.

    ReplyDelete