Suggested metadata practices for legislation and regulations

The workshop held by the Legal Information Institute at the Cornell Law School on March 22 and 23, 2010, produced some general recommendations regarding legislative metadata. They are best imagined as a series of answers to a question from a revisor of statutes or other governmental publisher of statutes or regulations who is asking, "What should I do?".

General considerations

First, do no harm. Any encoding should preserve any information value already in the text. Sometimes, for example when converting to XML from typesetting data that uses typography as a proxy for logical markup, this will involve encoding and preserving incompletely-understood data for later examination.

Second, use XML. A casual survey of state legislative systems, for example via one of the LII's state-law pages, shows a fair number that are clearly not using structured text in their underpinnings, some that clearly are, and many about which it is hard to tell. And of course it's a basic tenet of (and increasingly of the entire free-access-to-law movement) that all governmental data should be available freely, in bulk, in XML.

Document structure

Document structure should be captured at least at the paragraph level. Finer levels of granularity may be wanted if you are building a point-in-time legislative system (or other drafting system)(but see the remarks of Shetland and Bruce at JURIX 2007).

Document addressing should be supported at the same level as citation. If you can cite to a sub-sub-sub-sub-sub-sub-section, you should be able to link to it (or otherwise address it) too.

A reasonable test for any structural encoding scheme is that it be transformable to encoding in the CEN/Metalex interchange standard. While CEN/Metalex is probably too abstract and generic to be desirable as the "workaday" encoding for any particular jurisdiction or agency, encodings that can be translated to it are most likely structurally workable. In any case, portability and interchange are valuable in themselves.

Document contents

Minimally, all internal crossreferences should be marked up in a way that permits construction of internal hyperlinks. To the extent that they are resolvable, external references to other documents should be marked up too.

Special consideration (and markup) should be given to text features that appear to be external references but which we do not yet know how to resolve. For example, external references that are plainly legal citations should be tagged at the boundaries even if their internal structure and resolution are not yet understood, or if suitable collections are not yet available as targets for hyperlinks. At this writing, a good-sized slice of the US Statutes at Large falls into this category -- but not all of it.

This is not the same as external references to non-document objects or other real-world entities more properly the province of Linked Data. Linked Data may be especially useful in regulations, where rich information about the objects of regulation may be available. In any case, references to persons, official bodies, governmental structure, and so on are all useful objects for treatment.


A very minimal metadata set would consist of a title, an effective date, the name of the issuing body, and some sort of permanent identifier (preferably conforming to the URN:lex specification, for which worked examples can be found here). Traditional library cataloging practice seems to have been to apply this metadata at the whole-corpus level. It might make sense to apply it to smaller units.

Beyond that, one might add more dates (of many different kinds, including various milestones in the process of drafting and approval), any popular names attached to it (eg. "The Stoats and Weasels Act of 2010"), compact descriptions of the legislation and its intended effects, a representation of the legislative process, responsible agencies and organizations, and so forth. As much information about provenance as practical is a good idea. It is hard to know where to stop.

Things that have standards outside of legislation should adhere to those standards within it. For example, idiosyncratic timespans such as "108th Congress" or "2009 Term of the Supreme Court" should be marked up with elements having canonical attributes that indicate the timespans involved.


As with many other standards, a layered approach is indicated. Layered standards allow for an iterative development process that progresses quickly. They help control costs and raise participation in communities where markup to a fine-grained standard may be too costly or burdensome. In short, they represent a way of keeping the perfect from becoming the enemy of the good. A good example of the layered approach is TEI-Lite.

Point-in-time versus post-hoc

Point-in-time systems that manage legislation and regulatory activity as an integrated process from drafting to promulgation are desirable, but not yet widespread in the US (the first such system was in Tasmania, and there are now many examples throughout the world). They raise different issues for markup and metadata than those inherent in systems that simply "take snapshots" of legislation or regulations at successive time intervals or at different stages of process.

In general, such systems need to manage data about versioning, editing, and changes that is more complex and detailed than that of a "snapshot" system. However, it would be usef for "snapshot" systems to contain associated metadata that describes the transformation from one snapshot to the next. One might imagine this as similar to an encoding of the List of Sections Affected associated with the Code of Federal Regulations and the Federal Register, or a more granular version of the Notes associated with US Code sections and supersections.