Section identifiers (LII)

The United States Code section number

NB - 3 JUN 2008 - DAS
The assumption below that alpha extensions are lower case is producing some bugs in some interactive access routines. Of the 50,000 plus section numbers in the USC, more than 14,000 have alpha extensions using lower case. However, it is important to note that an additional 312 (as of this date) have alpha extensions using upper case. Since it is almost always obligatory to stay case-aware in parsing the USC text, and especially external text that seeks to cite the USC, this means that parsers should carefully look for both upper and lower, not just go "case insensitive."
There might well be a citation out there to an UC extension using the usual lower case, and we should honor such a reasonable assumption, but internally we need to remember the way it really is.
The 312 will be added in a separate post.

(Some observations, as of May 2008, by David Shetland as part of his work with US Code processing for the Legal Information Institute)

The author assumes the reader is, while reading this, vigorously curious about and actively connected to the US Code:

A US Code "section number" is an identifier. It provides a label which can be used to isolate one of the 50,000 sections of the Code for special consideration. It is unique across one of the fifty (or so) "titles" of the Code, so it is made unique across the Code by prefixing with the corresponding title number. Thus, a standard citation to a section of the US Code looks like the following...
9 USC 203
...which is to Title 9, Section 203, a real life US Code reference.
That's simple enough, but you don't have to have lived very long to suspect that life in that pile of 50,000, after eighty years of development, may not always be that simple.

The two main complicating factors are smaller collections within a title (like "chapter"), and insertions.

Effect of Chapters, etc.

When the number of sections within a title becomes substantial, it is natural and necessary to group them by subjects that are in some sense internal to the title. With even the smallest titles, there seems to be a convention of three chapters (see Title 1, Title 9, and now Title 6). The chapters usually introduce a jump in the sequence of section numbers, to correspond to what seemed a natural boundary at the time.

-- Clearly, in Title 1, it makes sense for chapter 1 to collect sections 1 through 8, chapter 2 to collect sections 101 through 114, and chapter 3 to collect sections 201 through 213.

-- Clearly, in Title 9, it makes sense for chapter 1 to collect sections 1 through 16, chapter 2 to collect sections 201 through 208, and chapter 3 to collect sections 301 through 307.

-- Clearly, in the very new Title 6, it makes sense for chapter 1 to collect sections 101 through 103 (plus some things called subchapters, a different subject entirely), chapter 2 to collect section 701 and some subchapters, and chapter 3 to collect section 901 and some subchapters. Yes, the subchapters have something to do with the strange chapter-level section number jumps.

So what is really clear is that when it comes to labeling things, we have nothing to clarify but clarity itself. But you knew that by looking in your spare closet.

Effect of Insertions (new stuff happens in the middle)

The vigorous mouse-clickers amongst you have already noticed something very important in the middle of the Title 1, Chapter 2 section number sequence:
101, 102, 103, 104, 105, 106, oops, 106a, 106b, 107,...
I just made up the explicit "oops" of course, but it really is in there - see the notes to learn the formal spelling of "oops"--
"1951—Act Oct. 31, 1951, ch. 655, § 2(a), 65 Stat. 710, added items 106a and 106b."
Title 1 is very small and quite stable, but old enough to have some of the insertion effect on section numbers-so-called, namely, alphabetic extensions.

How far does this go in big, old, unstable titles? Pretty far, but there seems to be some system to it. Past performance does not guarantee future results, but so far we've gotten away with the following, based on careful rummaging through the 50,000.

A championship real section "number" is 12 USC 1749bbb-10c, which indicates a third level insertion.

The challenge is to make an efficient index, to answer questions like the following:
-- Does this section number exist?
-- If not, what is the "closest" that does exist?
-- What is its predecessor or successor?
-- What is its container (chapter, etc.)?
Oh, but isn't this what XML gives us almost for free? Yes, once you have the XML. A project is underway to make the content sources be XML, but for now, the best sources are the data that are used to typeset the print volumes.

Analysis of the present set of all US Code section numbers indicates that the following four-part template is barely adequate:
(1) base: six decimal digit integer
(2) ext1: four character alphabetic field
(3) ext2: three decimal digit integer
(4) ext3: one character alphabetic field
This yields, with zero or dash filling of fields, the following version of 1749bbb-10c:
...which naturally collates on most systems (notice the dash in the normalized version is a fill character, and unrelated to the hyphen in the raw "number").

So what about that hyphen in the wild section number? Our current working principle is that it is obligatory if there is an extension-2.

Extension-1 is not obligatory for extension-2, which combined with the required hyphen, means ambiguity when certain literature uses the dash in a range citation. (Inside the US Code, the word "to" is used consistently to indicate a range within a section number, "through" in USC-internal ranged cross references.) Thus, a section number reference of "10a-10c" in the general literature needs to be disambiguated by textual context and/or target authority lookup, if available, between the possibilities of one (or a "first") section or a range from section 10a to 10c.

The classic use case for needing a fast index of authoritative section numbers is the one that forces me to pull out the Ugliest of All section numbers - the one I call the "ranged" section number.

We want to be able, to use an extreme but real-life "section" from current data, to cite "12 USC 1749bbb-10c", which may have crept into someone's notes, although the relevant section heading is actually
"§§ 1749bbb-10a to 1749bbb-10d. Omitted"
indicating that the current state of affairs is that our target section is in a range of sections that has been omitted. Ah, if only we had our system in place to ring up the ancient section (which for "omitted" might require more than old USC)! That would be wonderful, but even if we did, chances are very good that what we really need is the current note about the omission, which is here, not in some old document.

So, "1749bbb-10a to 1749bbb-10d" is our raw "section number" and it gets parsed into the two ends of a range, with normalized forms of "00001749-bbb010a" and "00001749-bbb010d" - and since these collate very well and very quickly, it is easy not only to find out about the two extremes, but that our cite is within the range. Bear in mind that the current data set has very useful information about section 1749bbb-10c (in the "section" with the above ranged "section number"), but nothing tagged or predictably structured at all.

To produce one of our "external IDs" like "usc_sec_01_00000101----000-" (the basis for the URI for title 1, section 101) you need a little more, but not much.