FOLIO VIEWS TO HTML -- LII'S PROCESS

LII Working Document 94-4

MOVING HYPERTEXT DOCUMENTS FROM FOLIO VIEWS TO HTML -- THE PROMISE, SOME PROBLEMS, AND A BRIEF OUTLINE OF THE LII's PROCESS

P.W. Martin

I. The Promise

Both Folio View and World Wide Web (WWW) hypertext are expressed or can be expressed in an SGML-like markup language. FOLIO Views infobases can be exported to and compiled from a pair of ASCII files: 1) a *.DEF file that contains all infobase- wide parameters and definitions (levels, styles, infobase name and description, and so on) and 2) a *.FFF file that represents the full infobase content including all link functionality. WWW hypertext is expressed in HTML (hypertext markup language). Since both markup languages represent many comparable styles and functions, translating FFF to HTML is a tempting goal. It is especially tempting to organizations like the LII that seek broader distribution for materials published with FOLIO Views through cross-platform WWW publication. Having that goal, the LII has pursued a pattern of building and maintaining texts in FOLIO Views with a subsequent port to HTML. But there are a number of reasons why electronic publishers focused solely on WWW publication might also find this an attractive route, assuming the port path to be a relatively straightforward one. (FOLIO Views has excellent import filters for all major wordprocessing applications. It has a fully interactive authoring environment including link validation and maintains information in a much more sophisticated and flexible data structure than HTML, and so on. In the absence of HTML building and editing tools of comparable sophistication, there is much to be said for FOLIO Views as an HTML authoring tool.)

This working document describes how the LII moves material from FFF to HTML. It assumes basic familiarity with the functionality and terminology associated with both platforms. It deals first with some problems that must be addressed and then outlines the LII process.

II. Some Problems

FOLIO Views is rich; HTML, lean, measured along several different dimensions. This difference creates numerous challenges to conversion. Since a FOLIO Views infobase has features that HTML does not support, the process of porting is a bit like moving a rich wordprocessor file to ASCII. One can simply drop all infobase features that do not have direct HTML equivalents, but a more effective translation includes finding proxies for all important ones -- something like finding a way to show emphasis in an ASCII e- mail message where the wordprocessor document would use bold, italics or underline.

These problems of translating and choosing between dropping or transposing can usefully be separated into three groups -- basic format, hypertext functionality, and data structure. In terms of the issues of porting they raise, these categories constitute an ascending scale of difficulty.

A. Format

Format, for these purposes, comprehends markup that principally determines where and how a document, paragraph, word, character or graphic element displays and prints.

At the micro-level, character for character, in-line graphic by in-line graphic, HTML is capable of matching FFF (with the important exceptions of white space achieved by multiple spaces or tabs). (Non-ASCII characters, however, require translation to the appropriate HTML & sequence. The section and paragraph symbols important in law materials (§ and ¶ ) must be converted, for example, to § and ¶.)

Characters styles are, with HTML, limited to a relatively short list of physical types (most importantly <B>, <I>, <U>, <TT>) and some logical types (notably <PRE>, <EMP>, and <STRONG>). Conversion from FFF to HTML requires only that all font and other character designations accomplished at the character level or by means of a character style or associated with a paragraph, link, or level style be translated into this more limited set.

When it comes to paragraph formatting, HTML knows only different header levels, and <P>, <BR>, indented block quote, and a variety of list (and nested list) types.

Developing proxies for FFF styles is both here and with character style made much easier if all such formatting in FOLIO Views is accomplished through an organized and comprehensive set of styles. (To illustrate, LII infobases achieve bold or italics either by associating those styles with a level or through a character style and they implement a hierarchical indent structure by having a sequence of paragraph styles denominated "Text - Level 1", "Text - Level 2", and so on.)

B. Hypertext Functionality

Native HTML can perform only one hypertext function, the point to point (or more accurately character string to character string) jump. It does not include a search capability although LII HTML publications do make use of an associated full-text search engine. FFF query and popup links must either be dropped or converted into some proxy. Where the query link has been used to access a list of discrete adjacent points, the HTML port can substitute a point to point link to the first of those points, allowing the user to browse to the adjacent ones. Where the query retrieves records scattered through the infobase, the link can be converted into a list of point to point links. If the set of HTML documents will be indexed then a query link can, indeed, be brought over with suitable adjustment of search syntax. Popup links can generally be restructured as footnotes (a jump from the link launch string to named location in an <HR> delimited area at the bottom of the HTML document which, in turn, has a "return to text" link to allow the user to do just that).

C. Data Structure

When documents are brought into FOLIO Views, many files become one. The typical infobase holds material that in wordprocessor or online database environments would be held in separate files or documents or record. The rich data structure of FOLIO which allows use of fields, records, groups, and levels to represent boundaries and relationships makes it possible to import many files while preserving their identity for purposes of search or display or linking.

Going to HTML the process must be reversed. What is in FOLIO Views a single infobase must be split back into separate files that are at least as small as those that would be used in a wordprocessing situation. In many cases, for reasons of client-server performance and appropriate indexing, they should be even smaller.

In taking a complex and coherent information collection and breaking it into a large number of fragments, the FFF to HTML converter must deal with the challenge of representing relationships among those fragments, a task that FOLIO Views performs dynamically.

The basic LII approach to HTML representation of infobase structure has three components:

Each collection of HTML documents that is part of a data collection that FOLIO Views holds in a single infobase is mapped by one or more overview documents. The simplest form of overview document supplies the FOLIO Views table of contents functionality by representing the hierarchy of the document collection with top level items, linked to their subordinate next level items, with these, in turn, linked to the named parts that are held in individual files. These named parts are both linked to the appropriate file and named so that they can be the target of a link back to the overview from that file. (Attachment I contains a sample overview document.)
Each HTML document carries at its top, set off by an <HR> from the principal header and main text, a series of lines representing the higher levels in the information structure that would in FOLIO Views be displayed in the reference window. The HTML title includes both the name of the larger collection and a short representation of this particular piece (the section number, say, with statutory material). (Attachment II contains an example of this material from a section of the trademark act and the template used to create it.)
Each HTML document carries at its bottom, set off on both sides with <HR>s, links to the previous section, to the next section, to the overview at the point where this document is listed. This is in addition to links to notes or comments associated with this document but held in a separate file and precedes any footnote material contained in the document. This set of links is designed to enable the user to browse along logical lines; in many situations these will be linear moves forward or back. (Attachment III contains an example of this material from a section of the trademark act and the template used to create it.)
III. The Process
The process the LII has developed involves use of four tools to accomplish a FOLIO Views to HTML port: 1) FOLIO Views itself, 2) a regular expression utility (FSR), 3) an editor with some basic macro capability (and forward and backward search), and 4) a file chopper.
The process outlined here is that employed by the LII in moving statutes and codes from FOLIO Views to HTML, other information collections will, no doubt, require significant adaptation.
A. Preparing for the Port in FOLIO Views
Anticipating the need to divide what is a single file in Views into multiple HTML files, the first step is to rename the Level at which the file separations will occur to "File". Levels above that one are renamed (from the top down): Overview, Reference 1, Reference 2 and so on. Any levels below the "File" level are renamed "Subfile."
The process that follows assumes that all records at the File level include at or near their beginning a jump destination that can be the root of their ultimate HTML file name.
All jump destinations below the File level that do not explicitly incorporate the root of the HTML file name are renamed to do so. This is with LII publications commonly the case for defined terms. In an LII infobase the definition of a word, "patentee", say will be a jump destination of that name. (HTML will need to know in which document that named spot lies.) To prepare for conversion, all such jump destinations are visited and renamed with a name that includes the name of the section in which the definition falls as it is expressed in jump destination terms, 35uscs156, say. An underbar separates the two elements, e.g., patentee_35uscs156. Jump destinations that begin with what is to be the HTML file name with an extension representing a subpart (often the case with a subsection, e.g., 35usc156(b) ) don't require any changes as long as the fsr is set to distinguish them from jumps to what will be the file level in HTML.
B. FSR or Other Regular Expression Utility
The LII's FSR script converts the FFF exported from a prepared infobase to a single large file holding the multiple HTML files that will result from it set off, if you will, by dotted lines (i.e., a chop mark that holds a file name). In doing so it performs the follow steps. It:
1. replaces section symbol and any other HTML special characters with corresponding & sequence
2. removes all fields
3. removes unwanted jump links (e.g., the LII standard section level self link)
4. converts all character styles to HTML equivalents or proxies
5. converts all levels below the File level to HTML equivalents or proxies
6. converts all paragraph styles to HTML proxies (e.g., LII Text - Level 1 becomes <P>, Text - Level 2 becomes <UL><LI>)
7. turns Jump Destinations into Names (with the </A> being placed after the first "word" following the JD)
8. turns all Jump Links into HREFs
9. places HTML markers and file name holders at the File division level
Here is an illustrative section of FFF prior to operation of the script, followed by the resulting output.
FFF Section
```
<RD:File><JL:section,15uscs1051><JD:15uscs1051>§ <FD:"section 
number">1051</FD:"section number">. Registration of trade-marks<EL>
<RD:Subfile><JD:"15uscs1051(a)">(a) Trade-marks used in commerce.
<RD><PS:"Text - Level 2">The owner of a trade-mark <JL:definition,"used in 
commerce">used in commerce<EL> may apply to register his or her trade-mark under 
this Act on the <JL:definition,"principal register">principal register<EL> hereby 
established:
<HR><PS:"Text - Level 3">(1) By filing in the Patent and Trademark Office
<HR><PS:"Text - Level 4">(A) a written application, in such form as may be prescribed 
by the <JL:definition,commissioner>Commissioner<EL>, verified by the 
<JL:definition,applicant>applicant<EL>, or by a member of the firm or an officer of the 
corporation or association applying, specifying applicant's domicile and citizenship, the 
date of applicant's first use of the mark, the date of applicant's first use of the mark in 
<JL:definition,commerce>commerce<EL>, the goods in connection with which the 
mark is used and the mode or manner in which the mark is used in connection with such 
goods, and including a statement to the effect that the 
<JL:definition,person>person<EL> making the verification believes himself, or the 
firm, corporation, or ***
<RD:File> ***
```
HTML output
```
chop_here="1051.html"
<HTML>
<H4><A NAME="1051">&sect; 1051.</A> Registration of trade-marks</H4>
<UL><LI><B><A NAME="1051(a)">(a)</A> Trade-marks used in commerce.</B>
<LI>The owner of a trade-mark <A HREF="1127.html#used in commerce_1127">used 
in commerce</A> may apply to register his or her trade-mark under this Act on the <A 
HREF="1127.html#principal register_1127">principal register</A> hereby established:
<UL><LI>(1) By filing in the Patent and Trademark Office
<UL><LI>(A) a written application, in such form as may be prescribed by the <A 
HREF="1127.html#commissioner_1127">Commissioner</A>, verified by the <A 
HREF="1127.html#applicant_1127">applicant</A>, or by a member of the firm or an 
officer of the corporation or association applying, specifying applicant's domicile and 
citizenship, the date of applicant's first use of the mark, the date of applicant's first use 
of the mark in <A HREF="1127.html#commerce_1127">commerce</A>, the goods in 
connection with which the mark is used and the mode or manner in which the mark is 
used in connection with such goods, and including a statement to the effect that the <A 
HREF="1127.html#person_1127">person</A> making the verification believes himself, 
or the firm, corporation ***
</HTML>
```
C. Use of an Editor
Following application of the regular expression utility, the output file is taken into an editor for review. Any remaining <CS: or <PS: are found and translated. More importantly all <QL: and <PW: are identified and converted into suitable proxies.
Finally, a set of macros are used to load top and bottom matter templates (see attachments II and III) at the appropriate places of what will become HTML files and to fill their placeholders with the correct text and HREF references. This last step involves a series of forward and backward search and replace moves. For example, since each file will carry a link to the "previous" and "next" document, the template's references to "previous.html" and "next.html" are replaced by the actual file names associated with the adjacent material before and after.
D. Chopping and Finishing
Following a final review of the HTML ready file (looking for any remaining FFF elements), the file is put through a chopper utility that divides it into its HTML pieces and gives each its designated file name.
Constructing the overview document comes last. Returning to the infobase a series of level queries are used to pull off ASCII files of the levels down through the HTML file level for its creation -- which at the moment is performed manually using a template document.
Two examples of LII infobases ported to HTML using this process can be viewed at:
```
	http://www.law.cornell.edu/usc/35/i_iv/overview.html
and
	http://www.law.cornell.edu/usc/15/22/overview.html
```

MOVING HYPERTEXT DOCUMENTS FROM FOLIO VIEWS TO HTML -- THE PROMISE, SOME PROBLEMS, AND A BRIEF OUTLINE OF THE LII's PROCESS

P.W. Martin

I. The Promise

II. Some Problems

A. Format

B. Hypertext Functionality

C. Data Structure

III. The Process

A. Preparing for the Port in FOLIO Views

B. FSR or Other Regular Expression Utility

FFF Section

HTML output

C. Use of an Editor

D. Chopping and Finishing