Linearized Markup

A screenshot of XRay 1.0 
displaying linearized markup.

What I call "linearized markup" or "linearized content" (I use these phrases interchangeably) is a way of representing the nested, parenthetical, tree structure of a text document written in a tag based language such as XML or HTML as a sequence of path-like lines of text, where each line necessarily denotes a particular element, attribute, or piece of content at a particular nesting level in the document independently of all the other lines. Inductively, the path from the root of the document to a given tag is given by the path of the tag in which it is immediately nested, followed by a dot, followed by the name of the tag itself. The new representation thus linearizes the original document by mapping hierarchal relationships that might otherwise cross line boundaries or range over non-contiguous regions of text onto successor-predecessor relationships between dot-delimited segments of a single line. In other words, the nesting level of a tag in the document may simply be read off the single line of text in which it occurs (it is a child of the dot-delimited path that precedes it in the line) rather than inferred from the pattern of opening and closing tags that precedes it over the entire course of the document prior to the occurrence of that tag. The linearization algorithm also preserves the order of sibling elements by mapping adjacent siblings in document order onto adjacent lines and distinguishes repeated occurrences of the same element type or "tag" at the same level of nesting by indexing repeated occurrences of the same tag at the same nesting level in document order. Attributes and text content are represented in the dotted, linear, notation as children of the element tag to which they are bound. Hence the entire structure of the document is preserved. For a concrete example of a linearized html document, view the screenshot of XRay 1.0 accompanying this page. For a discussion of the implementation and usefulness of linearized content in the context of content filtering, see Web Content Mining with Java by Tony Loton.