carpe indexum logo


XML Indexing

What is it?

XML indexing is a special type of embedded indexing. Embedded indexing is a technique in which the indexer inserts index terms directly within the body of the document, rather than entering them into a separate file. No page numbers are included in the index terms. When the document is ready for publication, an automated process locates, extracts, alphabetizes, and formats the index entries, and adds the correct page number(s) based on the terms' location in the original document. Embedded indexing can be done in many popular layout and word processing applications, including InDesign, FrameMaker, and even Microsoft Word.

This approach has several advantages, the most obvious being that, since the index terms are embedded within the document, they will flow with any changes to pagination (adding or deleting a paragraph, for example) without requiring changes to the index. Also, because the page numbers are automatically generated and inserted, the chance of typographical errors is reduced. Obviously an index scattered throughout a document is difficult to proof, so the layout software often provides the indexer the ability to generate and re-generate the index as work progresses, similar to the way Word can generate/regenerate a table of contents as needed. (For an excellent discussion of the pros, cons, and quirks of non-XML embedded indexing, see "Embedded Indexing" by Peg Mauer [1999], available in PDF format on the ASI website here.)

In XML indexing, index terms are inserted into a document that has been marked up using XML tags. Unlike embedded indexing in FrameMaker or Word, XML indexing does not rely on a proprietary file format or a specific software. The publisher may use any DTD (defined set of XML tags) that he likes -- a common one is DocBook -- and the indexer may use any XML editor that he or she prefers (e.g. XMetaL, Oxygen, etc.). XML indexing has the same advantages as any other form of embedded indexing, with the additional bonus that (depending on the DTD) the indexer often has far more control over sort order than in other forms of embedded indexing. However, it also has some unique challenges. The indexer needs to know at least the basics of XML, such as the use of elements and attributes, the concept of legal and illegal locations for tags, and how to validate a document against a DTD. Most XML authoring applications do not come with the ability to generate an index for proofreading, nor with features for checking common indexing problems such as circular references or missing "see" and "see also" references; therefore, the indexer must find methods for handling these quality assurance issues. Some publishing houses address these issues via specialized tools which they provide to the indexer, but many do not.

Who's using it?

Increasingly, the publishing world is relying on XML as a platform- and application-neutral file format for their documents. This is particularly true of publishers who do a great deal of technical publishing, such as O'Reilly's well-known and highly-respected line of computer-related books which are created, indexed, and stored in DocBook (O'Reilly & Associates was in fact one of the developers of the DocBook DTD). In addition, corporations are turning to XML to store and manage the vast amounts of electronic information they are generating. XML indexing is useful to both these groups (though for the latter, our challenge as indexers is to demonstrate how a professionally-created index can complement a simple full-text search!)

Our Qualifications

Carpe Indexum has 15 years experience with SGML and XML documents complemented by proven abilities in XSL. We have developed a suite of tools which, used in conjunction with XML authoring software, ensures that our XML indexes meet the same high standards as embedded indexes created using InDesign or Framemaker, or regular indexes created using dedicated indexing software such as Cindex or Macrex.