Add a Comment to this Note (list members only)
Dead medium: Dead Digital Documents (Part Two)
From: (Patrick I. LaFollette)

(((We continue Pat LaFollette's essay on death and degradation in digital archives -- bruces)))

A major advantage of the digital medium is that so long as the physical media is readable, copies made from it will be identical. There is no degradation from generation to generation as there is with analog reproduction processes such as microfilm and photocopying. Nothing is fugitive. And unlike books, which are produced in finite (often rather small) numbers that decrease over time, there never need be such thing as a "rare" digital document, so long as it is periodically copied and made available.

Digital media may not hold up as well as paper to adversity and neglect, but its content can be much more widely distributed. Local disasters (wars, fires, floods, hackers, budget cuts) would have a much less lasting impact on a properly managed digital archive than on a conventional library. As soon as the event is past and the equipment is replaced, identical copies of the data can be restored from other unaffected archives.

The third and most intractable issue affecting computer documents is, as William Schleihauf pointed out, the coding scheme in which the data are recorded.

"*Everything* stored in a computer is stored with a specific coding scheme. You need to have the "magic decoder ring" to get it all back. If you create a file/document with Word Perfect v 6 today, there's no guarantee that you'll be able to read it 10 years from now..."

The real problem here is the use of non-standard "proprietary" document and database formats by DBMS, word processors, page layout programs, and typesetting systems. They tend (often deliberately) to be mutually incompatible as well as changing over time.

A solution to this problem was agreed upon eleven years ago by the international standards organization, but is only gradually gaining wide acceptance, primarily in the publishing industry, government, large corporations, and the European Common Market. The solution is SGML (Standard Generalized Markup Language), ISO 8879 (1986) which defines a single international standard for coding documents that is hardware, operating system, and software independent.

Electronic documents in SGML format avoid the problems of proprietary formats and obsolescence, but can be converted to a proprietary format if this is necessary to perform a particular task. Most commercial typesetting and CD-ROM display software now accept SGML documents directly, without conversion. Actually, WWW browsers are quasi-SGML viewers in that HTML (HyperText Markup Language) files are SGML documents.

The new buzz word in publishing circles is "repurposing" documents. That is, taking a computer file (the manuscript for a reference book, for example) and using it to produce a CD-ROM or online database. If the text is in SGML format, it can be used for all three purposes without modification. What allows this to be done is that in SGML, it is the content, rather than the appearance, that is marked.

In any other text markup systems, one would say <start italic> Astraea undosa <end italic> to put the name in italic. In SGML you would say <start genus> Astraea <end genus> <start species> undosa <end species>. The rule "print genus in italic" (or red or 14pt gothic) is defined separately from the document, and can be changed without changing the document itself.

There are a variety of SGML editors that allow documents to be created and maintained directly. Unfortunately, it's still pretty much a "big boy" technology, the software expensive and clunky, and the conversion of existing electronic documents labor intensive. But this situation should improve in time, as more companies enter the arena.

The bottom line to all this is that digital documents are, and will continue to become an ever more useful supplement to the published literature, and an inexpensive method of distributing large volumes of data, but are not likely to take the place of paper any time soon.

Given that digital storage methods will continue to evolve for the foreseeable future, I would want to witness digital librarians staying ahead of the technological wave, maintaining the security and utility of their holdings, for a generation or three before I will have as much confidence in them and their holdings as I do in paper, conventional libraries, and old fashioned librarians.

On a related subject, does anyone out there have a copy of Sherborn's Index Animalium that I could borrow for a few weeks? (Or buy?) My plan is to scan and OCR it, convert the text to SGML, integrate the parts, supplements, and bibliographies, add hyperlinks between the index and the bibliography, and put the result on CD- ROM. (I'll give you a copy in return.)

Patrick I. LaFollette

Electronic Publishing

Auto-Graphics, Inc., 3201 Temple Ave., Pomona, California