XML - Hard Problems, Clean Solutions?

Copyright (c) 2000-2001 by Rich Morin
published in Silicon Carny, April 2000

XML is receiving increasing amounts of attention and is being proposed for all sorts of problems. Rich examines the reasons and gives us his own predictions for XML's prospects.

As regular readers of this column know, I am a big fan of self-describing data files. In "A Lazy Afternoon" and "Smart Data" (March and August, 1999), I covered some aspects of this technique.

Consequently, I have been watching XML (Extensible Markup Language) with a great deal of interest. XML has a good shot at being a really pervasive technology, causing changes in all parts of the computing community. Let's look at some XML fundamentals, to see how this could be.

XML looks quite a bit like HTML. The languages employ very similar syntax, though XML is distinctly pickier about compliance. Fundamentally, however, HTML and XML have rather different goals and some corresponding design differences.

The goal of a typical HTML file is to present information to a human reader, mediated by a browsing program. Although HTML documents can be written in a very abstract way, coders often use tricks to specify the exact appearance of the resulting document.

This makes web pages interesting to look at, to be sure, but it greatly increases the difficulty of writing programs to read the data. Worse, the HTML organization of a page may change at any time, subject to the whims of the developing organization.

XML documents, in contrast, are optimized for processing by computer programs. Their tight syntax rules allow both consistency checking by the creator and ease of access by random clients.

Further, the semantics (and high-level syntax) of XML files can be defined by a Document Type Definition (DTD) or an XML Schema. These can provide a programmer (or a particularly deft program) with a guide to reading and parsing the actual XML file. (HTML has DTDs, as well, but they tend not to be as detailed.)

XML has many other interesting characteristics, but these should get us started. Let's explore some possible XML-based applications.

Book Catalogs

Consider the task of parsing publishers' web pages to generate a comprehensive list of books on a given topical area. Each publisher uses a different format, of course, and every time a publisher rearranges a page or adds a feature, some programmer must figure out (again) how to parse the format.

Big companies such as Amazon simply step around this problem, requiring publishers to give them listings in a specified format. Unfortunately, this means that each publisher now has to generate a different listing for each online reseller.

Wouldn't it be more reasonable for publishers and resellers to agree on a single listing format? The pages could be transmitted privately or posted to the World Wide Web for more general access. In either case, however, the target audience would be programs, rather than humans.

If the format were well documented, special-purpose search programs could be hacked up in Perl, etc. As an occasional book reviewer, I would love to have a program which could generate lists of books on specified topics!

XML is aimed at precisely this kind of problem. Publishers and resellers could easily (from a technical perspective :-) define a common vocabulary and structure for XML-based catalogs.

Although this could be accomplished by a prose description, a DTD or XML Schema really should be used to specify the exact format. Existing DTDs (e.g., BiblioML and MARC) cover very similiar problems, so the publishing community could probably adopt (or adapt) one for their own use.

Once agreement has been reached on the DTD, each publisher must find a way to convert its local catalog format into (out of) the XML format. This is a relatively trivial effort, however, compared with generating formats for an arbitrary (and steadily increasing) number of resellers.

I would love to be able to tell you that the publishing industry is well on the way to having such a system in place. Sadly, even publishers which have lots of books about XML haven't (yet) published their catalogs in XML form. I predict that it will happen, however, and probably sooner than later.

Software Building and Distribution

In the Unix community, software builds are commonly controlled by a version of the make utility. Make files describe dependency relationships between files (e.g., "foo is built from foo.c and foo.h"), using a largely declarative syntax supplemented by snippets of shell code.

Because make is a very flexible language, wizards can cause it to do spectacular things. The FreeBSD Project's Jordan Hubbard, for instance, has created a 2500+ line makefile as the basis for the FreeBSD Ports Collection.

In concert with a small specification file for each package, Jordan's makefile automates the downloading, patching, building, and installation of given Open Source packages. About 3000 of these specification files currently exist, covering a very wide range of packages.

Unfortunately, the system depends heavily on Berkeley-style make, as well as having a variety of FreeBSD dependencies. Consequently, adapting the system to support Solaris (let alone Linux) might be a challenge. Extending it to cover local (e.g, site or system) preferences could be even harder, in the general case.

I have speculated about the possibility of using XML as the basis for a rewritten system. In the new system, the description files would be both abstract (no OS dependencies) and totally declarative (no embedded snippets of code). Packages could then be "formatted" into binaries and such, obeying local preferences, using a single XML "style sheet".

Looking around a bit, I discovered that I was not alone in considering this approach. The Open Software Description, developed by folks at Marimba and Microsoft, proposes XML as the foundation for a complete software packaging and distribution system.

Apple is also reported to be making heavy use of XML in various parts of Mac OS X. It plays a large role in WebObjects, of course, but it is also purported to be an integral part of the system administration infrastructure and the software build and distribution mechanisms. Clearly, XML isn't just for Internet applications!

Oh Yes, Web Pages

Although I have discounted the use of XML for web pages, there are some really interesting possibilities here, as well. In an effort to make web pages more interesting and dynamic, programmers are stuffing all sorts of executable code (e.g., Java, JavaScript, Perl, and Tk) into HTML pages.

This makes me more than a bit twitchy, as I have no way of knowing the real intentions (or, for that matter, simple competence) of the programmer(s) who wrote the code. So, I tend to leave these facilities off in my browser, missing pizzazz in return for a bit more safety.

Instead of sending executable code, however, programmers could send declarative descriptions of items, along with possible presentation modes. These modes, defined by style sheets, can support interactive graphics, multimedia, and more. What they do not do is pump arbitrary code into the viewer's machine.

Although I suspect that evildoers could find ways to subvert even XML, the opportunities are more limited. So, I look forward to upcoming uses of XML which will take advantage of "trusted" presentation code, giving me both pizzazz and safety.

There are many books on XML, dealing with assorted aspects of the standard. The ones I have listed in the Resources section are simply the ones which I found useful as introductions.



The FreeBSD Project

Open Software Description

DocBook: The Definitive Guide (O'Reilly)
The XML Handbook, 2e (Prentice Hall)

About the author

Rich Morin (rdm@cfcl.com) operates Prime Time Freeware (www.ptf.com), a publisher of books about Open Source software. Rich lives in San Bruno, on the San Francisco peninsula.