You need to have a schema… or at least a DTD

It is often stated in the XML literature that XML was not designed to be a syntax for the manual input of data. It was designed for machine-to-machine communication, but was intended to be easily readable by humans (which, in turn, makes for easier debugging, easier building of XPath queries, and that sort of thing). This wisdom is commonly ignored. Many products use manually-entered XML configuration files. Why? There are several reasons:

  • It is easier than coming up with a new syntax for the configuration file.
  • Code to parse XML is readily available, making it unnecessary to write a custom parser.
  • It is not all that difficult to manually enter XML, at least in small quantities.
  • There are some unexpected benefits, like being able to use XML stylesheets to upgrade configuration files for new releases. (See “Managing XML Documents Versions and Upgrades with XSLT,” by Vadim Zaliva.)

I recently moved to a new, smaller, office, and I no longer had the bookshelf space for my binders full of printed-out PDF articles. I got rid of all the binders and put the PDF files in a directory. I needed an index, so I built a manually-entered XML file with an entry for each book, listing the title, any topics I wanted it listed under, and the name of the PDF file, like this:

<book>
    <title>Agile Development of Safety-Critical Software for Machinery</title>
    <topic>Agile / Safety Critical</topic>
    <topic>Agile Development</topic>
    <file>Katara-18052010.pdf</file>
</book>

All the <book> elements were wrapped in a <library> element. I added a processing instruction to the front to point to an XML stylesheet:

<?xml-stylesheet href="/libsheet.xsl" type="text/xsl"?>

The stylesheet used the Muench method, modified to work with multiple group membership, to sort and group the books by topic. I just had to point my web browser to the XML file, and I had my index. Now, the format of the <book> entries could hardly be simpler: three child elements and no attributes, so simple that a schema seemed unnecessary. I use XML Copy Editor, which validates that my XML is well-formed before it saves it. What could go wrong? But I was studying XML schemas, so I decided to make a schema for my simple index file, just for practice. I added an xsi:schemaLocation attribute to my <library> element to point to the schema:

<library xmlns="http://www.TheXMLAdventure.com/schemas/pdfdoc/docindex.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.TheXMLAdventure.com/schemas/pdfdoc/docindex.xsd docindex.xsd">

XML Copy Editor can also validate an XML document against a schema, and I was shocked to find that in my index, which had grown to 618 books, I had about a dozen entries that did not match the schema. It was mostly entries where I had forgotten to put in the <file> element, which meant that I would not have detected the error until I tried to open the PDF file. I have come to the conclusion that even if your XML data is simple, if you are manually entering it, you need to have a schema. At least if you care about the integrity of your data. If you prefer, you could use a DTD, but that has some disadvantages:

  • Unlike schemas, DTDs are not XML and, at least to me, are a rather ugly construct.
  • DTDs cannot enforce things like maximum occurrences, or correct data type.

But you ought to use something to keep user errors from creeping into your XML data. I have now created schemas for all my existing XML projects.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *