Converting LibreOffice Calc spreadsheets to XML

I was working on an educational game about Medieval England, with an XML based GUI. For example, the new player would be asked to choose a name, a profession, and other characteristics. These were read from XML files in order to populate the list of choices.

The researcher on the team would send us spreadsheets with lists of medieval jobs, names, and so on, and I would have to convert them to XML. For the proof of concept, I just cut and pasted the information into an XML editor and manually added the XML tags. This  was clearly too tedious and error prone for production, so I needed an automated conversion.

A LibreOffice spreadsheet, which has an extension of ods, is really a zipped file under the covers. If you unzip the file, there is a file inside called content.xml that contains the data of the spreadsheet. The following XQuery program extracts the information from that file:

  • The first line of each sheet is assumed to be column names. These names, with any spaces translated to hyphens, are used for the XML element names for each column.
  • The name of the sheet is used for the element that encloses the data for each row.
  • The name of the sheet, suffexed with an “s”, is used for the element that encloses the data for the sheet.
  • A <batch> element encloses the data for all the sheets.

Here is the XQuery program, ods2xml2.xq:

(:~
 : This program converts a LibreOffice Calc spreadsheet into an XML file:
 :
 : - The extracted data is enclosed in a  element.
 : - The data for each sheet is enclosed in an element with the same name as the sheet,
 : suffixed with an "s".
 : - The data for each row is enclosed in an element with the same name as the sheet.
 : - The first row is assumed contain the name of each column. Within each data row,
 : the data for each column is enclosed in an element that matches the name of the
 : column. The program currently does not deal well with special characters in column
 : names, since they are invalid in XML element names, so be careful.
 :
 : Input:
 :
 : The input to this program is the content.xml file from a LibreOffice ods file. The ods
 : suffix should be renamed to zip, the file unzipped, and the content.xml file extracted.
 : If the file has a different name, you can specify it via the "filename" external variable.
 :
 : @author Lynn Grant
 : @version 2017-10-02
 :
 :)

(:
 :
 : License: Public Domain
 :
 : To the extent possible under law, Lynn Grant has waived all copyright and related or
 : neighboring rights to this program. This work is published from: United States.
 :
 :)

(: Declare namespaces used by LibreOffice :)
 declare namespace office="urn:oasis:names:tc:opendocument:xmlns:office:1.0";
 declare namespace style="urn:oasis:names:tc:opendocument:xmlns:style:1.0";
 declare namespace text="urn:oasis:names:tc:opendocument:xmlns:text:1.0";
 declare namespace table="urn:oasis:names:tc:opendocument:xmlns:table:1.0";
 declare namespace draw="urn:oasis:names:tc:opendocument:xmlns:drawing:1.0";
 declare namespace fo="urn:oasis:names:tc:opendocument:xmlns:xsl-fo-compatible:1.0";
 declare namespace xlink="http://www.w3.org/1999/xlink";
 declare namespace dc="http://purl.org/dc/elements/1.1/";
 declare namespace meta="urn:oasis:names:tc:opendocument:xmlns:meta:1.0";
 declare namespace number="urn:oasis:names:tc:opendocument:xmlns:datastyle:1.0";
 declare namespace presentation="urn:oasis:names:tc:opendocument:xmlns:presentation:1.0";
 declare namespace svg="urn:oasis:names:tc:opendocument:xmlns:svg-compatible:1.0";
 declare namespace chart="urn:oasis:names:tc:opendocument:xmlns:chart:1.0";
 declare namespace dr3d="urn:oasis:names:tc:opendocument:xmlns:dr3d:1.0";
 declare namespace math="http://www.w3.org/1998/Math/MathML";
 declare namespace form="urn:oasis:names:tc:opendocument:xmlns:form:1.0";
 declare namespace script="urn:oasis:names:tc:opendocument:xmlns:script:1.0";
 declare namespace ooo="http://openoffice.org/2004/office";
 declare namespace ooow="http://openoffice.org/2004/writer";
 declare namespace oooc="http://openoffice.org/2004/calc";
 declare namespace dom="http://www.w3.org/2001/xml-events";
 declare namespace xforms="http://www.w3.org/2002/xforms";
 declare namespace xsd="http://www.w3.org/2001/XMLSchema";
 declare namespace xsi="http://www.w3.org/2001/XMLSchema-instance";
 declare namespace rpt="http://openoffice.org/2005/report";
 declare namespace of="urn:oasis:names:tc:opendocument:xmlns:of:1.2";
 declare namespace xhtml="http://www.w3.org/1999/xhtml";
 declare namespace grddl="http://www.w3.org/2003/g/data-view#";
 declare namespace tableooo="http://openoffice.org/2009/table";
 declare namespace drawooo="http://openoffice.org/2010/draw";
 declare namespace calcext="urn:org:documentfoundation:names:experimental:calc:xmlns:calcext:1.0";
 declare namespace loext="urn:org:documentfoundation:names:experimental:office:xmlns:loext:1.0";
 declare namespace field="urn:openoffice:names:experimental:ooo-ms-interop:xmlns:field:1.0";
 declare namespace formx="urn:openoffice:names:experimental:ooxml-odf-interop:xmlns:form:1.0";
 declare namespace css3t="http://www.w3.org/TR/css3-text/";

(: Declare external variables :)
 declare variable $filename external := "content.xml";

(: Declare local functions :)

(:~
 : This function looks up the name of a column.
 :
 : @param $hdrrow The first row (table:table-row) of the table, which contains
 : the column names.
 :
 : @param $colnum The number of the column. The first column is numbered 1.
 :
 : @return A string that contains the name of the column, with spaces translated to
 : hyphens, or null if the column number is out of range. Note that the
 : function does not currently filter out characters that are invalid in
 : XML element names.
 :
 :)

declare function local:getColName($hdrrow as element()*, $colnum as xs:integer) as xs:string
 {
 let $name := fn:translate(data($hdrrow/table:table-cell[$colnum]/text:p), " ", "-")
 return
 if ($name)
 then $name
 else ""
 };

(:~
 : This function processes a row in a table. For each column, it creates an element
 : with the same name as the column. Spaces are normalized within the value of the
 : element.
 :
 : @param $row is the row (table:table-row) to be processed.
 :
 : @param $hdrrow The first row (table:table-row) of the table, which contains
 : the column names.
 :
 : @return The data from the row.
 :
 :)

declare function local:processRow($row as element()*, $hdrrow as element()*) as element()*
 {
 let $cdata :=
 for $cell at $cellnum in $row/table:table-cell
 let $name := local:getColName($hdrrow, $cellnum)
 return
 if ($name)
 then element {$name} {fn:normalize-space(data($cell))}
 else ()
 return $cdata

};

(:~
 :
 : This function processes all the rows in a table (table:table).
 :
 : @param $table The table (table:table) to be processed.
 :
 : @return The processed data from the table.
 :
 :)

declare function local:processTable($table as element()*) as element()*
 {

let $sheet := data($table/@table:name)

let $sheets := fn:concat($sheet, "s")

let $rows := $table/table:table-row

(: Get the first row, which has the names of the columns :)
 let $hdrrow := $rows[1]

(: Extract cells from data rows :)
 let $more :=
 for $row in $rows[position() > 1] (: Skip title row :)
 return element {$sheet} {local:processRow($row, $hdrrow)}

return {element {$sheets} {$more}}

};

(: Mainline :)

(: Extract all the tables from the spreadsheet :)
 let $tables := doc($filename)/office:document-content/office:body/office:spreadsheet/table:table

(: Process each table :)
 let $batch :=
 for $table in $tables
 return local:processTable($table)

(: Return batch of tables :)
 return {$batch}

Here is an example of a Windows bat file that uses the WinZip Command Line Support Add-On to extract the content.xml file, then runs the converter under  Zorba to extract the data:

@echo off
copy %1.ods %1.zip
move %1.zip data/%1.zip
wzunzip -o data/%1.zip data "content.xml"
cd data
del %1.zip
cd ..
zorba -f -q ods2xml2.xq --indent --external-variable filename:="data/content.xml" > data/batch.xml

In a build pipeline, this can be run whenever the researcher commits a new version of the spreadsheet, to automatically populate the XML files. Since the XML for all the sheets is contained in a single XML file, it will be necessary to select the desired table. This can easily be done with an XQuery or XSLT program.

By the way, the same technique, with a few modifications, can be used to extract data from Microsoft Excel spreadsheet (xlsx) files. The two major differences are that the names of the XML tags are different, though similar in function, and empty cells are handled differently. In LibreOffice Calc, empty cells are represented by an empty <table:table-cell> element. In Microsoft Excel, empty cells are omitted, and the next non-empty cell has an attribute that indicates its column number. I will post a version of the script for Microsoft Excel files sometime in the future.

Posted in Uncategorized | Comments Off on Converting LibreOffice Calc spreadsheets to XML

An extensible information radiator

Note: This post is being shared with my other blog, TryingToKeepItAgile, since it pertains to both agile development and XML.

Teams that are practicing continuous integration often have a central information radiator to indicate when the build fails. Several different kinds have been used, such as an Ambient Orb (a translucent globe that can glow in different colors), lava lamps, talking teddy bears, and all sorts of other things. The idea is that when the information radiator indicates that the build is broken, it is everyone’s first responsibility to get the build working again, so that problems are not compounded by checking in more changes on top of the code that broke the build.

I was working with a team that was not yet ready for continuous integration, but I had some long-running test streams, where each test job, if successful, would submit the next one. Rather than constantly monitoring the test stream, I wanted to have a light, similar to what continuous integration teams use, that would turn amber when my test stream was running, green if all the tests passed, and red if it failed along the way.

First, I needed a light. I chose a red-green-amber LED from Delcom Products.  It comprised the following parts:

  • 904017 USB HID Visual Signal Indicator RGY Black Case w/ Center Cable ($91.35)
  • 804136 Straight Al. Mounting Pole 36″ ($16.80)
  • 804301 Mounting Pole Base ($9.35)

The total cost was $117.50.

I later replaced the mounting pole base with a Grooming Arm Clamp ($17.05), for a total cost of $125.20. This allowed the light to be clamped to the edge of the desk, whereas the mounting pole base had to be screwed into a block of wood.  (In case you are not familiar with grooming arms, they are used in grooming dogs. A grooming arm is a curved pole that clamps to the table, and the dog’s leash is hooked to it to keep the dog in place.)

Here is a picture of the test light in use:

To make things even more interesting, the test jobs were running on an IBM mainframe (z/OS) system, and the light was a USB device that plugged into my desktop computer, so I had to establish communication between the two.

I already had a test framework on the mainframe that would monitor the results of each test job. If the results were as expected, it would submit the next job in the sequence; if they were not, it would stop.

I modified the framework to update a file, LIGHT.DATA.XML. The first job in the sequence updates it with a status of “running”. Any failing job updates it with a status of “failed”. If all jobs are successful, the final job updates it with a status of “passed”. Here is an example of the file. The <jobname> and <system> elements are there for future expansion. Note that the encoding is specified as “EBCDIC-CP-US”, because the IBM mainframe uses the EBCDIC character code, rather than ASCII or UTF-8.

On the PC, there is a Windows batch script, LightDriver.bat, that loops continuously, calling the UpdateLight.bat script, waiting 15 seconds, and repeating the procedure. It looks like this:

@echo off
rem Call the UpdateLight script every 15 seconds to update the light status

:loop
timeout 15
call UpdateLight.bat > nul
goto loop

The UpdateLight.bat script uses FTP to retrieve the LIGHT.DATA.XML file from the mainframe. FTP translates the file from EBCDIC to ASCII as it is downloading it, but the XML header still says its encoding is EBCDIC-CP-US, so the sed stream editor is used to change the encoding specification to UTF-8.

sed "s/EBCDIC-CP-US/UTF-8/" dynamic/light_data_ebcdic.xml > dynamic/light_data.xml

The encoding is really ASCII, rather then UTF-8, but it is close enough, and it makes subsequent XML processing happier. (By the way, I probably would have been OK if I had just specified UTF-8 in the XML declaration in the EBCDIC file on MVS. But I wanted to allow for the possibility of processing the XML on MVS also, and that would require a valid specification of the EBCDIC encoding.)

The modified XML file looks like this:

The modified XML file is processed by an XQuery program, UpdateLight.xq, using the Zorba XQuery processor, to produce a new Windows batch script, setlight.bat:

zorba -f -q C:/SignalLight/eclipse/UpdateLight.xq 
      -o C:/SignalLight/dynamic/setlight.bat 
      --external-variable input:="C:/SignalLight/dynamic/light_data.xml" 
      --serialize-text

The XQuery program looks like this:

(: The $input variable is passed into Zorba, and determines the input file name :)
declare variable $input external;

(: Extract the status of the test run from the rest of the XML data :)
let $testStatus := doc( $input )/test/status/text()

(: Set up a variable to represent the Newline character :)
let $nl := "&#10;"

(: Set the light color, based on the test status :)
let $testColor := 
    if ( $testStatus = "passed" )
    then "green"
    else if ($testStatus = "failed")
    then "red"
    else if ($testStatus = "running")
    then "amber"
    else "unknown"

(: Set the parameters to the light utility, based on the color :)
let $lightParm :=
   if ( $testColor = "green" )
   then "1"
   else if ( $testColor = "red" )
   then "2"
   else if ( $testColor = "amber" )
   then "4"
   else "0"

(: Generate the lines for the light-setting script :)
let $line01 := concat( "@echo off", $nl )    
let $line02 := concat( "rem Turn signal light ", $testColor, $nl )
let $line03 := $nl
let $line04 := concat( "usbcmdapx64 0 0 101 12 ", $lightParm, " 7", $nl )

(: Assemble all the lines. We do this with concat, rather than just returning
   a sequence, to avoid a blank at the beginning of the second and 
   subsequent lines :)
let $lines  := concat( $line01, $line02, $line03, $line04 )

return {$lines}

The program extracts the value of the <status> element, and maps its three values, running, passed, or failed, into corresponding colors, amber, green, or red. If the <status> has an unexpected value, the color is set to dark.

The color is then mapped to a parameter for the command-line utility that sets the color of the LED light, 0 for dark, 1 for green, 2 for red, and 4 for amber.

The Windows command-line utility that comes with the Delcom light is USBCMDAPx64.exe. The program takes the following parameters, specified in this order:

  • v – Verbose.  If specified, this prints more informaton.
  • TID – type.  Specifying 0 means all.
  • SID – serial ID.  Specifying 0 means all.
  • Major command – Specifying 101 means send an eight-byte write command.
  • Minor command – Specifying 12 means set or reset the port 1 pins individually.
  • LSBDATA – specifies the pins to reset.  (Resetting a pin turns it on.)
  • MSBDATA – specifies the pins to set.  (Setting a pin turns it off.)

LSBDATA takes precedence, so specifying 7 for MSBDATA turns off all the pins, and then any pins specified in LSBDATA are turned on.

The pins for the various colors are:

  • 1 – Green
  • 2 – Red
  • 4 – Amber

It is possible to turn more than one color on at the same time, but it does not look very good.

The following commands will set the light to the various colors:

Green:  usbcmdapx64 0 0 101 12 1 7 
Red:    usbcmdapx64 0 0 101 12 2 7 
Amber:  usbcmdapx64 0 0 101 12 4 7 
Dark:   usbcmdapx64 0 0 101 12 0 7

The highlighted parameters are the ones that were set in the XQuery program.

The resulting Windows batch script looks like this:

@echo off
rem Turn signal light red

usbcmdapx64 0 0 101 12 2 7

The generated script is then called:

call dynamic/setlight.bat

This two-process, generating a dynamic Windows script and then calling it, is necessary because it is not easy to call programs out of an XQuery program.

This process worked well, with the LED light on my desk letting me know the status of my test jobs. But when the test stream failed, I really wanted to know which job failed. So I got a BetaBrite Prism scrolling LED sign. (These are sometimes called ticker-tape signs or Times Square signs.) Here are a couple of images of one, from a previous blog post, XML for LEDs:

The BetaBrite Prism connects to the PC via USB. The previous post was about BetaBright Classic signs, which used an RS-232 connection. The method described in that post does not work for the USB signs. Luckily, the folks at Industro Logic have a free command-line program to control BetaBrite Prism signs. You can download it here. The file you want is PRISMCOM.EXE.

If you run the program with no parameters, it gives you a help screen that shows you the options, but the basics are that you enclose control information in braces. In our case, we want to specify the color of the message, so the command would like like this:

prismcom usb {red}Test job TEST1234 failed

I modified the UpdateLight.bat script to run another XQuery program that takes the same input XML file and creates a Windows batch script, setsign.bat, that contains the prismcom command to send the message to the BetaBrite sign. As before, the script is then called.

Because the XQuery programs use XPath expressions to select the parts of the XML file that they need, I can add new information to the LIGHT.DATA.XML file without breaking anything. For example, I decided I wanted to have the job number, as well as its name, displayed on the sign, so I changed the test framework to add a <jobnum> element to the XML file. Once that was in place, I could update the XQuery program for the sign at my leisure.

Since the LightDriver.bat script invokes the UpdateLight.bat script every 15 seconds, it is easy to play with the message on the sign to see what works best. You edit the XQuery program, save it, and within 15 seconds you can see the result.

For example, after I added the job number, I had a message that looked like this:

{red}Test job TEST1234{amber}({red}JOB01723{amber}){red}failed

The parentheses surrounding the job number were amber to better set it off from the job name. The messages for running and passed status were similar, except for the colors.

The BetaBright sign scrolls rather quickly, and it was hard to read the job name and number as it scrolled by. So I changed the message to just display the job name and number, in the appropriate color:

{hold}{red}TEST1234 JOB01723

The {hold} tells the sign not to scroll. Instead, it alternates displaying the job name and the job number. I decided that the rest of the message was extraneous. The sign is located right next to the light, so people who are interested in it already know what it means. (It actually would not even need to change color, but having it match the color of the light looks better, and reassures people that the two are in sync.)

But what about color-blind coworkers? Over-reliance on color coding can be a problem for them. Right now, no one using this information radiator has that problem. But should it occur in the future, it can be easily solved by changing to a different kind of light:

This is a Patlite LED stack light. They are available on eBay, frequently for less than $50. (This light has four colors, but you really need only three.) They are not USB devices, but just have wires, one for each color plus a common wire. You will need  a 24 VDC power supply, a USB relay board from Numato Labs, and a little bit different programming. (The relay boards looks like a virtual COM port.)

With a stack light like this, people can tell what the light is saying by the position of the illuminated, even if they cannot differentiate between the colors.

This method lets the mainframe control the light and the sign without a bunch of TCP programming, or that sort of thing, so it can be set up very quickly. Using an XML file as the communication medium makes it easy to extend: new elements can be added to the XML file, and information radiators that are not interested in them will just ignore them.

With proper locking on the mainframe side, the XML file could even convey the status of several test streams, and individual information radiators could decide whether they wanted to report the status of a particular stream or all of them.

Posted in Uncategorized | Comments Off on An extensible information radiator

Extracting XML from Wiki pages

The Fitnesse testing framework uses a built-in wiki to allow end users to specify business-related tests. End users specify the tests and expected values in tables, which are then extracted and turned into tests.

Even if you are not using Fitnesse, a wiki can be a useful mechanism for entering data. Here are some useful characteristics of the Confluence wiki:

  • It is possible to restrict access to pages.
  • The wiki retains old versions of the pages. It has a built-in comparison function that makes it easy to see what changed.
  • The wiki can notify you by email when a page changes.
  • You don’t have to write a user interface. People know how to edit Wiki pages.

This last characteristic is perhaps the most important. You don’t have to write and maintain a bunch of data-entry pages, and you don’t have to train your users on how to use the pages.

There is a down side: you don’t get immediate feedback when you enter an invalid value (whatever invalid means for your application). You system would have to send the user an email indicating the error. Also, some discipline is required on the part of the users, but surprisingly little:

  • If there are multiple tables on the page, the order should not be changed.
  • The first row of the table should contain the column headings.

Users can add text before or after the table, and can add their own columns to the table.

To extract a table from a Confluence Wiki page, you first need to retrieve the page. You can do this with the curl utility, like this:

curl -u userid:password http://wiki.example.com/display/DEVWiki/Test+Team+Roster > WikiPageExtract.html

 

Here we are using basic authentication (-u userid:password). Confluence also supports  OAuth validation, so you don’t have to put your user ID and password in your scripts. However, it appears that you have to get an administrator to authorize the use of OAuth on your user ID. I haven’t tried this yet.

The HTML from Confluence Wiki pages, like HTML generated by most of applications, is not completely standard. It is necessary to run it through HTML tidy to clean it up and transform it to XML that we can work with. HTML tidy is available here, and more documentation on its use is available here. The Confluence Wiki uses some HTML5 tags: <header>, <nav>, and <section>. HTML tidy does not understand HTML5 tags, so it is necessary to set up a configuration file to define these tags. (There is an experimental version of HTML tidy that supports HTML5 here, but I have not yet tried it.) The configuration file looks like this:

new-blocklevel-tags: header,section,nav

If we call our configuration file tidyhtml5.txt, this command will clean up the HTML and transform it to XHTML, which is also valid XML:

tidy -b -i -asxml -config tidyhtml5.txt -o WikiPageExtract.xml WikiPageExtract.html

The following options are specified:

  • -b: strip out smart quotes and em dashes, etc.
  • -i: indent the output.
  • -asxml: convert HTML to well-formed XHTML.
  • -config: specifies the name of the configuration file.
  • -o: specifies the name of the output file.

HTML tidy will generate a lot of warning messages as it fixes up the HTML, but it should not generate any error messages.

The tidied XHTML begins with a standard XML header and the opening tag for the html element:

Note the XHTML namespace namespace on the html tag. It will be necessary to define this when using XQuery/XPath to process the XHTML. The next thousand or so lines are scripts and such that we are not interested in, and the we come to a div element with a class of “wiki-content”. That is where we will find the data from our table.

Here is an XQuery program extract table data from the XHTML.

(: Extract table from Confluence Wiki XHTML :) 

(: XHTML namespace :)
declare namespace x = "http://www.w3.org/1999/xhtml";

(: Input file :)
declare variable $InputFile external;

(: Table Number :)
declare variable $TableNumber external;

(: 
   Get the names of the columns
:)

declare function local:GetColumnNames( $table as element()* ) 
   as element()* {

    for $rawName in $table//x:th

        (: We are going to use the column names for XML tags, which 
           cannot have embedded spaces or slashes. So we translate 
           embedded spaces and slashes to nulls :)
        let $name := 
           translate( normalize-space( data($rawName) ), ' /', '' ) 
        return <col>{$name}</col>

};

let $wikiData := doc( $InputFile )

let $body := $wikiData/x:html/x:body
let $wc   := $body//x:div[@class="wiki-content"]
let $title := normalize-space($wc/x:h2[$TableNumber])
let $firstTable := 
   $wc/x:div[@class="table-wrap"][$TableNumber]/x:table

let $colNames := local:GetColumnNames( $firstTable )

let $tableData :=
    for $row in $firstTable//x:tr[position() > 1]  (: Skip first 
                                                      row: title 
                                                      row :)
       let $cols :=
          for $col at $colNum in data($row/x:td)
            let $colName := $colNames[$colNum]/text()
            return element{$colName} {normalize-space($col)}
    return <row>{$cols}</row>

return  <table>
            <title>{$title}</title>
            {$tableData}
        </table>

The two external variables are TableName, which specifies the name of the file containing the extracted XHTML, and TableNumber, which specifies which table on the page is desired, counting from 1. The title of the table is captured from the XHTML H2 element that precedes it.

Here is how the program would be run via the open-source Zorba XQuery processor:

 zorba -f -q program-file-name.xq -o output-file-name.xml 
       --lib-path library-path 
       --external-variable InputFile:=input-file-name.xml 
       --external-variable TableNumber:=1 
       --indent  
       --timing

The processed XML looks like this:

The elements in each row are based on the column heading in the first row of the table, with spaces and slashes removed.

Posted in Uncategorized | Comments Off on Extracting XML from Wiki pages

Documenting XML

I recently ran into a couple of useful tools for documenting XML.

The first is a tool that (among other things) can import an XML schema and display it as a UML class diagram.  This could be useful when documenting existing systems. The program is called hyperModel, and was written by Dave Carlson, author of Modeling XML Applications with UML: Practical e-Business Applications, and the xmlmodeling.com website. It can be downloaded without charge from the website. The license for the program is not displayed on the website, but it is in the download. It does not appear to contain any pitfalls.

Here is an example of a simple XML schema I had lying around:

<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" 
            targetNamespace="http://www.TheXMLAdventure.com/schemas/pdfdoc/docindex.xsd" 
            elementFormDefault="qualified">
  <xsd:element name="library">
    <xsd:complexType>
      <xsd:sequence maxOccurs="unbounded">
        <xsd:element name="book">
          <xsd:complexType>
            <xsd:sequence minOccurs="1" maxOccurs="1">
              <xsd:element name="title" type="xsd:string" />
              <xsd:element name="topic" type="xsd:string" 
                           minOccurs="1" maxOccurs="unbounded" />
              <xsd:element name="file" type="xsd:string" />              
              <xsd:element name="url" type="xsd:string" 
                           minOccurs="0" maxOccurs="1" />
            </xsd:sequence>
            <xsd:attribute name="ca-ip" type="xsd:boolean" 
                           default="false" />
          </xsd:complexType>
        </xsd:element>
      </xsd:sequence>
    </xsd:complexType>
  </xsd:element>
</xsd:schema>

From the XML schema, the program generated the following UML diagram:

The second tool is the XML to HTML Verbatim Formatter with Syntax Highlighting on Oliver Becker’s website. It is an XML stylesheet that reformats an XML document into an HTML page with syntax highlighting. All you have to do is add a processing instruction to the front of your document, like this:

<?xml-stylesheet href="xmlverbatimwrapper.xsl" type="text/xsl"?>

This has saved me a lot of time. The tools I use to edit XML, such as VIM and XML Copy Editor, have syntax highlighting. It makes it immeasurably easier to follow the document. But when I cut text out of the editor and paste it into a Word file or a Wiki page, the syntax highlighting is lost. I consider the increase in readability so important that I used to go and manually change the colors of the various XML elements to recreate the highlighting. As you might imagine, that is no fun at all. I had thought of writing an XML stylesheet to do the highlighting, but never got around to it. Then I discovered Oliver’s page, and I’m glad I waited, because his stylesheet does an excellent job.

The last tool is for users of the Confluence Wiki. When inserting XML code in a Wiki page, you can use the CodeBlock macro to display it. To use it, highlight the XML text, then on the toolbar, select “+” (Insert More Content), “Other Macros”, and “CodeBlock”. The macro gives you formatting options for several languages and color schemes, as well as whether you want line numbers or for the block to be collapsible.

In edit mode, the code block looks like this:

Code block in edit mode

In display mode, it looks like this:

Code block in display mode

 

Posted in Uncategorized | Comments Off on Documenting XML

You need to have a schema… or at least a DTD

It is often stated in the XML literature that XML was not designed to be an syntax for the manual input of data. It was designed for machine-to-machine communication, but was intended to be easily readable by humans (which, in turn, makes for easier debugging, easier building of XPath queries, and that sort of thing).

This wisdom is commonly ignored. Many products use manually-entered XML configuration files. Why? There are several reasons:

  • It is easier than coming up with a new syntax for the configuration file.
  • Code to parse XML is readily available, making it unnecessary to write a custom parser.
  • It’s not all that difficult to manually enter XML, at least in small quantities.
  • There are some unexpected benefits, like being able to use XML stylesheets to upgrade configuration files for new releases. (See “Managing XML Documents Versions and Upgrades with XSLT,” by Vadim Zaliva.)

I recently moved to a new, smaller, office, and I no longer had the bookshelf space for my binders full of printed-out PDF articles. I got rid of all the binders and put the PDF files in a directory. I needed an index, so I built a manually-entered XML file with an entry for each book, listing the title, any topics I wanted it listed under, and the name of the PDF file, like this:

<book>
    <title>Agile Development of Safety-Critical Software for Machinery</title>
    <topic>Agile / Safety Critical</topic>
    <topic>Agile Development</topic>
    <file>Katara-18052010.pdf</file>
</book>

All the <book> elements were wrapped in a <library> element. I added a processing instruction to the front to point to an XML stylesheet:

<?xml-stylesheet href="libsheet.xsl" type="text/xsl"?>

The stylesheet used the Muench method, modified to work with multiple group membership, to sort and group the books by topic. I just had to point my web browser to the XML file, and I had my index.

Now, the format of the <book> entries could hardly be simpler: three child elements, no attributes, so simple that a schema seemed unnecessary. I use XML Copy Editor, which validates that my XML is well-formed before it saves it. What could go wrong?

But I was studying XML schemas, so I decided to make a schema for my simple index file, just for practice. I added an xsi:schemaLocation attribute to my <library> element to point to the schema:

<library xmlns="http://www.TheXMLAdventure.com/schemas/pdfdoc/docindex.xsd"
              xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
              xsi:schemaLocation="http://www.TheXMLAdventure.com/schemas/pdfdoc/docindex.xsd docindex.xsd">

XML Copy Editor can also validate an XML document against a schema, and I was shocked to find that in my index, which had grown to 618 books, I had about a dozen entries that did not match the schema. It was mostly entries where I had forgotten to put in the <file> element, which meant that I would not have detected the error until I tried to open the PDF file.

I have come to the conclusion that even if your XML data is simple, if you are manually entering it, you need to have a schema. At least if you care about the integrity of your data.

If you prefer, you could use a DTD, but that has some disadvantages:

  • Unlike schemas, DTDs are not XML and, at least to me, are a rather ugly construct.
  • DTDs cannot enforce things like maximum occurrences, or correct data type.

But you ought to use something to keep user errors from creeping into your XML data. I have now created schemas for all my existing XML projects.

Posted in Uncategorized | Comments Off on You need to have a schema… or at least a DTD

XML for LEDs

When I think of XML, I usually think of storing data, like customer orders or employee information, and transforming it from one form to another. But I recently ran into a couple of uses for XML that I would not have thought of. Both involve controlling LEDs.

The first is BBXML, Darin Franklin’s XML interface for BetaBrite LED signs.  These are the LED signs you see all over the place with scrolling messages; I have also heard them referred to as ticker tape signs or Times Square signs. These signs are programmed over an RS-232 connection using an arcane protocol that is not very convenient to work with.

For example, if you wanted to display “THE XML ADVENTURE” on the sign, you would send it an ASCII string like this:

_01Z00_02A0THE XML ADVENTURE_04

The meaning of the characters is as follows:

  • _01 – Start Of Header.
  • Z – Type of sign. “Z” means “all types”.
  • 00 – Address of sign. “00” means broadcast to all signs.
  • _02 – Start Of Text.
  • A – Write TEXT file.
  • 0 – Message area. “0” means priority text message.
  • THE XML ADVENTURE – The text of the message.
  • _04 – End Of Transmission.

Using BBXML, you can express the message in XML, like this:

<alphasign>
   <text label="0">THE XML ADVENTURE</text>
</alphasign>

All the processing in BBXML is done in the alphasign.xsl XML stylesheet.

So if you had the above XML in a file called commands.xml, you would just issue this command to convert the XML to the protocol required by the BetaBrite sign:

xsltproc alphasign.xsl commands.xml > commands.txt

(xsltproc is the XSLT processor from the libxslt package for Unix/Linux. If you are using a different processor you would modify the command accordingly.)

Once the XML is converted to the BetaBrite protocol, it can be sent to the sign with the Unix/Linix cat command:

cat commands.txt > /dev/betabrite

(On Windows, you would copy it to a COM port.)

BBXML would be particularly useful if you have data that you need to publish in several places, perhaps in a PDF file, on a website, and, in abbreviated form, on a BetaBrite sign. The information would start out as an XML file, with different XML stylesheets to tailor it to the various output media. The output of the stylesheet for BetaBrite signs would then be fed to BBXML.

The BBXML website has an excellent user’s guide for BBXML, which tells you pretty much everything you need to know about using BBXML with BetaBrite RS-232-based signs.

More recent BetaBrite signs are USB-based, and I have not yet found a driver (for Unix/Linux orWindows) that allows you to copy data directly to the sign. As soon as such a driver is found, BBXML will work with USB-based signs, too.

The second application comes from Front2BackDev, one of my favorite XQuery blogs. It uses XQuery and  a MarkLogic server to control Phillips Hue LED light bulbs. These light bulbs have a controller that hooks up to your network, and communicates with the bulbs over the power wiring. It makes it possible to set each light individually to a particular color and brightness. Phillips has apps for Androids and iPhones, but the controller also has a REST-based API. The example at Front2BackDev uses XQuery to call this API and step all the lights in the house through a list of colors. This could be enhanced to do all sorts of interesting things, like adjusting the lights in the house for different ambiance at different house of the day, or controlling lights being used as information radiators in Continuous Integration systems.

In a later post at Front2BackDev, they used this in conjunction with geofences defined in Google maps. An app on a smartphone would send its GPS coördinates to a Marklogic server, which would figure out where in the geofences the phone was, and change the color of all the lights in the house accordingly.


Posted in Uncategorized | Comments Off on XML for LEDs

A quick XSLT gallery

XSLT comes to the rescue again!

I have a small website of tutorials that uses the old Coppermine software. Each tutorial is a gallery. The first image of each gallery is the title of the tutorial, and the titles of subsequent images are “Step 1”, “Step 2”, etc. The description for each image explains how to do that step.

I recently had to move my website to a new host that was incompatible with Coppermine. I quickly looked around for photo gallery software, but I found that much of it does not accommodate having a long description for each image. I needed to get my site back up quickly, and I didn’t want to rush into choosing gallery software, so I made a quick gallery with the help of XSLT.

First I dumped the data from the Coppermine database to produce XML for each page that looked like this:

     <?xml version="1.0" encoding="UTF-8"?>
     <?xml-stylesheet href="../gallery.xsl" type="text/xsl"?>     
     <page>
        <title>Step 1</title>
        <image>LRG-20100804-001.jpg</image>
        <text><p>This is the first step in this process.</p>
              <p>Be <em>very</em> careful to follow these 
              instructions.</p>
        </text>
        <next>page2</next>
        <prev>page0</prev>
     </page>

Most of this I could do using an SQL select statement with the CONCAT() function and appropriate literals. I added the <next> and <prev> elements manually. The data was saved in files named page0.xml, page1.xml, etc, with a separate directory for each tutorial.

The gallery XSLT stylesheet, which was saved in the parent directory so that it could be shared by all the tutorials, looked like this:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
                version="1.0">
  <xsl:output method="html" indent="yes"/>
  <xsl:template match="/">
   <html>
      <head>
        <title>
          <xsl:value-of select="page/title"/>
        </title>
        <style>
  .icon { width:50px;
  height:50px;
  border-style:none;
  }
        </style>
      </head>
      <body>
        <h1>
          <xsl:value-of select="page/title"/>
        </h1>
        <!-- Image -->
        <div>
          <xsl:element name="img">
            <xsl:attribute name="src">images/
            <xsl:value-of select="page/image"/>
            </xsl:attribute>
          </xsl:element>
        </div>
        <div>
          <xsl:copy-of select="page/text"/>
        </div>
        <!-- Navigation buttons -->
        <div>
          <xsl:if test="page/next">
            <xsl:element name="a">
              <xsl:attribute name="href">
                <xsl:value-of select="page/next"/>
              </xsl:attribute>
              <img src="icons/next.png" class="icon" 
                   style="float:right;"/>
            </xsl:element>
          </xsl:if>
          <xsl:if test="page/prev">
            <xsl:element name="a">
              <xsl:attribute name="href">
                <xsl:value-of select="page/prev"/>
              </xsl:attribute>
              <img src="icons/back.png" class="icon" 
                   style="float:left;"/>
            </xsl:element>
          </xsl:if>
        </div>
      </body>
    </html>
  </xsl:template>
</xsl:stylesheet>

The result was a quick gallery with buttons to go forward and backwards through the tutorials.

Now, there are some disadvantages to implementing a gallery this way. By using stylesheets that are processed in the user’s browser, you are at the mercy of how each different browser implements XML stylesheets.  Also, processing the individual files this way makes it trickier to do things that require looking at all of them, like automatically generating an index of tutorials, or generating “page x of y” page numbers.

And finally, it would be nice if I could have avoided manually adding the <next> and <prev> tags. I am in transition between host sites, so my Marklogic server was not available, but if it had been, I could have used an XQuery program to look at the <title> elements and insert the appropriate <next> and <prev> elements using the xdmp:node-insert-child() function.

But if my Marklogic server had been available, I probably would have just done the whole thing using XQuery.  Still, the way I have the tutorials XML-ized now, it will be easy to use XQuery to convert them to whatever form I need, either to serve them from the Marklogic server using XQuery, or to convert them for some gallery package, if I find one I like.

 

Posted in Uncategorized | Comments Off on A quick XSLT gallery

The MarkLogic XQuery server

MarkLogic makes a commercial XQuery-based server, designed for quickly searching large quantities of XML and unstructured data. (Big Data is the current buzzword).  They also have several useful proprietary extensions to XQuery.  (Of course, extensions are always a tradeoff: along with more functionality, you get more vendor lock-in.  Only you can decide whether the tradeoff is worth it.)

MarkLogic offers a free Express license for MarkLogic 5, the latest version of its server. The free license allows for one developer, one computer, two CPUs, and 40 GB of data, and can be used for commercial sites.

You can get full details on the license here: http://developer.marklogic.com/licensing

You can download the software here: http://developer.marklogic.com/products

MarkLogic has published some documentation on using their server:

I really like the last one; it talks about using the free MarkLogic Express License, along with an Amazon EC2 instance, to run a MarkLogic-based website for only about $40 a month.

In order to show off the MarkLogic server’s searching prowess, MarkLogic has set up a website, http://markmail.org, that indexes and searches technology-oriented mailing lists.  It watches over 8500 lists, over 3500 of which are currently active.  The oldest one goes back t0 1992.  I have found this to be very handy in finding technical discussions, without the usual noise that you get when you use an ordinary search engine.

They also have a tutorial on using XQuery on MarkLogic 5 that lets you experiment with queries on a public, read-only, server containing five million emails: http://developer.marklogic.com/try/ninja/index
(Even if you are not interested in the MarkLogic 5 server, this site is great for practicing writing XQuery queries against a large store of data.)

MarkLogic seems to use an interesting business model. Apparently the idea is that one of two things will happen:

  1. By playing with their product and getting familiar with it, you will decide you really like it, and the next time you or your company needs a production database server, you will buy it, or
  2. If you are a startup building a website, you can use the free server to get your website started, and by the time you outgrow the free server, you will be making enough money from the site to buy a license from them (and by then, you will be locked into their server).

So far, I have been really impressed with the MarkLogic server, so it is not hard to believe that this is a successful marketing strategy.

Posted in Uncategorized | Comments Off on The MarkLogic XQuery server

Grouping data with multiple group membership using the Muench method

Grouping data with an XSLT 1.0 XML stylesheet is generally done using the Muench method, named after Steve Muench, who popularized the technique in his book Building Oracle XML Applications. Recently, I needed to group data where it was possible to be a member of multiple groups. I could not find anything in the literature about using the Muench method with multiple group membership, so after I figured out how, I decided to write it down, in case anyone else runs into the same problem.

Suppose you have a file of XML employee information, and you want display a list of employees, grouped by department. The employee information looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="deptlist.xsl" type="text/xsl"?>
<company>
   <emp id="01276" gender="Female">
      <dept>Accounting</dept>
      <name>
         <first>Sarah</first>
         <last>Collins</last>
      </name>
      <DOB>1978-04-11</DOB>
   </emp>
   <emp id="01001" gender="Male">
      <dept>Payroll</dept>
      <name>
         <first>Fred</first>
         <last>Smith</last>
      </name>
      <DOB>1969-10-20</DOB>
   </emp>
   <emp id="01711" gender="Male">
      <dept>Personnel</dept>
      <name>
         <first>Juan</first>
         <last>Muñoz</last>
      </name>
      <DOB>1980-01-04</DOB>
   </emp>
   <emp id="00941" gender="Male">
      <dept>Purchasing</dept>
      <name>
         <first>Sam</first>
         <last>Francisco</last>
      </name>
      <DOB>1972-08-31</DOB>
   </emp>
   <emp id="01868" gender="Female">
      <dept>Maintenance</dept>
      <name>
         <first>Betty</first>
         <last>Carson</last>
      </name>
      <DOB>1978-11-21</DOB>
   </emp>
</company>

Using the Muench method, the XML stylesheet, deptlist.xsl, might look like this:Us

01    <?xml version="1.0" encoding="UTF-8"?>
02    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
03    <xsl:key name="depts" match="emp" use="dept"/>
04    <xsl:template match="/">
05    <xsl:for-each select="//emp[generate-id(.)=generate-id(key('depts', dept)[1])]">
06        <xsl:sort select="dept"/>
07        <xsl:for-each select="key('depts', dept)">
08            <xsl:sort select="name/first"/>
09            <xsl:sort select="name/last"/>
10            <xsl:if test="position() = 1">
11                <xsl:element name="br"/>
12                <xsl:element name="h3">
13                    <xsl:value-of select="dept"/>
14                </xsl:element>
15            </xsl:if>
16            <p>
17                <xsl:value-of select="name/first"/>
18                <xsl:text> </xsl:text>
19                <xsl:value-of select="name/last"/>
20            </p>
21        </xsl:for-each>
22    </xsl:for-each>
23    </xsl:template>
24    </xsl:stylesheet>

On line 3, the xsl:key generates an index of <emp> elements, based on their <dept> values.

On line 5 in the xsl:for-each, it looks at each <emp> element, and generates a unique ID (a hash that includes position, so two different <emp> elements will always have different hashes, even if they are identical). It selects the <emp> element if its ID matches the first entry in the index for that department. The purpose of this is to select one <emp> element for each department. This lets us enumerate the departments, so we can have a group per department. The next line sorts these <emp> elements by department, so we get a sorted list of departments.

On line 7 in the xsl:for-each, it processes each <emp> element for the department we just selected. Lines 8 and 9 sort the <emp> elements by first name and last name.

Line 10 checks for the first <emp> for a department, and generates an HTML H3 heading with its department name.

Lines 16 through 20 display the first name, a space, and the last name for each employee.

The result looks like this:

Accounting

Sarah Collins

Maintenance

Betty Carson

Payroll

Fred Smith

Personnel

Juan Muñoz

Purchasing

Sam Francisco

That is how the Muench method normally works. But what if it is possible to belong to multiple groups? Suppose our employees can belong to more than one department? If we add additional <dept> tags to some of the employees, our XML employee information might look like this:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="deptlist.xsl" type="text/xsl"?>
<company>
   <emp id="01276" gender="Female">
      <dept>Accounting</dept>
      <dept>Payroll</dept>
      <name>
         <first>Sarah</first>
         <last>Collins</last>
      </name>
      <DOB>1978-04-11</DOB>
   </emp>
   <emp id="01001" gender="Male">
      <dept>Payroll</dept>
      <name>
         <first>Fred</first>
         <last>Smith</last>
      </name>
      <DOB>1969-10-20</DOB>
   </emp>
   <emp id="01711" gender="Male">
      <dept>Personnel</dept>
      <dept>Maintenance</dept>
      <name>
         <first>Juan</first>
         <last>Muñoz</last>
      </name>
      <DOB>1980-01-04</DOB>
   </emp>
   <emp id="00941" gender="Male">
      <dept>Purchasing</dept>
      <dept>Maintenance</dept>
      <name>
         <first>Sam</first>
         <last>Francisco</last>
      </name>
      <DOB>1972-08-31</DOB>
   </emp>
   <emp id="01868" gender="Female">
      <dept>Maintenance</dept>
      <dept>Sales</dept>
      <name>
         <first>Betty</first>
         <last>Carson</last>
      </name>
      <DOB>1978-11-21</DOB>
   </emp>
</company>

When we run this XML information against our XSL stylesheet, the results are not what we might expect, and definitely not what we want:

Payroll

Fred Smith
Sarah Collins

Purchasing

Sam Francisco
Juan Muñoz
Betty Carson

It is listing only the departments that have at least one employee who does not multiple <dept> elements. The reason is line 5, where it says

<xsl:for-each select=”//emp[generate-id(.)=generate-id(key(‘depts’, dept)[1])]”>

The highlighted part will be the value of the <dept> element for employees with a single department, but for the rest, it will be a concatenation of the values of the employee’s <dept> tags, which will not be found in the index. Thus, departments that consist solely of employees with multiple allegiances will not be selected.

Luckily, it does not take many changes to the XML stylesheet to make it work with multi-department employees. Here is the modified XML stylesheet:

01    <?xml version="1.0" encoding="UTF-8"?>
02    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
03    <xsl:key name="depts" match="emp" use="dept"/>
04    <xsl:template match="/">
05    <xsl:for-each select="//emp/dept[generate-id(parent::*)=generate-id(key('depts',.)[1])]">
06        <xsl:sort select="."/>
07        <xsl:variable name="thisdept" select="." />
08        <xsl:for-each select="key('depts',.)">
09            <xsl:sort select="name/first"/>
10            <xsl:sort select="name/last"/>
11            <xsl:if test="position() = 1">
12                <xsl:element name="br"/>
13                <xsl:element name="h3">
14                    <xsl:value-of select="$thisdept"/>
15                </xsl:element>
16            </xsl:if>
17            <p>
18                <xsl:value-of select="name/first"/>
19                <xsl:text> </xsl:text>
20                <xsl:value-of select="name/last"/>
21            </p>
22        </xsl:for-each>
23    </xsl:for-each>
24    </xsl:template>
25    </xsl:stylesheet>

We change line 5 so that instead of looking at each <emp> element, it looks at each <dept> within an <emp> element. We replace the current node (“.”) with the parent of the current node “parent::*”, so we are generating the same ID we did before, but we may process each <emp> element multiple times if it has more than one <dept> element. We compare that ID with the first ID in the index for an <emp> element that has that department, and select the <dept> element if they match. This selects one <emp> element per department, but since the for-each looks at each <dept> element, we can deal with having multiples.

On line 6, since we are sorting by <dept> elements, rather than <emp> elements, we change “dept” to the current node (“.”), since it already is a <dept> element.

On line 7, we set a variable to the department we are working with. We will need this in a minute.

On line 8, in the for-each where we are going through all the keys for the department we are working with, we change “dept” to the current node (“.”), since it already is a <dept> element.

On line 14, when we make the H3 heading, we use the value of the variable we set on line 7, so we get the name of the department we are working with, not a concatenation of all the departments for the first employee in the department.

Now when we run our XML employee information against our XML stylesheet, we get this result:

Accounting

Sarah Collins

Maintenance

Sam Francisco
Juan Muñoz
Betty Carson

Payroll

Fred Smith
Sarah Collins

Personnel

Juan Muñoz

Purchasing

Sam Francisco

Sales

Betty Carson

This is what we were looking for. The departments are listed in alphabetical order, and each employee is listed (in alphabetical order) under each department that he or she is a member of.

Posted in Uncategorized | Comments Off on Grouping data with multiple group membership using the Muench method