For scientific programmers, it has long been a goal to have mathematical equations in their programs displayed as they would normally write them, with superscripts and subscripts in the right places, and that sort of thing.

One early attempt was the HAL/S language, developed in the 1970s for the NASA Space Shuttle program.

An equation like this:

X = A ** 2 + B$(I) ** 2

would be displayed like this:

E       2     2
M  X = A  + B
S            I

The letters E, and S at the left margin indicated exponent and subscript lines. The line marked M was the main line. (I'm not sure if M stood for main or not.) C was used for comment lines.

That was a start, but the fixed line spacing available in 1970s line printers violated the norms of mathematical printing and make the resulting equations difficult to read accurately. But today, half a century later, we have the technology to do a much better job of this. (By the way, if I am understanding what I have been reading about HAL/S correctly, it was also possible to enter the lines that way, putting E, M, S or C in column one. Since punch cards were still the prevalent form of input at that time, that seems a bit brutal. since any change to a line might require re-punching three cards.)

This could be done much more smoothly today. Once the language parser has figured out the line, it could generate MathML to display it properly. For example, the above line, would generate the following MathML:

<m:math display='block'>
 <m:mrow>
  <m:mi>X</m:mi><m:mo>=</m:mo><m:msup>
   <m:mi>A</m:mi>
   <m:mn>2</m:mn>
  </m:msup>
  <m:mo>+</m:mo><m:msubsup>
   <m:mi>B</m:mi>
   <m:mi>I</m:mi>
   <m:mn>2</m:mn>
  </m:msubsup>
  </m:mrow>
</m:math>

and would display like this:

upper-case equation: X equals A squared plus B sub I squared

Actually, since lower-case letters, which were not available in the keypunch days are possible, it would probably be re-coded like this,

lower-case equation: x equals a squared plus b sub i squared

which would be more in keeping with norms of mathematical notation.

The generated MathML can be easily displayed using the MathJax browser plugin.

 

Even if we have to wait for someone to augment a compiler to produce MathML-enhanced output, we can still get nice looking equations in the Doxygen pages that describe our programs. By specifying USE_MATHJAX in the Doxygen configuration, you can include equations in LaTex form in your Doxygen prologue comments by surrounding them with \f tags. This is the preferred method for people who work with LaTex a lot and are comfortable with its syntax. If you prefer to insert your equations as MathMl, you can surround them with \htmlonly and \endhtmlonly tags. It will be included verbatim in the HTML output, to be interpreted by MathJax.

Either way, you do not have to know LaTex or MathML syntax. You can use a tool like MathType, by Wirris, to build your equations, and them paste the resulting LaTex or MathML into your prologue.  

Maxwell-Faraday equation

Typesetting mathematics has always been a chore, for as long as typesetting has been a thing. These articles give some insight into mathematical typesetting in the days of hot-metal type: From boiling lead and black art: An essay on the history of mathematical typography, by Eddie Smith, and The Monotype 4-Line System for Setting Mathematics, by Daniel Rhatigan. Things got better with the advent of computerized typesetting, but there were still problems:

  • How do you represent the mathematical symbols?
  • How do you represent the equations in a portable way?
  • How do you make some symbols, like summation and integral signs, expand to fit the following part of the equation?

Unicode took care of the first problem, by providing code points for an almost innumerable number of symbols. (The value of this, in all sorts of fields, should not be underestimated. For example, I have a book, Medieval Slavic Mannuscripts and SGML: Problems and Perspectives, by Marin Drinov, from the pre-Unicode era. Much of the book deals with how to represent Old Slavic characters in ASCII.)

The second and third problems were addressed by LaTeX (generally pronounced like LAY-tek or LAH-tek), a mathematical typesetting system that has become the de facto standard in the scientific and mathematical communities. (It was released in 1984, and is based on the TeX typesetting system introduced by Donald Knuth in 1978.) The Maxwell-Faraday equation shown at the top of this article would be represented like this in LaTex. (The equation states that a time-varying magnetic field is always accompanied by a spatially-varying, non-conservative electric field, and vice-versa.)

\oint_C {E \cdot d\ell = - \frac{d}{{dt}}} \int_S {B_n dA}

This is a compact notation, but not very easy to read unless you have studied LaTex. It is also not easy to process by machine.

MathML is another way of addressing the second and third problems, using an XML-based language. This makes it accessible to XML-based tools, such as XSLT and XQuery. Here is the Maxwell-Faraday equation expressed in MathML:

<?xml version="1.0" encoding="UTF-8"?>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" display='block'>
   <m:mrow>
      <m:mstyle displaystyle='true'>
         <m:mrow>
            <m:msub>
               <m:mo>&#x222E;</m:mo><!-- Unicode Character 'CONTOUR INTEGRAL' -->
               <m:mi>C</m:mi>
            </m:msub>
            <m:mrow>
               <m:mi>E</m:mi>
               <m:mo>&#x22C5;</m:mo><!-- Unicode Character 'DOT OPERATOR' -->
               <m:mi>d</m:mi>
               <m:mi>&#x2113;</m:mi><!-- Unicode Character 'SMALL SCRIPT L' -->
               <m:mo>=</m:mo>
               <m:mo>&#x2212;</m:mo><!-- Unicode Character 'MINUS SIGN' -->
               <m:mfrac>
                  <m:mi>d</m:mi>
                  <m:mrow>
                     <m:mi>d</m:mi>
                     <m:mi>t</m:mi>
                  </m:mrow>
               </m:mfrac>
            </m:mrow>
         </m:mrow>
      </m:mstyle>
      <m:mstyle displaystyle='true'>
         <m:mrow>
            <m:msub>
               <m:mo>&#x222B;</m:mo><!-- Unicode Character 'INTEGRAL' -->
               <m:mi>S</m:mi>
            </m:msub>
            <m:mrow>
               <m:msub>
                  <m:mi>B</m:mi>
                  <m:mi>n</m:mi>
               </m:msub>
               <m:mi>d</m:mi>
               <m:mi>A</m:mi>
            </m:mrow>
         </m:mrow>
      </m:mstyle>
   </m:mrow>
</m:math>

Although MathML seems more intuitive (at least to me) and regular than LaTeX, it is extremely monotonous entering equations by hand. In manually coding an equation of any complexity, one quickly gets in the many nestings. But, like XML itself,  MathML is not intended to be entered manually.

So why use MathML? For one thing, it makes it possible to use XML stylesheets (XSLT) to automatically transform equations. For example, the tangent of X is represented in the US as "tan(x)", but in some countries, it is represented as "tg(x)". With XSLT, you can easily transform your equations to suit local preferences. And since you can do it in a stylesheet, you do not have to change each equation individually. There are two different forms of MathML: Content MathML and Presentation MathML. The former is concerned with describing equations in ways that they can be consumed by mathematical engines, like Mathematica. The latter is concerned with how the equations are displayed. If you need to deal with equations in both modes, you can use an XML stylesheet to transform Content MathML into Presentation MathML.

There are editors that make it possible to enter equations in MathML by selecting the pieces from menus, and there are even tools that will convert LaTeX to MathML. My favorite MathML editor is MathType, by WIRIS Math & Science. It allows you to build equations by picking pieces from menus. Here is a screenshot of its user interface:

It also has a "math input" panel, where you can draw your equation and it will OCR it. I suspect that this feature works better for those with better handwriting than mine.

Once you have your equation entered, you can cut it in either MathML or LaTex format (or several others) and paste it into your XML document. MathType can also take LaTex and convert it to MathML. So if you are accustomed to entering equations in LaTeX, but want to use them in your XML workflow, this may be the tool for you.

The LibreOffice Equation Editor is an open-source tool that can export MathML. I do not have any experience with it yet, but it looks promising.

Even with MathML, displaying equations on the web has been difficult, because of spotty and inconsistent support among the various web browsers. MathJax, a JavaScript display engine that works in all browsers, has improved this situation, as have the STIX Fonts, free Postscript type 1 fonts for math. (By the way, in case you want to enter your equations as LaTeX, and are not interested in XML workflow, MathJax can display LaTeX, too.)

And if you want to print your equations, you can use the JEuclid plug-in for Apache FOP processor to turn your XML into PDF files. (Be warned, this is not quite as easy as it sounds. Apache FOP reads XSL-FO [Formatting Objects], which describes the page layout. Setting up page templates will be part of the setup for your workflow.)

Here is an excellent reference for MathML: Ry's MathML Tutorial [Kindle Edition], by Ryan Hodson (2014).

There is another XML-based mathematical representation called OpenMath. It is comparable to Content MathML, but concentrates on defining the semantics of each symbol. While OpenMath and MathML were once looked at as competing technologies, the two organizations have been working towards improving their compatibility, and the ability to convert between them, so that each can be used where it is stronger.

Note: This post is being shared with my other blog, TryingToKeepItAgile, since it pertains to both agile development and XML.

Teams that are practicing continuous integration often have a central information radiator to indicate when the build fails. Several different kinds have been used, such as an Ambient Orb (a translucent globe that can glow in different colors), lava lamps, talking teddy bears, and all sorts of other things. The idea is that when the information radiator indicates that the build is broken, it is everyone's first responsibility to get the build working again, so that problems are not compounded by checking in more changes on top of the code that broke the build.

I was working with a team that was not yet ready for continuous integration, but I had some long-running test streams, where each test job, if successful, would submit the next one. Rather than constantly monitoring the test stream, I wanted to have a light, similar to what continuous integration teams use, that would turn amber when my test stream was running, green if all the tests passed, and red if it failed along the way.

First, I needed a light. I chose a red-green-amber LED from Delcom Products. It comprised the following parts:

  • 904017 USB HID Visual Signal Indicator RGY Black Case w/ Center Cable ($91.35)
  • 804136 Straight Al. Mounting Pole 36" ($16.80)
  • 804301 Mounting Pole Base ($9.35)

The total cost was $117.50. I later replaced the mounting pole base with a Grooming Arm Clamp ($17.05), for a total cost of $125.20. This allowed the light to be clamped to the edge of the desk, whereas the mounting pole base had to be screwed into a block of wood. (In case you are not familiar with grooming arms, they are used in grooming dogs. A grooming arm is a curved pole that clamps to the table, and the dog's leash is hooked to it to keep the dog in place.) Here is a picture of the test light in use:

To make things even more interesting, the test jobs were running on an IBM mainframe (z/OS) system, and the light was a USB device that plugged into my desktop computer, so I had to establish communication between the two. I already had a test framework on the mainframe that would monitor the results of each test job. If the results were as expected, it would submit the next job in the sequence; if they were not, it would stop. I modified the framework to update a file, LIGHT.DATA.XML. The first job in the sequence updates it with a status of "running". Any failing job updates it with a status of "failed". If all jobs are successful, the final job updates it with a status of "passed". Here is an example of the file. The <jobname> and <system> elements are there for future expansion. Note that the encoding is specified as "EBCDIC-CP-US", because the IBM mainframe uses the EBCDIC character code, rather than ASCII or UTF-8.

 

On the PC, there is a Windows batch script, LightDriver.bat, that loops continuously, calling the UpdateLight.bat script, waiting 15 seconds, and repeating the procedure. It looks like this:

@echo off
rem Call the UpdateLight script every 15 seconds to update the light status

:loop
timeout 15
call UpdateLight.bat > nul
goto loop

 

The UpdateLight.bat script uses FTP to retrieve the LIGHT.DATA.XML file from the mainframe. FTP translates the file from EBCDIC to ASCII as it is downloading it, but the XML header still says its encoding is EBCDIC-CP-US, so the sed stream editor is used to change the encoding specification to UTF-8.

sed "s/EBCDIC-CP-US/UTF-8/" dynamic/light_data_ebcdic.xml > dynamic/light_data.xml

 

The encoding is really ASCII, rather then UTF-8, but it is close enough, and it makes subsequent XML processing happier. (By the way, I probably would have been OK if I had just specified UTF-8 in the XML declaration in the EBCDIC file on MVS. But I wanted to allow for the possibility of processing the XML on MVS also, and that would require a valid specification of the EBCDIC encoding.) The modified XML file looks like this:

 

The modified XML file is processed by an XQuery program, UpdateLight.xq, using the Zorba XQuery processor, to produce a new Windows batch script, setlight.bat:

zorba -f -q C:/SignalLight/eclipse/UpdateLight.xq 
      -o C:/SignalLight/dynamic/setlight.bat 
      --external-variable input:="C:/SignalLight/dynamic/light_data.xml" 
      --serialize-text

 

The XQuery program looks like this:

(: The $input variable is passed into Zorba, and determines the input file name :)
declare variable $input external;

(: Extract the status of the test run from the rest of the XML data :)
let $testStatus := doc( $input )/test/status/text()

(: Set up a variable to represent the Newline character :)
let $nl := "&#10;"

(: Set the light color, based on the test status :)
let $testColor := 
    if ( $testStatus = "passed" )
    then "green"
    else if ($testStatus = "failed")
    then "red"
    else if ($testStatus = "running")
    then "amber"
    else "unknown"

(: Set the parameters to the light utility, based on the color :)
let $lightParm :=
   if ( $testColor = "green" )
   then "1"
   else if ( $testColor = "red" )
   then "2"
   else if ( $testColor = "amber" )
   then "4"
   else "0"

(: Generate the lines for the light-setting script :)
let $line01 := concat( "@echo off", $nl )    
let $line02 := concat( "rem Turn signal light ", $testColor, $nl )
let $line03 := $nl
let $line04 := concat( "usbcmdapx64 0 0 101 12 ", $lightParm, " 7", $nl )

(: Assemble all the lines. We do this with concat, rather than just returning
   a sequence, to avoid a blank at the beginning of the second and 
   subsequent lines :)
let $lines  := concat( $line01, $line02, $line03, $line04 )

return {$lines}

 

The program extracts the value of the <status> element, and maps its three values, running, passed, or failed, into corresponding colors, amber, green, or red. If the <status> has an unexpected value, the color is set to dark. The color is then mapped to a parameter for the command-line utility that sets the color of the LED light, 0 for dark, 1 for green, 2 for red, and 4 for amber. The Windows command-line utility that comes with the Delcom light is USBCMDAPx64.exe. The program takes the following parameters, specified in this order:

  • v - Verbose. If specified, this prints more informaton.
  • TID - type. Specifying 0 means all.
  • SID - serial ID. Specifying 0 means all.
  • Major command - Specifying 101 means send an eight-byte write command.
  • Minor command - Specifying 12 means set or reset the port 1 pins individually.
  • LSBDATA - specifies the pins to reset. (Resetting a pin turns it on.)
  • MSBDATA - specifies the pins to set. (Setting a pin turns it off.)

LSBDATA takes precedence, so specifying 7 for MSBDATA turns off all the pins, and then any pins specified in LSBDATA are turned on. The pins for the various colors are:

  • 1 - Green
  • 2 - Red
  • 4 - Amber

It is possible to turn more than one color on at the same time, but it does not look very good. The following commands will set the light to the various colors:

Green:  usbcmdapx64 0 0 101 12 1 7 
Red:    usbcmdapx64 0 0 101 12 2 7 
Amber:  usbcmdapx64 0 0 101 12 4 7 
Dark:   usbcmdapx64 0 0 101 12 0 7

 

The highlighted parameters are the ones that were set in the XQuery program. The resulting Windows batch script looks like this:

@echo off
rem Turn signal light red

usbcmdapx64 0 0 101 12 2 7

 

The generated script is then called:

call dynamic/setlight.bat

 

This two-process, generating a dynamic Windows script and then calling it, is necessary because it is not easy to call programs out of an XQuery program. This process worked well, with the LED light on my desk letting me know the status of my test jobs. But when the test stream failed, I really wanted to know which job failed. So I got a BetaBrite Prism scrolling LED sign. (These are sometimes called ticker-tape signs or Times Square signs.) Here are a couple of images of one, from a previous blog post, XML for LEDs:

 

The BetaBrite Prism connects to the PC via USB. The previous post was about BetaBright Classic signs, which used an RS-232 connection. The method described in that post does not work for the USB signs. Luckily, the folks at Industro Logic have a free command-line program to control BetaBrite Prism signs. You can download it here. The file you want is PRISMCOM.EXE. If you run the program with no parameters, it gives you a help screen that shows you the options, but the basics are that you enclose control information in braces. In our case, we want to specify the color of the message, so the command would like like this:

prismcom usb {red}Test job TEST1234 failed

 

I modified the UpdateLight.bat script to run another XQuery program that takes the same input XML file and creates a Windows batch script, setsign.bat, that contains the prismcom command to send the message to the BetaBrite sign. As before, the script is then called. Because the XQuery programs use XPath expressions to select the parts of the XML file that they need, I can add new information to the LIGHT.DATA.XML file without breaking anything. For example, I decided I wanted to have the job number, as well as its name, displayed on the sign, so I changed the test framework to add a <jobnum> element to the XML file. Once that was in place, I could update the XQuery program for the sign at my leisure. Since the LightDriver.bat script invokes the UpdateLight.bat script every 15 seconds, it is easy to play with the message on the sign to see what works best. You edit the XQuery program, save it, and within 15 seconds you can see the result. For example, after I added the job number, I had a message that looked like this:

{red}Test job TEST1234{amber}({red}JOB01723{amber}){red}failed

 

The parentheses surrounding the job number were amber to better set it off from the job name. The messages for running and passed status were similar, except for the colors. The BetaBright sign scrolls rather quickly, and it was hard to read the job name and number as it scrolled by. So I changed the message to just display the job name and number, in the appropriate color:

{hold}{red}TEST1234 JOB01723

 

The {hold} tells the sign not to scroll. Instead, it alternates displaying the job name and the job number. I decided that the rest of the message was extraneous. The sign is located right next to the light, so people who are interested in it already know what it means. (It actually would not even need to change color, but having it match the color of the light looks better, and reassures people that the two are in sync.)

But what about color-blind coworkers? Over-reliance on color coding can be a problem for them. Right now, no one using this information radiator has that problem. But should it occur in the future, it can be easily solved by changing to a different kind of light:

This is a Patlite LED stack light. They are available on eBay, frequently for less than $50. (This light has four colors, but you really need only three.) They are not USB devices, but just have wires, one for each color plus a common wire. You will need a 24 VDC power supply, a USB relay board from Numato Labs, and a little bit different programming. (The relay boards looks like a virtual COM port.) With a stack light like this, people can tell what the light is saying by the position of the illuminated, even if they cannot differentiate between the colors.

This method lets the mainframe control the light and the sign without a bunch of TCP programming, or that sort of thing, so it can be set up very quickly. Using an XML file as the communication medium makes it easy to extend: new elements can be added to the XML file, and information radiators that are not interested in them will just ignore them. With proper locking on the mainframe side, the XML file could even convey the status of several test streams, and individual information radiators could decide whether they wanted to report the status of a particular stream or all of them.

I was working on an educational game about Medieval England, with an XML based GUI. For example, the new player would be asked to choose a name, a profession, and other characteristics. These were read from XML files in order to populate the list of choices. The researcher on the team would send us spreadsheets with lists of medieval jobs, names, and so on, and I would have to convert them to XML. For the proof of concept, I just cut and pasted the information into an XML editor and manually added the XML tags. This was clearly too tedious and error prone for production, so I needed an automated conversion. A LibreOffice spreadsheet, which has an extension of ods, is really a zipped file under the covers. If you unzip the file, there is a file inside called content.xml that contains the data of the spreadsheet. The following XQuery program extracts the information from that file:

  • The first line of each sheet is assumed to be column names. These names, with any spaces translated to hyphens, are used for the XML element names for each column.
  • The name of the sheet is used for the element that encloses the data for each row.
  • The name of the sheet, suffexed with an "s", is used for the element that encloses the data for the sheet.
  • A <batch> element encloses the data for all the sheets.

Here is the XQuery program, ods2xml2.xq:

(:~
 : This program converts a LibreOffice Calc spreadsheet into an XML file:
 :
 : - The extracted data is enclosed in a  element.
 : - The data for each sheet is enclosed in an element with the same name as the sheet,
 : suffixed with an "s".
 : - The data for each row is enclosed in an element with the same name as the sheet.
 : - The first row is assumed contain the name of each column. Within each data row,
 : the data for each column is enclosed in an element that matches the name of the
 : column. The program currently does not deal well with special characters in column
 : names, since they are invalid in XML element names, so be careful.
 :
 : Input:
 :
 : The input to this program is the content.xml file from a LibreOffice ods file. The ods
 : suffix should be renamed to zip, the file unzipped, and the content.xml file extracted.
 : If the file has a different name, you can specify it via the "filename" external variable.
 :
 : @author Lynn Grant
 : @version 2017-10-02
 :
 :)

(:
 :
 : License: Public Domain
 :
 : To the extent possible under law, Lynn Grant has waived all copyright and related or
 : neighboring rights to this program. This work is published from: United States.
 :
 :)

(: Declare namespaces used by LibreOffice :)
 declare namespace office="urn:oasis:names:tc:opendocument:xmlns:office:1.0";
 declare namespace style="urn:oasis:names:tc:opendocument:xmlns:style:1.0";
 declare namespace text="urn:oasis:names:tc:opendocument:xmlns:text:1.0";
 declare namespace table="urn:oasis:names:tc:opendocument:xmlns:table:1.0";
 declare namespace draw="urn:oasis:names:tc:opendocument:xmlns:drawing:1.0";
 declare namespace fo="urn:oasis:names:tc:opendocument:xmlns:xsl-fo-compatible:1.0";
 declare namespace xlink="http://www.w3.org/1999/xlink";
 declare namespace dc="http://purl.org/dc/elements/1.1/";
 declare namespace meta="urn:oasis:names:tc:opendocument:xmlns:meta:1.0";
 declare namespace number="urn:oasis:names:tc:opendocument:xmlns:datastyle:1.0";
 declare namespace presentation="urn:oasis:names:tc:opendocument:xmlns:presentation:1.0";
 declare namespace svg="urn:oasis:names:tc:opendocument:xmlns:svg-compatible:1.0";
 declare namespace chart="urn:oasis:names:tc:opendocument:xmlns:chart:1.0";
 declare namespace dr3d="urn:oasis:names:tc:opendocument:xmlns:dr3d:1.0";
 declare namespace math="http://www.w3.org/1998/Math/MathML";
 declare namespace form="urn:oasis:names:tc:opendocument:xmlns:form:1.0";
 declare namespace script="urn:oasis:names:tc:opendocument:xmlns:script:1.0";
 declare namespace ooo="http://openoffice.org/2004/office";
 declare namespace ooow="http://openoffice.org/2004/writer";
 declare namespace oooc="http://openoffice.org/2004/calc";
 declare namespace dom="http://www.w3.org/2001/xml-events";
 declare namespace xforms="http://www.w3.org/2002/xforms";
 declare namespace xsd="http://www.w3.org/2001/XMLSchema";
 declare namespace xsi="http://www.w3.org/2001/XMLSchema-instance";
 declare namespace rpt="http://openoffice.org/2005/report";
 declare namespace of="urn:oasis:names:tc:opendocument:xmlns:of:1.2";
 declare namespace xhtml="http://www.w3.org/1999/xhtml";
 declare namespace grddl="http://www.w3.org/2003/g/data-view#";
 declare namespace tableooo="http://openoffice.org/2009/table";
 declare namespace drawooo="http://openoffice.org/2010/draw";
 declare namespace calcext="urn:org:documentfoundation:names:experimental:calc:xmlns:calcext:1.0";
 declare namespace loext="urn:org:documentfoundation:names:experimental:office:xmlns:loext:1.0";
 declare namespace field="urn:openoffice:names:experimental:ooo-ms-interop:xmlns:field:1.0";
 declare namespace formx="urn:openoffice:names:experimental:ooxml-odf-interop:xmlns:form:1.0";
 declare namespace css3t="http://www.w3.org/TR/css3-text/";

(: Declare external variables :)
 declare variable $filename external := "content.xml";

(: Declare local functions :)

(:~
 : This function looks up the name of a column.
 :
 : @param $hdrrow The first row (table:table-row) of the table, which contains
 : the column names.
 :
 : @param $colnum The number of the column. The first column is numbered 1.
 :
 : @return A string that contains the name of the column, with spaces translated to
 : hyphens, or null if the column number is out of range. Note that the
 : function does not currently filter out characters that are invalid in
 : XML element names.
 :
 :)

declare function local:getColName($hdrrow as element()*, $colnum as xs:integer) as xs:string
 {
 let $name := fn:translate(data($hdrrow/table:table-cell[$colnum]/text:p), " ", "-")
 return
 if ($name)
 then $name
 else ""
 };

(:~
 : This function processes a row in a table. For each column, it creates an element
 : with the same name as the column. Spaces are normalized within the value of the
 : element.
 :
 : @param $row is the row (table:table-row) to be processed.
 :
 : @param $hdrrow The first row (table:table-row) of the table, which contains
 : the column names.
 :
 : @return The data from the row.
 :
 :)

declare function local:processRow($row as element()*, $hdrrow as element()*) as element()*
 {
 let $cdata :=
 for $cell at $cellnum in $row/table:table-cell
 let $name := local:getColName($hdrrow, $cellnum)
 return
 if ($name)
 then element {$name} {fn:normalize-space(data($cell))}
 else ()
 return $cdata

};

(:~
 :
 : This function processes all the rows in a table (table:table).
 :
 : @param $table The table (table:table) to be processed.
 :
 : @return The processed data from the table.
 :
 :)

declare function local:processTable($table as element()*) as element()*
 {

let $sheet := data($table/@table:name)

let $sheets := fn:concat($sheet, "s")

let $rows := $table/table:table-row

(: Get the first row, which has the names of the columns :)
 let $hdrrow := $rows[1]

(: Extract cells from data rows :)
 let $more :=
 for $row in $rows[position() > 1] (: Skip title row :)
 return element {$sheet} {local:processRow($row, $hdrrow)}

return {element {$sheets} {$more}}

};

(: Mainline :)

(: Extract all the tables from the spreadsheet :)
 let $tables := doc($filename)/office:document-content/office:body/office:spreadsheet/table:table

(: Process each table :)
 let $batch :=
 for $table in $tables
 return local:processTable($table)

(: Return batch of tables :)
 return {$batch}

 

Here is an example of a Windows bat file that uses the WinZip Command Line Support Add-On to extract the content.xml file, then runs the converter under Zorba to extract the data:

@echo off
copy %1.ods %1.zip
move %1.zip data/%1.zip
wzunzip -o data/%1.zip data "content.xml"
cd data
del %1.zip
cd ..
zorba -f -q ods2xml2.xq --indent --external-variable filename:="data/content.xml" > data/batch.xml

 

In a build pipeline, this can be run whenever the researcher commits a new version of the spreadsheet, to automatically populate the XML files. Since the XML for all the sheets is contained in a single XML file, it will be necessary to select the desired table. This can easily be done with an XQuery or XSLT program. By the way, the same technique, with a few modifications, can be used to extract data from Microsoft Excel spreadsheet (xlsx) files. The two major differences are that the names of the XML tags are different, though similar in function, and empty cells are handled differently. In LibreOffice Calc, empty cells are represented by an empty <table:table-cell> element. In Microsoft Excel, empty cells are omitted, and the next non-empty cell has an attribute that indicates its column number. I will post a version of the script for Microsoft Excel files sometime in the future.

The Fitnesse testing framework uses a built-in wiki to allow end users to specify business-related tests. End users specify the tests and expected values in tables, which are then extracted and turned into test data.

Even if you are not using Fitnesse, a wiki can be a useful mechanism for entering data. Here are some useful characteristics of the Confluence wiki:

  • It is possible to restrict access to pages.
  • The wiki retains old versions of the pages. It has a built-in comparison function that makes it easy to see what changed.
  • The wiki can notify you by email when a page changes.
  • You don't have to write a user interface. People know how to edit Wiki pages.

This last characteristic is perhaps the most important. You don't have to write and maintain a bunch of data-entry pages, and you don't have to train your users on how to use the pages.

There is a down side: you don't get immediate feedback when you enter an invalid value (whatever invalid means for your application). You system would have to send the user an email indicating the error. Also, some discipline is required on the part of the users, but surprisingly little:

  • If there are multiple tables on the page, the order should not be changed.
  • The first row of the table should contain the column headings.

Users can add text before or after the table, and can add their own columns to the table.

To extract a table from a Confluence Wiki page, you first need to retrieve the page. You can do this with the curl utility, like this:

curl -u userid:password http://wiki.example.com/display/DEVWiki/Test+Team+Roster > WikiPageExtract.html

 

Here we are using basic authentication (-u userid:password). Confluence also supports OAuth validation, so you do not have to put your user ID and password in your scripts. However, it appears that you have to get an administrator to authorize the use of OAuth on your user ID. I have not tried this yet.

The HTML from Confluence Wiki pages, like HTML generated by most of applications, is not completely standard. It is necessary to run it through HTML tidy to clean it up and transform it to XML that we can work with. HTML tidy is available here, and more documentation on its use is available here. The Confluence Wiki uses some HTML5 tags:

<header>
<nav>
<section>

 

Older versions of HTML tidy do not understand HTML5 tags, so it is necessary to set up a configuration file to define these tags. (There is a version of HTML tidy that supports HTML5 here, but I have not yet tried it.) The configuration file looks like this:

new-blocklevel-tags: header,section,nav

 

If we call our configuration file tidyhtml5.txt, this command will clean up the HTML and transform it to XHTML, which is also valid XML:

tidy -b -i  -asxml -config tidyhtml5.txt -o WikiPageExtract.xml WikiPageExtract.html

 

The following options are specified:

  • -b: strip out smart quotes and em dashes, etc.
  • -i: indent the output.
  • -asxml: convert HTML to well-formed XHTML.
  • -config: specifies the name of the configuration file.
  • -o: specifies the name of the output file.

HTML tidy will generate a lot of warning messages as it fixes up the HTML, but it should not generate any error messages.

The tidied XHTML begins with a standard XML header and the opening tag for the html element:

 

Note the XHTML namespace namespace on the <html> tag. It will be necessary to define this when using XQuery/XPath to process the XHTML. The next thousand or so lines are scripts and such that we are not interested in, and the we come to a <div> element with a class of "wiki-content". That is where we will find the data from our table. 

div with class of 'wiki-content'

 

Here is an XQuery program extract table data from the XHTML.

(: Extract table from Confluence Wiki XHTML :) 

(: XHTML namespace :)
declare namespace x = "http://www.w3.org/1999/xhtml";

(: Input file :)
declare variable $InputFile external;

(: Table Number :)
declare variable $TableNumber external;

(: 
   Get the names of the columns
:)

declare function local:GetColumnNames( $table as element()* ) 
   as element()* {

    for $rawName in $table//x:th

        (: We are going to use the column names for XML tags, which 
           cannot have embedded spaces or slashes. So we translate 
           embedded spaces and slashes to nulls :)
        let $name := 
           translate( normalize-space( data($rawName) ), ' /', '' ) 
        return <col>{$name}</col>

};

let $wikiData := doc( $InputFile )

let $body := $wikiData/x:html/x:body
let $wc   := $body//x:div[@class="wiki-content"]
let $title := normalize-space($wc/x:h2[$TableNumber])
let $firstTable := 
   $wc/x:div[@class="table-wrap"][$TableNumber]/x:table

let $colNames := local:GetColumnNames( $firstTable )

let $tableData :=
    for $row in $firstTable//x:tr[position() > 1]  (: Skip first 
                                                      row: title 
                                                      row :)
       let $cols :=
          for $col at $colNum in data($row/x:td)
            let $colName := $colNames[$colNum]/text()
            return element{$colName} {normalize-space($col)}
    return <row>{$cols}</row>

return  <table>
            <title>{$title}</title>
            {$tableData}
        </table>

 

The two external variables are TableName, which specifies the name of the file containing the extracted XHTML, and TableNumber, which specifies which table on the page is desired, counting from 1. The title of the table is captured from the XHTML <H2> element that precedes it. Here is how the program would be run via the open-source Zorba XQuery processor:

 zorba -f -q program-file-name.xq -o output-file-name.xml 
       --lib-path library-path 
       --external-variable InputFile:=input-file-name.xml 
       --external-variable TableNumber:=1 
       --indent  
       --timing

 

The processed XML looks like this: 

Sample extracted table data

 

The elements in each row are based on the column heading in the first row of the table, with spaces and slashes removed.