Extracting XML from Wiki pages

The Fitnesse testing framework uses a built-in wiki to allow end users to specify business-related tests. End users specify the tests and expected values in tables, which are then extracted and turned into test data.

Even if you are not using Fitnesse, a wiki can be a useful mechanism for entering data. Here are some useful characteristics of the Confluence wiki:

  • It is possible to restrict access to pages.
  • The wiki retains old versions of the pages. It has a built-in comparison function that makes it easy to see what changed.
  • The wiki can notify you by email when a page changes.
  • You don’t have to write a user interface. People know how to edit Wiki pages.

This last characteristic is perhaps the most important. You don’t have to write and maintain a bunch of data-entry pages, and you don’t have to train your users on how to use the pages.

There is a down side: you don’t get immediate feedback when you enter an invalid value (whatever invalid means for your application). You system would have to send the user an email indicating the error. Also, some discipline is required on the part of the users, but surprisingly little:

  • If there are multiple tables on the page, the order should not be changed.
  • The first row of the table should contain the column headings.

Users can add text before or after the table, and can add their own columns to the table.

To extract a table from a Confluence Wiki page, you first need to retrieve the page. You can do this with the curl utility, like this:

curl -u userid:password http://wiki.example.com/display/DEVWiki/Test+Team+Roster > WikiPageExtract.html

Here we are using basic authentication (-u userid:password). Confluence also supports OAuth validation, so you do not have to put your user ID and password in your scripts. However, it appears that you have to get an administrator to authorize the use of OAuth on your user ID. I have not tried this yet.

The HTML from Confluence Wiki pages, like HTML generated by most of applications, is not completely standard. It is necessary to run it through HTML tidy to clean it up and transform it to XML that we can work with. HTML tidy is available here, and more documentation on its use is available here. The Confluence Wiki uses some HTML5 tags:


Older versions of HTML tidy do not understand HTML5 tags, so it is necessary to set up a configuration file to define these tags. (There is a version of HTML tidy that supports HTML5 here, but I have not yet tried it.) The configuration file looks like this:

new-blocklevel-tags: header,section,nav

if we call our configuration file tidyhtml5.txt, this command will clean up the HTML and transform it to XHTML, which is also valid XML:

tidy -b -i  -asxml -config tidyhtml5.txt -o WikiPageExtract.xml WikiPageExtract.html

The following options are specified:

  • -b: strip out smart quotes and em dashes, etc.
  • -i: indent the output.
  • -asxml: convert HTML to well-formed XHTML.
  • -config: specifies the name of the configuration file.
  • -o: specifies the name of the output file.

HTML tidy will generate a lot of warning messages as it fixes up the HTML, but it should not generate any error messages.

The tidied XHTML begins with a standard XML header and the opening tag for the html element:

Note the XHTML namespace namespace on the tag. It will be necessary to define this when using XQuery/XPath to process the XHTML. The next thousand or so lines are scripts and such that we are not interested in, and the we come to a element with a class of “wiki-content“. That is where we will find the data from our table.

Here is an XQuery program extract table data from the XHTML.

(: Extract table from Confluence Wiki XHTML :) 

(: XHTML namespace :)
declare namespace x = "http://www.w3.org/1999/xhtml";

(: Input file :)
declare variable $InputFile external;

(: Table Number :)
declare variable $TableNumber external;

   Get the names of the columns

declare function local:GetColumnNames( $table as element()* ) 
   as element()* {

    for $rawName in $table//x:th

        (: We are going to use the column names for XML tags, which 
           cannot have embedded spaces or slashes. So we translate 
           embedded spaces and slashes to nulls :)
        let $name := 
           translate( normalize-space( data($rawName) ), ' /', '' ) 
        return <col>{$name}</col>


let $wikiData := doc( $InputFile )

let $body := $wikiData/x:html/x:body
let $wc   := $body//x:div[@class="wiki-content"]
let $title := normalize-space($wc/x:h2[$TableNumber])
let $firstTable := 

let $colNames := local:GetColumnNames( $firstTable )

let $tableData :=
    for $row in $firstTable//x:tr[position() > 1]  (: Skip first 
                                                      row: title 
                                                      row :)
       let $cols :=
          for $col at $colNum in data($row/x:td)
            let $colName := $colNames[$colNum]/text()
            return element{$colName} {normalize-space($col)}
    return <row>{$cols}</row>

return  <table>

The two external variables are TableName, which specifies the name of the file containing the extracted XHTML, and TableNumber, which specifies which table on the page is desired, counting from 1. The title of the table is captured from the XHTML <H2> element that precedes it. Here is how the program would be run via the open-source Zorba XQuery processor:

zorba -f -q program-file-name.xq -o output-file-name.xml 
       --lib-path library-path 
       --external-variable InputFile:=input-file-name.xml 
       --external-variable TableNumber:=1 

The processed XML looks like this:

The elements in each row are based on the column heading in the first row of the table, with spaces and slashes removed.


Leave a Reply

Your email address will not be published. Required fields are marked *