Converting LibreOffice Calc spreadsheets to XML

I was working on an educational game about Medieval England, with an XML-based GUI. For example, the new player would be asked to choose a name, a profession, and other characteristics. These were read from XML files in order to populate the list of choices. The researcher on the team would send us spreadsheets with lists of medieval jobs, names, and so on, and I would have to convert them to XML. For the proof of concept, I just cut and pasted the information into an XML editor and manually added the XML tags. This was clearly too tedious and error-prone for production, so I needed an automated conversion. A LibreOffice spreadsheet, which has an extension of .ods, is really a zipped file under the covers. If you unzip the file, there is a file inside called content.xml that contains the data of the spreadsheet. The following XQuery program extracts the information from that file:

  • The first line of each sheet is assumed to be column names. These names, with any spaces translated to hyphens, are used for the XML element names for each column.
  • The name of the sheet is used for the element that encloses the data for each row.
  • The name of the sheet, suffexed with an “s”, is used for the element that encloses the data for the sheet.
  • A <batch> element encloses the data for all the sheets.

Here is the XQuery program, ods2xml2.xq:

(:~
 : This program converts a LibreOffice Calc spreadsheet into an XML file:
 :
 : - The extracted data is enclosed in a  element.
 : - The data for each sheet is enclosed in an element with the same name as the sheet,
 : suffixed with an "s".
 : - The data for each row is enclosed in an element with the same name as the sheet.
 : - The first row is assumed contain the name of each column. Within each data row,
 : the data for each column is enclosed in an element that matches the name of the
 : column. The program currently does not deal well with special characters in column
 : names, since they are invalid in XML element names, so be careful.
 :
 : Input:
 :
 : The input to this program is the content.xml file from a LibreOffice ods file. The ods
 : suffix should be renamed to zip, the file unzipped, and the content.xml file extracted.
 : If the file has a different name, you can specify it via the "filename" external variable.
 :
 : @author Lynn Grant
 : @version 2017-10-02
 :
 :)

(:
 :
 : License: Public Domain
 :
 : To the extent possible under law, Lynn Grant has waived all copyright and related or
 : neighboring rights to this program. This work is published from: United States.
 :
 :)

(: Declare namespaces used by LibreOffice :)
 declare namespace office="urn:oasis:names:tc:opendocument:xmlns:office:1.0";
 declare namespace style="urn:oasis:names:tc:opendocument:xmlns:style:1.0";
 declare namespace text="urn:oasis:names:tc:opendocument:xmlns:text:1.0";
 declare namespace table="urn:oasis:names:tc:opendocument:xmlns:table:1.0";
 declare namespace draw="urn:oasis:names:tc:opendocument:xmlns:drawing:1.0";
 declare namespace fo="urn:oasis:names:tc:opendocument:xmlns:xsl-fo-compatible:1.0";
 declare namespace xlink="http://www.w3.org/1999/xlink";
 declare namespace dc="http://purl.org/dc/elements/1.1/";
 declare namespace meta="urn:oasis:names:tc:opendocument:xmlns:meta:1.0";
 declare namespace number="urn:oasis:names:tc:opendocument:xmlns:datastyle:1.0";
 declare namespace presentation="urn:oasis:names:tc:opendocument:xmlns:presentation:1.0";
 declare namespace svg="urn:oasis:names:tc:opendocument:xmlns:svg-compatible:1.0";
 declare namespace chart="urn:oasis:names:tc:opendocument:xmlns:chart:1.0";
 declare namespace dr3d="urn:oasis:names:tc:opendocument:xmlns:dr3d:1.0";
 declare namespace math="http://www.w3.org/1998/Math/MathML";
 declare namespace form="urn:oasis:names:tc:opendocument:xmlns:form:1.0";
 declare namespace script="urn:oasis:names:tc:opendocument:xmlns:script:1.0";
 declare namespace ooo="http://openoffice.org/2004/office";
 declare namespace ooow="http://openoffice.org/2004/writer";
 declare namespace oooc="http://openoffice.org/2004/calc";
 declare namespace dom="http://www.w3.org/2001/xml-events";
 declare namespace xforms="http://www.w3.org/2002/xforms";
 declare namespace xsd="http://www.w3.org/2001/XMLSchema";
 declare namespace xsi="http://www.w3.org/2001/XMLSchema-instance";
 declare namespace rpt="http://openoffice.org/2005/report";
 declare namespace of="urn:oasis:names:tc:opendocument:xmlns:of:1.2";
 declare namespace xhtml="http://www.w3.org/1999/xhtml";
 declare namespace grddl="http://www.w3.org/2003/g/data-view#";
 declare namespace tableooo="http://openoffice.org/2009/table";
 declare namespace drawooo="http://openoffice.org/2010/draw";
 declare namespace calcext="urn:org:documentfoundation:names:experimental:calc:xmlns:calcext:1.0";
 declare namespace loext="urn:org:documentfoundation:names:experimental:office:xmlns:loext:1.0";
 declare namespace field="urn:openoffice:names:experimental:ooo-ms-interop:xmlns:field:1.0";
 declare namespace formx="urn:openoffice:names:experimental:ooxml-odf-interop:xmlns:form:1.0";
 declare namespace css3t="http://www.w3.org/TR/css3-text/";

(: Declare external variables :)
 declare variable $filename external := "content.xml";

(: Declare local functions :)

(:~
 : This function looks up the name of a column.
 :
 : @param $hdrrow The first row (table:table-row) of the table, which contains
 : the column names.
 :
 : @param $colnum The number of the column. The first column is numbered 1.
 :
 : @return A string that contains the name of the column, with spaces translated to
 : hyphens, or null if the column number is out of range. Note that the
 : function does not currently filter out characters that are invalid in
 : XML element names.
 :
 :)

declare function local:getColName($hdrrow as element()*, $colnum as xs:integer) as xs:string
 {
 let $name := fn:translate(data($hdrrow/table:table-cell[$colnum]/text:p), " ", "-")
 return
 if ($name)
 then $name
 else ""
 };

(:~
 : This function processes a row in a table. For each column, it creates an element
 : with the same name as the column. Spaces are normalized within the value of the
 : element.
 :
 : @param $row is the row (table:table-row) to be processed.
 :
 : @param $hdrrow The first row (table:table-row) of the table, which contains
 : the column names.
 :
 : @return The data from the row.
 :
 :)

declare function local:processRow($row as element()*, $hdrrow as element()*) as element()*
 {
 let $cdata :=
 for $cell at $cellnum in $row/table:table-cell
 let $name := local:getColName($hdrrow, $cellnum)
 return
 if ($name)
 then element {$name} {fn:normalize-space(data($cell))}
 else ()
 return $cdata

};

(:~
 :
 : This function processes all the rows in a table (table:table).
 :
 : @param $table The table (table:table) to be processed.
 :
 : @return The processed data from the table.
 :
 :)

declare function local:processTable($table as element()*) as element()*
 {

let $sheet := data($table/@table:name)

let $sheets := fn:concat($sheet, "s")

let $rows := $table/table:table-row

(: Get the first row, which has the names of the columns :)
 let $hdrrow := $rows[1]

(: Extract cells from data rows :)
 let $more :=
 for $row in $rows[position() > 1] (: Skip title row :)
 return element {$sheet} {local:processRow($row, $hdrrow)}

return {element {$sheets} {$more}}

};

(: Mainline :)

(: Extract all the tables from the spreadsheet :)
 let $tables := doc($filename)/office:document-content/office:body/office:spreadsheet/table:table

(: Process each table :)
 let $batch :=
 for $table in $tables
 return local:processTable($table)

(: Return batch of tables :)
 return {$batch}

Here is an example of a Windows .bat file that uses the WinZip Command Line Support Add-On to extract the content.xml file, then runs the converter under Zorba to extract the data:

@echo off
copy %1.ods %1.zip
move %1.zip data/%1.zip
wzunzip -o data/%1.zip data "content.xml"
cd data
del %1.zip
cd ..
zorba -f -q ods2xml2.xq --indent --external-variable filename:="data/content.xml" > data/batch.xml

In a build pipeline, this can be run whenever the researcher commits a new version of the spreadsheet, to automatically populate the XML files. Since the XML for all the sheets is contained in a single XML file, it will be necessary to select the desired table. This can easily be done with an XQuery or XSLT program. By the way, the same technique, with a few modifications, can be used to extract data from Microsoft Excel spreadsheet (.xlsx) files. The two major differences are that the names of the XML tags are different, though similar in function, and empty cells are handled differently. In LibreOffice Calc, empty cells are represented by an empty element. In Microsoft Excel, empty cells are omitted, and the next non-empty cell has an attribute that indicates its column number. I will post a version of the script for Microsoft Excel files sometime in the future.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *