Converting Microsoft Excel spreadsheets to XML

Several years ago, I posted an XQuery program to convert LibreOffice Calc spreadsheets to usable XML. The following rationale was given in that article:

I was working on an educational game about Medieval England, with an XML based GUI. For example, the new player would be asked to choose a name, a profession, and other characteristics. These were read from XML files in order to populate the list of choices. The researcher on the team would send us spreadsheets with lists of medieval jobs, names, and so on, and I would have to convert them to XML. For the proof of concept, I just cut and pasted the information into an XML editor and manually added the XML tags. This was clearly too tedious and error prone for production, so I needed an automated conversion.

At the time, I promised that a program to convert Microsoft Excel spreadsheets would follow shortly. Unfortunately, that took a lot longer than originally planned.

Like LibreOffice Calc spreadsheets, Excel spreadsheets are really zip files in disguise. If you change the file type from .xlsx to .zip, you can unzip them. If you point the following XQuery program to the unzipped directory, it will produce an XML file using the following rules:

  • The first line of each sheet is assumed to be column names. These names, with any spaces translated to hyphens, are used for the XML element names for each column.
  • The name of the sheet is used for the element that encloses the data for each row.
  • The name of the sheet, suffexed with an “s”, is used for the element that encloses the data for the sheet.
  • A <batch> element encloses the data for all the sheets.

Here is the program:

(:~
 : This program extracts useable XML data from an unzipped Excel spreadsheet. 
 : 
 : An Excel .xlsx file is really a zip file, so we first have to rename it 
 : to have a .zip file type, and then unzip it. 
 : 
 : The unzipped file creates a directory structure like this: 
 : 
 :    _rels
 :    docProps
 :    xl
 :    [Content_Types].xml 
 :
 : This program is intended to be run from the topmost directory of the unzipped file, 
 : unless a different directory is specified, using the directory external variable. 
 : 
 : The xl directory looks like this:
 :
 :    rels
 :    theme
 :    worksheets
 :    sharedStrings.xml 
 :    styles.xml
 :    workbook.xml
 : 
 : The worksheets live in the xorksheets directory. The first one is called "Sheet1", 
 : the second, "Sheet2", etc. The actual names of the worksheets are in the workbook.xml 
 : file, in the <sheet> elements.  
 : 
 : If the zip file has been unzipped to a directory other than the one the program is 
 : running from,  : you can specify it with the $directory external variable, like this: 
 : 
 : --external-variable directory="C:\Users\Test User\Documents\test-sheet"
 :
 : The generated has a root element of <batch>. For each sheet, there is an element 
 : whose name is the sheet name, followed by an "s". Each row of the sheet is represented 
 : by an element with the same name as the sheet. Within the row, each element has 
 : the same name as the column name from the first row of the sheet. 
 : 
 : Note: This program is intended for spreadsheets that contain only data. It has not 
 : been tested with spreadsheets with cells that contain formulas. It will probably 
 : either blow up or give incorrect results if it encounters one. 
 : 
 : Note for future development:
 : 
 : A cell that contains a picture looks like this: 
 :
 :    <c r="D5" t="e" vm="3">
 :           <v>#VALUE!</v>
 :         </c>
 :
 : The t="e" indicates that it contains an image. The value of the vm attribute,
 : prefixed with "rId", appears to refer to the Id attribute of a <Relationship>
 : element in the xl/richData/_rels/richValueRel.xml.rels file. The Target
 : attribute points to the actual image. The value of the <v> element appears
 : to be a placeholder. 
 : 
 : @author Lynn Grant
 : @version 2023-12-23
 :
 :)

(:
 :
 : License: Public Domain
 :
 : To the extent possible under law, Lynn Grant has waived all copyright and related or
 : neighboring rights to this program. This work is published from: United States.
 :
 :)

(: Declare namespaces :)
declare namespace sst = "http://schemas.openxmlformats.org/spreadsheetml/2006/main";

(: Declare external variables :)
declare variable $directory external := "";


(:~
 : This function gets a shared string. 
 :
 : A column entry in an Excel spreadsheet takes on of the following forms: 
 :
 :     <c r="C6">
 :           <v>1</v>
 :         </c>
 :
 :         <c r="D6" t="s">
 :           <v>8</v>
 :         </c>
 :
 : In the first case, the value of <v> is the actual value of the column entry. 
 : In the second case, the value of <v> is a zero-based index into the 
 : sharedStrings.xml file. 
 :
 : This function will look up the shared string corresponding to an index.
 :
 : @param $sharedstrings - The contents of the sharedStrings.xml file. 
 : @param $index - The zero-based index of the desired string. 
 : @return - The shared string corresponding to the index.
 :
 :)

declare function local:getSharedString($sharedStrings as element()*, $index as xs:integer) as xs:string
{
   let $string := $sharedStrings[$index+1]/text()    (: $sharedStrings is one-based, so we have to bump up the index. :)
   return $string
};


(:~
 : This function gets the name of a column, based on its column number. The first row of the spreadsheet is assumed
 : to consist of column headings. 
 : 
 : @param $sharedStrings - The contents of the sharedStrings.xml file. This will be passed to local:getSharedString().
 : @param $headings - The contents of the first row of the spreadsheet.
 : @param $colNum - The number of the column.
 : @return - The name of the column, or null, in the unlikely case that the column heading
 :           cannot be found.
 :
 :)

declare function local:getColumnName($sharedStrings as element()*, $headings as element()*, $colNum as xs:integer) as xs:string
{
   let $heading := $headings/sst:c[$colNum]
   let $t := data($heading/@t)
   let $v := $heading/sst:v/text()
   let $colName := if ($t = "s")
      then {local:getSharedString($sharedStrings, $v)}
      else {$v} 
   let $colNameNb := fn:translate($colName, " ", "-")
   return if ($colNameNb)
          then $colNameNb
          else ""   
};


(:~
 : This function extracts the column number from a row-column designator.
 :
 : A column entry in an Excel spreadsheet looks like this: 
 :
 :     <c r="B6">
 :           <v>33333</v>
 :         </c>
 :
 : The value of r is the alphabetic column ID and the numeric row ID. We are interested in the column number, 
 : starting with 1, so we take the letter and convert it to a number. Note that you can have more than 26 columns, 
 : in which case you can have designators like "AA5". This program does not anticipate dealing with spreadsheets 
 : with more than 26 columns, so if it encounters one, it will not give you the right result.
 :
 : @param $rowCol - The row and column designator, such as "A5".
 : @return The column number extracted from the row-column designator.
 :
 :)
 
declare function local:getColumnNumber($rowCol as xs:string) as xs:integer
{
    let $letters := ("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z")
           let $colNum := fn:index-of($letters, substring($rowCol, 1, 1))
        return $colNum
};


(:~
 : This function processes a row, in other words, the contents of a <row> element.
 :
 : @param $sharedStrings - The contents of the sharedStrings.xml file.
 : @param $headings - The contents of the first row of the spreadsheet, containing column headings.
 : @param $row - The contents of a <row> element. 
 : @return - An element whose name is the column name, and whose value is the column value.
 :
 :)

declare function local:processRow($sharedStrings as element()*, 
                $headings as element()*, 
                $row as element()*
                ) as element()*
{
    for $col in $row/sst:c
              let $t := data($col/@t)
              let $r := data($col/@r)
              let $colNum := local:getColumnNumber($r)
              let $v := $col/sst:v/text()
              let $colName := local:getColumnName($sharedStrings, $headings, $colNum)
              let $colVal := 
                  if ($t = "s")
                      then {local:getSharedString($sharedStrings, $v)}
                      else {$v}
               return element {$colName} {$colVal}
};

declare function local:processSheet($sharedStrings as element()*, 
                   $sheetName as xs:string, 
                   $sheetFile as xs:string
                   ) as element()*
{
    let $collectionName := concat($sheetName, "s")
    let $sheetData := doc($sheetFile)/sst:worksheet/sst:sheetData

    let $sheets := $sheetData/sst:row[fn:position() > 1]
 
    let $headings := $sheetData/sst:row[fn:position() = 1]

    return element     {$collectionName}
    {
        for $row in $sheets
            return 
                element {$sheetName}
                  {
                    local:processRow($sharedStrings, $headings, $row)
                } 
              } 
};

(: Mainline :)

let $directoryPrefix := 
    if ($directory = "")
    then ""
    else fn:concat($directory, "/")
 
let $sharedStringsFilename := fn:concat($directoryPrefix, "xl/", "sharedStrings.xml")
let $sharedStrings         := doc($sharedStringsFilename)/sst:sst/sst:si/sst:t

(: The list of sheets is in the workbook.xml file, in a structure that looks like this: 
   
    <workbook> 
        . . .
        <sheets>
            <sheet name="Sheet1" r:id="rId1" sheetId="1"/>
            <sheet name="dfs" r:id="rId2" sheetId="2"/>
        </sheets>
        . . .
    </workbook>

    Name is the name that is on the tab at the bottom of the sheet in Excel.
    SheetId, when prefixed by "Sheet", and suffixed by ".xml", is the filename
        of the worksheet in the xl/worksheets directory.

:)

let $workbookFilename      := fn:concat($directoryPrefix, "xl/", "workbook.xml")
let $workbook              := doc($workbookFilename)/sst:workbook/sst:sheets/sst:sheet

(: Process each worksheet :)
let $batch :=
    for $sheet in $workbook
        let $sheetId := $sheet/@sheetId
        let $sheetFile := fn:concat($directoryPrefix, "xl/worksheets/Sheet", $sheetId, ".xml") 
            let $sheetName := data($sheet/@name)
        return local:processSheet($sharedStrings, $sheetName, $sheetFile)

return element  {"batch"}
                {$batch}

Now, it may be that you want JSON, the other popular data format, rather than XML. I use a proprietary tool, XML Validator Buddy, which can convert back and forth between the two, but there are also several convertors available, both online and as open-source projects.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *