Converting CSV files to XML

I recently approached an upstream application group, requesting that they provide data in XML form. “We already provide a .csv file,” they said. “Isn’t that good enough?” I thought about at a bit and, although that was not the answer I had hoped for, I could, indeed, convert their .csv file into an XML file, with a little help from XQuery.

Consider this hypothetical CSV file. I used a CSV format where each field is enclosed in quotation marks, which makes dealing with fields that contain commas easier.

"H","100","305142","Detroit Auto Parts","2012-03-05"
"D","100","2034","License plate bolts","4"
"D","100","3442","License plate light","1"
"D","100","3443","License plate light bulb","1"
"H","101","307119","Pontiac Auto Parts Emporium","2012-04-09"
"D","101","7009","Floor mat","2"
"D","101","1004","HD Shifter knob","1"
"D","101","1010","Loud muffler","2"
"D","101","2112","Window tint film","5"
"H","102","309108","Drayton Plains Auto Service","2012-05-06"
"D","102","3336","Fuzzy Mirror Dice","1"
"D","102","3002","Toilet Paper Oil Filter","2"
"D","102","4773","Spark Plug Wolf Whistle","1"
"H","103","310143","Lake Orien Automobiles","2012-06-04"
"D","103","2331","Spark plug","8"
"D","103","2000","Tachometer","1"
"D","103","1113","Oil Pressure Gauge","1"

The lines that begin with “H” are headers, each one representing an order. The subsequent fields are order number, customer number, customer name, and order date.

The lines that begin with “D” are details, each one representing a part that is being ordered. The subsequent fields are order number, part number, part name, and quantity.

Good data normalization practice would dictate that we should not have both the customer name and number, or both the part name and number. Instead, we should have just the customer number or part number, and look up the name in another file; that way we do not have to the data duplicated, and we also do not have to deal with the case where two orders with the same customer number have different names.

In practice, it is probably a good idea to have both, particularly when the data is coming from someplace we do not have control over, in this case auto parts stores. Having both the name and the number can help resolve things if the customer mistypes the number and, in the case of customer name, can alert us to changes in the name of the business. Once we have validated the name to our satisfaction, we should probably remove it and stick with the number.

So we have our .csv file, but what we would really like is some XML that looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<orders>
  <order>
    <orderNumber>100</orderNumber>
    <customerNumber>305142</customerNumber>
    <customerName>Detroit Auto Parts</customerName>
    <orderDate>2012-03-05</orderDate>
    <items>
      <item>
        <partNumber>2034</partNumber>
        <partName>License plate bolts</partName>
        <quantity>4</quantity>
      </item>
      <item>
        <partNumber>3442</partNumber>
        <partName>License plate light</partName>
        <quantity>1</quantity>
      </item>
      <item>
        <partNumbe>3443</partNumber>
        <partName>License plate light bulb</partName>
        <quantity>1</quantity>
      </item>
    </items>
  </order>
  <order>
    <orderNumber>101</orderNumber>
    <customerNumber>307119</customerNumber>
    <customerName>Pontiac Auto Parts Emporium</customerName>
    <orderDate>2012-04-09</orderDate>
    <items>
      <item>
        <partNumber>7009</partNumber>
        <partName>Floor mat</partName>
        <quantity>2</quantity>
      </item>
      <item>
        <partNumber>1004</partNumber>
        <partName>HD Shifter knob</partName>
        <quantity>1</quantity>
      </item>
      <item>
        <partNumber>1010</partNumber>
        <partName>Loud muffler</partName>
        <quantity>2</quantity>
      </item>
      <item>
        <partNumber>2112</partNumber>
        <partName>Window tint film</partName>
        <quantity>5</quantity>
      </item>
    </items>
  </order>
  <order>
    <orderNumber>102</orderNumber>
    <customerNumber>309108</customerNumber>
    <customerName>Drayton Plains Auto Service</customerName>
    <orderDate>2012-05-06</orderDate>
    <items>
      <item>
        <partNumber>3336</partNumber>
        <partName>Fuzzy Mirror Dice</partName>
        <quantity>1</quantity>
      </item>
      <item>
        <partNumber>3002</partNumber>
        <partName>Toilet Paper Oil Filter</partName>
        <quantity>2</quantity>
      </item>
      <item>
        <partNumber>4773</partNumber>
        <partName>Spark Plug Wolf Whistle</partName>
        <quantity>1</quantity>
      </item>
    </items>
  </order>
  <order>
    <orderNumber>103</orderNumber>
    <customerNumber>310143</customerNumber>
    <customerName>Lake Orien Automobiles</customerName>
    <orderDate>2012-06-04</orderDate>
    <items>
      <item>
        <partNumber>2331</partNumber>
        <partName>Spark plug</partName>
        <quantity>8</quantity>
      </item>
      <item>
        <partNumber>2000</partNumber>
        <partName>Tachometer</partName>
        <quantity>1</quantity>
      </item>
      <item>
        <partNumber>1113</partNumber>
        <partName>Oil Pressure Gauge</partName>
        <quantity>1</quantity>
      </item>
    </items>
  </order>
</orders>

Since I will probably have to convert other .csv files, I will do this in two steps. The first step will convert the .csv file to an XML structure like this:

<?xml version="1.0" encoding="UTF-8"?>
<lines>
   <line>
      <field>"H"</field>
      <field>100</field>
      <field>305142</field>
      <field>Detroit Auto Parts</field>
      <field>2012-03-05</field>
   </line>
   <line>
      <field>D</field>
      <field>100</field>
      <field>2034</field>
      <field>License plate bolts</field>
      <field>4</field>
   </line>
   . . . 
   <line>
      <field>D</field>
      <field>103</field>
      <field>2000</field>
      <field>Tachometer</field>
      <field>1</field>
   </line>
   <line>
      <field>D</field>
      <field>103</field>
      <field>1113</field>
      <field>Oil Pressure Gauge</field>
      <field>1</field>
   </line>
</lines>

This can be reused any time I need to work with a .csv file. By the way, I used to use an AWK script to wrap a <line> element around each line of the .csv file, and wrap the whole thing in a <lines> tag, so that the XQuery fn:doc() function could deal with it. With XQuery 3.0, the fn:unparsed-text() function is now available, so it is now possible to just read in the raw text.

The following XQuery program, csv2xml.xq, will do the job:

(:~ Convert a comma-delimited file of order information to XML.

For each order, there is an "H" header row that describes the
 customer, followed by one or more "D" detail rows that describe
 the parts being ordered.

For example, here is an example of a csv file containing two orders.
 Note that we add an XML declaration and enclosing <lines> element so that XQuery will be happy processing it.

<?xml version="1.0" encoding="UTF-8"?>
 <lines>
 "H","100","305142","Detroit Auto Parts","2012-03-05"
 "D","100","2034","License plate bolts","4"
 "D","100","3442","License plate light","1"
 "D","100","3443","License plate light bulb","1"
 "H","101","307119","Pontiac Auto Parts Emporium","2012-04-09"
 "D","101","7009","Floor mat","2"
 "D","101","1004","HD Shifter knob","1"
 "D","101","1010","Loud muffler","2"
 "D","101","2112","Window tint film","5"
 "H","102","309108","Drayton Plains Auto Service","2012-05-06"
 "D","102","3336","Fuzzy Mirror Dice","1"
 "D","102","3002","Toilet Paper Oil Filter","2"
 "D","102","4773","Spark Plug Wolf Whistle","1"
 "H","103","310143","Lake Orien Automobiles","2012-06-04"
 "D","103","2331","Spark plug","8"
 "D","103","2000","Tachometer","1"
 "D","103","1113","Oil Pressure Gauge","1"
 </lines>
 
 The fields for an "H" line are:
 - Type: "H" for a header line
 - Order Number
 - Customer number
 - Order date
 
 The fields for a "D" line are:
 - Type: "D" for a detail line
 - Order number
 - Part number
 - Part description
 - Quantity
 
 Each field should be enclosed in double quotes. 
 Commas inside the fields are OK, and will be preserved. 
 
:) 

(:
 :
 : License: Public Domain
 :
 : To the extent possible under law, Lynn Grant has waived all copyright and related or
 : neighboring rights to this program. This work is published from: United States.
 :
 :)

(:~ GetLines: Break a new-line-delimited file down into individal lines.
 
 @param $file - The file to be broken down into lines. Usually a
 variable obtained from the doc(...) function.
 @returns a line structure that looks like this: 
 <lines>
 <line>first line data</line>
 <line>second line data</line>
 ... etc. ...
 </lines>
:)

declare variable $filename external;
 
declare function local:GetLines( $file as xs:string ) as element()*
{
 <lines>
 {
 let $lines := fn:tokenize( $file, "\n" )
 for $line in $lines
 (: We need to temporarily get rid of any commas between double quotes, so they don't confuse
 things later when we split each line into fields. 
 - First we change any occurrences of '","', which are the commas between fields,
 into '"~!~"', a string not likely to occur in our input data. 
 - All remaining commas should be the ones within quotes, so we change them to "~#~", another
 string not likely to be in our data. 
 - Finally we change all occurrences of '"~!~"' back to '","', to put back the inter-field commas.
 :)
 return <line>{fn:replace( fn:replace( fn:replace($line, '","', '"~!~"'), ",", "~#~"), '"~!~"', '","') }</line>
 }
 </lines>
};



(:~ GetFields: Get the fields of a line. Also, remove the
 double quotes around text fields.
 
 @param $line - The <line> element to be processed.
 @returns a field structure: 
 <fields>
 <field>First field</field>
 <field>Second field</field>
 ... etc. ...
 </fields>
:)
 
declare function local:GetFields( $line as element()*) as element()*
{
 <fields>
 {
 let $fields := fn:tokenize( $line, "," )
 for $field in $fields
 (: Here we replace any fields within quotes with the contents of the field, 
 but without the quotes. In other words, we remove the quotes. 
 Then we replace the "~#~" string, which the GetLines routine changed
 all commas within quotes to, back to commas. :)
 return <field>{fn:replace( fn:replace( $field, '^"(.*)"$', "$1" ), "~#~", ",")}</field> 
 }
 </fields>
};


(:~ ProcOrder: Process a Header Line for an order
 
 This routine takes the <fields> element for an "H"
 line, extracts the fields, and creates an <order>
 element. The GetItems() routine is called to 
 generate the <items> section. 
 
 @param $order - The <fields> element for the order to be processed.
 @param $stream - The <fields> elements for all "H" records that have the
 same order number as $order.
 @returns an <order> element like this: 
 <order>
 <orderNumber>order number</orderNumber>
 <customerNumber>customer number</customerNumber>
 <customerName>customer name</customerName>
 ... <items> section from GetItems() routine ...
 </order>
 
:)
 
declare function local:ProcOrder( $order as element()*, $stream as element()* ) as element()* 
{
 let $orderNumber := $order/field[2]/string() (: Get the order number :)
 return
 <order>
 <orderNumber>{$order/field[2]/string()}</orderNumber>
 <customerNumber>{$order/field[3]/string()}</customerNumber>
 <customerName>{$order/field[4]/string()}</customerName>
 <orderDate>{$order/field[5]/string()}</orderDate>
 {
 (: Pass <fields> elements for all "D" lines with the same order number to GetItems() :)
 local:GetItems( $stream[field[1]/string() = "D" and field[2]/string() = $orderNumber] )
 }
 </order>
};

(:~ GetItems: Generate the <items> section of an <order> element.
 
 @param $items - a sequence of <fields> elements, one for each 
 "D" line for this order.
 @returns an <items> element with an <item> elemennt for each "D" line: 
 
 <items>
 <item>
 <partNumber>2034</partNumber>
 <partName>License plate bolts</partName>
 <quantity>4</quantity>
 </item>
 <item>
 <partNumber>3442</partNumber>
 <partName>License plate light</partName>
 <quantity>1</quantity>
 </item>
 <item>
 <partNumber>3443</partNumber>
 <partName>License plate light bulb</partName>
 <quantity>1</quantity>
 </item>
 </items>
 
:)

declare function local:GetItems( $items as element()* ) as element()* 
{
 <items>
 {
 for $item in $items
 return 
 <item>
 <partNumber>{$item/field[3]/string()}</partNumber>
 <partName>{$item/field[4]/string()}</partName>
 <quantity>{$item/field[5]/string()}</quantity>
 </item>
 }
 </items>
};

(:~ Mainline
 
 - Open the <orders> element.
 - Break the file into lines, breaking at new-line characters.
 - Break each line into fields, breaking at commas.
 - Process each "H" line. 
 - Close the <orders> element.
 
:)

<orders>
{
 let $csvfile := doc($filename)/lines
 let $lines := local:GetLines( $csvfile )
 (: Skip line containing <lines> :)
 (: Break each line into a <fields> element with 
 a <field> child element for each field. :)
 let $fields :=
 for $line in fn:subsequence( $lines/line, 2 )
 return {local:GetFields( $line )}
 
 (: Process only the lines with a first field of "H".
 These represent order headers. :)
 for $order in $fields[field[1]/string() = "H"]
 return {local:ProcOrder( $order, $fields )}
}
</orders>

The mainline calls GetLines to convert the lines into a series of <line> elements. For each <line> elements, it calls GetFields to convert each line into a series of <field> elements, enclosed in a <fields> element. For each <fields> element where the first field has a value of “H”, representing a header, it calls ProcOrder, passing it the <fields> element, and the whole sequence of <fields> elements. ProcOrder will return a series of <order> elements, which will be enclosed in an <orders> element.

The GetLines routine first uses fn:tokenize to split the input at new-line characters. Three nested occurrences of fn:replace are used to change temporarily change commas within quotes, so they don’t confuse the field-tokenizing process:

  • The string “,” represents the commas between fields. These are temporarily changed to ~!~.
  • Any remaining commas will be the ones within quotes. They are changed to ~#~.
  • The string ~!~ is changed back to “,”.
  • Each line is wrapped in a <line> element.
  • The resulting elements are wrapped in a <lines> element.

The result is that an input string like this:

"H","100","305142","Detroit Auto Parts, Rochester","2012-03-05"

becomes:

<lines> <line>"H","100","305143","Detroit Auto Parts~#~ Rochester","2012-03-05"</line> </lines>

The GetFields routine takes a <line> element from GetLines and converts it to a sequence of <field> elements. First, the fn:tokenize function is used to split up the <line> data at commas. Since we temporarily changed the commas within quotes to something else, this will split on the commas between fields. Next we use two nested fn:replace calls to process the line. The innermost one uses a regular expression to remove the quotes around the field’s data. The outermost one translates the ~#~ string back to comma, putting back the commas within quotes. A <field> element is wrapped around each field’s data, and a <fields> element is wrapped around all the <field> elements for the line.

The ProcOrder routine extracts header data from the header <fields> element, and builds <orderNumber>, <customerNumber>, <customerName>, and <orderDate> elements. It then calls GetItems, passing it all the <fields> elements that have the same order number as the header. The data from GetItems is appended to the header fields, and wrapped in an <order> element.

The GetItems routine takes each <fields> element passed to it, and generates <partNumber>, <partName>, and <quantity> elements, wrapped in an <item> element. The group of <item> elements is wrapped in an <items> element.

See this post for another way of converting .csv files to XML, using Python.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *