Converting CSV files to XML, revisited

In a previous post, I gave an example of how to use XQuery to convert a .csv file, in the common format of header records followed by detail records, into an XML file. In this post I will show you another approach, using Python with the pandas and xml.etree.ElementTree packages.

We will use the same example file of auto parts orders as last time, with one change: the pandas package expects to have the names of the columns in the first row. Since the columns mean different things in header or detail rows, we will just give them one-letter names, starting with “A”, like an Excel spreadsheet:

"H","100","305142","Detroit Auto Parts","2012-03-05"
"D","100","2034","License plate bolts","4"
"D","100","3442","License plate light","1"
"D","100","3443","License plate light bulb","1"
"H","101","307119","Pontiac Auto Parts Emporium","2012-04-09"
"D","101","7009","Floor mat","2"
"D","101","1004","HD Shifter knob","1"
"D","101","1010","Loud muffler","2"
"D","101","2112","Window tint film","5"
"H","102","309108","Drayton Plains Auto  Service","2012-05-06"
"D","102","3336","Fuzzy Mirror Dice","1"
"D","102","3002","Toilet Paper Oil Filter","2"
"D","102","4773","Spark Plug Wolf Whistle","1"
"H","103","310143","Lake Orien Automobiles","2012-06-04"
"D","103","2331","Spark plug","8"
"D","103","1113","Oil Pressure Gauge","1"

If you don’t have the pandas package installed, you will need to install it, like this:

pip install pandas

Here is the python program:

# Convert a comma-delimited file of order information to XML.
# The name of the file to be converted is passed as the
# first parameter.
# The name of the output file is passed as the second parameter. 
# For each order, there is an "H" header row that describes the
# customer, followed by one or more "D" detail rows that 
# describe the parts being ordered. The first row of the file 
# has the names of the columns, which in our case are letters, 
# starting with "A".
# For example, here is an example of a .csv file containing 
# two orders.
# "A","B","C","D","E"
# "H","100","305142","Detroit Auto Parts","2012-03-05"
# "D","100","2034","License plate bolts","4"
# "D","100","3442","License plate light","1"
# "D","100","3443","License plate light bulb","1"
# "H","101","307119","Pontiac Auto Parts Emporium","2012-04-09"
# "D","101","7009","Floor mat","2"
# "D","101","1004","HD Shifter knob","1"
# "D","101","1010","Loud muffler","2"
# "D","101","2112","Window tint film","5"
# The columns for a header row are: 
# - A - "H", indicating a header row
# - B - Order number
# - C - Customer number
# - D - Customer name
# - E - Order date
# The columns for a detail row are: 
# - A - "D", indicating a detail row
# - B - Order number
# - C - Part number
# - D - Part name
# - E - Quantity
# Each field should be enclosed in double quotes. 
# Commas inside the fields are OK, and will be preserved. 
# License: Public Domain
# To the extent possible under law, Lynn Grant has waived all 
# copyright and related or
# neighboring rights to this program. This work is published 
# from: United States.

import pandas as pd
import sys
import xml.etree.ElementTree as ET

# Column A is "H" for header, "D" for detail.
# Column B is the order number
# The remaining columns vary for the two kinds of rows, as 
# reflected in the following dictionaries. Note that the key
# for the last column is "E ", rather than "E". There seems to 
# be some quirk with pandas where a space is added to the name
# of the last column.
headDict   = {"C": "customerNumber", 
              "D": "customerName", 
              "E ": "orderDate"}
detailDict = {"C": "partNumber",     
              "D": "partName",     
              "E ": "quantity"} 

# Generate the subelements for a header or detail row.
# @param row    - The row of the data frame to be processed.
# @param dict   - The dictionary that maps alphabetic keys to 
#                 column names. 
# @param parent - The XML element that will be the parent of
#                 the generated subelements.
# @returns      - Zero is always returned.
def generateSubElements(row, dict, parent):

    for key in dict:
        entry = ET.SubElement(parent, dict[key])
        entry.text = str(row[key])
    return 0

# Process the parameters
inFile  = sys.argv[1]
outFile = sys.argv[2]

# Open the output file
of = open(outFile, "wt")

# Open the input .csv file and load it into a data frame. 
df = pd.read_csv(inFile)

items = ""
headerOrderNumber = ""
orders = ET.Element('orders')

for rowNum in df.index:

    # Extract the current row
    row = df.iloc[rowNum]

    # Process the header record for a new order.
    if str(row['A']) == "H":

        # Get the order number and save for checking 
        # with detail rows.
        headerOrderNumber = str(row['B'])

        # Generate the <order> element.
        order = ET.SubElement(orders, 'order')

        # Generate the <orderNumber> element.
        orderNumber = ET.SubElement(order, 'orderNumber')
        orderNumber.text = headerOrderNumber

        # Generate elements for the rest of the header columns.
        generateSubElements(row, headDict, order)

        # Generate the <items> element to hold detail
        # information.
        items = ET.SubElement(order, 'items')

    # Process a detail record for an order
    elif str(row['A']) == "D":

        # Generate the <item> element. 
        item  = ET.SubElement(items, 'item')

        # Get the order number and make sure it matches 
        # the one in the header row. 
        detailOrderNumber = str(row['B'])
        if detailOrderNumber != headerOrderNumber:
            sys.exit("Detail order number " + 
                     detailOrderNumber + 
                     " does not match header order number " + 
        # Generate items for the rest of the detail columns.
        generateSubElements(row, detailDict, item)
    # If a row has neither "H" nor "D" in the first column,
    # we just exit with an error message, since we do not
    # know what to do with the row. 
        sys.exit("Column A in row " + str(rowNum) + " is '" + 
                 str(row['A']) + "', not 'H' or 'D'") 

# Serialize the XML, put an XML header on the front, 
# and output it.    
xml_data = ET.tostring(orders)
of.write('<?xml version="1.0" encoding="UTF-8"?>' + xml_data.decode('UTF-8'))

Unlike the Zorba XQuery processor, this program will not nicely indent the XML output, which makes it difficult to read. However, you can process the output with the HTML Tidy program, specifying the -i (indent) parameter, to get the same result. (In spite of its name, it also tidies XML if you use the -xml parameter.) The -o parameter specifies the name of the output file.

python orders.csv raworders.xml
tidy -i -xml -o orders.xml raworders.xml 

Here is the resulting orders.xml file:

<?xml version="1.0" encoding="utf-8"?>
    <customerName>Detroit Auto Parts</customerName>
        <partName>License plate bolts</partName>
        <partName>License plate light</partName>
        <partName>License plate light bulb</partName>
    <customerName>Pontiac Auto Parts Emporium</customerName>
        <partName>Floor mat</partName>
        <partName>HD Shifter knob</partName>
        <partName>Loud muffler</partName>
        <partName>Window tint film</partName>
    <customerName>Drayton Plains Auto Service</customerName>
        <partName>Fuzzy Mirror Dice</partName>
        <partName>Toilet Paper Oil Filter</partName>
        <partName>Spark Plug Wolf Whistle</partName>
    <customerName>Lake Orien Automobiles</customerName>
        <partName>Spark plug</partName>
        <partName>Oil Pressure Gauge</partName>


Leave a Reply

Your email address will not be published. Required fields are marked *