Converting DBASE files from GIS shapefiles to XML

Shapefiles, also know as ESRI shapefiles, are used to convey geographic information about locales to GIS (Geographic Information Systems) programs like ArcGIS or QGIS. However, they frequently contain useful information about the locales they describe.

A shape file is a .zip file containing at least the following files:

  • A main file (.shp) that contains the coordinates of the entries.
  • An index file (.shx) that is an index of the .shp file.
  • A data file (.dbf), in dBase format, that contains attributes of the points or polygons in the main file.

The .dbf file frequently contains information that can be useful even if you are not using a GIS program. Here is a Python program to convert the .dbf file from a shapefile to XML.

The program uses the dbfread and tidylib libraries, so you will have to install them before you run the program:

pip install dbfread
pip install tidylib

You also need to have HTML Tidy installed, which you can get here.

Here is the program:

#
# Convert a .dbf file from an ESRI shapefile to XML
#
# The first parameter is the input file name.
# The second parameter is the name to be used
# to enclose the elements from each row of the .dbf
# file. The root element will be the smae name
# with an "s" on the end. For example, if you
# specified "settlement", the XML would start
# like this:
#
# <setttlements>
#    <settlement>
#    ...
#    </settlement>
# </settlements>
#
# ESRI shapefiles are used to convey geographic information
# about locations. They are usually used with Geographical
# Information Systems like ArcGIS or QGIS.
#
# A shape file is a .zip file containing at least the
# following files:
# - A main file (.shp) that contains the coordinates
#   of the entries.
# - An index file (.shx) that is an index of the .shp file.
# - A data file (.dbf), in dBase format, that contains
#   attributes of the points or polygons in the main file.
#
# Consider, for example, the shapefile
# Belize_Settlements_Point_2010_Pop.zip,
# which contains information about settlements
# in Belize. The corresponding .dbf file is
# Belize_Settlements_Point_2010_Pop.dbf, and
# contains the following attributes for each
# settlement:
# - The name of the settlement
# - The district it is in
# - The type of settlement
# - The population
# - The X coordinate
# - The Y coordinate
#
# Some of the information is useful even if you
# are not using a GIS program. In this case,
# all but the last two fields might be useful.
# (In order to use the X and Y coordinates,
# you would have to delve into the shapefile
# more to figure out what coordinate system
# and projection it uses.)
#
# The fields available in any particular
# shapefile will be pertinent to that shapefile.
#
# The detailed specifications of ESRI shapefiles
# are in the following document.
#
# https://www.esri.com/content/dam/esrisites/
#   sitecore-archive/Files/Pdfs/library/
#   whitepapers/pdfs/shapefile.pdf
#
# Shapefiles can be found in various places
# on the Web. For example, this site has
# shapefiles related to Belize:
#
# http://www.biological-diversity.info/GIS.htm
#
# You can also purchase them from commercial
# vendors, such as NextGIS:
#
# https://nextgis.com/
#
# License: Public Domain
#
# To the extent possible under law, Lynn Grant has waived
# all copyright and related or
# neighboring rights to this program. This work is 
# published from: United States.
#

from dbfread import DBF
import sys
from tidylib import tidy_document
import xml.etree.ElementTree as ET

tidyOptions = {"input-xml": 1, "indent": 1}

# Get filename and element name
inFile = sys.argv[1]
entryName = sys.argv[2]

# Calculate the output name
j = inFile.rfind(".")
outFile = inFile[:j] + ".xml"

# Display the file names
print("Input file:  ", inFile)
print("Output file: ", outFile)

# Open the output file
of = open(outFile, "wt")

# The root element name is the entry name with an "s"
# with an "s" tacked on the end.
rootName = entryName + "s"

# Load in the DBASE file
table = DBF(inFile, load=True, lowernames=True)

# Generate the root element
root = ET.Element(rootName)

# Process all rows
for record in table:

    # Generate an entry element for this row
    entry = ET.SubElement(root, entryName)

    # Generate subelements for each field
    for key in record:
        subEntry = ET.SubElement(entry, key)
        subEntry.text = str(record[key])

# Serialize the XML and put an XML header on the front.
xml_data = '<?xml version="1.0" encoding="UTF-8"?>' + 
   ET.tostring(root).decode('UTF-8')

# Pretty-print the XML and write it to the output file.
document, errors = tidy_document(xml_data, tidyOptions)
if (errors):
   print(errors)
of.write(document)
of.close()


To run the program, you pass it the name of the .dbf file and the name to use for the element containing the information for each row. The root element will be that same name, but with an “s” on the end.

python dbf2xml.py Belize_Settlements_Point_2010_Pop.dbf settlement

(The picture at the beginning of this post is how the shapefile looks in a GIS program, showing the various settlements in Belize.)

The output file will have the same name as the input file, but with the extension changed from .dbf to .xml.

The last step of the conversion process runs the generated XML through the HTML Tidy program (which, in spite of its name, can also tidy XML), in order to indent it nicely.

Here is an abbreviated version of the generated XML:

<?xml version="1.0" encoding="utf-8"?>
<settlements>
  <settlement>
    <settname>Belize City</settname>
    <district>Belize</district>
    <type>City</type>
    <population>57310.0</population>
    <x_coord>373687.0</x_coord>
    <y_coord>1934478.0</y_coord>
  </settlement>
  <settlement>
    <settname>Bermudian Landing</settname>
    <district>Belize</district>
    <type>Village</type>
    <population>183.0</population>
    <x_coord>337190.05264</x_coord>
    <y_coord>1941472.7739</y_coord>
  </settlement>
  . . .
</settlements>

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *