Using XPath for Efficient Parsing of the XML Documents in Java
If you ever had to write a document parser that uses DOM API in Java, you may have mixed feelings about the simplicity of using of this API. Yes, it is simple and the concept of the document tree is very easy to understand. However, if you want to write some really good (i.e. reliable) code you have to take care about a number of things, including the mixed comment and element nodes, presence of the optional attributes, CDATA nodes etc. Defining a DTD or schema for your document and using the validating parser may reduce the need for the extra condition checking in the code. Proper configuration of the parser may simplify the task a little bit more but…For example consider the following simple task. Here is the sample document:
<item-list> <item visible="true" id="1" >Item 1</item> <item visible="false" id="2" >Item 2</item> <item visible="true" >Item 3</item> <item visible="true" id="4" ><![CDATA[item 4]]></item> </item-list>
For example, we need to get the list of the visible item names (and their IDs). Here is how it can be done in Java using DOM (assuming that you have already parsed the XML text into an instance of the Document class)
NodeList nodeList = doc.getElementsByTagName("item"); for(int i = 0, numElements = nodeList.getLength(); i < numElements; i++) { Node node = nodeList.item(i); if (node instanceof Element) { Element element = (Element)node; String idStr = element.getAttribute("id"); String visibleStr = element.getAttribute("visible"); String itemName = null; if (idStr.length() > 0 && "true".equals(visibleStr)) { // found the visible element with known id NodeList elementContentNodes = element.getChildNodes(); for(int ci = 0, numChildren = elementContentNodes.getLength(); ci < numChildren; ci++) { Node n = elementContentNodes.item(ci); if (n instanceof Text) { itemName = n.getNodeValue(); break; } } if (itemName != null) { System.out.println("Item #" + idStr + ": " + itemName); } } } }
This code does not look very compact and it does not take into account certain possibilities, for example if someone puts a comment between two parts of the text content etc.
XPath (see the references) language allows you to simplify this task a lot. You can write an expression that selects a list of element nodes using various parameters (even evaluating the expressions). In our case, the expression that selects the nodes we are interested in would look like this:
/item-list/item[@visible="true" and @id!=""]
This XPath expression selects all the “item” elements that are the children of “item-list” element and have “visible” attribute set to “true” and non-empty value of the “id” attribute.
XPath can be used in various ways in Java, most recent versions of J2SE offer support for it. However, personally I find that the APIs offered by Xalan library are the most convenient to use. For example, this fragment of code does the same as the one quoted above (and in addition to that it can handle the situation with the mix of comments and text values):
NodeIterator itemNodeIterator = XPathAPI.selectNodeIterator(doc.getDocumentElement(), "/item-list/item[@visible="true" and @id!=""]"); Element element = null; while((element = (Element)itemNodeIterator.nextNode()) != null) { String itemName = XPathAPI.eval( element, "string(.)").toString(); System.out.println("Item #" + element.getAttribute("id") + ": " + itemName); }
Using of the XPath queries can be combined with the custom functions, that would give the caller even more power for efficient processing of the XML content. However, it is important to understand what is hidden “under the hood” of the technology: XPath-based approach allows you to write compact and readable (and thus more bug-free) code but the performance of this code may be lower than the performance of the code based on the DOM API (or even using simple SAX parser). Nothing is free in this world ;)
This is not the most efficient way of using Xalan XPath engine (see the javadocs for XPathAPI class to understand why), however it demonstrates the advantages of using XPath over traditional DOM API. First of all, by properly writing the XPath expression you can simplify the type casting - if you select the list of elements you will get the list of Element
instances or empty list. All the filtering is done by the XPath engine, you get the final result. Plus, in my example I just ask the engine to give me the value of “select(.)” which returns the text representation of the current node (which is “item” element) and all its children. It saves me from the need of iterating over all the children of the current “item” node to collect its text content.
You can download the sample Java programs and the XML file here. Make sure that you include xalan.jar in your classpath when running the XPath sample.
blog comments powered by Disqus