java boilerplate

Thursday, December 01, 2005

Xml Parsing

The DOM system especially seems much better suited to writing generic XML parsers, not the much more usual day to day business of parsing specific XML - XML for which you know the structure. Even the SAX system is generally considered quite unwieldly.

We suggest a new parser, based on direct-to-memory principles, which works by 'mapping' the structure of the XML onto classes directly. The java code of these POJO (Plain Old Java Object) classes can be explicitly written, or it can be generated from an XML Schema.

This xml system should be simple, self-contained, and have a low size footprint, so that it can be included in any project (or, better yet, as we're proposing java boilerplate) - just ship with mustang out of the box.

Annotations are used to identify the mappings. Example:

@Tag(namespace = "http://foo/bar")
public class RootElement {
@Attribute String id;
@Map String author;
@Map(tagName="published", dateFormat="yyyy-MM-dd HH:mm")
long publishedInMillis;
targetClass = IconRef.class, tagName="icon")
List icons;

class IconRef {
@Attribute int id;
@SimpleContent String url;

The above class structure is basically all that's required; register the classes and you can now read for example the following XML; you end up with an instance of RootElement:

<?xml version="1.0" encoding="UTF-8"?>
<rootelement id="something">
<author>Reinier Zwitserloot</author>
<published>2005-11-30 08:01</published>
<icon id="icon1">http://example/icon1.jpg</icon>
<icon id="icon2">http://example/icon2.jpg</icon>

The same system can be used to write XML.

Other features:
  • @AttributeMap - can be applied to a java.util.Map. Stores all attributes as key/value pairs in this map.

  • getters and setters - the getters and setters are called if they exist, instead of directly writing/reading the field

  • Aside from mapType=MapType.SINGLE_TAG (the default) and MapType.MULTIPLE_TAGS (shown in the example), there's also MapType.COLLECTION_TAG, which assumes the tagName refers to a tag which has no (important) attributes. The contents of that tag are loaded into the list.

  • The entire system uses sane defaults. If you don't specify an explicit tagName (either on a @Map or a @Tag annotation), the instruction matches any tag matching the name of the class (or field) in any case when reading, and translates the field/class name to lowercase when writing. COLLECTION_TAG and MULTIPLE_TAGS works on any standard java collection (Lists, Sets, even ConcurrentLists and such), and also work on arrays.

  • Whitespace is parsed and normalized automatically, unless this feature is explicitly turned off (by setting a whitespaceNormalization = false on one of the annotations), or if the data is contained inside a CDATA element. CDATA and text elements are mixed together into a single string automatically.

  • The system consists of a reader, a writer, and an XML Schema to class generator.

  • Attributes can have defaults: @Attribute(defaultValue = "something") int something; which will be used if the attribute does not exist in the XML data. When writing, the attribute is omitted if the data in the object structure matches the default.

I call it MOX (Mappable Object Xml) and its intended use is for daily basic XML parsework. Extremely complicated XML should probably be tackled with DOM or SAX, not this system. As such it's entirely reasonable that various 'advanced' XML things can't be done using MOX; keeping MOX simple to write, use, and understand is more important than making sure it can handle every possible type of XML. DTD stuff and processing instructions are completely ignored, as a simple example. The parser just skips right over them.

Here's a sample beta implementation. Most importantly - it only reads and writes to fields. setters/getters are ignored. Examples included: MOX download page.

NB: This example implementation is public domain code. Feel free to do whatever you like with it.


  • This comment has been removed by a blog administrator.

    By Blogger ChrisWoznitza, at 5:48 PM  

  • I've written a tutorial. The end result of it is a fully functional ATOM parser, and it only takes about 10 minutes to go through (yup, write a basic ATOM parser inside of 10 minutes... not something DOM or SAX can do very easily).

    By Blogger Reinier Zwitserloot, at 5:42 AM  

  • Missing accomplished. Sort of.

    JAXB 2 has a lot in common with MOX, and it'll be included in J2SE 1.6, and J2EE 1.5, apparently.

    By Blogger Reinier Zwitserloot, at 6:04 AM  

Post a Comment

<< Home