Functional Flow

Streaming XML Input With XElementReader

| Comments

Updated on 2008/09/17: Fixed problem when skipping elements.
Updated on 2008/09/15: Fixed problem when trying to read missing attributes.

One of new features introduced in .NET 3.5 that I welcomed the most was LINQ to XML. The old DOM API was a bit clumsy to use, and the simple fact that you don't need owner documents any more makes the new XElement much more flexible and pleasant to work with than the old XmlElement.

Also new is an API for streaming XML output, XStreamingElement, that by using deferred execution gives you SAX-like performance on a DOM-like API. There's no streaming XML input API, though, so although you can now get away with not having to use XmlWriter any more, you'll still need to use XmlReader when you want good performance on large documents. During the LINQ to XML development, the XML Team considered such an API, but they decided not to do it for Orcas.

Fortunately, Ralf Lämmel proposed such an API in API-based XML streaming with FLWOR power and functional updates. I contacted him too ask if he could publicly release the code of his prototype, but he said he couldn't do it. Nevertheless, he kindly offered to help me develop a similar library myself, so with his help I implemented a small subset of the functionality he described in the paper. It took a while to get the corner cases right, but this is being used in a real-world scenario for some months now, so I think it's stable enough.

The interface is the following:

XElementReader(string path);
XElementReader(TextReader reader);
XElementReader(XmlReader reader);
XElementReader(Stream stream);

XName Name { get; }
string Value { get; }

XAttribute Attribute(XName name);
IEnumerable<XAttribute> Attributes();

XElementReader FirstElement();
XElementReader Element(XName name);
IEnumerable<XElementReader> Elements();
IEnumerable<XElementReader> Elements(XName name);

XElement ToXElement();

Although XElementReader looks a lot like a subset of XElement, you still have to remember that it's using a XmlReader underneath, so after you try to get to any child element, you have changed the reader position.

For example, if you have an element like <Root><A/><B/><A/><B/></Root> and call .Element("A") twice and then .Element("B") twice, the second call for B will return null. If you instead call .Elements("A") and then .Elements("B"), you'll get two A elements, but no B elements at all. So to do this right, you have iterate on .Elements() and check the .Name property to see if you're in an A or in a B element.

This read once nature sometimes also complicates debugging. To help with that, you can define XML_DEBUG_MODE to force XElementReader to use a XElement behind the covers instead of a XmlReader, so you can add watches freely without worrying about side effects. But remember to fully test with this conditional compilation symbol off.

Here's the full code: