Reading large XML files

Have you ever had a task to read and deserialize a large XML file? Like 500 MB or 2 GB, that is impossible just to read to the end into memory and parse it.

I've faced such problem having an XML that looks like the following:

<?xml version="1.0" encoding="UTF-8"?>
<record id="0" Text="Some text for record 0" />
<record id="1" Text="Some text for record 1" />
<record id="2" Text="Some text for record 2" />
<record id="3" Text="Some text for record 3" />
...
<record id="1000000" Text="Some text for record 1000000" />

We can't just read the whole file as a string and pass it to the deserializer because it is too large. Moreover, we don't have a root element, that breaks the deserializer.

Additionally, we probably don't want to load all parsed elements into memory, we need to produce the IEnumerable to be able to process elements one by one.

And last, it would be great to have the generic version of code :)

First, let's define the record class:

[XmlRoot(ElementName = "record")]
public class XmlRecord
{
    [XmlAttribute("id")]
    public int Id { get; set; }
 
    [XmlAttribute("Text")]
    public string Text { get; set; }
}

Make sure that class is marked as an XmlRoot because we gonna read the file element by element and treat each record as a whole XML document.

In order to read the file by fragments, we can use the XmlReader with appropriate settings:

var settings =
    new XmlReaderSettings
    {
        ConformanceLevel = ConformanceLevel.Fragment,
        IgnoreWhitespace = true
    };
 
using (var reader = XmlReader.Create(xmlFilePath, settings))
{
    // ...
}

We can start by putting the cursor to the beginning of content and then read elements one by one until the end of the file:

reader.MoveToContent();
while (!reader.EOF)
{
    // reader.ReadSubtree();
    // ...
}

Let's put everything together:

public static IEnumerable<T> ReadLargeXml<T>(string xmlFilePath)
{
    var settings =
        new XmlReaderSettings
        {
            ConformanceLevel = ConformanceLevel.Fragment,
            IgnoreWhitespace = true
        };
 
    using (var reader = XmlReader.Create(xmlFilePath, settings))
    {
        var serializer = new XmlSerializer(typeof(T));
 
        reader.MoveToContent();
        while (!reader.EOF)
        {
            var element = XElement.Load(reader.ReadSubtree());
            var record = (T)serializer.Deserialize(
                element.CreateReader());

            yield return record;
            reader.Read();
        }
    }
}

This method is generic. It returns an enumeration of elements, so we can process them one by one. The input XML file remains open until all elements are read. And the method correctly handles the case when the file is empty or contains only the XML definition.

next post: ConfigEx 2.1.0