Azure Web App XML Data at the root level is invalid. Line 1, position 1

Question

Azure Web App XML Data at the root level is invalid. Line 1, position 1

Given two XML files, whose content start as follows:

File A

<?xml

File B

ï»¿<?xml

File B would error. So we implemented the following code to remove the BOM, which has been working in a production environment for years:

private static string RemoveUTF8ByteOrderMark(string str)
{
    var byteOrderMarkUtf8 = Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble());

    if (str.StartsWith(byteOrderMarkUtf8))
    {    
        str = str.Remove(0, byteOrderMarkUtf8.Length);
    }

    return str;
}

Once the BOM is removed, we parse the string to XML using:

public static XDocument ParseXmlDocumentFromText(string fileText)
{
    if (string.IsNullOrEmpty(fileText)) return null;

    var nsm = new XmlNamespaceManager(new NameTable());

    nsm.AddNamespace("*****", "*****");

    var ctx = new XmlParserContext(null, nsm, null, XmlSpace.Default);
    var settings = new XmlReaderSettings { ProhibitDtd = false, XmlResolver = null };

    using (var fs = new StringReader(fileText))
    {
        using (var reader = XmlReader.Create(fs, settings, ctx))
        {
            var doc = new XmlDocument();

            doc.Load(reader);

            return XDocument.Parse(doc.OuterXml);
        }
    }
}

Now, we have migrated our website from a dedicated server to an Azure Web App, and while File B continues to load correctly, File A errors at doc.Load(reader);.

System.Web.HttpUnhandledException (0x80004005): Exception of type 'System.Web.HttpUnhandledException' was thrown. ---> System.Xml.XmlException: Data at the root level is invalid. Line 1, position 1.

In Azure, if I disable the RemoveUTF8ByteOrderMark(...) call, then File A loads correctly, and File B errors (as expected).

When I test locally on my computer, both files load with RemoveUTF8ByteOrderMark(...) enabled, which is consistent with our old dedicated server. In all 3 environments, the XML files are being pulled from Azure blob storage, so the exact same files are always being used.

What's going on in the Azure Web App that's changing how this code is run?

Update

In Azure, when RemoveUTF8ByteOrderMark(...) is called, I can see the text being returned is as follows:

File A

?xml

File B

<?xml

So why is RemoveUTF8ByteOrderMark(...) apparently causing the leading < to be stripped off?

c#

asp.net

encoding

azure-web-app-service

asked on Stack Overflow Jul 15, 2019 by

StronglyTyped • edited Jul 16, 2019 by

StronglyTyped

1 Answer

It seems that your code has no issue for processing a file with UTF-8 encoding. However, according to the wiki page Byte order mark, there is different byte length of BOM head in a file for different encoding, as the figure below.

So a solution to remove file BOM head for general purpose, you need to detect the file encoding first, then to remove the different length bytes of the file header to get the real content without BOM.

There is a GitHub repo AutoItConsulting/text-encoding-detect in C# and C++ to help detecting the text content encoding and checking BOM, as the code below.

/// <summary>
///     Gets the BOM length for a given Encoding mode.
/// </summary>
/// <param name="encoding"></param>
/// <returns>The BOM length.</returns>
public static int GetBomLengthFromEncodingMode(Encoding encoding)
{
    int length;

    switch (encoding)
    {
        case Encoding.Utf16BeBom:
        case Encoding.Utf16LeBom:
            length = 2;
            break;

        case Encoding.Utf8Bom:
            length = 3;
            break;

        default:
            length = 0;
            break;
    }

    return length;
}

/// <summary>
///     Checks for a BOM sequence in a byte buffer.
/// </summary>
/// <param name="buffer"></param>
/// <param name="size"></param>
/// <returns>Encoding type or Encoding.None if no BOM.</returns>
public Encoding CheckBom(byte[] buffer, int size)
{
    // Check for BOM
    if (size >= 2 && buffer[0] == _utf16LeBom[0] && buffer[1] == _utf16LeBom[1])
    {
        return Encoding.Utf16LeBom;
    }

    if (size >= 2 && buffer[0] == _utf16BeBom[0] && buffer[1] == _utf16BeBom[1])
    {
        return Encoding.Utf16BeBom;
    }

    if (size >= 3 && buffer[0] == _utf8Bom[0] && buffer[1] == _utf8Bom[1] && buffer[2] == _utf8Bom[2])
    {
        return Encoding.Utf8Bom;
    }

    return Encoding.None;
}

I think you can directly use or just change these code for fixing your issue to remove possible BOM bytes in a file.

Meanwhile, if you only need to process the content of XML file, I think a simple way is to IndexOf with <?xml for a string with possible BOM bytes, because the <?xml string is a fixed content in XML DTD.

Here is my sample code, it works for my xml file with or without BOM bytes.

int index = str.IndexOf("<?xml");
return str.Substring(index);

Hope it helps.

answered on Stack Overflow Jul 16, 2019 by

Peter Pan

User contributions licensed under CC BY-SA 3.0