Tuesday, October 20, 2009

How To: Strip Illegal XML Characters from a String in VB.NET

Recently I was having some trouble with string data that was being sent to an .asmx web service I had built and returning the following exception message:

"Response is not well-formed XML System.Xml.XmlException: ' ', hexadecimal value 0x13, is an invalid character."

The origination of this cause was due to my users copying and pasting data from Microsoft Word into a WYSIWYG editor that was preserving illegal characters, such as the one ('!!') shown in the exception above.

Rather than put in place some calls shielding the web service from the bad data, I decided to research building a method that would strip out and remove illegal characters prior to placing the data into my business object on the front end. Of coarse I could check it on the back end too to be thorough, but this is what was appropriate for my scenario.

There turns out to be some information on this topic, but oddly enough most of the solutions were written for Java and PHP. The .NET solutions I found were only half working and not complete. The best solution I came across was one written in Java at
Ben J. Christensen's Blog. With the help from users on the ASP.NET forums here I was able to place all the information I had found together to come up with a VB.NET version of the code. I really credit Ben and the forum for the base code help; thank you.

The code's purpose is to take the passed in string value, and check each character 1 by 1 to see if any illegal XML characters exist. All valid characters are re-appended to the output, and illegal characters are omitted.

If you need the C# version check the forum link I provided above. The main difference is that the 'AscW' function that wraps the character in focus is not required in C#. This is because C# and VB.NET deal differently in character to integer conversions. The final code is below, and hopefully this .NET version will help somebody in the future as it did for me.


Public Shared Function RemoveIllegalXMLCharacters(ByVal Content As String) As String

'Used to hold the output.
Dim textOut As New StringBuilder()
'Used to reference the current character.
Dim current As Char
'Exit out and ruturn an empty string if nothing was passed in to method
If Content Is Nothing OrElse Content = String.Empty Then
Return String.Empty
End If

'Loop through the lenght of the content (1) character at a time to see if there
'are any illegal characters to be removed:
For i As Integer = 0 To Content.Length - 1
'Reference the current character
current = Content(i)
'Only append back to the StringBuilder valid non-illegal characters
If (AscW(current) = &H9 OrElse AscW(current) = &HA OrElse AscW(current) = &HD) _
OrElse ((AscW(current) >= &H20) AndAlso (AscW(current) <= &HD7FF)) _
OrElse ((AscW(current) >= &HE000) AndAlso (AscW(current) <= &HFFFD)) _
OrElse ((AscW(current) >= &H10000) AndAlso (AscW(current) <= &H10FFFF)) Then
textOut.Append(current)
End If
Next

'Return the screened content with only valid characters
Return textOut.ToString()

End Function

Someone had asked how this method could be modified to accept and return an 'XmlDocument' type. The method only needs a few small code changes to support this, and would make a good overload to the original funtion. You will need to import the System.XML and System.IO namespaces for this overload.


Public Shared Function RemoveIllegalXMLCharacters(ByVal XmlDoc As XmlDocument) As XmlDocument

'Use a StringWriter & XmlTextWriter, to extract the raw text from the XmlDocument passed in:
Dim sw As New StringWriter()
Dim xw As New XmlTextWriter(sw)
XmlDoc.WriteTo(xw)
Dim Content As String = sw.ToString()

'Used to hold the output.
Dim textOut As New StringBuilder()
'Used to reference the current character.
Dim current As Char
'Exit out and ruturn an empty string if nothing was passed in to method
If Content Is Nothing OrElse Content = String.Empty Then
Return Nothing
End If

'Loop through the lenght of the content (1) character at a time to see if there
'are any illegal characters to be removed:
For i As Integer = 0 To Content.Length - 1
'Reference the current character
current = Content(i)
'Only append back to the StringBuilder valid non-illegal characters
If (AscW(current) = &H9 OrElse AscW(current) = &HA OrElse AscW(current) = &HD) _
OrElse ((AscW(current) >= &H20) AndAlso (AscW(current) <= &HD7FF)) _
OrElse ((AscW(current) >= &HE000) AndAlso (AscW(current) <= &HFFFD)) _
OrElse ((AscW(current) >= &H10000) AndAlso (AscW(current) <= &H10FFFF)) Then
textOut.Append(current)
End If
Next

'Build a new XMLDocument to return containing the screened content with only valid characters
Dim XmlDocClean As New XmlDocument
XmlDocClean.LoadXml(textOut.ToString())
Return XmlDocClean

End Function

12 comments:

  1. Any thoughts on adapting it to take an xml file as opposed to the content of the xml file as a string?

    ReplyDelete
  2. I added an overload accepting and returning an XMLDocument type to the original post that should help. Thanks for reading.

    ReplyDelete
  3. This solution is great!!. Thanks!!

    ReplyDelete
  4. This doesn't seem to replace/fix the "&" character which is considered invalid for XML files. Could you update this to correct that? And perhaps list the characters that it does correct?

    ReplyDelete
  5. You might also wish to check for Numerically Coded References(NCR) to illegal characters in both decimal and hexidecimal form, like and , which are also illegal. I have the C# code at my blog: http://xponentsoftware.com/Articles/Removing illegal characters.aspx

    ReplyDelete
  6. Does this not do the same thing?
    System.Security.SecurityElement.Escape

    ReplyDelete
  7. Thanks for posting this code!

    ReplyDelete
  8. Can you show how to load the xml file to use the function. As a string and xmldocument. trouble with calling the function on the returned data

    ReplyDelete
  9. This method is not working for hexadecimal value 0x03 ! Any solution is for that.

    ReplyDelete
  10. This is exactly what I needed, thank you

    ReplyDelete
  11. Thank you. Works like a charm!

    ReplyDelete