Normalizing unicode strings in C# - Info Support Blog

Home » Normalizing unicode strings in C#

Normalizing unicode strings in C#
- By Willem Meints
- 15 years ago
- 0 comments
- .NET
Currently I’m working on a tool that is going to migrate titles and other information into website URI’s. Now there is something ugly about the URI that causes quite a bit of trouble when converting normal text into an URI. Spaces for example are transformed in to an encoded variant %20. The same happens for the @ sign and other characters. Some will need to be replaced, but others need special care.

Take for example characters like ë, ä, ç, etc. What I wanted do here is convert them into e,a and c respectively. After looking high and low for a simple solution I found out the following trick:

resultValue = resultValue.Normalize(NormalizationForm.FormD);
StringBuilder normalizedOutputBuilder = new StringBuilder();

foreach (char c in resultValue)
{
 UnicodeCategory category = CharUnicodeInfo.GetUnicodeCategory(c);

 if (category != UnicodeCategory.NonSpacingMark)
 {
  normalizedOutputBuilder.Append(c);
 }
}

The first step is to normalize the unicode string. This splits up é into a marker to for the accent and the real letter e. The next step is to filter out the markers and leave the spaces and letters in the output string.

Pretty simple once you know how 😉

Share this

Leave a Reply

Click here to cancel reply.