Remove non-alphanumeric characters from a string in C#

By FoxLearn 2/5/2025 9:00:29 AM   13
A simple approach to removing non-alphanumeric characters from a string is by utilizing a regular expression (regex):
if (string.IsNullOrEmpty(s))
    return s;

return Regex.Replace(s, "[^a-zA-Z0-9]", "");

Note: Avoid passing in a null value, as this will result in an exception.

Using regex is a straightforward method for filtering characters by "category," such as retaining only alphanumeric characters. However, keep in mind that regex can be the slower option compared to alternatives, which is something to consider if performance is critical.

This example only retains ASCII alphanumeric characters. If you need to handle other alphabets or character sets, there's more you can do.

For Optimal Performance, Use a Loop

An alternative method that improves performance is looping through the string and checking each character individually. This method is considerably faster (up to 7.5 times) than regex.

if (string.IsNullOrEmpty(s))
    return s;

StringBuilder sb = new StringBuilder();
foreach (var c in s)
{
    if ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || (c >= '0' && c <= '9'))
        sb.Append(c);
}
return sb.ToString();

This method is significantly faster than regex and avoids the overhead of regex operations.

Avoid Using Compiled Regex

Using compiled regex is unlikely to improve performance in this case. In some scenarios, it could even be slower than just using non-compiled regex. A simpler approach is to use the static Regex.Replace() method, which avoids the need to manage a compiled regex object.

Example of using compiled regex:

private static readonly Regex regex = new Regex("[^a-zA-Z0-9]", RegexOptions.Compiled);

public static string RemoveNonAlphanumericChars(string s)
{
    if (string.IsNullOrEmpty(s))
        return s;

    return regex.Replace(s, "");
}

Use char.IsLetterOrDigit() for Unicode Alphanumeric Characters

If you need to keep all Unicode alphanumeric characters, you can use char.IsLetterOrDigit() which will return true for any character classified as a letter or digit across all Unicode character sets.

For example, this would allow characters from many languages and scripts (like Greek or Arabic):

if (string.IsNullOrEmpty(s))
    return s;

StringBuilder sb = new StringBuilder();
foreach (var c in s)
{
    if (char.IsLetterOrDigit(c))
        sb.Append(c);
}
return sb.ToString();

Note: Using char.IsLetterOrDigit() can be inefficient in situations where you only need to keep a specific set of characters. In those cases, it’s best to specify exactly which characters you want.

Performance Comparison of Methods

To compare performance, I benchmarked four methods for removing non-alphanumeric characters from a string. The test was done with a string of 100 characters.

  • Regex: 5016 ns (compiled regex: 4457 ns)
  • Linq: 1506 ns
  • Loop: 663 ns

The loop-based method outperforms all other methods by a significant margin.

Handling Non-ASCII Characters in Regex

What if you need to deal with non-ASCII characters, such as Greek characters? Here’s how to handle it:

For instance, if you're working with Greek characters (like "Ελληνικά"), you can specify the Unicode range for Greek characters:

Regex.Replace(s, "[^\u0370-\u03FF]", "");

Alternatively, you can use a Unicode named block to make the regex more readable:

Regex.Replace(s, @"[^\p{IsGreek}]", "");

Or, you can specify exactly which Unicode characters to allow, such as this range for Greek letters:

Regex.Replace(s, "[^α-ωάΕ]", "");

This approach helps ensure you're working with the exact characters you need, making the code more readable and easier to maintain.