How to remove HTML tags from string in C#

By FoxLearn 1/9/2025 4:45:22 AM   156
In this post, we will explore how to remove HTML tags from a string in C#.

This is a common task when processing text, especially when you need to extract plain content from HTML code or when you're working with user-generated HTML input.

A string in C# might contain HTML elements, and our goal is to strip them out. This is useful when displaying HTML content as simple text and removing any HTML formatting, such as bold, italic, or hyperlinks.

There are various methods to achieve this, but here we will focus on two common approaches:

Remove HTML Tags using RegEx

RegEx is one of the simplest and most efficient ways to remove HTML tags from a string.

// 1. Using Regular Expression to remove HTML tags
System.Text.RegularExpressions.Regex regex = new System.Text.RegularExpressions.Regex("<[^>]*>");
FinalData = regex.Replace(FinalData, "");

This code uses a regular expression to match all content enclosed in < and >, which is characteristic of HTML tags, and replaces them with an empty string.

Remove HTML Tags Without RegEx

If you prefer not to use RegEx, you can manually remove HTML tags using a character array. This approach checks for the < and > characters and removes everything between them.

public string RemoveHTMLTagsManually(string html)
{
    var result = new StringBuilder();
    bool isInsideTag = false;

    foreach (char currentChar in html)
    {
        if (currentChar == '<')
        {
            // Start skipping content inside tag
            isInsideTag = true;
        }
        else if (currentChar == '>')
        {
            // Stop skipping content inside tag
            isInsideTag = false;
        }
        else if (!isInsideTag)
        {
            // Add the character to result if not inside a tag
            result.Append(currentChar);
        }
    }

    return result.ToString();
}

This method works by iterating through the input HTML string. When it encounters a <, it starts skipping characters until it finds a >, thus removing the HTML tags.

Sometimes, you may want to remove specific HTML elements, such as lists (<ul>) or tables (<table>). Below is an example of how to remove a <ul> element from an HTML string using custom logic:

// Example: Removing <ul> tags from HTML string
while (htmlContent.ToLower().IndexOf("<ul") > 0) 
{
    try
    {
        string ulTag = htmlContent.Substring(htmlContent.IndexOf("<ul"), 
                            (htmlContent.IndexOf("</ul>") + 5) - htmlContent.IndexOf("<ul"));
        htmlContent = htmlContent.Replace(ulTag, "");
    }
    catch (Exception ex)
    {
        string errorMessage = ex.Message;
        errorCount++;

        // Limit the number of iterations to avoid infinite loops in case of bad HTML
        if (errorCount > 100)
        {
            goto cleanup;
        }
    }
}

This approach locates the <ul> element and removes it from the HTML string. It handles cases where tags are improperly closed or malformed.

C# Remove <div> Tags from HTML String

Here's a method that uses RegEx to remove <div> tags from an HTML string:

public static string RemoveDivTags(string input)
{
   return Regex.Replace(input, "<div.*?>.*?</div>", String.Empty, RegexOptions.Singleline);
}

This method removes the entire <div> element and its contents using a RegEx pattern that matches <div> tags and everything inside them.

How to remove all html tags in Javascript

If you need to remove HTML tags in JavaScript, you can use the following regular expression:

item = item.replace(/<(.|\n)*?>/g, '');

This JavaScript code will strip out all HTML tags from the given item string, leaving only the plain text content.

By using one of these approaches, you can easily remove HTML tags from strings in both C# and JavaScript.