How to Trim a UTF-8 string to the specified number of bytes in C#

By Tan Lee Published on Feb 05, 2025 489

To trim a UTF-8 string to a specified number of bytes in C#, you need to account for the fact that characters in UTF-8 encoding can vary in size, ranging from 1 to 4 bytes.

When you trim based on byte length, you must ensure that multi-byte characters are not split, or else you’ll end up with invalid sequences.

For example, how you can trim a UTF-8 string to a specific byte length:

using System;
using System.Text;

public static class StringExtensions
{
    public static string TrimToByteLength(this string input, int byteLength)
    {
        if (string.IsNullOrEmpty(input))
            return input;

        int currentBytes = Encoding.UTF8.GetByteCount(input);
        if (currentBytes <= byteLength)
            return input;

        // If the string contains only 1-byte characters (like ASCII), use substring for efficiency
        if (currentBytes == input.Length)
            return input.Substring(0, byteLength);

        // Convert the string to a byte array
        byte[] bytesArray = Encoding.UTF8.GetBytes(input);

        // Resize the byte array to the desired byte length
        Array.Resize(ref bytesArray, byteLength);

        // Convert the byte array back to a string
        string result = Encoding.UTF8.GetString(bytesArray, 0, byteLength);

        // Trim off the potential replacement character '�' if multi-byte characters are cut
        return result.TrimEnd('�');
    }
}

class Program
{
    static void Main()
    {
        string input = "Hello, 世界!"; // "世界" is 6 bytes in UTF-8
        int byteLength = 10;

        string trimmed = input.TrimToByteLength(byteLength);
        Console.WriteLine($"Original: {input}");
        Console.WriteLine($"Trimmed: {trimmed}");
    }
}

In this example:

The original string "Hello, 世界!" consists of 13 bytes in UTF-8.
After trimming to 10 bytes, the result "Hello, 世" is returned, which correctly handles the multi-byte characters without cutting them off in the middle.

Output:

Original: Hello, 世界!
Trimmed: Hello, 世

Handling Multi-Byte Characters:

UTF-8 characters can vary in length (1 to 4 bytes), so the method avoids invalid byte sequences by ensuring no multi-byte character is truncated. If a character is cut improperly, it will leave behind the replacement character �, which is then trimmed off.

Tan Lee

How to Trim a UTF-8 string to the specified number of bytes in C#

By Tan Lee Published on Feb 05, 2025 489

Handling Multi-Byte Characters:

Related

Categories

Popular Posts

Structured Data using FoxLearn.JsonLd

Implement security headers for an ASP.NET Core

10 Common Mistakes ASP.NET Developers Should Avoid

Entity Framework Code First vs Database First vs Model First Approach

HTML Bootstrap 4 Login, Register & Reset Template