How to Trim a UTF-8 string to the specified number of bytes in C#

By FoxLearn 2/5/2025 9:37:57 AM   12
To trim a UTF-8 string to a specified number of bytes in C#, you need to account for the fact that characters in UTF-8 encoding can vary in size, ranging from 1 to 4 bytes.

When you trim based on byte length, you must ensure that multi-byte characters are not split, or else you’ll end up with invalid sequences.

For example, how you can trim a UTF-8 string to a specific byte length:

using System;
using System.Text;

public static class StringExtensions
{
    public static string TrimToByteLength(this string input, int byteLength)
    {
        if (string.IsNullOrEmpty(input))
            return input;

        int currentBytes = Encoding.UTF8.GetByteCount(input);
        if (currentBytes <= byteLength)
            return input;

        // If the string contains only 1-byte characters (like ASCII), use substring for efficiency
        if (currentBytes == input.Length)
            return input.Substring(0, byteLength);

        // Convert the string to a byte array
        byte[] bytesArray = Encoding.UTF8.GetBytes(input);

        // Resize the byte array to the desired byte length
        Array.Resize(ref bytesArray, byteLength);

        // Convert the byte array back to a string
        string result = Encoding.UTF8.GetString(bytesArray, 0, byteLength);

        // Trim off the potential replacement character '�' if multi-byte characters are cut
        return result.TrimEnd('�');
    }
}

class Program
{
    static void Main()
    {
        string input = "Hello, 世界!"; // "世界" is 6 bytes in UTF-8
        int byteLength = 10;

        string trimmed = input.TrimToByteLength(byteLength);
        Console.WriteLine($"Original: {input}");
        Console.WriteLine($"Trimmed: {trimmed}");
    }
}

In this example:

  • The original string "Hello, 世界!" consists of 13 bytes in UTF-8.
  • After trimming to 10 bytes, the result "Hello, 世" is returned, which correctly handles the multi-byte characters without cutting them off in the middle.

Output:

Original: Hello, 世界!
Trimmed: Hello, 世

Handling Multi-Byte Characters:

UTF-8 characters can vary in length (1 to 4 bytes), so the method avoids invalid byte sequences by ensuring no multi-byte character is truncated. If a character is cut improperly, it will leave behind the replacement character , which is then trimmed off.