How to Trim a UTF-8 string to the specified number of bytes in C#
By Tan Lee Published on Feb 05, 2025 270
To trim a UTF-8 string to a specified number of bytes in C#, you need to account for the fact that characters in UTF-8 encoding can vary in size, ranging from 1 to 4 bytes.
When you trim based on byte length, you must ensure that multi-byte characters are not split, or else you’ll end up with invalid sequences.
For example, how you can trim a UTF-8 string to a specific byte length:
using System; using System.Text; public static class StringExtensions { public static string TrimToByteLength(this string input, int byteLength) { if (string.IsNullOrEmpty(input)) return input; int currentBytes = Encoding.UTF8.GetByteCount(input); if (currentBytes <= byteLength) return input; // If the string contains only 1-byte characters (like ASCII), use substring for efficiency if (currentBytes == input.Length) return input.Substring(0, byteLength); // Convert the string to a byte array byte[] bytesArray = Encoding.UTF8.GetBytes(input); // Resize the byte array to the desired byte length Array.Resize(ref bytesArray, byteLength); // Convert the byte array back to a string string result = Encoding.UTF8.GetString(bytesArray, 0, byteLength); // Trim off the potential replacement character '�' if multi-byte characters are cut return result.TrimEnd('�'); } } class Program { static void Main() { string input = "Hello, 世界!"; // "世界" is 6 bytes in UTF-8 int byteLength = 10; string trimmed = input.TrimToByteLength(byteLength); Console.WriteLine($"Original: {input}"); Console.WriteLine($"Trimmed: {trimmed}"); } }
In this example:
- The original string
"Hello, 世界!"
consists of 13 bytes in UTF-8. - After trimming to 10 bytes, the result
"Hello, 世"
is returned, which correctly handles the multi-byte characters without cutting them off in the middle.
Output:
Original: Hello, 世界! Trimmed: Hello, 世
Handling Multi-Byte Characters:
UTF-8 characters can vary in length (1 to 4 bytes), so the method avoids invalid byte sequences by ensuring no multi-byte character is truncated. If a character is cut improperly, it will leave behind the replacement character �
, which is then trimmed off.
Categories
Popular Posts
Portal HTML Bootstrap
Nov 13, 2024
Freedash bootstrap lite
Nov 13, 2024
Implementing Caching in ASP.NET Core
Dec 14, 2024
11 Things You Didn't Know About Cloudflare
Dec 19, 2024