How to Trim a UTF-8 string to the specified number of bytes in C#
By Tan Lee Published on Feb 05, 2025 420
To trim a UTF-8 string to a specified number of bytes in C#, you need to account for the fact that characters in UTF-8 encoding can vary in size, ranging from 1 to 4 bytes.
When you trim based on byte length, you must ensure that multi-byte characters are not split, or else you’ll end up with invalid sequences.
For example, how you can trim a UTF-8 string to a specific byte length:
using System; using System.Text; public static class StringExtensions { public static string TrimToByteLength(this string input, int byteLength) { if (string.IsNullOrEmpty(input)) return input; int currentBytes = Encoding.UTF8.GetByteCount(input); if (currentBytes <= byteLength) return input; // If the string contains only 1-byte characters (like ASCII), use substring for efficiency if (currentBytes == input.Length) return input.Substring(0, byteLength); // Convert the string to a byte array byte[] bytesArray = Encoding.UTF8.GetBytes(input); // Resize the byte array to the desired byte length Array.Resize(ref bytesArray, byteLength); // Convert the byte array back to a string string result = Encoding.UTF8.GetString(bytesArray, 0, byteLength); // Trim off the potential replacement character '�' if multi-byte characters are cut return result.TrimEnd('�'); } } class Program { static void Main() { string input = "Hello, 世界!"; // "世界" is 6 bytes in UTF-8 int byteLength = 10; string trimmed = input.TrimToByteLength(byteLength); Console.WriteLine($"Original: {input}"); Console.WriteLine($"Trimmed: {trimmed}"); } }
In this example:
- The original string
"Hello, 世界!"
consists of 13 bytes in UTF-8. - After trimming to 10 bytes, the result
"Hello, 世"
is returned, which correctly handles the multi-byte characters without cutting them off in the middle.
Output:
Original: Hello, 世界! Trimmed: Hello, 世
Handling Multi-Byte Characters:
UTF-8 characters can vary in length (1 to 4 bytes), so the method avoids invalid byte sequences by ensuring no multi-byte character is truncated. If a character is cut improperly, it will leave behind the replacement character �
, which is then trimmed off.
Categories
Popular Posts
K-WD Tailwind CSS Admin Dashboard Template
Nov 17, 2024
DASHMIN Admin Dashboard Template
Nov 16, 2024
Monster Admin Template
Nov 14, 2024
10 Common Mistakes ASP.NET Developers Should Avoid
Dec 16, 2024