How to Trim a UTF-8 string to the specified number of bytes in C#
By FoxLearn 2/5/2025 9:37:57 AM 12
To trim a UTF-8 string to a specified number of bytes in C#, you need to account for the fact that characters in UTF-8 encoding can vary in size, ranging from 1 to 4 bytes.
When you trim based on byte length, you must ensure that multi-byte characters are not split, or else you’ll end up with invalid sequences.
For example, how you can trim a UTF-8 string to a specific byte length:
using System; using System.Text; public static class StringExtensions { public static string TrimToByteLength(this string input, int byteLength) { if (string.IsNullOrEmpty(input)) return input; int currentBytes = Encoding.UTF8.GetByteCount(input); if (currentBytes <= byteLength) return input; // If the string contains only 1-byte characters (like ASCII), use substring for efficiency if (currentBytes == input.Length) return input.Substring(0, byteLength); // Convert the string to a byte array byte[] bytesArray = Encoding.UTF8.GetBytes(input); // Resize the byte array to the desired byte length Array.Resize(ref bytesArray, byteLength); // Convert the byte array back to a string string result = Encoding.UTF8.GetString(bytesArray, 0, byteLength); // Trim off the potential replacement character '�' if multi-byte characters are cut return result.TrimEnd('�'); } } class Program { static void Main() { string input = "Hello, 世界!"; // "世界" is 6 bytes in UTF-8 int byteLength = 10; string trimmed = input.TrimToByteLength(byteLength); Console.WriteLine($"Original: {input}"); Console.WriteLine($"Trimmed: {trimmed}"); } }
In this example:
- The original string
"Hello, 世界!"
consists of 13 bytes in UTF-8. - After trimming to 10 bytes, the result
"Hello, 世"
is returned, which correctly handles the multi-byte characters without cutting them off in the middle.
Output:
Original: Hello, 世界! Trimmed: Hello, 世
Handling Multi-Byte Characters:
UTF-8 characters can vary in length (1 to 4 bytes), so the method avoids invalid byte sequences by ensuring no multi-byte character is truncated. If a character is cut improperly, it will leave behind the replacement character �
, which is then trimmed off.
- String to Byte Array Conversion in C#
- How to Save a list of strings to a file in C#
- How to Convert string list to int list in C#
- How to Convert string list to float list in C#
- How to Remove a list of characters from a string in C#
- How to Check if a string contains any substring from a list in C#
- Find a character in a string in C#
- Remove non-alphanumeric characters from a string in C#
Categories
Popular Posts
Regal Admin Dashboard Template
11/18/2024
Horizon MUI Admin Dashboard Template
11/18/2024
How to secure ASP.NET Core with NWebSec
11/07/2024
Responsive Animated Login Form
11/11/2024