How to Use regex capturing groups to extract data in C#

By FoxLearn 12/20/2024 2:12:43 AM   17
In regex, capturing groups allow you to extract and save portions of text for later use. These groups can either be unnamed and referenced by index, or they can be named for easier reference.

In this article, we will demonstrate how to use named capturing groups to extract specific data from server log files, helping you analyze key information like IP addresses and timestamps.

Let's assume you have a server log file with entries similar to the following:

2024-12-19 12:45:02 [INFO] 192.168.1.1 User logged in
2024-12-19 12:46:15 [ERROR] 192.168.1.2 Failed login attempt
2024-12-19 12:47:30 [INFO] 192.168.1.3 User logged out

These logs include the timestamp, log level, IP address, and the log message.

Your goal is to extract the timestamp, log level, and IP address from each log entry.

Determine What Data You Want to Extract

We want to extract:

  • Timestamp: The date and time of the log entry.
  • Log Level: The type of log message, such as INFO or ERROR.
  • IP Address: The IP address that triggered the log.

For example, we want to convert the following log line:

2024-12-19 12:45:02 [INFO] 192.168.1.1 User logged in

into

Timestamp               | Log Level | IP Address      | Message
--------------------------------------------------------------
2024-12-19 12:45:02     | INFO      | 192.168.1.1     | User logged in

Write the Regex

To extract this data, we will write a regex pattern. We’ll use the Regex Tester to build and test our pattern.

Here’s the regex pattern to extract the timestamp, log level, IP address, and message:

(?<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) \[(?<level>\w+)\] (?<ip>\d+\.\d+\.\d+\.\d+) (?<message>.*)
  • (?<timestamp> ...): This is a named capturing group called "timestamp".
  • \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}: Matches the timestamp in the format YYYY-MM-DD HH:MM:SS.
  • \[ and \]: Matches the literal square brackets around the log level.
  • (?<level>\w+): A named capturing group called "level" that matches one or more word characters (INFO, ERROR, etc.).
  • (?<ip> ...): A named capturing group called "ip".
  • \d+\.\d+\.\d+\.\d+: Matches the IP address pattern, consisting of four groups of digits separated by dots.
  • (?<message> ...): A named capturing group called "message".
  • .*: Matches the rest of the string (the log message).

 Now that we have the regex, let's write the C# code to extract the data from the logs using the regex pattern.

using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;

namespace RegexCapturingGroups
{
    public class LogEntry
    {
        public string Timestamp { get; set; }
        public string Level { get; set; }
        public string IpAddress { get; set; }
        public string Message { get; set; }
    }

    public class LogParser
    {
        private static readonly Regex regex = new Regex(@"(?<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) \[(?<level>\w+)\] (?<ip>\d+\.\d+\.\d+\.\d+) (?<message>.*)", RegexOptions.Compiled);

        public List<LogEntry> ParseLogs(string rawLogData)
        {
            var logEntries = new List<LogEntry>();

            foreach (Match match in regex.Matches(rawLogData))
            {
                logEntries.Add(new LogEntry()
                {
                    Timestamp = match.Groups["timestamp"].Value,
                    Level = match.Groups["level"].Value,
                    IpAddress = match.Groups["ip"].Value,
                    Message = match.Groups["message"].Value
                });
            }

            return logEntries;
        }
    }
}

By using regex capturing groups in C#, we can easily extract specific data from a complex text structure, such as server logs. This technique can be adapted for many other scenarios where structured data needs to be extracted from unstructured text.