How to remove duplicates from a list in C#

By FoxLearn 12/25/2024 4:35:34 AM 286

Removing duplicates from a list is a common task in programming. The most efficient way to handle it is by iterating through the list and using a data structure that ensures uniqueness, such as a HashSet.

1. Remove Duplicates with ToHashSet() and ToList()

One of the most concise and simple ways to remove duplicates in C# is by leveraging LINQ methods ToHashSet() and ToList().

using System.Linq;

var list = new List<int> { 1, 1, 1, 2, 2, 2, 3 };
var dedupedList = list.ToHashSet().ToList();

In this example:

ToHashSet(): This method iterates over the list and adds each item to a HashSet. Since a HashSet only allows unique elements, any duplicate items are discarded.
ToList(): After removing duplicates, we convert the HashSet back to a List using ToList().

The non-LINQ equivalent of this approach would be:

var dedupedList = new List<int>(new HashSet<int>(list));

2. Remove Duplicates in a Loop

While the previous method is concise, iterating manually gives you more flexibility and control over the process. This approach is also the most efficient in terms of time complexity:

var list = new List<int> { 1, 1, 1, 2, 2, 2, 3 };
var dedupedList = new List<int>(list.Count); // Minimize resizing during additions
var hashSet = new HashSet<int>(list.Count);

foreach (var item in list)
{
    if (hashSet.Add(item))
        dedupedList.Add(item);
}

In this example:

We iterate through the original list.
For each item, we attempt to add it to a HashSet using hashSet.Add(item).
HashSet.Add() returns true if the item was added successfully (i.e., it's not a duplicate). If it's true, we add the item to a new list (dedupedList).

This method has the advantage of directly controlling how items are added and ensures that duplicates are removed with an optimal time complexity of O(n).

3. Remove Duplicates In-Place with List.RemoveAll()

If you want to modify the original list directly and avoid creating a new one, you can use the List.RemoveAll() method along with a HashSet.

var list = new List<int> { 1, 1, 1, 2, 2, 2, 3 };
var hashSet = new HashSet<int>();

list.RemoveAll(i => hashSet.Add(i) == false);

In this example:

List.RemoveAll() takes a predicate (a lambda function) and removes elements based on the given condition.
The lambda checks if hashSet.Add(i) returns false, which means the element has already been encountered and is therefore a duplicate.

This method is efficient because it doesn't create a new list and only modifies the original one.

4. Remove Duplicate Objects Based on a Property

In real-world scenarios, you may need to remove duplicate objects based on a specific property, such as an object’s name or ID. In .NET 6+, you can use DistinctBy() from LINQ for this purpose:

using System.Linq;

var list = new List<Person>
{
    new Person() { Name = "Bob", Age = 25 },
    new Person() { Name = "Bob", Age = 40 },
    new Person() { Name = "Linda", Age = 39 }
};

var dedupedList = list.DistinctBy(p => p.Name).ToList();

In this example:

DistinctBy(p => p.Name) removes duplicates based on the Name property. It keeps the first occurrence of each name and removes subsequent duplicates.
ToList() converts the result into a new list.

If you're using a version of .NET earlier than 6 or want a more performance-optimized solution, you can manually iterate over the list while keeping track of unique property values using a HashSet:

var list = new List<Person>
{
    new Person() { Name = "Bob", Age = 25 },
    new Person() { Name = "Bob", Age = 40 },
    new Person() { Name = "Linda", Age = 39 }
};

var dedupedList = new List<Person>(list.Count);
var hashSet = new HashSet<string>();

foreach (var item in list)
{
    if (hashSet.Add(item.Name)) // Check for unique names
        dedupedList.Add(item);
}

This method is twice as fast as using LINQ's DistinctBy() because it avoids the overhead of LINQ and directly uses a HashSet to track unique properties.

Inefficient Approaches to Avoid

While the approaches mentioned above are all O(n) and efficient, there are several inefficient methods that should be avoided:

Sorting the list: Sorting brings the duplicates next to each other, then iterating and removing duplicates.

Time Complexity: Sorting has an average time complexity of O(n log n), which is worse than O(n).

list.Sort();
var dedupedList = new List<int>();
foreach (var item in list)
{
    if (!dedupedList.Contains(item))
        dedupedList.Add(item);
}

Sorting takes longer and adds unnecessary complexity.

Using List.Contains(): For each item in the list, checking if it exists in the new list using List.Contains() adds O(n²) complexity.

var dedupedList = new List<int>();
foreach (var item in list)
{
    if (!dedupedList.Contains(item))
        dedupedList.Add(item);
}

Contains() performs a linear search each time, making it much slower for larger lists.

Let's see how these inefficient methods compare in practice for various list sizes:

List Size	Foreach (O(n))	Sort-Dedupe (O(n log n))	Contains-Dedupe (O(n²))
1,000	0.03 ms	0.41 ms	0.11 ms
10,000	0.13 ms	3.38 ms	5.72 ms
100,000	1.70 ms	37.03 ms	498.91 ms

As you can see, the performance gap widens significantly as the list size increases. The foreach loop (O(n)) consistently performs better than both the sort-dedupe (O(n log n)) and contains-dedupe (O(n²)) methods.

Removing duplicates in C# is straightforward and efficient when done using a HashSet. Whether you choose LINQ’s ToHashSet() method for brevity or manually iterate with a HashSet for performance, you ensure that your solution runs in O(n) time. Avoid inefficient methods like sorting the list or using List.Contains(), as they can cause unnecessary delays, especially with larger lists.