3 minute read

Till now we discovered the projection operation and the filter operation. It’s time to move forward, and put our newly created hashset into motion. This is where set operations come into play.

Our more advanced filter

Like countries or cities, we will focus on uniqueness. Each country and city is unique, if it’s not by a name, it is by a structure, culture etc. Instances of classes, can be similar, or even contain the same values. We need to ensure that our results will be unique. To guarantee this requirement, we must filter each object, but unlike the last time.

If we look inside of the DistinctIterator we can discover that it is nearly identical to the IEnumerableWhereIterator. Using this knowledge, we can jump right away into coding.

public ref struct InternalDistinctBy<TEnumerator, TIn, TKey> : IStructEnumerator<TIn>
    where TEnumerator : struct, IStructEnumerator<TIn>, allows ref struct
{
    private TEnumerator _source;
    private readonly Func<TIn, TKey> _selector;
    private MinimalHashSet<TKey> _set;

    public InternalDistinctBy(TEnumerator source, Func<TIn, TKey> selector, EqualityComparer<TKey>? comparer = null)
    {
        _source = source;
        _selector = selector;
        _set = new MinimalHashSet<TKey>(comparer ?? EqualityComparer<TKey>.Default);
    }

    public void Dispose()
    {
        _set.Dispose();
        _source.Dispose();
    }

    public bool Next(ref TIn current)
    {
        TIn? fromSource = default(TIn);
        while (_source.Next(ref fromSource))
        {
            if (_set.Add(_selector(fromSource)))
            {
                current = fromSource;
                return true;
            }
        }

        return false;
    }

    public bool GetCountToLeftEnumerate(out int count)
    {
        Unsafe.SkipInit(out count);
        
        return false;
    }

    public bool GetUnderlying(ref ReadOnlySpan<TIn> span) => false;

    public bool TryCopy(ref Span<TIn> destination, int offset) => false;
}

In this case, we will partially combine approach used in Select and Where. This enumerator takes selector as Select, but it utilizes result as a key to filter it. Moreover, our approach will require us to know which selected key already exists.
This is the place for our MinimalHashSet. MinimalHashSet stores keys selected by our selector, thus we can quickly check if the next element in _source repeats and skip it.
Putting it shortly into example, if we want to get unique city names around the world, this approach will move to next element if another Paris appears in the array.

Extension

Now we will want to use our InternalDistinctBy. So we will proceed with extensions:

public static class DistinctByStruct
{
    public static StructEnumerable<InternalDistinctBy<ArrayStructEnumerator<T>, T, TKey>, T> StructDistinctBy<T, TKey>(
        this T[] array,
        Func<T, TKey> keySelector,
        EqualityComparer<TKey>? comparer = null)
    {
        return new(new InternalDistinctBy<ArrayStructEnumerator<T>, T, TKey>(new ArrayStructEnumerator<T>(array), keySelector, comparer));
    }

    public static StructEnumerable<InternalDistinctBy<TEnumerator, T, TKey>, T> StructDistinctBy<TEnumerator, T, TKey>(
        this StructEnumerable<TEnumerator, T> source, Func<T, TKey> keySelector, EqualityComparer<TKey>? comparer = null)
    where TEnumerator : struct, IStructEnumerator<T>, allows ref struct
    {
        return new(new InternalDistinctBy<TEnumerator, T, TKey>(source.Enumerator, keySelector, comparer));
    }
}

And ToArray:

    public static TIn[] ToArray<TEnumerator, TIn, TKey>(
        this StructEnumerable<InternalDistinctBy<TEnumerator, TIn, TKey>, TIn> enumerable)
        where TEnumerator : struct, IStructEnumerator<TIn>, allows ref struct
    {
        return enumerable.Enumerator.FromArrayBuilder<InternalDistinctBy<TEnumerator, TIn, TKey>, TIn>();
    }

Of course there is a place for an optional shortcut between Select and DistinctBy, but unless you’re using both very often, there is no need for such method for now.

Benchmark

We will use the same approach as before:

[MemoryDiagnoser]
public class DistinctBenchmark
{
    public IEnumerable<Example[]> Data()
    {
        yield return Generator.Generate(500_000);
    }

    
    [Benchmark]
    [ArgumentsSource(nameof(Data))]
    public void LinqDistinct(Example[] input)
    {
        Enumerable.DistinctBy(input, x => x.ExampleEnum).ToArray().Consume(new Consumer());
    }

    [Benchmark]
    [ArgumentsSource(nameof(Data))]
    public void ZLinqDistinct(Example[] input)
    {
        input.AsValueEnumerable().DistinctBy(x => x.ExampleEnum).ToArray().Consume(new Consumer());
    }
    
    [Benchmark]
    [ArgumentsSource(nameof(Data))]
    public void StructDistinct(Example[] input)
    {
        input.StructDistinctBy(x => x.ExampleEnum).ToArray().Consume(new Consumer());
    }
}

Results

| Method         | input           | Mean     | Error     | StdDev    | Allocated |
|--------------- |-----------------|----------|-----------|-----------|-----------|
| LinqDistinct   | Example[500000] | 3.326 ms | 0.0146 ms | 0.0122 ms |     520 B |
| ZLinqDistinct  | Example[500000] | 2.986 ms | 0.0566 ms | 0.0652 ms |     216 B |
| StructDistinct | Example[500000] | 2.928 ms | 0.0273 ms | 0.0242 ms |     464 B |

Source code can be found here Used image