Equality in C#

Part of being an expert in a language is knowing all its subtle ins and outs. Personally, I am a big fan of going back to some very basic concepts and studying them in great detail. For this blog post, I dived into the finer points of object equality in C#. Because it is such a basic subject, developers with a couple of years of experience under their belt should already be familiar with most of what follows. However – as I have learned – there are still some very important subtleties to be aware of if you want to get the most out of your language.

Prerequisites

Before we begin, there are some prerequisites we have to get out of the way.

Reference types vs. value types

First, the reader must have a solid understanding of the differences between reference types and value types. Reference types are created by defining new classes and interfaces. Delegates and array types are also reference types. Examples include Stream and String[]. Value types on the other hand are defined using structs and enums. Guid, DateTime and the primitive CLR types such as Char, Byte, Int32, Int64, Single, Double, Decimal and Boolean are popular examples. The most important difference between the two is how they are passed to methods. Reference types follow call-by-reference semantics, while value types use call-by-value whereby a copy of the whole data structure is passed. This is the main reason why it is very strongly recommended to make value types immutable. Making value types mutable would give rise to some very unintuitive and obscure bugs. Reference types on the other hand can be either mutable or immutable.

A common misconception is that value types must always live on the stack, whereas reference types are allocated on the heap. This is mostly true in the Microsoft implementation of C# on the desktop CLR, but it is not formally specified anywhere. Background: https://blogs.msdn.microsoft.com/ericlippert/2010/09/30/the-truth-about-value-types

The differences between reference types and value types are not something that comes up often in the day to day lives of average C# developers. They have their language’s unified type system to thank for that. It means that all types inherit either directly or indirectly from System.Object. Contrast this with the Java language, where there are no user defined value types. The only available value types in Java are the primitive types, and they do not inherit from Object. This problem is most pronounced in the context of generics, where a compiler error pops up when a primitive type is put in place of a generic type parameter. This can easily be remedied by using one of the several predefined wrapper reference types such as Integer. This processing is called boxing. The impact of this operation must not be understated, especially in performance critical sections such as tight loops. The inverse operation is called unboxing. C# also has the concept of boxing and unboxing, but without the predefined wrapper types. The value is automatically wrapped in a reference type when assigned to a variable of type object. Confusingly, just as Java’s int vs. Integer, C# has int vs. Int32. In the latter case these are aliases, there is no boxing involved. On a related note, the generics system of Java works very differently from that of C#, using a technique called “type erasure” during compilation.

Primes and co-primes

Another prerequisite involves prime and co-prime numbers. Everyone should be familiar with primes – numbers that are only divisible by one and themselves. Two numbers are co-prime if their greatest common divisor (gcd) is one. 14 and 15 are co-prime, whereas 14 and 21 are not because they are both divisible by 7. Of course two primes are always co-prime.

Hashing

The last prerequisite is a basic understanding of hashing. Hash functions are deterministic functions that take data of arbitrary size and convert it to output of fixed size. By definition this function cannot be guaranteed to be injective. In other words, distinct inputs can potentially map to the same output. This is called a collision. If the input size is bigger than the output size, this is unavoidable due to the pigeonhole principle. It can also happen with poorly defined hash functions if the input size is less than or equal to the output size. A desirable property of hash functions is that given some output, it should be very difficult to calculate corresponding inputs. Said differently, hash functions should be one-way functions. Even better hash functions achieve avalanche effect, whereby a small difference in input causes the output to be radically different. Hash functions are used for many purposes including data structures involving cryptography, blockchain technology and big data algorithms. However, for our story we will focus on their application in hash tables.

Why were hash tables invented in the first place? If you have a collection of objects, why not always just store them in a list? One reason may be that lists can contain duplicates. Data structures imitating mathematical sets often automatically filter out duplicates, at least if object equality is properly defined. Another reason may be that lists are ordered, and you do not need that feature. For example, if you design an EmailMessage class, the To field can be modelled as a set of MailAddresses because order does not matter and duplicates are undesirable. However, the most important reason is perhaps performance characteristics. The Contains method (and some related methods) of lists typically has a linear time complexity in terms of list size because it iterates through the list from front to back to find the given item. This makes lists unsuitable for algorithms that have to process many items. Other data structures (e.g. HashSet and Dictionary/HashMap) are able to do this is quasi-constant time. In return the memory footprint grows slightly in comparison to lists. Such data structures typically use a hash table internally. Instead of storing all items one after the other in an array, hash tables sort their items into buckets – exactly one bucket per item. Ideally the items in the hash table are spread evenly across all available buckets. If this is the case, the number of items in any particular bucket remains relatively small. To check whether an item is contained in the hash table, it is a simple matter of calculating the corresponding bucket number and searching through the (small) bucket linearly. The bucket number calculation is something along the lines of item.GetHashCode() % NumberOfBuckets. While many data structures have capacities equal to a power of two, NumberOfBuckets is typically a prime number for reasons that will become clear later.

Introduction to equality

Now that we have the prerequisites out of the way, we can focus on the subject at hand. Before discussing the practical side, let us look at three fundamental rules that all equivalence relations must adhere to.

  • Reflexive: x = x
  • Symmetric: x = y \iff y = x
  • Transitive: x = y \land y = z \implies x = z

However trivial this may sound, it never hurts to double-check whether an equals implementation is compliant with these rules.

Equality can be partitioned in various ways. The most common partitioning is in referential equality on the one hand, and structural equality on the other hand. With referential equality, object references are equal only if they both point to the exact same object in memory. In C#, reference equality can always be checked using the static Object.ReferenceEquals(object, object) method. Structural equality is less strict compared to reference equality in that objects in different locations in memory can be treated as equal under certain conditions. Those conditions are typically defined by the developer by overriding the Object.Equals(object) instance method. For entities, the requirement is often that both objects simply have the same ID, regardless of other properties and fields. Of course full structural equality using all properties and fields is also an option. Note that reference equality implies structural equality.

For value types, reference equality violates the reflexive rule: Object.ReferenceEquals(x, x) always returns false. This is not surprising, it only make sense to talk about structural equality in the context of value types.

Another way to partition equality is through natural vs. custom equality. Natural equality uses a sensible default definition of equality and is specified by the Equals method in the class itself. Custom equality on the other hand is often defined on an ad hoc basis outside of the class to be compared using an equality comparer. For example, a custom equality comparer can define integers as being equal modulo some given number n. For n = 3, 2 and 5 would be considered equal. The resulting equality comparer can be passed to various methods such as LINQ’s Distinct method.

Pre-generic equality

Natural equality

After this largely theoretical introduction, we will discuss equality in C# before and after the introduction of generics (with C# 2.0). The following pre-generic methods all have a role to play in determining natural equality. The implementations of some of these methods can be studied in object.cs in the reference source.

static bool operator ==(object, object)
static bool operator !=(object, object)
static bool ReferenceEquals(object, object)
static bool Equals(object, object)
virtual bool Equals(object)
virtual int GetHashCode()

First up are the two equality operators. Unlike Java, C# explicitly exposes these as special methods and allows developers to overload them. It is important to realize that both methods are independent of each other. If you decide to overload one, you must also overload the other in order to keep their behaviour consistent. Surprisingly, the original definitions to compare objects cannot be found in object.cs. Depending on whether reference or value types are compared, the behaviour will be different. For reference types, the behaviour is – as expected – consistent with reference equality. For value types, there are no default implementations of the two operators so they cannot be used out of the box. We can work around this by first boxing both values in an object type, but the result will always be false. This is exactly what happens when calling ReferenceEquals with two value types, since that method is a simple wrapper for the ==(object, object) operator. Unlike for ReferenceEquals, developers occasionally define overloads for the equality operators. Some programming guides suggest keeping the equality operators’ behaviour consistent with that of custom Equals implementations. In my experience this is often overkill, especially for reference types. Reserving the equality operators for reference equality and using Equals for structural equality of reference types (exactly as in Java) is what I recommend.

The static Equals method is a simple helper method. It first checks whether both parameters refer to the same object using the default implementation of the == operator and returns true if this is the case. This check is not strictly necessary, but boosts performance nonetheless. Next, it checks whether any parameter equals null and returns false if so. Finally, it calls the virtual Equals method to do the heavy lifting. This method is useful because – unlike for the Equals instance method – there is no risk of a NullReferenceException being thrown because the instance on which it is invoked is null.

The Equals instance method is traditionally the one developers are most familiar with. Most of the heavy lifting happens in overrides of this method. The default implementations still contain some surpises though. For reference types, it redirects – for reasons unknown to me – to the external RuntimeHelpers.Equals method instead of to the equality operator. The behaviour is again consistent with reference equality. For value types, the implementation (as seen in valuetype.cs) contains more surprises. Because reference equality would be pointless in this context, C# attempts to offer structural equality out of the box. If a value type exclusively contains fields and properties that are themselves also value types, a fast, bitwise comparison of the structure can be performed to determine equality. Else, reflection is used to retrieve all fields at runtime and compare them one by one. Reflection is very slow so a custom Equals implementation is strongly recommended in this case.

Finally, we arrive at the odd one out in the list: GetHashCode. The first question to ask is why GetHashCode is even defined at the object level. Its only (proper) use is in data structures based on hash tables. Would it not make more sense to define an IHashable interface for such occasions? Eric Lippert seems to agree with me one this one. (A similar reasoning could be made for ToString – Haskell actually does this.) Regardless, GetHashCode in C# is here to stay on the object level. What follows are various rules and guidelines to take into account when overriding this method. These will explain why GetHashCode always appears alongside Equals, even though it must by itself never be used to determine equality.

Rules and guidelines for GetHashCode

  • Rule one: if objects are equal according to the Equals method, they must also have an equal hash code. Another way to say this is that if two object have different hash codes, they cannot possibly be equal. Note that two objects with identical hash code are not guaranteed to be equal, this could also be caused by a collision. The practical implication of this rule is that every override of Equals must be accompanied by an override of GetHashCode. If this rule is broken, hash tables could not work at all. After all, objects inserted in hash tables are internally stored in one of many buckets. This bucket is chosen based on the output of GetHashCode of the key. If two equal objects could have different hash codes, they could end up in different buckets.

  • Rule two: GetHashCode output must never change while the object is contained in a data structure that depends on the hash remaining stable. This rule is sometimes simplified to “objects used as keys in hash tables must be immutable”, but this is overly restrictive. In the case of entities for example, both Equals and GetHashCode would typically only depend on the ID. Other properties can freely change without breaking this rule. What happens when this rule is violated? Consider what happens when a mutable object is inserted in a hash table and then its internal state is changed. If this change affects the hash code, hashtable.Contains would likely return false for this updated object because it searched the wrong bucket. Some very nasty bugs can result from this. Making these objects immutable is certainly the easiest preventative measure. Some sources go as far as to suggest only implementing GetHashCode on immutable types. Otherwise, evaluate carefully whether this rule is respected.

  • Rule three: GetHashCode must never throw exceptions. In other words, it must be a total function. The reason is obvious.

  • Guideline one: the implementation must be extremely fast.

  • Guideline two: the distribution of hash codes must be “random”. There should be no obvious collision patterns. This ensures that the objects are spread across many small buckets inside the hash table, maximizing performance.

Microsoft obviously followed these rules and guidelines in their default implementations of GetHashCode, although we are unable to inspect the inner workings in the reference source. We can confidently say that it behaves consistently with reference equality for reference types and with structural equality for value types. In the latter case, performance is again very poor due to the use of reflection. Once more, it is strongly recommended to override the default implementation for structs.

GetHashCode implementation strategies

GetHashCode implementations are not necessarily complex. For integers, it simply returns itself. For booleans, it returns zero and one for false and true respectively. Keep in mind that because the output is an integer, negative values are allowed. Next, some implementation strategies for objects with multiple fields will be discussed.

  • Option one: simply return 1. If you look back at the rules and guidelines, you will notice that this option does not violate a single rule and only the last guideline. I chose one instead of zero because zero is often seen as the hash code for null. The consequence of this choice is that hash tables degenerate into linked lists because only one bucket is used. Several operations now take linear instead of constant time. Regardless, this could be an acceptable option during debugging to verify that the bug is not caused by an invalid GetHashCode implementation.
  • Option two: XOR (\oplus) all hash codes of the individual fields. At first sight, exclusive OR seems like a good operation to keep the ones and zeroes balanced. After all, the chances of one and zero are both exactly 50% according to the truth table. If we used AND instead, the result would likely contain lots of zeroes if there are lots of fields to combine. OR on the other hand would result in lots of ones. Unfortunately, XOR also has some undesirable properties. First, XOR is commutative: x \oplus y = y \oplus x. In other words, the order of the fields does not change the hash code, which leads to some obvious collisions. A Point2D class that uses this method will have the same hashcode for (1, 2) as for (2, 1). Furthermore, x \oplus x = 0 and 0 \oplus x = x = x \oplus 0, so if two fields contain equal values, their contributions cancel each other out. The fields do not even have to be adjacent because XOR is also associative: x \oplus (y \oplus z) = (x \oplus y) \oplus z. Consequently, it is easy to see that x \oplus y \oplus x = y. These limitations can be somewhat alleviated by also performing some bitwise left (x \ll c) and right shifts (x \gg c) in between, but that makes it rather complex. On an unrelated note, these properties allow you to swap the values of two variables without using a third temporary variable: https://en.wikipedia.org/wiki/XOR_swap_algorithm.

  • Option three: the anonymous object trick. Similar to value types, anonymous types (which are reference types by the way) contain predefined structural Equals and GetHashCode implementations. Simply wrap all relevant fields in a new anonymous object new { x, y, z }, and you can immediately profit from this feature. Of course this solution is not very performant due to the additional object allocation overhead, but it makes up for this in simplicity.

  • Option four: combine the hash codes of the fields together is a complex way involving multiplication, exclusive OR and some carefully chosen constants. Traditionally, tools (such as ReSharper) are used to generate such code. The whole computation is typically wrapped in an unchecked section because overflowing integers do not pose a problem in this context. The constants involved are the result of quite a bit of research, but also a lot of trial and error. As always, the goal is to make sure the result is random enough so that objects can be optimally spread across buckets. There is no solid mathematical theory supporting these constants, but it is known that constants which are co-prime with the number of buckets in the hash table generally produce good results. Since the number of buckets is already a prime number, using prime numbers as constants ensures both numbers are co-primes. This option is my favourite because it does not require any work on my part and it is still reasonably fast and random. Example:

Custom equality

Up to this point we have focussed our attention on natural equality. For custom equality, the story is very similar. an interface called IEqualityComparer defines two methods signatures: bool Equals(object, object) and int GetHashCode(object). Instead of acting directly on the instance of a class, these methods each take an extra parameter to pass on object to.

Generic equality

This concludes our discussion of equality as it is set up in C# 1.0. C# 2.0 introduced generics, which added some more methods to determine equality. More specifically, IEquatable of T and IEqualityComparer of T were added for natural and custom equality respectively. There are a few advantages over previous techniques: casting is no longer necessary, the method calls are type safe and boxing overhead for value types disappears. Unfortunately, these new techniques do not entirely replace the old ones. They are merely added as a convenience and must be kept consistent.

IEquatable of T – predictably – defines only one method: Equals(T). These days, this method does the heavy lifting, while Equals(object) simply casts and redirects to it. Optionally some simple checks – as in the static Object.Equals method – are performed before casting. By implementing this interface, the developer signals to everyone using his class that a proper equality definition is provided. In contrast, simply overriding the non-generic Equals is transparent for outsiders.

IEqualityComparer of T is very similar to IEqualityComparer; it defines the same two methods but with more specific signatures. What is more interesting is the new EqualityComparer of T abstract base class which does not have a pre-generic equivalent. It contains a Default property, which contains an equality comparer that automatically uses the natural equality methods under the hood. It checks whether the given type T implements IEquatable of T, and if so redirects its Equals(T,T) method to the Equals(T) implementation, and to Equals(object) otherwise. Constructors and methods defined in the System.Collections.Generic framework often require an IEqualityComparer of T as an optional parameter. If it is omitted, the default implementation is used, else the provided custom implementation is used instead.

Addendum: comparison

Before we wrap up, we will briefly touch upon the subject of object comparisons. Whereas object equality only determines whether objects are equal or not, object comparison defines an order on a set of objects by determining whether a given object precedes or follows another object. Mathematics again offers us an interesting conceptual framework in the form of total and partial orders. Unfortunately, the implementation in C# does not nicely map to these concepts, so we will keep the mathematics on the background for now.

For natural comparisons, the interfaces IComparable and IComparable of T come into play. Both define a unary CompareTo method. This method returns an integer signalling the order of the given object relative to the current instance. If the result is less than zero, this instance precedes the given object. If the result is greater than zero, this instance follows the given object. Else, both objects share the same position (i.e., are equal with respect to this ordering). In particular for natural ordering, the results must be consistent with those of natural equality. In fact, natural equality can be expressed completely in terms of natural ordering. In conclusion, if you implement IComparable of T, you must also implement IEquatable of T and override the Equals and GetHashCode methods.

Complementary, the four comparison operators \lneq, \leqslant, \gneq,  \geqslant can be overloaded to conform to the imposed natural ordering. Once more, it is best to use this feature sparingly.

For custom comparisons, the interfaces IComparer and IComparer of T are used. Both define a binary Compare method with the same return mechanics as discussed previously. As was the case with EqualityComparer of T, here we have a Comparer of T base class that also defined a Default property redirecting the custom comparison logic to the natural comparison logic. What is new is that this class also define a static Create method convenience method where you pass in a lambda that has the same signature as a Compare method, and you get a Comparer of T as a result.

Conclusion

For such a basic subject, this blog post turned out to be quite long. You could ask yourself the question whether it is reasonable to have to write 40 lines of boilerplate code for something as simple as proper structural equality for a class with three properties. As a shameless plug, I will leave you with this link that shows how thing can be done better in a superior language.

Update (2017-04-08)

There are two interfaces I forgot to talk about earlier in this blog post: IStructuralEquatable and IStructuralComparable. Their non-generic nature might lead you to think these are quite old, but actually they have been introduced no so long ago with .NET 4.0 in 2010. Like the name suggests, these interfaces provide structural equality and structural comparison definitions. However, was providing structural equality not a solved problem by overriding object.Equals and implementing the regular IEquatable of T? As it turns out, these interfaces apply to a very specific context only: collections. The interfaces are defined in the System.Collections namespace and are to this day only implemented by Array and the Tuple family. The methods on the interfaces are fairly predictable: Equals(object, IEqualityComparer), GetHashCode(IEqualityComparer) and CompareTo(object, IComparer).

Because these interfaces are typically implemented explicitly instead of implicitly, you must first cast an array or tuple to the IStructuralEquatable or IStructuralComparable interface before you can access the implemented methods.

The default implementations consist of two parts: logic to loop through the items in both collections and logic to test the elements pairwise. The actual testing logic is delegated to the provided comparer parameter. The default value for this parameter is either StructuralComparisons.StructuralEqualityComparer or StructuralComparisons.StructuralComparer. These implementations first check whether the given objects implement IStructuralEquatable or IStructuralComparable. If so, they cast the objects to that interface and call the appropriate method on these casted objects with themselves as the comparer.

Because of their non-generic nature, and the fact that accessing the important methods requires a cast, the resulting code is not very pleasing to the eye. Remember that each IEnumerable offers a SequenceEquals method that largely does the same in a much more readable fashion.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s