Pitfalls of lazy code evaluation

Software developers coming from a functional background are usually intimately familiar with the concept of lazy evaluation. C# also enables this feature by means of the IEnumerable interface. The lazy version of List enables us to process items as they come in, without the need to know ahead of time which or even how many will follow. The whole LINQ framework introduced in .NET 3.5 revolves around this concept. This is a small hassle for the one who must implement the lazy method, but on the other hand a big constraint is lifted for the one invoking the method.

This lazy behaviour sometimes behaves unintuitively. So much in fact that many developers would seemingly rather not use the feature at all, and solely rely on simply lists. By writing this blog post, I hope to give the reader a better intuitive sense of what lazy evaluation entails, so that avoiding this technique out of fear becomes a thing of the past.

Skip is an example of a lazy method that works on enumerables. The MSDN documentation states the following:

Bypasses a specified number of elements in a sequence and then returns the remaining elements.

Sounds straightforward enough, so let us take it for a test drive. Consider an example where we receive an IEnumerable as input, and we would like to return it in chunks of a given size. The return type of this method would be an enumerable of an enumerable. A first implementation might look something like this:

This implementation is fairly straightforward. As long as there are any items left in source, take (at most) chunkSize items and immediately return this chunk using yield return. Also update the source using Skip so that the current chunk is not returned twice.

For those not familiar with yield return, let us compare it to a regular return. A regular return indicates that the calculation has completed, that the result is known, and that control can be returned to the caller immediately. yield return on the other hand shares its motto with many of Billy Mays’ commercials: “But wait, there is more!”. It signals that part of the result is already calculated and ready for use by the consumer, but that more may be on the way. Control is returned to the caller until the next part of the result is requested. The yielding method then resumes where it left of and repeats the same process until there is nothing more to calculate. Methods containing yield typically have IEnumerable of T as return type. The C# compiler does quite a bit of work in the background (involving state machines) to make all of this possible yet easy to use.

To verify the correctness of the Chunk method, write a small unit test. Create a simple enumerable generator but avoid using non-lazy constructs such as Lists. Your can either use the following example, or use a built-in method such as Enumerable.Range.

Using this method, create chunks of the result with CountTo(20).Chunk(5), loop through the chunks to concatenate them again and verify that the result is a series of numbers from 1 to 20. If you did this correctly, the test will pass without a problem. Another job well done, or is it..?

People that have ReSharper installed will see a message warning them of “multiple enumerations” of parameter source in the Chunk method. You likely think about this for a while and then realise that it makes sense. After all, the Take and Skip methods both loop through the same section of the enumerable. One returns the items, the other drops them. Note that the Any method also causes an overlap, but only for the first item in the enumerable. Clearly this is suboptimal, but with our current understanding the algorithm still runs in linear time (O(2N)).

Your instincts may urge you to fix this problem by converting the enumerable to a list. Resist this temptation. It will certainly fix the problem, but it will cost you the lazy nature of your chunk method. In that case all items need to be known ahead of time. That does not make sense at all from a functional point of view.

So you make peace with the fact that your solution is suboptimal. At least the code is still very succinct. Unfortunately, at some point you start noticing a serious performance impact from using Chunk. Fortunately, you already have your test in place. This makes it easy to investigate the issue. Start by logging the current number right above the yield statement in CountTo. Pause for a moment and write down the output you would expect to see. Personally, I expected something along these lines for CountTo(20).Chunk(5) (newlines added for clarity):

1 | 
1 | 2 | 3 | 4 | 5 | 
1 | 2 | 3 | 4 | 5 | 
6 | 
6 | 7 | 8 | 9 | 10 | 
6 | 7 | 8 | 9 | 10 | 
11 | 
11 | 12 | 13 | 14 | 15 | 
11 | 12 | 13 | 14 | 15 | 
16 | 
16 | 17 | 18 | 19 | 20 |
16 | 17 | 18 | 19 | 20 |

The lines come in triplets and correspond to Any, Take and Skip respectively.

Now actually run the test and observe the result. This is the result:

1 | 
1 | 2 | 3 | 4 | 5 | 
1 | 2 | 3 | 4 | 5 | 6 | 
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |

Not at all what we expected, but at least it explains the terrible performance. The time complexity of this algorithm is O(N^2)! The lines come in groups of two this time, corresponding to Any and Take. What happened to Skip, and why do all sequences start from the very beginning? The output we see corresponds to the following sequence of method calls:

source = CountTo(20)
source.Any() //true
source.Take(5)
source.Skip(5).Any() //true
source.Skip(5).Take(5)
source.Skip(5).Skip(5).Any() //true
source.Skip(5).Skip(5).Take(5)
source.Skip(5).Skip(5).Skip(5).Any() //true
source.Skip(5).Skip(5).Skip(5).Take(5)
source.Skip(5).Skip(5).Skip(5).Skip(5).Any() //false

The documentation stated that Skip would bypass a number of elements, but it was not immediately clear how we should have interpreted this “bypass”. My intuition (and I would expect that of most other developers) was that Skip retrieved a certain number of items on top once, discarded them and then saved a snapshot of the current enumerable state for further use. This is reminiscent of a Queue or a Stream where you dequeue items for processing and forget about them immediately after.

Contrary to our initial understanding, Skip does not actually drop items the moment it is called. In fact, it does not drop items at all. Enumerables are actually immutable, so dropping items from them is impossible. The best Skip can do, is to simply ignore a given number of items when its first item is requested and then yield the next item. The line source = source.Skip(chunksize); chains the Skip onto the base enumerable and exposes this as a new enumerable. Of course this new enumerable is just a facade, and it actually delegates the work to the inner components every time it is iterated. This is the essence of lazy evaluation: do not apply the operation to the input at call time but remember both the input and the operation and combine them later as needed. Of course the input itself can be such a facade as well. This way, many diverse operations can be chained on top of a base enumerable while the whole can still be presented as a simple enumerable.

We can peek behind the curtain and see how Microsoft actually implemented Skip in its reference source:

The key learning point here is the difference between an enumerable and an enumerator (a.k.a. iterator). Enumerables capture the logic for generating a sequence of items, but never the state of enumeration. That is what an enumerator is for. This also implies enumerables always start from the very beginning when iterated over and are indivisible. You can use tricks like Take and Skip to give the appearance of chunking the sequence, but in practise only a specific enumerator instance can be properly chunked. Fortunately, it is easy enough to go from an IEnumerable to an IEnumerator. In fact, GetEnumerator is the only non-extension method defined by the IEnumerable interface.

I tried writing my own code for chunking the enumerator retrieved from an enumerable. It works and it is much more efficient (no multiple enumerations), but it is also complicated and much less succinct than the original code. Your best bet is to use a properly maintained and tested framework such as MoreLINQ (or RX), which – as it turns out – provides a Batch (resp. Buffer) method that does exactly what we want.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s