In my previous post I described how to create a thread safe data cache using PFX. PFX however is scheduled to be released as a part of the .Net framework 4.0 which means we will have to wait a while before we can use this in real world applications. That’s why I created an implementation using the generic Dictionary combined with a ReaderWriterLockSlim which are both available in .Net 3.5. today.
1: public class CacheDictionary<TKey, TValue>
2: {
3: ReaderWriterLockSlim _cacheLock = new ReaderWriterLockSlim();
4: Dictionary<TKey, LazyInit<TValue>> _cacheItemDictionary = new Dictionary<TKey,LazyInit<TValue>>();
5:
6: public TValue Fetch(TKey key, Func<TValue> producer)
7: {
8: LazyInit<TValue> cacheItem;
9: bool found;
10:
11: _cacheLock.EnterReadLock();
12: try
13: {
14: found = _cacheItemDictionary.TryGetValue(key, out cacheItem);
15: }
16: finally
17: {
18: _cacheLock.ExitReadLock();
19: }
20:
21: if (!found)
22: {
23: _cacheLock.EnterWriteLock();
24: try
25: {
26: if (!_cacheItemDictionary.TryGetValue(key, out cacheItem))
27: {
28: cacheItem = new LazyInit<TValue>(producer);
29: _cacheItemDictionary.Add(key, cacheItem);
30: }
31: }
32: finally
33: {
34: _cacheLock.ExitWriteLock();
35: }
36: }
37:
38: return cacheItem.Value;
39: }
40: }
This Implementation uses the (thread safe version of) the double-check locking pattern. This code first checks if the item exists using only a read lock, and only if the item is not found it acquire a write lock. To be sure the item was not added in between it checks again and adds the new item if it is still not found. Assuming there will be a fairly large amount of cache hits compared to cache misses, this is (in theory) more efficient than exclusively locking the dictionary right away, because multiple readers can get items from the cache concurrently and we only block other threads if we need to add a new item.
Just like in my previous implementation, I did not want to keeping the cache locked while actually retrieving or creating the item that needs to be cached. In the PFX version I did this by using LazyInit<T>, which will also be part of .Net 4.0 (They actually just renamed it to Lazy<T>). Because I wanted the cache to work on .Net 3.5, I created my own implementation of LazyInit<T> which is not as versatile and highly optimized as the .Net 4.0 version, but does the job and is actually quite simple (see code below). While the cache-wide write lock is held, all I do is create an instance of LazyInit<T> and store that instance in the dictionary. After the cache-wide lock is released I call LazyInit.Value which will actually call the delegate to create the item as needed. LazyInit<T> uses a lock as well, but each instance has it’s own lock object, this way each instance will be initialized exactly once, but different items can be initialized concurrently.
1: public class LazyInit<T>
2: {
3: Func<T> _producer;
4: object _lock = new object();
5: T _data;
6: volatile bool _created;
7:
8: public LazyInit(Func<T> producer)
9: {
10: _producer = producer;
11: }
12:
13: public T Value
14: {
15: get
16: {
17: if (!_created)
18: {
19: lock (_lock)
20: {
21: if (!_created)
22: {
23: _data = _producer.Invoke();
24: _created = true;
25: _producer = null;
26: }
27: }
28: }
29: return _data;
30: }
31: }
32: }
Actually being quite happy with this implementation which does a minimum of locking while still being thread safe I sat back and relaxed.
Only A short time after I created this implementation, a blog post by Joe Duffy, the Lead developer / Architect behind PFX, showed up in my feed reader:
Reader/writer locks and their (lack of) applicability to fine-grained synchronization
In this post Joe compares the ReaderWriterLockSlim to (among others) a normal Mutex better known as the C# lock() statement. On a 4 core machine, when there are mostly readers and only a few writers, the reader/writer lock should theoretically be about 4 times as fast as the mutex. The reader / writer lock will allow multiple reader to execute concurrent on all four cores and the mutex only allows one thread to execute at a time leaving 3 cores idle while the other is holding the lock.
As it turns out, the mutex outperforms the ReaderWriterLockSlim in many scenario’s, especially in those cases where the locks are held for a relatively short time (like in a single peek in a dictionary). In stead of being 4 times faster the ReaderWriterLockSlim is actually about half as fast as the mutex! It seems that the internal book-keeping that needs to be done by the ReaderWriterLockSlim is way more expensive than the penalty of 3 cores being idle for a short time while 1 thread is holding the lock. Even Joe’s experimental highly optimized spinning ReaderWriterLock does not do a much better job in these scenario’s
ReaderWriterLockSlim start to outperform the mutex when the lock needs to be kept for a longer time (which you should generally avoid anyway). The number of cores will probably influence the results as well, I think running the same test on a 256 core box will increase the performance of the reader writer lock compared to the mutex, but unfortunately I have not been able to test this.
To test how this influenced my CacheDictionary, I created yet another implementation using a mutex in stead of the reader writer lock. Compared to the ReaderWriterLockSlim version this code is actually quite simple because it does not need to do the double check locking and the lock() statement does not require an explicit finally block. I did stick to the LazyInit aproach the avoid locking the whole cache while creating the items to cache.
1: public class CacheDictionary<TKey, TValue>
2: {
3: object _cacheLock = new object();
4: Dictionary<TKey, LazyInit<TValue>> _cacheItemDictionary = new Dictionary<TKey,LazyInit<TValue>>();
5:
6: public TValue Fetch(TKey key, Func<TValue> producer)
7: {
8: LazyInit<TValue> cacheItem;
9:
10: lock (_cacheLock)
11: {
12: if (!_cacheItemDictionary.TryGetValue(key, out cacheItem))
13: {
14: cacheItem = new LazyInit<TValue>(producer);
15: _cacheItemDictionary.Add(key, cacheItem);
16: }
17: }
18: return cacheItem.Value;
19: }
20: }
I ran some simple performance test on my dual core machine to compare this code to the ReaderWriterLockSlim version. My results were similar to Joe’s, the version with the Mutex is way faster then ReaderWriterLockSlim, even when there are only readers and not a single writer! The actual results depend on a lit of factors, like the number of items in the dictionary and of course the amount of work that is done in the delegate that produces the cache item, but in general the reader/writer version took about 1.5 times as long on the same scenario as the Mutex version. As in many cases: Simplicity wins!
For the near future, I will stay away from reader writer locks unless it is absolutely necessary to hold a lock for a fair amount of time. Maybe I will revisit this statement when commodity hardware reaches 32 or 64 cores.
Joe Duffy concludes his post on this topic with: “Sharing is evil, fundamentally limits scalability, and is best avoided.” While in general I think he is right, the whole point of a cache is to share the result of a previous request with future requests in order to improve performance. Since caching implies sharing you should always be aware of the concurrency issues involved, that is why I tried to handle most of these issues in this generic utility.