Caching as an antipattern
As part of a self imposed challenge to write a blog post about code on average every day for about 100 days, I am discussing a link that I've previously shared on Facebook, but that I liked so much I feel the need to contemplate some more: The Caching Antipattern
Basically, what is says is that caching is done wrong in any of these cases:
The alternatives are not to cache and instead know your data and service performance and tweak them accordingly. Try to take advantage of client caches such as the HTTP If-Modified-Since header and 304-not-modified return code whenever possible.
I've had the opportunity to work on a project that did almost everything in the list above. The performance bottleneck was the database, so the developers embedded an in code memory cache for resources that were not likely to change. Eventually other anti-patterns started to emerge, like having a filtered call (let's say a call asking for a user by id) getting all users and then selecting the one that had that id. Since it was in memory anyway, "getting" the entire list of records was just getting a reference to an already existing data structure. However, if I ever wanted to get rid of the cache or move it in an external service, I would have had to manually look for all cases of such presumptions in code and change them. If I wanted to optimize a query in the database I had no idea how to measure the performance as actually used in the product, in fact it led to a lack of interest in the way the database was used and a strong coupling of data provider implementation to the actual service code. Why use memory tables for often queried data? We have it cached anyway.
My take on this is that caching should always be separate from the service code. Have the code work, as slow as the data provider and external services allow it, then measure the performance and add caching where actually needed, as a separate layer of indirection. These days there are so many ways to do caching, from in memory tables in SQL Server, to distributed memory caches provided as a service by most cloud providers - such as Memcached or Redis in AWS, to content caches like Akamai, to html output caches like Varnish and to client caches, controlled by response and request headers like in the suggestion from the original article. Adding your own version is simply wasteful and error prone. Just like the data provider should be used through a thin interface that allows you to replace it at will, the caching layer should also be plug and play, thus allowing your application to remain unchanged, but able to upgrade any of its core features when better alternatives arrive.
There is also something to be said about reaching the limit of your resources. Let's say you cache everything your clients ask for in memory, when they ask for it. At one time or another you might reach the upper limit of your memory. At this time the cache should not fail, instead it should clear the data least used or the oldest inserted or something like that. A cache is not something that is supposed to hold all your data, only the part of it that is most efficient, performance wise, and it should never ever bring more problems like memory overflow crashes. Eek!
Now, I need to find a suitable name for my caching layer invalidation manager ;)
Other interesting resources about caching:
Cache (computing)
Caching Best Practices
Cache me if you can Powerpoint presentation
Caching guidance
Caching Techniques
Basically, what is says is that caching is done wrong in any of these cases:
- Caching at startup - thus admitting that your dependencies are too slow to begin with
- Caching too early in development - thus hiding the performance of the service you are developing
- Integrated cache - cache is embedded and integral in the service code, thus breaking the single responsibility principle
- Caching everything - resulting in an opaque service architecture and even recaching. Also caching things you think will be used, but might never be
- Recaching - caches of caches and the nightmare of untangling and invalidating them in cascade
- Unflushable cache - no method to invalidate the cache except restarting services
The alternatives are not to cache and instead know your data and service performance and tweak them accordingly. Try to take advantage of client caches such as the HTTP If-Modified-Since header and 304-not-modified return code whenever possible.
I've had the opportunity to work on a project that did almost everything in the list above. The performance bottleneck was the database, so the developers embedded an in code memory cache for resources that were not likely to change. Eventually other anti-patterns started to emerge, like having a filtered call (let's say a call asking for a user by id) getting all users and then selecting the one that had that id. Since it was in memory anyway, "getting" the entire list of records was just getting a reference to an already existing data structure. However, if I ever wanted to get rid of the cache or move it in an external service, I would have had to manually look for all cases of such presumptions in code and change them. If I wanted to optimize a query in the database I had no idea how to measure the performance as actually used in the product, in fact it led to a lack of interest in the way the database was used and a strong coupling of data provider implementation to the actual service code. Why use memory tables for often queried data? We have it cached anyway.
My take on this is that caching should always be separate from the service code. Have the code work, as slow as the data provider and external services allow it, then measure the performance and add caching where actually needed, as a separate layer of indirection. These days there are so many ways to do caching, from in memory tables in SQL Server, to distributed memory caches provided as a service by most cloud providers - such as Memcached or Redis in AWS, to content caches like Akamai, to html output caches like Varnish and to client caches, controlled by response and request headers like in the suggestion from the original article. Adding your own version is simply wasteful and error prone. Just like the data provider should be used through a thin interface that allows you to replace it at will, the caching layer should also be plug and play, thus allowing your application to remain unchanged, but able to upgrade any of its core features when better alternatives arrive.
There is also something to be said about reaching the limit of your resources. Let's say you cache everything your clients ask for in memory, when they ask for it. At one time or another you might reach the upper limit of your memory. At this time the cache should not fail, instead it should clear the data least used or the oldest inserted or something like that. A cache is not something that is supposed to hold all your data, only the part of it that is most efficient, performance wise, and it should never ever bring more problems like memory overflow crashes. Eek!
Now, I need to find a suitable name for my caching layer invalidation manager ;)
Other interesting resources about caching:
Cache (computing)
Caching Best Practices
Cache me if you can Powerpoint presentation
Caching guidance
Caching Techniques
0 comments:
Post a Comment