Professional Documents
Culture Documents
The Blurring Line
The Blurring Line
The Blurring Line
August 2010
This work is licensed under the Creative Commons Attribution-NoDerivs 3.0 Unported License. To view a copy of this
license, visit http://creativecommons.org/licenses/by-nd/3.0/ or send a letter to Creative Commons, 171 Second
Street, Suite 300, San Francisco, California, 94105, USA.
For the last 10 years or so, enterprise messaging software has enjoyed a peculiar position in 3
tier deployments. Content with just delivering messages and never owning/storing them, it has
always acted as a midwife and nothing more (nothing less).
Before I start, let me make it clear that this is not a rant on messaging or data storage. I
understand the vital role they play. However, its ease of use sometimes leads to weird
architectural decisions. Let me elaborate.
The blurring line between Messaging and Storage
Every project starts off with an innocent requirement to pull messages from another system.
The value added by the project is more data that is then fed to another system - Upstream and
Downstream systems. The larger the deployment, the more sources and sinks it has.
Now, imagine many such large deployments within an enterprise and you now have a maze of
pipes connecting and sending messages to each other.
At every stage there is something lost in translation and/or some value added to the message.
This results in an explosion of variations of the same message:
Each project has to maintain a copy of the version it received. Very likely it has already
been transformed and enriched and so that means the original has to be maintained
somewhere for traceability
Each project has to provision and manage the storage and related aspects of this
deployment
Each stage works at a different rate and this means having to throttle messages at the
receiving end. Throttle too much and the sender starts to choke which brings down all
other downstream systems
To avoid throttling and adding backpressure the downstream systems need to ease the
pressure by dumping messages into a database and then consuming messages at a more
leisurely rate. Now that the messages are in a database or a filesystem, another set of
queues are needed to pull them out and deliver the messages to the application. At
every stage the problem is just compounded or merely postponed
The problem is much like storing a town’s water supply in a network of pipes instead of a large
reservoir. Since each project starts to maintain its own version of the reference and master
data, the storage requirements of duplicated data requires more storage. This is a perverse
inter-dependence between storage that does not scale well, which consequently requires more
messaging and this in turn causes even more storage silos.
Inside a project
All the points I’ve made so far are related to just bringing the messages into or out of the
application. Once you are inside the application, the nature of the challenges is of a different
kind.
Most modern applications run on a cluster of machines. The input is often delivered over
queues – for example Orders, Shipment requests, Payment confirmations etc. These messages
are consumed by multiple application instances in parallel. The instant this happens, the whole
concept of First In First Out (FIFO) delivery goes out the window because there are multiple
consumers and a single producer. Order is lost anyway.
So, why do we still restrict ourselves to traditional queuing systems? This out of order / parallel
delivery also causes performance problems:
If the messages are delivered to servers without a sticky session concept, then related
messages coming from the queue can get delivered to different and multiple application
server instances at the same time. This means that the application programmer has to
perform a kind of distributed synchronization and co-ordination to prevent all servers
from working on the same data at the same time
This also causes another serious performance issue in addition to the cost of distributed
co-ordination – which is cache ping-ponging. i.e if the application has a cache to start
with. This phenomenon is usually seen in multi-threaded and multi-processor/core
systems at the CPU level but the nature of the problem is similar
If the traffic pattern is particularly skewed where related messages arrive at around the
same time and one behind the other, then without sticky sessions, cluster wide
throughput suffers because threads are blocked on several machines as the related
messages are scattered to all servers and they all have to wait for their turn
Now if there are several such queues and the application is expected to perform some
kind of assembly or correlation across these queues, the performance problem is
compounded. Distributed deadlocks, timeouts and inexplicable delays are common
Correlating data across multiple sources that send data at different rates causes another
problem. The application needs to hold data that has arrived from the faster source
until related data from the other slower sources arrive before correlation can be
performed
Absolute reliance on ordered delivery of messages is passé. If the first message in the
series causes an exception then the rest in the series that arrive on the queue will get
processed. If the application is not designed to handle out of order messages then all
those related messages would have to be stored somewhere and replayed in sequence
later. Storing this offline requires another queue which just postpones the problem
The problem – data is sent to the application by another upstream application – Server push.
Plain messaging as a communication layer is good, but to use it as means to copy data generally
leads to data retention, consistency and maintenance issues later. Also,
consumers/downstream systems have little control over when and at what rate to receive data:
The reason queues are chosen is because the upstream systems do not usually have the
capacity to handle the additional read requests that can come from downstream
systems with regards to master data
This often results in copies of the same message at various locations
This also results in inconsistent data spread all over the enterprise without centralized
ownership or accountability. Maintaining redundant copies is an expense
A note on architecture
With these problems out of the way, the application architect can now focus on improving the
performance of the application and not worry about external systems. A constantly available,
master data store that is always consistent removes a huge burden.
Project architects can now focus on simpler, light weight solutions with fewer moving parts.
Fewer moving parts mean fewer problems. There are many capable, open source solutions
available to the architect. Most of which have been built in the recent past to assist large scale
deployments in companies like Google, Amazon, Yahoo, Facebook, LinkedIn and the like.
Now that applications are free to pull data at their own pace, smarter architectures are
possible. Applications can lease/pull/loan data that is relevant to a transaction, on demand and
perform the required processing. This is vastly different from being force fed data from
different sources.
Such a federated source would vastly benefit from caches to alleviate the read loads. It must be
noted that indiscriminate caching would bring back the very problems we are trying to solve.
Overly aggressive cache would face problems with consistency and completeness which again
require brittle distributed membership, consistency and leader election algorithms along with
notification mechanisms. Most large deployments would fare better if they relied on simpler
and more resilient architectures that use eventual consistency and compensating transactions
when data is unavailable or networks get partitioned.
There are many open source solutions available to help build such a system. To name a few and
in no particular order:
Memcached:
http://code.google.com/p/memcached/wiki/NewProgramming#Wrapping_an_SQL_Query
http://code.google.com/p/memcached/wiki/NewProgrammingTricks#Ghetto_central_locking
http://code.google.com/p/memcached/wiki/NewProgramming#Extended_Functions
Java clients
o http://wiki.github.com/gwhalin/Memcached-Java-Client/performance
o http://stackoverflow.com/questions/731738/java-memcached-client
Voldemort:
http://project-voldemort.com/configuration.php
http://groups.google.com/group/project-voldemort/browse_frm/thread/d8d8a222dd857c57?tvc=1
Cassandra vs Hbase:
http://whynosql.com/cassandra-vs-hbase/
More:
http://javaforu.blogspot.com/search/label/big%20data