About Zombies and Convoys...
This week was Load Testing week for our latest set of Interfaces. Incidentally, this was our first set of Interfaces which implemented Uniform Sequential Convoys in their Orchestrations. The implementation was more or less like my previous post. Meaning the convoys had a timeout mechanism, which lead the Orchestration to completion, if no new messages would arrive for a period of time.
So far - so good. But under a realistic amount of load, we observed that almost 1 out of every 5-6 messages would fail to go through to the destination. The HAT would show the following:
Completed with discarded messages
Of course the 1/5 is a very subjective number, depending upon the actual convoy implementation in question. After some snooping around (mostly at Lee Graber’s blog), I figured out that the source of the problem. But no amount of the aforementioned snooping yielded any easy and/or elegant solution to this little pickle.
What happens is that the control might pass to the delay branch of the Listen shape where we implement some logic to escape the loop. During this time interval, if a message arrives at the MessageBox, Biztalk sees our Orchestration as “Running” and since our Orchestration is a match for the Message’s subscription, Biztalk assigns this orchestration to take care of this message. Now our orchestration is blissfully unaware that it has a message or messages to take care of, and runs to completion. Hence, these message(s) are left in the message box with no one to process them or in other words in a Zombie state.
In my opinion, the first thing to do in this situation is (and I am not being facetious here) to verify that your correlation set is actually exact enough for your business scenario. For e.g. if we are transmitting patient medical records, we need to correlate only the same patients and not ALL patients. In other words don’t correlate JUST on MessageType if you can correlate on MessageType AND PatientID.
Secondly set your delay interval to an appropriate value. Setting it too low creates more such race timespans, per unit time. This will reduce your Zombie chances, but won’t give you a 100% success guarantee.
Thirdly have a proper plan to deal with these Zombies according to your business scenario. Retransmit? Discard? What?
Finally ponder about the “Receive Draining” pattern mentioned in Lee Graber’s blog, and if you have a nice elegant way of doing it, tell me about it FIRST!
Comments
If you are interested and like to test it, go ahead and drop me an e-mail...
Br,
Gregory (gregory@eai.be)