At the beginning of last week a report came in from our users which said that they were experiencing timeout issues while working with the application that has been running in production quite stable since the beginning of this year. Because this was a blocking issue we quickly started investigating what went wrong. After looking at the code for about half an hour we decided we needed to put on tracing to find out what really happens. So we did that and we found out that the events we were sending to our workflows weren’t delivered. We couldn’t quickly find out what the problem was, so we decided to remove the running workflow from the persistence store, which would cause the application to create a new workflow and force it to the correct state. It wasn’t the most ideal solution, but at least it meant that the users could continue working with the application. We did however request a backup of the database so we could debug the problem a bit more.
So after I had restored the backup from the production environment on my local development machine and got everything set up I found out that the reason the events weren’t delivered to the workflow was because the scheduler was busy executing another activity. It took quite some debugging and reflectoring to find out that when a message is written to the workflow queue, the event doesn’t get fired until the SchedulerBusy flag on the StateMachineExecutionState internal class is set to false. But I still couldn’t figure out why this specific flag was set to True, instead of False.
A colleague of mine suggested to use the Workflow Monitor, which is one of the applications that comes with the Windows Workflow Foundation samples. It queries on a SQL tracking database and displays the current state of all the running workflows. Quite a nifty tool if you ask me. So I started it up and had a look at the workflows that were working and the workflows that were broken. Although it wasn’t obvious immediately, I found out that the workflows that were broken were currently executing a ConfigurableDelayActivity.
This was an activity that was written by my colleague Robert and it was supposed to solve two problems. The first one was that there was a requirement to be able to configure various delays that were used by the workflow. The second problem it tried to solve was a bug that we found in Windows Workflow Foundation. You can find the full details of this bug in the linked post, but in short it meant that if you were using the ManualWorkflowScheduler (which is recommended when hosting in IIS) and you want a timer that expires more than 47 days in the future, you get an exception. So with this activity we thought we had a workaround.
Unfortunately we found out that our work around wasn’t working correctly. What was happening is that when the ConfigurableDelayActivity becomes the next activity to execute, it would create a timer queue for itself and set it to some time in the future. When that time in the future was more than the earlier mentioned 47 days away, it would make a timer for 47 days. Then, when the timer went off, the activity started executing, so the Execute method was called. There we determined whether the actual delay has expired. If it hadn’t expired yet, we would wait some more by repeating the process. After that, the activity would return ActivityExecutionState.Executing. This is were the problems start.
Because the ConfigurableDelayActivity implements the IEventActivityListener interface, the activity itself isn’t really subscribed to the event that gets raised when a message is put in the queue. Rather, the activity hosting the ConfigurableDelayActivity, which would be an EventDrivenActivity gets sent the message, which will in turn call the Execute method on the activity implementing IEventActivityListener. The problem we were having is that once the 47 days timer went off, the activity gets executed and then says it is still executing, which resulted in the scheduler being busy. This blocks any other EventDrivenActivity in the same state from receiving a message on a queue, because the SchedulerBusy flag is set.
Now, this might all be the way it is supposed to work. It’s too bad thought that there isn’t any documentation on it, or at least we haven’t been able to find any. In fact, I had a look at what the out of the box DelayActivity does, and it explicitly checks whether it is running inside an EventDrivenActivity, and if it is, it will always return ActivityExecutionState.Closed. But as this is all internal stuff, how are we supposed to know? So that’s why I thought I’d write a blog about it, because I’m sure somebody else will run into these problems.
Although I don’t work for Microsoft anymore, I believe I can still provide some insight about this, since I was the original developer for State Machine Workflows (but not the designer of the relationship between IEventActivityListenter and EventDrivenActivity).
The confusion that you describe comes from the fact that IEventActivityListeners have this dual personality, one when used inside EventDriven activities and another when used as part of a regular flow.
When used as part of the regular flow, the Delay or any other Receive-style activity simply blocks waiting for the expected message to come blocking in the Executing state.
EventDriven activities, on the other hand, were used to describe a “wait for the first of these messages”, which is the very nature of the State activity or the a group of EventDriven activities in a sequential workflow.
Without getting into too much detail (I don’t even think I’m allowed to, anyway), given the nature of the workflow scheduler, there were issues regarding guaranteeing that only one event would be processed in the case where two messages were received very close to each other.
So the solution adopted was to have this internal protocol that the wait is actually performed by a parent activity that “knows” about all IEventActivityListeners currently waiting.
Trust me, I’m just like any developer that hates this sort of special case, and although I had some ideas on how to clean this up, I joined the team after most of these designs were already implemented and it was deemed too risky to do pervasive changes at that point in the project even though a lot of people recognized the need for a better design.
The lack of documentation, at least in part, comes from the fact that the team want to reserve some breathing space to do these changes in the future without having to break lots and lots of customer code.
This is one of the coolest things and also the biggest curse of working on API’s that are used everywhere, you get a lot of exposure, but if you make something public that is not quite right, you have to live with it for as long people now what GAC stands for.
Luckily, the people (like you) that find out how these things really work under the hood are the ones best equipped to survive (and forgive, hopefully) redesigns that breaks customer code.