Tuesday, May 06, 2014

The Sentinel Pattern

I was watching a presentation yesterday on InfoQ called Real World Akka Recipes and the presenter discussed a concept that he called a sentinel. In short, it is a process, in this case an Akka actor, that would periodically check the state of the world to see what it should look like and what it does look like. If there is a difference, then perform an appropriate action.

This got me thinking. This looks to be a good pattern for distributed systems where we have inputs and outputs. See, the issue with such systems is that they have a degree of unreliability in them. Packets can be dropped. Services can go down. A variety of different failures can occur. While there are some possible solutions, they can be unreliable and/or have a negative effect on performance. For example, we could try to use acks to know if a message was delivered. However, the ack could be lost. Then, do we really know if the message was or wasn't received?

So, how would this work? Well, since I typically work on ingesting data, let's look at that scenario. So, using Kafka, we have producers and consumers that write to/read from a broker. Consumers are pretty good about reliability since they leverage Zookeeper to know what messages have and have not been consumed. However, producers don't have that ability. Even then, there's always a chance of failure that may not report which messages were processed. So, if we use a sentinel, we can have a process that periodically checks to see what messages have been sent and which have been processed. If there is a difference, then the sentinel can make the producer re-send the messages.

Now, this is one scenario. Other scenarios could be more complex, such as replaying transactions. Or, perhaps the sentinel can't actually fix anything. In this case, the sentinel could instead report that an error has occurred. For example, it could be a transactional system where a record could be modified multiple times. A transaction request could have failed to be executed, but replaying it could have unintended consequences because it's dependent on the state of the data at a specific point in time. In this case, an error would be reported once the sentinel runs and detects that the expected state of the system does not match the current state of the system.

For this to work, the sentinel needs support from other parts of the system. The "producers" will need to keep track of what it expects the state of the system should be or at least a set of transactions that should have occurred. The actual state of the system needs to be able to be queried periodically. Both sides need to be accessible to a process that may be external to the core of the system. I state "may be" because the sentinel could be a process that lives on a separate node or a thread within a piece of software.

To conclude, I find this to be a very interesting concept. While I haven't used it myself yet, it's something that I hope to leverage at some point in time on systems that I work on. To have a clear picture of what went wrong with the potential to recover or at least get a better idea of what went wrong. I don't know for certain how I would implement it in the systems that I work on, but I will keep it in mind as we move forward. Even if I don't get a chance to use it myself, hopefully someone else finds this useful.