Tuesday, September 08, 2015

Are distributed frameworks necessary?

When someone asks how they should approach a big data problem, they are typically steered towards and existing product like Hadoop or Storm.  While these are nice, they do incur some issues.  First, at least based on my experience, to get the best results, you have to write your code in the preferred language/VM that is used by the framework, which typically seems to be the JVM.  While this isn't necessarily bad, it may mean that you have to rewrite code to fit the framework.  Also, frameworks like Hadoop and Storm do different things very differently, thus making it harder to reuse code.  Also, if you want do both streaming and batch analytics, you need both.  Granted, some may be able to do both.  I don't know as there are many to choose from and it's hard to keep up.

I'm currently working on a distributed system and it uses none of those technologies.  It's working quite nicely.  It's not perfect, but it's getting there.  It got me thinking if these frameworks are even necessary.  In reality, what is the real difference between MapReduce and streaming frameworks?  Data is streamed through different processes.  It's just a matter of how they are chained together and how often the processes emit data.

So, perhaps what we really need is documentation on how to stand up multiple processes and how to tie them together.  I believe we have most if not all of the technology to do it.  Mesos and Kubernetes can be used to execute processes on a cluster of machines.  Queueing technology like Kafka and NSQ can be used to pass messages between processes.  Processes can be written in many different languages and can be packaged in containers using Docker or similar to manage dependencies.

I personally find this to be the preferable way to go.  This approach can focus on how the processes communicate with each other.  And by focusing on the protocol, I believe that we can truly decouple the processes from each other.  Also, technologies used by the analytics can be more easily swapped out when needed.  For example, Python could be used to prototype an analytic and later, when performance becomes more of an issue, it could be rewritten in a compiled language like D or Go.  We also get better reuse as the same process could be used for both stream processing and batch processing/MapReduce without code modifications.

Granted, this is a rough idea and doesn't cover all cases and aspects of these systems, but I believe it's a good start.  What I would love to see is a project that goes more in depth and perhaps defines a specification for such systems.  This way supporting libraries, if needed, can follow the specification to ensure compatibility.  Also, and perhaps more importantly, describe what to do in the event of a compatibility issue.

Thoughts?