Tuesday, September 08, 2015

Are distributed frameworks necessary?

When someone asks how they should approach a big data problem, they are typically steered towards and existing product like Hadoop or Storm.  While these are nice, they do incur some issues.  First, at least based on my experience, to get the best results, you have to write your code in the preferred language/VM that is used by the framework, which typically seems to be the JVM.  While this isn't necessarily bad, it may mean that you have to rewrite code to fit the framework.  Also, frameworks like Hadoop and Storm do different things very differently, thus making it harder to reuse code.  Also, if you want do both streaming and batch analytics, you need both.  Granted, some may be able to do both.  I don't know as there are many to choose from and it's hard to keep up.

I'm currently working on a distributed system and it uses none of those technologies.  It's working quite nicely.  It's not perfect, but it's getting there.  It got me thinking if these frameworks are even necessary.  In reality, what is the real difference between MapReduce and streaming frameworks?  Data is streamed through different processes.  It's just a matter of how they are chained together and how often the processes emit data.

So, perhaps what we really need is documentation on how to stand up multiple processes and how to tie them together.  I believe we have most if not all of the technology to do it.  Mesos and Kubernetes can be used to execute processes on a cluster of machines.  Queueing technology like Kafka and NSQ can be used to pass messages between processes.  Processes can be written in many different languages and can be packaged in containers using Docker or similar to manage dependencies.

I personally find this to be the preferable way to go.  This approach can focus on how the processes communicate with each other.  And by focusing on the protocol, I believe that we can truly decouple the processes from each other.  Also, technologies used by the analytics can be more easily swapped out when needed.  For example, Python could be used to prototype an analytic and later, when performance becomes more of an issue, it could be rewritten in a compiled language like D or Go.  We also get better reuse as the same process could be used for both stream processing and batch processing/MapReduce without code modifications.

Granted, this is a rough idea and doesn't cover all cases and aspects of these systems, but I believe it's a good start.  What I would love to see is a project that goes more in depth and perhaps defines a specification for such systems.  This way supporting libraries, if needed, can follow the specification to ensure compatibility.  Also, and perhaps more importantly, describe what to do in the event of a compatibility issue.



Blogger scarecrow said...

in fact, i think it's easy to implements a raw distributed computing system and i did it.
it contains less than 10000 lines of java source code.
i just use raw message queues and tcp socket to connect jobs, the job manager monitors how fast jobs
consume queues to process back pressure, and it provide a great ui to observe how data streams from job to job.
it's simple, more importantly, it's very simple and transparent.
and it's service oriented. all jobs communicate with job manager as a service, and the manager self is also a service.
i just draw a protocol which describes how jobs communicate with manager, and define different kinds of message.
i can use any language to implement the framework, and i can use any language to write jobs.
because the jobs and framework depend on nothing other than tcp socket and protocol message.
if i implement the framework in erlang/otp, mybe it just need several thousands of lines of erlang code.
and i don't know why those framework is so complicated.
maybe my framework is too raw to use, :).
finally, maybe you are interested in akka stream, it's nice.

1:59 AM  
Blogger blockcipher said...

Cool. I personally like the concept of using a mature, reliable queue to connect things together. Kafka is my favorite, but there are plenty of others to choose from. Even ZeroMQ and nanomsg are good choices. I just like using something that works well than rolling my own. This makes language support an issue, but that can be worked around a number of different ways. We're actually doing that now with Kafka.

As for Akka, unfortunately I can't use it as we're not using the JVM for anything we're doing.

5:41 PM  

Post a Comment

Links to this post:

Create a Link

<< Home