Wednesday, August 21, 2013

D for Analytics

I've been working with Twitter Storm for a few weeks and while it's pretty nice and simple to create bolts in, it's not necessarily the cleanest way to work with streaming data. Granted, I'm using the standard API, so that may be part of it as Trident looks to be more like what I'm looking for. That and doing some more reading about other APIs for MapReduce, such as Cascading.

Now, the big thing that kept going through my mind is how to make this nicer. I'm not saying that what is there isn't nice, but everything that's been created for this isn't quit as nice as I would like it. It's really limited by what you can do in Java. For example, until Java 8 comes out, it doesn't have the lambda syntax. There's more, but there's no sense in spending more time pointing out the deficiencies that I see.

So, what would be nicer? Perhaps I'm biased, but I feel D would be a better language for this. There are several features that make it a better choice in my opinion:

  1. Enforceable purity - Functions can be created that have no side effects, thus making them ideal for concurrent applications.
  2. Ranges - These data structures are used to deal with iterative data, either reading or writing, and is leveraged by many methods within the standard library.
  3. Message passing - This is a better concurrency model that can be used to communicate messages between threads.
  4. Lower memory footprint - No virtual machine is required and basic data types that can be placed on the stack vs. the heap, thus resulting in more efficient memory access.
  5. Optional garbage collection - Many parts of the standard library and the language do use it, but there are many aspects that don't require it.
  6. Scope statement - This is used to ensure that even in the event of an error, cleanup code can be executed to ensure everything ends up in sane state.
  7. Better standard library for algorithms.
  8. Between mixins, templates, and compile time function execution (CTFE), a significant amount work can be done at compile time vs. execution time. This means we can create DSLs, more reusable methods, generate static data, etc.
  9. Parallelism is part of the standard library.

There are more features that I haven't mentioned yet, but that's a good amount features that would be very useful. Just looking at the ones I mentioned, you can see that there are significant reasons to look into D for processing data. Looking at those features, we can see how we can make safer, more reliable code that can be very flexible and performant. Granted, there is a lack of libraries for various algorithms, such as NLP. However, I feel that the current standard library is a good base for such algorithms since it has plenty functions and data structures.

Others appear to agree as this papercomes to a similar conclusion, though for somewhat different reasons. The concern of the authors was the fact that two languages were used in the past: one for performance and one for flexibility and rapid development. This is another area where D is a good choice because of the reasons I mentioned above, but also because it results in very fast code. Granted, it may not always be as fast as pure C, but it's much faster than languages like Python, Ruby, and Perl. Also, if we need code that is as fast as C, we can write it in C and access it from D. However, this shouldn't be necessary as D allows you to do everything you can do is C and modern compilers generate executables that are very efficient.