Tuesday, September 08, 2015

Are distributed frameworks necessary?

When someone asks how they should approach a big data problem, they are typically steered towards and existing product like Hadoop or Storm.  While these are nice, they do incur some issues.  First, at least based on my experience, to get the best results, you have to write your code in the preferred language/VM that is used by the framework, which typically seems to be the JVM.  While this isn't necessarily bad, it may mean that you have to rewrite code to fit the framework.  Also, frameworks like Hadoop and Storm do different things very differently, thus making it harder to reuse code.  Also, if you want do both streaming and batch analytics, you need both.  Granted, some may be able to do both.  I don't know as there are many to choose from and it's hard to keep up.

I'm currently working on a distributed system and it uses none of those technologies.  It's working quite nicely.  It's not perfect, but it's getting there.  It got me thinking if these frameworks are even necessary.  In reality, what is the real difference between MapReduce and streaming frameworks?  Data is streamed through different processes.  It's just a matter of how they are chained together and how often the processes emit data.

So, perhaps what we really need is documentation on how to stand up multiple processes and how to tie them together.  I believe we have most if not all of the technology to do it.  Mesos and Kubernetes can be used to execute processes on a cluster of machines.  Queueing technology like Kafka and NSQ can be used to pass messages between processes.  Processes can be written in many different languages and can be packaged in containers using Docker or similar to manage dependencies.

I personally find this to be the preferable way to go.  This approach can focus on how the processes communicate with each other.  And by focusing on the protocol, I believe that we can truly decouple the processes from each other.  Also, technologies used by the analytics can be more easily swapped out when needed.  For example, Python could be used to prototype an analytic and later, when performance becomes more of an issue, it could be rewritten in a compiled language like D or Go.  We also get better reuse as the same process could be used for both stream processing and batch processing/MapReduce without code modifications.

Granted, this is a rough idea and doesn't cover all cases and aspects of these systems, but I believe it's a good start.  What I would love to see is a project that goes more in depth and perhaps defines a specification for such systems.  This way supporting libraries, if needed, can follow the specification to ensure compatibility.  Also, and perhaps more importantly, describe what to do in the event of a compatibility issue.


Tuesday, May 06, 2014

The Sentinel Pattern

I was watching a presentation yesterday on InfoQ called Real World Akka Recipes and the presenter discussed a concept that he called a sentinel. In short, it is a process, in this case an Akka actor, that would periodically check the state of the world to see what it should look like and what it does look like. If there is a difference, then perform an appropriate action.

This got me thinking. This looks to be a good pattern for distributed systems where we have inputs and outputs. See, the issue with such systems is that they have a degree of unreliability in them. Packets can be dropped. Services can go down. A variety of different failures can occur. While there are some possible solutions, they can be unreliable and/or have a negative effect on performance. For example, we could try to use acks to know if a message was delivered. However, the ack could be lost. Then, do we really know if the message was or wasn't received?

So, how would this work? Well, since I typically work on ingesting data, let's look at that scenario. So, using Kafka, we have producers and consumers that write to/read from a broker. Consumers are pretty good about reliability since they leverage Zookeeper to know what messages have and have not been consumed. However, producers don't have that ability. Even then, there's always a chance of failure that may not report which messages were processed. So, if we use a sentinel, we can have a process that periodically checks to see what messages have been sent and which have been processed. If there is a difference, then the sentinel can make the producer re-send the messages.

Now, this is one scenario. Other scenarios could be more complex, such as replaying transactions. Or, perhaps the sentinel can't actually fix anything. In this case, the sentinel could instead report that an error has occurred. For example, it could be a transactional system where a record could be modified multiple times. A transaction request could have failed to be executed, but replaying it could have unintended consequences because it's dependent on the state of the data at a specific point in time. In this case, an error would be reported once the sentinel runs and detects that the expected state of the system does not match the current state of the system.

For this to work, the sentinel needs support from other parts of the system. The "producers" will need to keep track of what it expects the state of the system should be or at least a set of transactions that should have occurred. The actual state of the system needs to be able to be queried periodically. Both sides need to be accessible to a process that may be external to the core of the system. I state "may be" because the sentinel could be a process that lives on a separate node or a thread within a piece of software.

To conclude, I find this to be a very interesting concept. While I haven't used it myself yet, it's something that I hope to leverage at some point in time on systems that I work on. To have a clear picture of what went wrong with the potential to recover or at least get a better idea of what went wrong. I don't know for certain how I would implement it in the systems that I work on, but I will keep it in mind as we move forward. Even if I don't get a chance to use it myself, hopefully someone else finds this useful.

Wednesday, August 21, 2013

D for Analytics

I've been working with Twitter Storm for a few weeks and while it's pretty nice and simple to create bolts in, it's not necessarily the cleanest way to work with streaming data. Granted, I'm using the standard API, so that may be part of it as Trident looks to be more like what I'm looking for. That and doing some more reading about other APIs for MapReduce, such as Cascading.

Now, the big thing that kept going through my mind is how to make this nicer. I'm not saying that what is there isn't nice, but everything that's been created for this isn't quit as nice as I would like it. It's really limited by what you can do in Java. For example, until Java 8 comes out, it doesn't have the lambda syntax. There's more, but there's no sense in spending more time pointing out the deficiencies that I see.

So, what would be nicer? Perhaps I'm biased, but I feel D would be a better language for this. There are several features that make it a better choice in my opinion:

  1. Enforceable purity - Functions can be created that have no side effects, thus making them ideal for concurrent applications.
  2. Ranges - These data structures are used to deal with iterative data, either reading or writing, and is leveraged by many methods within the standard library.
  3. Message passing - This is a better concurrency model that can be used to communicate messages between threads.
  4. Lower memory footprint - No virtual machine is required and basic data types that can be placed on the stack vs. the heap, thus resulting in more efficient memory access.
  5. Optional garbage collection - Many parts of the standard library and the language do use it, but there are many aspects that don't require it.
  6. Scope statement - This is used to ensure that even in the event of an error, cleanup code can be executed to ensure everything ends up in sane state.
  7. Better standard library for algorithms.
  8. Between mixins, templates, and compile time function execution (CTFE), a significant amount work can be done at compile time vs. execution time. This means we can create DSLs, more reusable methods, generate static data, etc.
  9. Parallelism is part of the standard library.

There are more features that I haven't mentioned yet, but that's a good amount features that would be very useful. Just looking at the ones I mentioned, you can see that there are significant reasons to look into D for processing data. Looking at those features, we can see how we can make safer, more reliable code that can be very flexible and performant. Granted, there is a lack of libraries for various algorithms, such as NLP. However, I feel that the current standard library is a good base for such algorithms since it has plenty functions and data structures.

Others appear to agree as this papercomes to a similar conclusion, though for somewhat different reasons. The concern of the authors was the fact that two languages were used in the past: one for performance and one for flexibility and rapid development. This is another area where D is a good choice because of the reasons I mentioned above, but also because it results in very fast code. Granted, it may not always be as fast as pure C, but it's much faster than languages like Python, Ruby, and Perl. Also, if we need code that is as fast as C, we can write it in C and access it from D. However, this shouldn't be necessary as D allows you to do everything you can do is C and modern compilers generate executables that are very efficient.

Sunday, March 11, 2012

Current Vim Configuration Setup

Vim is a great and wonderful tool. However, it can be a bit of a burden to maintain the configuration. I recently read an article that improved this and I tried it out to see how well it worked. So far, it has worked out well and has definitely improved how easy it is to maintain my vim configuration.

To start, my new vim configuration file contains the following:

runtime! config/*.vim

That's it. What this does is execute the configuration in every file in the .vim/config directory that ends with the .vim extension. I now have my configuration into six different files. Here is my current breakout:


Common contains just basic options, such as setting the font, color schema, default tab stops, etc. T?he rest are pretty self-explanatory. The vundle file is named as such because I use Vundle for managing my plugins, however this could contain configuration information for any similar tool.

Another change I made was to leverage the ftplugin capability of vim to set file type specific configurations only for those specific files. I don't have a lot and most aren't that interesting, though for the java plugin, I do set a couple options to make java development a bit nicer.

setlocal omnifunc=javacomplete#Complete
noremap <F5> :let $CLASSPATH=system('cat .classpath')<CR>
compiler maven2

This just sets omnicompletion to use the javacomplete plugin and to also allow the classpath for the project to be set to the contents of the .classpath file, which I generate via maven. Most importantly, at least for me, is the fact that maven2 is now used for the compiler. In my situation, I like to use the :copen command to keep the window with the errors open when compiling so that I can see all of the compilation errors that occurred.

That's all for now. Hopefully this helps out anyone else trying to keep their vim configuration under control. Keeping anything organized on a computer, be it configuration files or code, can be a challenge and anything that can help making things as easy as possible is always welcome.

Monday, December 26, 2011

Howto: Exceptions

At work, we've been helping an intern who needs to learn Java and pass a test in a few weeks. Since we're helping him, I figure I should probably jot down some of what I've learned on my career in case it helps anyone else out.

Now, before I get started, anything I say here should be taken as something to be considered and not gospel. There are those who believe there is only the "one true way" to do things where in reality, there may be several depending on what you're doing, who you're working with/for, etc. Just look at the various coding standards on the internet for doing C coding. Oh, and at least one of them is the "one true coding standard."

I want to start with Exceptions mainly because it's the first thing I thought about writing about. Now, I'm not sure what the "correct" way to handle exceptions is, so I'm going to describe my preferred method of handling exceptions.

First, there's a big difference between errors and exceptions. Exceptions are errors that are unexpected during the normal operation of the software vs. errors which can occur as part of normal operation. For example, if you write a division function, divide by zero is an error because it can happen and should be checked for. However, if you're writing a logger and you run out of disk space, that's not something that we expect to happen, therefore it's an exception.

So, we should not rely on exceptions for any cases that we can resolve programmatically. This means there should never be the case where in the catch block we have logic that affects the results of the program. Instead, if we have any cases where we know that something can happen during the normal operation, these should be checked for without exceptions. Can't recall exactly where I found this originally, but exceptions are expensive operations when they occur when compared to checking for errors. Therefore, if you know an error will occur, it should be checked for whereas exceptions, since they should be very rare, should be caught using exceptions.

Next, I'm a firm believer in capturing exceptions as soon as possible vs. bubbling them up. The big reason for this is that now if an exception occurs, you should get as much information as you need in order to know what caused the exception. On one project, this was not the practice as some others wanted the exceptions to bubble up to a higher level before logging them. This to me is a mistake as you can't capture any of the state where the exception occurs when you just bubble it up to a higher level. However, if you capture the exception as soon as it occurs, you can log any and all necessary information.

Now, if you do have to raise exceptions up to a higher level, then I'd suggest that you first capture the exception at the lowest level, capture as much details as necessary, and then raise a new exception with all of the data stored in it. Remember, the goal is to ensure we have everything we need to figure out what happened so we can resolve it as fast as possible.

On the subject of logging, I highly recommend using some sort of logger to capture exception information. If nothing else, you have a persistent copy of the message.

That last thing I want to bring up is using the generic Exception when capturing exceptions vs. specific exceptions. I prefer to capture generic exceptions since there may be several different exceptions that can be thrown in a try/catch block, but unless there's something different that needs to bone done for a specific type of exception, I don't see the need to capture something specific. Typically, I just capture which exception occurred and perhaps the stack trace. I haven't had the need to do anything else, so it was just simpler to just capture Exception. Also, if for some reason I add a method or one changes and throws a different exception, I don't have to change my catch block.

Now, this is just what I do and it has worked very well for me so far. Now, I'm not saying that it should be done this way, but I wanted to show a way to do it and explain why I do it the way I do. Here's to hoping it helps someone.

Saturday, November 05, 2011

Central better then distributed?

While we all know that distributed version control systems are considered the wave of the future and the better solution for version control, there is at least one scenario where a centralized system can be, and is, the better solution.

First, what are version control systems typically used for? The simple answer is tracking changes to source code files. And source code is, typically, text. Therefore, it is easy to see what has changed and, more importantly, merge changes.

The merging of changes is the key issue here in that any version control system can merge changes between at least two versions of a file and have a single file as a result. The big benefit with distributed systems is that since the merging happens on the client, it's faster.

However, when we look at binary files, the whole thing falls apart since in most cases, a binary format cannot be easily merged. So, how do we ensure that if two people can make changes and not necessarily overwrite one set of changes with another? This is where a centralized system with locking comes into play. By being able to lock a file, and proper training, the second person would have to wait for the file to be unlocked before making changes to the file.

Now, where would this be useful? Here are a few scenarios: managing word documents for proposal, managing the images for a web site, managing diagrams for project planning and design, and managing data files that store complex data.

Granted, this isn't the most common scenario, but it's one that may exist and can be important for certain organizations or activities.

Thursday, August 25, 2011

Optimization bad? No!

I recently read another article that's has an anti-optimization slant to it. In this case, it's stating that students should not learn various little tricks to save a few nanoseconds off the time to execute a block of code. While this post did have a point, it still angered me.

You see, there have been a number of articles expounding that readability, and probably a few other aspects of coding, are much more important and we should listen to Knuth and not optimize! This recent article even made it sound like such optimizations aren't important anymore and that's what truly got my goat.

Let me phrase this as succinctly as possible: optimizations are at least as important if not more important than they were years ago. These optimizations were originally meant for the original computers where saving a few clock cycles made a large performance improvement. Now, with such fast processors, the theory is that we don't have to save a couple clock cycles here and there because the amount of time saved is inconsequential. Hate to say it, but it wasn't true.

You see, many pieces of software got bigger and bigger. And by getting bigger, there were many components that could be optimized. However, because each component is such a small part of the system, in theory they shouldn't be optimized. However, these optimizations add up and, in many cases, may be in a library used by many aspects of the system.

Now, let's take a step back and look at the reasons for optimization: performance, bandwidth, and memory. Most people talk about performance optimizations where we attempt to reduce the amount of time it takes to execute a specific tasks. And yes, in many cases it probably doesn't matter too much, but to state that one should not learn some of the little tricks to save a few clock cycles is very mistaken.

There are two great examples of where these little tricks are still applicable: big data and real-time processing. Systems like Hadoop process terabytes and more of data, so squeezing out every bit of performance means we get the jobs done faster and potentially with less hardware. With respect to real-time systems, a great article was recently written about the algorithms used on Wall Street and how these companies need results as fast as possible. They even eliminated firewalls in the name of performance!

Another example of where performance matters is web servers, though not in the way you may think. Currently, most web sites use interpreted languages because they're easier to set up and use. However, the use of interpreters does incur a cost in terms of performance and other factors. In some cases there's a shift towards compiled languages. Facebook has even mentioned created a compiler to compile PHP code down to executable code. (There may be a step where it's compiled to C) These savings in these cases isn't just in performance, but in power usage as the faster a request is completed, the fewer resources used. This can lead to fewer servers needed, thus reducing the energy footprint of the servers.

Memory usage is probably the one case where more optimization is desperately needed, but not done. How many application these days use up significantly more memory than their previous versions? How many apps are written in an interpreted language, including Java, and the users must take into account the virtual machine as well as the application itself? How often is memory just poorly allocated? Many Linux distros are capable of running well on a system with 1GB of RAM, but recent versions of Windows require 1GB and don't necessarily run well without more.

By optimizing memory usage, you can allow for more programs to run concurrently without as much contention. Also, you can use cheaper computers to perform the same tasks. Or, better yet, be able to keep more data in memory for those data-intensive tasks. Personally, if I could find a browser that could keep memory usage very small and still be very functional, I would be very happy as it would allow me to have my browser going in my little VM while I'm doing my development.

Part of my job is to create software that can analyze data in batches. The server that this is running on isn't all that big and has several processes running on it. The one aspect of the system that is in our favor is that it is multi-core, so running processes in parallel doesn't affect CPU performance too much, but there's still the issue of memory. So, my software is optimized to use as little memory as possible. Also, this allows the file cache to use more memory, which helps keep file system performance reasonable.

Lastly, there's bandwidth optimization. Here we look at things like compression, JSON vs. XML, and the like. This works for both for file system bandwidth and network bandwidth as reading compressed data off a disk means fewer disk seeks. Similarly, compressed data over the network uses fewer packets. The short of it is, the faster we get data to where it can be used, the better.

Now, most optimizations should be occurring at the algorithmic level. These are things like using a merge sort vs. a quick sort for a sorting multi-gigabyte file or how one distributes work over multiple cores/machines. However, the lower-level optimizations are still important. Isn't it important to save a few clock cycles per record when you're processing billions/trillions of them? Isn't it important to save a few clock cycles when data needs to be converted to information as quickly as possible? If so, then where are people supposed to learn this?

And if people keep downplaying proper optimization, how hard will it become to find developers who can do proper optimization?

Please note that I'm not saying that optimization is the most important part of software development. What I'm saying is that we shouldn't treat it as if it's the least important. Unless we know for certain that performance doesn't matter, primarily because it isn't used very often, then we should attempt make the software as efficient as possible. The process is relatively simple: make it right then make it fast. Well, efficient would be a better word since we know that it's not always about raw speed.

Ugh...getting tired. Here's to hoping this is well written enough to be understandable. I apologize if this sounds angry, but it just burns me that optimization is becoming less and less of a priority, but then we complain about computers getting slower.