Sunday, March 08, 2009

Custom Data Storage vs. Mnesia

To better learn Erlang, I've been slowly working on a side project to build a search service. Since I wanted to be able to distribute the workload, my initial plan was to create a custom data structure to hold all of the data and to have each node in the cluster, with some redundancy, to hold a different subset of the data. This way, if I have to search 1,000,000 records, for instance, I can have each node look at only 100K, assuming 10 nodes, in parallel.

Well, after going through a few iterations working on the interface that will be used to access the cluster, I came to a couple realizations:

1) I don't know the best way to have a "shared" data structure in Erlang without creating some sort of artificial bottleneck.

2) Due to how Erlang works, I'd be making a copy of the data structure every time I spawn a new process.

Needless to say, I probably could have worked out number 1 in time, but number 2 is a problem. With a normal imperative language, I probably could have had a global variable that held the data in memory and, with some management code, had various processes access the same variable without making a copy of it. In Erlang, the best I could think of keep memory usage low is a singleton, which means there would be a bottleneck, one that I created. Needless to say, I don't want to make the same mistakes as people have done in the past.

Now, I didn't think about using Mnesia initially because I was concerned with having copy of the data on every node. If the dataset became large, it may be necessary to use fairly beefy machines for each node and that didn't really set well. However, I came to a realization: I don't plan on storing that much data. Yes, there may be a large number of records, but each one would be small. With disks being relatively cheap, this may not be a problem and it solves many of the other problems I was planning on dealing with, such as distributing the data. Now, I haven't read up on Mnesia too much, so I may be able to do some of the things I wanted to do anyway, but even if I can't, just using Mnesia would be a good start. If nothing else, I should be able to create something that can withstand multiple node failures and still be useful.

Regardless if how the actual implementation will work, I still plan on making it work well in a single-server environment before moving to multiple nodes. Not much of a point in building something reliable if it doesn't work right, now does it?

Labels: ,