Sunday, March 08, 2009

Custom Data Storage vs. Mnesia

To better learn Erlang, I've been slowly working on a side project to build a search service. Since I wanted to be able to distribute the workload, my initial plan was to create a custom data structure to hold all of the data and to have each node in the cluster, with some redundancy, to hold a different subset of the data. This way, if I have to search 1,000,000 records, for instance, I can have each node look at only 100K, assuming 10 nodes, in parallel.

Well, after going through a few iterations working on the interface that will be used to access the cluster, I came to a couple realizations:

1) I don't know the best way to have a "shared" data structure in Erlang without creating some sort of artificial bottleneck.

2) Due to how Erlang works, I'd be making a copy of the data structure every time I spawn a new process.

Needless to say, I probably could have worked out number 1 in time, but number 2 is a problem. With a normal imperative language, I probably could have had a global variable that held the data in memory and, with some management code, had various processes access the same variable without making a copy of it. In Erlang, the best I could think of keep memory usage low is a singleton, which means there would be a bottleneck, one that I created. Needless to say, I don't want to make the same mistakes as people have done in the past.

Now, I didn't think about using Mnesia initially because I was concerned with having copy of the data on every node. If the dataset became large, it may be necessary to use fairly beefy machines for each node and that didn't really set well. However, I came to a realization: I don't plan on storing that much data. Yes, there may be a large number of records, but each one would be small. With disks being relatively cheap, this may not be a problem and it solves many of the other problems I was planning on dealing with, such as distributing the data. Now, I haven't read up on Mnesia too much, so I may be able to do some of the things I wanted to do anyway, but even if I can't, just using Mnesia would be a good start. If nothing else, I should be able to create something that can withstand multiple node failures and still be useful.

Regardless if how the actual implementation will work, I still plan on making it work well in a single-server environment before moving to multiple nodes. Not much of a point in building something reliable if it doesn't work right, now does it?

Labels: ,

2 Comments:

Blogger Gleber said...

Hello.

There is module called "ets", which in fact allows you to have thread-safe shared memory. It has some performance limitations, but it should be a good place to start. To store data on disk you can use "dets" module.

Data in mnesia tables can be fragmented, so you can avoid having multiple copies of the same data on each node. For further information read [1]

You can take a look at Scalaris project, ringo project, dynamo project or some other key-value storage written in Erlang.

btw, will you make project of yours open source?

1: http://www.trapexit.org/Mnesia_Table_Fragmentation

6:25 AM  
Blogger blockcipher said...

Hello,

Thanks for the comment. I did look into ets, but Mnesia looks to be exactly what I want. I've actually started reading more into Mnesia after this post and it really does look like what I want, including the table fragmentation.

I haven't looked into any of the other projects mainly because I didn't think of it. Also, I want to have as much low-level access to the data as possible for when I implement any algorithms I may need. I'm in a somewhat conceptual stage right now, so I'm not 100% sure about everything I'll be doing.

As for open source? Yes, I do plan on it. I just want to get it to the point where I like it first. Then, I guess I have to find someplace to host it. I just don't know when it will be.

3:01 PM  

Post a Comment

Links to this post:

Create a Link

<< Home