Sunday, February 22, 2009

Perl rocks, again.

As much as people have been knocking it lately, I have to say it's still a great tool to have around. I'm working on a project where we have to examine data from an external source in order to generate reasonable metrics/reports. Part of this is knowing the quality of the data. Well, I tried several tools to get this data and there were two major problems:

1) Slow. It took a while to get the metrics and typically sucked up a huge chunk of memory doing so.

2) I couldn't save the results, which meant that I had to rerun the analysis every time I wanted to see it, which meant I had a large chunk of memory being used for as long as the program was open.

Granted, this is not to say these programs didn't do a good job. They really did, at least with the analysis. Those two points are just necessary for me. If I didn't get the info in a reasonable amount of time, it just wouldn't be useful for me.

Perl to the rescue. Well, mostly. I had a script that I created before for analyzing a CSV file and making an educated guess on the data types for each column. Well, I just took that and revamped it a bit as there were some more features I wanted to add. With a combination of language constructs that make things easy to program and the wealth of modules that are available in the Perl core and CPAN, I managed to get a very nice script up and running that would analyze the CSV file, generate the info that I needed, and save the reports as HTML files.

Now, I just have to export the data as CSV, which I prefer as it makes no assumptions about types, run it through the script, and I'm good to go. Not surprisingly, this is significantly faster than running a tool that does it directly from the database. Many of the tables have over 100K rows, one had over 500K, so having to read all that data into memory and analyzing the data takes up a good bit of memory. For the analysis of one of the larger data sets, my Perl script took over 600MB of memory.

Yes, I do realize that this isn't the most efficient code for this, however I needed it sooner than later and it runs fast enough now. What used to bog down my computer for some time now may take a minute or two.

Labels: , ,

Tuesday, February 03, 2009

Prime Factorization Using Concurrent Processes in Erlang

Well, I started doing the Project Euler problems again using Erlang and I got to one that required factorization, so I re-did problem number 3. After trying the solution advertised, I did a bit of research and I came across the Sieve of Eratosthenes. It was very simple to implement, so I used that to do the factorization. It was pretty simple and seemed to be reasonable fast.

Well, I was thinking of trying to speed up the sieve, mainly by trying to parallelize parts of it. Yep. I'm thinking more about processes as the days go on. Well, it hit me yesterday that by using the sieve to create prime numbers that I then try to use to factor a number, I could generate the prime numbers in parallel to testing to see if they're a factor. Not sure if it's any faster than the single-threaded version, especially since I'm doing this in a virtual machine, however I'm guessing that if I do get this on a multi-core machine, it should be somewhat faster.

Here's the code for anyone who's interested.


pfactor(N) ->
SN = round(math:sqrt(N)),
P = spawn_link(factor, pfactor1, [N, []]),
spawn_link(factor, pseive, [lists:seq(2, SN), P]).

pfactor1(N, Factors) ->
io:format("In pfactor1.~n"),
receive
{factor, Factor} ->
io:format("Received a factor: ~p~n", [Factor]),
if
(N rem Factor) == 0 ->
pfactor1(N, Factors ++ [Factor]);
true ->
pfactor1(N, Factors)
end;
{done, []} ->
io:format("Done.~n"),
io:format("Factors: ~p~n", [Factors])
after 10000 ->
io:format("Timed out...~n")
end.

pseive([P | R], Pid) ->
io:format("Sending the new factor: ~p~n", [P]),
Pid ! {factor, P},
pseive([X || X <- R, (X rem P) > 0], Pid);
pseive([], Pid) ->
io:format("All done...~n"),
Pid ! {done, []}.

Labels: , ,