Friday, October 27, 2006

Data Mining and Software Development

Dean, your comments in data mining and software development are interesting. At this point, I largely use my own MATLAB code for data mining. I have access to the Statistics and Curve Fitting Toolboxes, which provide some modeling capability and some useful utility functions. My experience is that, very often I need something which commercial tools (at the convenient interface level) do not provide. With MATLAB, once I have the data, I can prepare the data, perform modeling and report and graph results all under one roof. MATLAB-specific benefits aside, the same sort of thing could be done in other, more conventional languages like Fortran, Java or C++, perhaps with libraries like those from IMSL.

The dark side is the responsibility. I have to do all the things which the commercial shells do, such as manage the data. Occasionally I even need to manage the RAM, on really big problems. My current work machine is a Windows workstation with 2GB RAM (soon to move to a faster machine with 4GB RAM). While I have much more flexibility than the commercial tools provide, sometimes my fingers bleed (figuratively, not literally- yuck) taking care of all the details.

Still, once a decent code base is established, it isn't so bad. For instance, my feature selection process at this point is fairly efficient and robust, being implemented as a few MATLAB functions.

1 comment:

Dean Abbott said...

I agree that software that can be customized via scripting or language support provide a big advantage for applications that require more customized processing.

You refer also to reusability--a big benefit to these (or visual programming interfaces). Those tools that are just windows drop-down apps (i.e., stateless) are at a disadvantage here.