Thursday, May 24, 2007

KDnuggets 2007 Poll

The frenzy surrounding the annual software poll KDnuggets is finally over. The results are available at:

Data Mining / Analytic Software Tools (May 2007)

A number of statistical issues have been raised regarding this particular survey, but I will highlight only one here: The survey now includes separate counts for votes cast by people who voted for a single item, and those who voted for multiple items. Partially, this is in response to "get out the vote" efforts made by some vendors.

Anyway, some interesting highlights:

1. Free tools made a good showing. In the lead among free tools: Yale (103 votes).
2. "Your Own Code" (61 votes) did respectably well.
3. Despite not having data mining-specific components, MATLAB (30 votes), which is my favorite tool, was more popular than a number of well-known commercial data mining tools.

Monday, May 07, 2007

Quotes from Moneyball

I know it took me too long to do it, but I finally have read through Moneyball and thoroughly enjoyed it. There are several quotes from it that I thought captures aspects of the data mining attitude I think should be adopted.

On a personal note, I suppose my fascination with data analysis started when I was playing baseball in Farm League, and later with playing Strat-o-matic baseball. I received all the 1972 teams, and proceeded to try to play every team's complete schedule--needless to say, I didn't get too far. But more than playing the games, what I enjoyed most of all was computing the statistics and listing the leaders in the hitting and pitching categories. Once a nerd, always a nerd...

Here is the first quote:

Intelligence about baseball statistics had become equated in the public mind with the ability to recite arcane baseball stats. What [Bill] James's wider audience had failed to understand was that the statistics were beside the point. The point was understanding; the point was to make life on earth just a bit more intelligible; and that point, somehow, had been lost. 'I wonder,' James wrote, 'if we haven't become so numbed by all these numbers that we are no longer capable of truly assimilating any knowledge which might result from them.'

Friday, May 04, 2007

PMML Deployment

I posted this question on IT Toolbox, but thought I'd post it here as well.

I'm working on a project where the company wants to score
model(s) in real time (transactional type data). They also would
like to remain vendor-independent. With these in mind, they have
considered using PMML. However, they are having a hard time
finding vendors that have a Scoring Engine that runs PMML (many
software products have this, if you want to use those products).
We want a standalone option so no matter what tool is used to be
the models, we can just drop in the PMML code and run it.


I've discussed the options of running source code (C or Java),
but they also want to be able to update models on the fly
without a recompile.


Anyone have experiences with PMML in production out there?

Tuesday, May 01, 2007

Comparison of Algorithms at PAKDD2007


At the link in the title are the results from the 11th Pacific-Asia Knowledge Discovery and Data Mining conference (PAKDD 2007). The dataset for the competition was a cross-seller dataset, but that is not of interest to me here. The interesting this is these: which algorithms did the best, and were they significantly different in their performance?

A note about the image in the post: I took the results, sorted by area under the ROC curve (AUC). The results already had color coded the groups of results (into winners, top 10 and top 20)--I changed the colors to make them more legible. I also added red-bold text to the algorithm implementations that included an ensemble (note that after I did this, I discovered that the winning Probit model was also an ensemble).

And, for those who don't want to look at the image, the top four models were as follows:

AUC.......Rank...Modeling Technique
70.01%.....1.....TreeNet + Logistic Regression
69.99%.....2.....Probit Regression
69.62%.....3.....MLP + n-Tuple Classifier
69.61%.....4.....TreeNet


First, note that all four winners used ensembles, but ensembles of 3 different algorithms: Trees (Treenet), neural networks, and probits. The differences between these results are quite small (arguably not significant, but more testing would have to take place to show this). The conclusion I draw from this then is that the ensemble is more important than the algorithm; so long as there are good predictors, variation in data used to build the models, and sufficient diversity in predictions issued by the individual models.

I have not yet looked at the individual results to see how much preprocessing was necessary for each of the techniques, however I suspect that less was needed for the TreeNet models just because of the inherent characteristics of CART-styled trees in handling missing data, outliers, and categorical/numeric data.

Second, and related to the first, is this: while I still argue that generally speaking, trees are less accurate than neural networks or SVMs, ensembles level the playing field. What surprised me the most was that logistic regression or Probit ensembles performed as well as they did. This wasn't because of the algorithm, but rather because I haven't yet been convinced that Probits or Logits consistently work well in ensembles. This is more evidence that they do (though I need to read further how they were constructed to be able to comment on why they did as well as they did).