Saturday, April 21, 2007

Data Mining Methods Poll

Interesting results of the latest KDNuggets poll on data mining methods. Interestingly, Decision Trees won the competition, followed by Clustering and Regression.

A couple of observations...
1) ensembles (Bagging and Boosting) went up. The sample size is too small to make any inferences, but this will be interesting to track over time.
2) SVMs and Neural Networks are at about the same level, though SVM usage dropped from 2006. I do wonder if SVMs will surpass neural networks as the "complex way to model accurately", but the verdict is still out on this.

Tuesday, April 17, 2007

Applications of Prediction Technology

It is interesting to learn how predictive technologies are being applied. Below are links to some cases which may prove instructive as well as novel:


An Empirical Study of Machine Learning Algorithms
Applied to Modeling Player Behavior in a “First Person Shooter” Video Game
, Masters thesis by Benjamin Geisler

Using Machine Learning to Break Visual Human Interaction Proofs (HIPs), by Kumar Chellapilla and Patrice Y. Simard

Spatial Clustering Of Chimpanzee Locations For Neighborhood
Identification
, by Sandeep Mane, Carson Murray, Shashi Shekhar, Jaideep Srivastava and Anne Pusey

Are You HOT or NOT?, by Jim Hefner and Roddy Lindsay

Predicting Student Performance, by Behrouz Minaei-Bidgoli, Deborah A. Kashy, Gerd Kortemeyer, William F. Punch

Discrimination of Hard-to-pop Popcorn Kernels by Machine Vision and Neural Networks, by W. Yang, P. Winter, S. Sokhansanj, H. Wood and B. Crerer

Predicting habitat suitability with machine learning
models
, by Marta Benito Garzón, Radim Blazek, Markus Neteler, Rut Sánchez de Dios, Helios Sainz Ollero and Cesare Furlanello


It is not necessary to read such cases from end-to-end to benefit from them. Glance through these to pick up what tips you may. Happy hunting!

Saturday, April 14, 2007

Is Data Mining still on the rise?

Another very interesting and thoughtful take on Predictive Analytics and Data Mining from Mark Madsen can be found here. I've never met him before, but I think I'd like to since he is a TDWI kind of guy, obviously well informed, and I'll be in the same location this May in Boston teaching a data mining course at the next TDWI conference in Boston on the 17th, which is Thursday.

But back to the article...Mr. Madsen writes that Predictive Analytics
rated by the Executive Summit attendees as the number one item expected to have the most impact over the next several years.
Well, that's good news, and I think it makes sense because most companies I deal with are just starting to use predictive analytics. There will always be the powerhouse, large companies that have large data mining teams. They make for great case studies. But we'll know that data mining has "made it" when small companies can have one person working part time doing their analytics, and being effective with it. I know several companies like this already, but it takes some investment in training to get there.

Sunday, April 08, 2007

Future Data Mining Trends

In his latest post, Sandro has a nice summary about future data mining trends here. I'm with him that being a prognosticator is not something I do a lot of, but I do have one idea that I still think will happen.

First, let me say that of the references provided by Sandro, the Tom Dietterich one is something I like very much, especially his treatment of model ensembles.

At the 1999 or 2000 KDD conference in San Diego, I think there was a roundtable discussion on the future of data mining with the particular emphasis revolving around whether or not data mining will occur inside the database or external to the database. The general consensus was that mining will move more inside the database, and I frankly agreed. This has not materialized nearly to the degree I expected, though it has progressed especially in the past couple of years with improvements to Oracle Data Miner and SQL Server 2005 Business Intelligence. (I'm not familiar with the current state of DB2 Data Warehouse Edition, and I don't think there has been much work done in recent years on the Teradata Warehouse Miner product, formerly TeraMiner).

However, most folks I know who do data mining still pull data from a datamart or warehouse, build models in a standalone app, and then push models and/or scores back up to the warehouse. I think this is going to move more and more into the warehouse either through improved software in the warehouse (like what we're seeing with Oracle and Microsoft), or, perhaps more likely, through improved interfaces to warehouse functions by standalone data mining software. For example Clementine from SPSS allows you pushback database function to the database itself rather than operating on data that has been pulled from the warehouse. This speeds up basic data processing considerably I've found. I think the latter is the more likely area of growth in data mining software and how practitioners use data mining software.