Monday, March 19, 2007

Document, Document, Document!

I recently came across a cautionary list of "worst practices", penned by Dorian Pyle, titled This Way Failure Lies. No one likes filling out paperwork, but Dorian's rule 6 for disaster makes a good point:

Rule 6. Rely on memory. Most data mining projects are simple enough that you can hold most important details in your head. There's no need to waste time in documenting the steps you take. By far, the best approach is to keep pressing the investigation forward as fast as possible. Should it be necessary to duplicate the investigation or, in the unlikely event that it's necessary to justify the results at some future time, duplicating the original investigation and recreating the line of reasoning you used will be easy and straightforward.


As opposed to purely point-and-click tools, data mining tools which include "visual programming" interfaces (Insightful Miner, KNIME, Orange) or programming languages (Fortran, C++, MATLAB) allow a certain amount of self-documentation. Unless commenting is extremely thorough, though, it is probably worth producing at least some sort of summary document, which will need to explain the purpose and basic structure of the models. As analysis indicates adjustments in your course, this document should be updated accordingly.

Tuesday, March 13, 2007

Missing Values and Special Values: The Plague of Data Analysis

Every so often, an article is published on data mining which includes a statistic like "Amount of data mining time spent preparing the data: 70%", or something similar , expressed as a pie chart. It is certainly worth the investment of time and effort at the beginning of a data mining project, to get the data cleaned up, to maximize model performance and avoid problems later on.

Two related issues is data preparation are missing values and special values. Note that some "missing values" are truly "missing values" (items for which there is a true value which is not present in the data), while others are actually special values or undefined (or at least poorly defined) values. Much has already been written about truly missing values, especially in the statistical literature. See, for instance:

Dealing with Missing Data, by Judi Scheffer

Missing data, by Thomas Lumley

Working With Missing Values, by Alan C. Acock

How can I deal with missing data in my study?, by Derrick A. Bennett

Advanced Quantitative Research Methodology, G2001, Lecture Notes: Missing Data, by Gary King

Important topics to understand and keywords to search on, if one wishes to study missing data and its treatment are: MAR ("missing at random"), MCAR ("missing completely at random"), NMAR ("not missing at random"), non-response and imputation (single and multiple).

Special values, which are not quite the same as missing values, also require careful treatment. An example I encountered recently in my work with bank account data was a collection of variables which were defined over lagged time windows, such as "maximum balance over the last 6 months" or "worst delinquency in the last 12 months".

The first issue was that the special values were not database nulls ("missing values"), but were recorded as flag values, such as -999.

The second issue was that the flag values, while consistent within individual variables, varied across this set of variables. Some variables used -999 as the flag value, others used -999.99. Still others used -99999.

The first and second issues, taken together, meant that actually detecting the special values was, ultimately, a tedious process. Even though this was eventually semi-automated, the results needed to be carefully checked by the analyst.

The third issue was the phenomenon driving the creation of special values in the first place: many accounts had not been on the system long enough to have complete lagged windows. For instance, an account which is only 4 months old has not been around long enough to accumulate 12 months worth of delinquency data. In this particular system, such accounts received the flag value. Such cases are not quite the same as data which has an actual value which is simply unrecorded, and methods for "filling-in" such holes probably would provide spurious results.

A similar issue surrounds a collection of variables which relies on some benchmark event- which may or may not have happened, such as "days since purchase" or "months since delinquency". Some accounts had never purchased anything, and others had never been delinquent. One supposes that, theoretically, such situations should have infinity recorded. In the actual data, though, they had flag values, like -999.

Simply leaving the flag values makes no sense. There are a variety of ways of dealing with such circumstances, and solutions need to be carefully chosen given the context of the problem. One possibility is to convert the original variable to one which represents, for instance, the probability of the target class (in a classification problem). A simple binning or curve-fitting procedure would act as a single-variable model of the target, and the special value would be assigned whatever probability was observed in the training data for those cases.

Many important, real circumstances will give rise to these special values. Be vigilant, and treat them with care to extract the most information from them and avoid data mining pitfalls.

Monday, March 12, 2007

Oh the ways data visualization enlightens!

I came across a blog awhile ago by Matthew Hurst called Data Mining: Text Mining, Visualization and Social Media, but revisited it today because of a recent post on data visualization blogs. The blogs he lists are interesting, though there is another one on his side bar called Statistical Graphics (check out the animated 3-D graphic!).

It just is a reminder of how difficult a truly good visualization of data is to create. Mr. Hurst shows an example from the National Safety Council that is truly an opaque graphic, and a great example of data that is crying out for a TABLE rather than a graph. (See here for the article.) I have to admit though, it looks pretty cool. Here's the graphic--can you easily summarize the key content?




But just because a graphic is complex, doesn't make it bad. (I cite as an example the graphic I posted on here.

Model Selection Poll Closed

The poll is closed, with votes as follows:

R^2 or MSE 21%
Lift or Gains 46%
True Alert
vs. False Alert Tradeoff 12%
PCC 12%
Other 8%

Broken down in another way:

Global error (R^2 or PCC): 33%
Ranked error (Lift, ROC): 58%

where ranked error means that one first sorts the scored records and then scores the model. The relative proportions of these two roughly correspond to what I see and use in consulting: probably about 75% of the time I used something like ROC or Lift to score models.

Thanks to those who voted.