Friday, April 15, 2005

Beware of Automatic Handling of Categorical Variables

Categorical variables are often difficult to use because many data mining algorithms require that input (independent) variables be continuous. Fortunately many data mining software tools handle this problem for you by converting the single categorical variable with "N" values into "N" new dummy variables, one new variable for each value. For example, if you have a field "State" with 50 text labels, the tools will create automatically 50 new variables with values 0 or 1. If a record is has the value "MA" in the variable State, the new dummy column representing "MA" will have value "1", and all 49 other state dummy columns will have value "0". Because of this, analysts don't have to convert all their text and categorical variables to numeric variables prior to modeling.

However, the automatic handling of categorical variables could cause problems that are hidden to you. Instead of having one input variable in your model (as it appears when you select input variables), you could have hundreds! This can effect decision trees (that are biased toward variables with more categories) and neural network sensitivities (that are often biased toward categorical variables with large numbers of categories). In other words, there is a hidden bias toward larger numbers of categories that could bias your interpretation of the models.

What should one do? First, be aware of these variables. During the data understanding stage of your data mining project, identify variables with large numbers of categories. This will at least alert you to the possiblity of bias in your models or sensitivies. Second, If there are more than a dozen or two categories, consider binning up those variable groups by combining dummy variables with smaller counts into larger groups, or dropping them altogether. More on identifying the significance of categorical variable values in an upcoming Abbott Insights.