Tuesday, December 07, 2004

Find Correlated Variables Prior to Modeling

Many data sets contain highly correlated variables that measure the same kind of information in different ways. Or, when in-house data is appended with third-part data (census data, for example), the same problem often occurs. Some algorithms will build unstable models if two or more highly correlated variables are included in the model, and others will just slow down. Either way, it is a good idea to remove highly (linearly) correlated variables. But how do you identify them and remove them?

Frequently, data mining software packages allow you to measure correlation between variables, but they don’t typically allow you to select a variable subset based on some correlation threshold. A trick to use when dealing with relatively small data sets that can fit into Excel is to do the following. Export a snippet of the real-valued columns of data as tab or comma delimited, and load it into Excel. Use the correlation data analysis option to create the correlation matrix. Then use the conditional formatting option in Excel to highlight the cells where high correlations occur as one color (green), medium correlations as a second color (orange), and low correlations as a third color (blue). Typically I use logic like “if the cell value is not between 0.9 and –0.9, color the cell green.”

Once the cells are color coded, one typically sees blocks of data that are highly correlated with one another. The threshold depends on the application, but I typically use +/- 0.9 as a threshold. Only one of those variables is needed to represent that idea in the model; remove the others from the list of candidate inputs to the model. This process can remove half or more of the variables from consideration without losing the ability to build reliable models. Additionally, the visual correlation matrix provides insights into variable groupings not readily available without doing some kind of factor analysis or principal component analysis.

A sample correlation matrix is shown below.









Bands 3, 4, and 5 are correlated with each other above the 0.9 level
Bands 8, 9 and 10 are correlated with each other above the 0.9 level
Bands 11 and 12 are correlated with each other above the 0.9 level

Therefore, one may want to remove bands 4, 5, 9, 10, and 12 from the candidate input list.