Hi Super John,

First of all, say hello to that wonderful person in the pic with you.

A semi facetious way of rephrasing the question is: “When is “Not a Cat” not just “Not a Cat”? When it is also “Not a Dog”!

I think it has to do with all the underlying assumptions about the data and the error distribution that make the kind of regression we are doing valid.

Consider a generalized linear model in which we focus on just two features x1 and x2. If we were able to calculate not just the coefficients m1 and m2 but also their standard errors (as we can do in the case of Linear Regression using the covariance matrix), then we can simply use the t-scores or p-values for the coefficients in order to do “feature selection”. In the absence of the ability or willingness to understand coefficient space, we use “size (of the coefficients) matters” as a criterion for selecting features. In the above example, we might find that m2 is smaller than the cutoff and hence keep only x1 in the feature set. However, if x1 and x2 are covariant, then they are not the independent features we should be looking at to begin with, in general it will be some (linear) combination z1 = cos(theta)* x1 + sin(theta) * x2

and z2 = — sin(theta) * x1 + cos(theta) * x2

It is the coefficients of z1 and z2 that we should be considering and if we do, it will be one of z1 or z2 that we retain as a feature, and it will be a combination of x1 and x2.

Independent of the specific question of OneHot vs. WarmFuzzy or other ways to encode categorical data, in complete generality it is a requirement that *before* we start regression or feature selection, we have to ensure that the features are non-covariant — that they diagonalize the covariance matrix. Again, to emphasize, this is simply a mathematical requirement on the data — just like it is up to us to ensure that the residuals are normally distributed or that the distribution of residuals doesn’t depend on any of the features.

Then it is just a question of whether the OneHot encoded features satisfy the diagonal covariance : they can’t. So the orthogonal combinations have to be found and these are precisely the WarmFuzzy features.

One could ask if there is something about the coefficients of the x1, x2 … themselves that can be used to come up with a selection criteria. First of all, there isn’t any such: My playing with this question showed that as long as there is one off-diagonal covariance cov(xI, xJ) which is comparable to either var(xI) or var(xJ) there is no getting around the diagonalization problem. Second, more importantly, any such criteria could only be applied to selecting a subset of the {xI}, but the above discussion shows that the correct set of features should be a (proper) subset of the linear combinations {zI}.

Hope that helps.