‘vtreat’ is a package that prepares arbitrary data frames into clean data frames that are ready for analysis (usually supervised learning). A clean data frame:
To effect this encoding ‘vtreat’ replaces original variables or columns with new derived variables. In this note we will use variables and columns as interchangeable concepts. This note describes the current family of ‘vtreat’ derived variable types.
‘vtreat’ usage splits into three main cases:
In all cases vtreat variable names are built by appending a notation
onto the original user supplied column name. In all cases the easiest
way to examine the derived variables is to look at the
scoreFrame component of the returned treatment plan.
We will outline each of these situations below:
An example categorical variable treatment is demonstrated below:
library(vtreat)
dTrainC <- data.frame(x=c('a','a','a','b','b',NA),
   z=c(1,2,3,4,NA,6),y=c(FALSE,FALSE,TRUE,FALSE,TRUE,TRUE),
   stringsAsFactors = FALSE)
treatmentsC <- designTreatmentsC(dTrainC,colnames(dTrainC),'y',TRUE)## [1] "vtreat 1.6.5 inspecting inputs Wed Jun 12 08:51:34 2024"
## [1] "designing treatments Wed Jun 12 08:51:34 2024"
## [1] " have initial level statistics Wed Jun 12 08:51:34 2024"
## [1] " scoring treatments Wed Jun 12 08:51:34 2024"
## [1] "have treatment plan Wed Jun 12 08:51:34 2024"
## [1] "rescoring complex variables Wed Jun 12 08:51:34 2024"
## [1] "done rescoring complex variables Wed Jun 12 08:51:34 2024"scoreColsToPrint <- c('origName','varName','code','rsq','sig','extraModelDegrees')
print(treatmentsC$scoreFrame[,scoreColsToPrint])##   origName   varName  code        rsq       sig extraModelDegrees
## 1        x    x_catP  catP 0.11457614 0.3289524                 2
## 2        x    x_catB  catB 0.12081050 0.3161341                 2
## 3        z         z clean 0.25792985 0.1429977                 0
## 4        z   z_isBAD isBAD 0.19087450 0.2076623                 0
## 5        x  x_lev_NA   lev 0.19087450 0.2076623                 0
## 6        x x_lev_x_a   lev 0.08170417 0.4097258                 0
## 7        x x_lev_x_b   lev 0.00000000 1.0000000                 0For each user supplied variable or column (in this case
x and z) ‘vtreat’ proposes derived or treated
variables. The mapping from original variable name to derived variable
name is given by comparing the columns origName and
varName. One can map facts about the new variables back to
the original variables as follows:
# Build a map from vtreat names back to reasonable display names
vmap <- as.list(treatmentsC$scoreFrame$origName)
names(vmap) <- treatmentsC$scoreFrame$varName
print(vmap['x_catB'])## $x_catB
## [1] "x"# Map significances back to original variables
aggregate(sig~origName,data=treatmentsC$scoreFrame,FUN=min)##   origName       sig
## 1        x 0.2076623
## 2        z 0.1429977In the scoreFrame the sig column is the
significance of the single variable logistic regression using the named
variable (plus a constant term), and the rsq column is the
“pseudo-r-squared” or portion of deviance explained (please see here
for some notes).
Essentially a derived variable name is built by concatenating an
original variable name and a treatment type (also recorded in the
code column for convenience). The codes give the different
‘vtreat’ variable types (or really meanings, as all derived variables
are numeric).
For categorical targets the possible variable types are as follows:
x_lev_x.a is 1 when the original x variable
had a value of “a”. These indicators are essentially variables
representing explicit encoding of levels as dummy variables. In some
cases a special level code is used to represent pooled rare values.x_catB = logit(P[y==target|x]) - logit(P[y==target]). This
encoding is especially useful for categorical variables that have a
large number of levels, but be aware it can obscure degrees of freedom
if not used properly.An example numeric variable treatment is demonstrated below:
library(vtreat)
dTrainN <- data.frame(x=c('a','a','a','b','b',NA),
   z=c(1,2,3,4,NA,6),y=as.numeric(c(FALSE,FALSE,TRUE,FALSE,TRUE,TRUE)),
   stringsAsFactors = FALSE)
treatmentsN <- designTreatmentsN(dTrainN,colnames(dTrainN),'y')## [1] "vtreat 1.6.5 inspecting inputs Wed Jun 12 08:51:34 2024"
## [1] "designing treatments Wed Jun 12 08:51:34 2024"
## [1] " have initial level statistics Wed Jun 12 08:51:34 2024"
## [1] " scoring treatments Wed Jun 12 08:51:34 2024"
## [1] "have treatment plan Wed Jun 12 08:51:34 2024"
## [1] "rescoring complex variables Wed Jun 12 08:51:34 2024"
## [1] "done rescoring complex variables Wed Jun 12 08:51:34 2024"##   origName   varName  code          rsq       sig extraModelDegrees
## 1        x    x_catP  catP 1.538462e-01 0.4418233                 2
## 2        x    x_catN  catN 1.131222e-01 0.5145190                 2
## 3        x    x_catD  catD 1.111111e-01 0.5185185                 2
## 4        z         z clean 3.045045e-01 0.2562868                 0
## 5        z   z_isBAD isBAD 2.000000e-01 0.3739010                 0
## 6        x  x_lev_NA   lev 2.000000e-01 0.3739010                 0
## 7        x x_lev_x_a   lev 1.111111e-01 0.5185185                 0
## 8        x x_lev_x_b   lev 1.110223e-16 1.0000000                 0The treatment of numeric targets is similar to that of categorical targets. In the numeric case the possible derived variable types are:
x_lev_x.a is 1 when the original x variable
had a value of “a”. These indicators are essentially variables
representing explicit encoding of levels as dummy variables. In some
cases a special level code is used to represent pooled rare values.x_catN = E[y|x] - E[y]. This encoding is especially useful
for categorical variables that have a large number of levels, but be
aware it can obscure degrees of freedom if not used properly.Note: for categorical targets we don’t need cat\_D
variables as this information is already encoded in cat\_B
variables.
In the scoreFrame the sig column is the
significance of the single variable linear regression using the named
variable (plus a constant term), and the rsq column is the
“r-squared” or portion of variance explained (please see here)
for some notes).
An example “no target” variable treatment is demonstrated below:
library(vtreat)
dTrainZ <- data.frame(x=c('a','a','a','b','b',NA),
   z=c(1,2,3,4,NA,6),
   stringsAsFactors = FALSE)
treatmentsZ <- designTreatmentsZ(dTrainZ,colnames(dTrainZ))## [1] "vtreat 1.6.5 inspecting inputs Wed Jun 12 08:51:34 2024"
## [1] "designing treatments Wed Jun 12 08:51:34 2024"
## [1] " have initial level statistics Wed Jun 12 08:51:34 2024"
## [1] " scoring treatments Wed Jun 12 08:51:34 2024"
## [1] "have treatment plan Wed Jun 12 08:51:34 2024"##   origName   varName  code extraModelDegrees
## 1        x    x_catP  catP                 2
## 2        z         z clean                 0
## 3        z   z_isBAD isBAD                 0
## 4        x  x_lev_NA   lev                 0
## 5        x x_lev_x_a   lev                 0
## 6        x x_lev_x_b   lev                 0Note: because there is no user supplied target the
scoreFrame significance columns are not meaningful, and are
populated only for regularity of code interface. Also indicator
variables are only formed by designTreatmentsZ for
vtreat 0.5.28 or newer. Beyond that the no-target
treatments are similar to the earlier treatments. Possible derived
variable types in this case are:
x_lev_x.a is 1 when the original x variable
had a value of “a”. These indicators are essentially variables
representing explicit encoding of levels as dummy variables. In some
cases a special level code is used to represent pooled rare values.Both designTreatmentsX and prepare
functions take an argument called codeRestriction that
restricts the type of variable that is created. For example, you may not
want to create catP and catD variables for a
regression problem.
dTrainN <- data.frame(x=c('a','a','a','b','b',NA),
   z=c(1,2,3,4,NA,6),y=as.numeric(c(FALSE,FALSE,TRUE,FALSE,TRUE,TRUE)),
   stringsAsFactors = FALSE)
treatmentsN <- designTreatmentsN(dTrainN,colnames(dTrainN),'y',
                                 codeRestriction = c('lev', 
                                                      'catN',
                                                      'clean',
                                                      'isBAD'),
                                 verbose=FALSE)
# no catP or catD variables
print(treatmentsN$scoreFrame[,scoreColsToPrint])##   origName   varName  code          rsq       sig extraModelDegrees
## 1        x    x_catN  catN 1.131222e-01 0.5145190                 2
## 2        z         z clean 3.045045e-01 0.2562868                 0
## 3        z   z_isBAD isBAD 2.000000e-01 0.3739010                 0
## 4        x  x_lev_NA   lev 2.000000e-01 0.3739010                 0
## 5        x x_lev_x_a   lev 1.111111e-01 0.5185185                 0
## 6        x x_lev_x_b   lev 1.110223e-16 1.0000000                 0Conversely, even if you have created a treatment plan for a
particular type of variable, you may subsequently decide not to use it.
For example, perhaps you only want to use indicator variables and not
the catN variable for modeling. You can use
codeRestriction in prepare().
## Warning in prepare.treatmentplan(treatmentsN, dTrainN, codeRestriction =
## c("lev", : possibly called prepare() on same data frame as
## designTreatments*()/mkCrossFrame*Experiment(), this can lead to over-fit.  To
## avoid this, please use mkCrossFrame*Experiment$crossFrame.##     z z_isBAD x_lev_NA x_lev_x_a x_lev_x_b y
## 1 1.0       0        0         1         0 0
## 2 2.0       0        0         1         0 0
## 3 3.0       0        0         1         0 1
## 4 4.0       0        0         0         1 0
## 5 3.2       1        0         0         1 1
## 6 6.0       0        1         0         0 1varRestriction works similarly, only you must list the
explicit variables to use. See the example below.
Variables that “do not move” (don’t take on at least two values
during treatment design) or don’t achieve at least a minimal
significance are suppressed. The catB/catN
variables are essentially single variable models and are very useful for
re-encoding categorical variables that take on a very large number of
values (such as zip-codes).
The intended use of ‘vtreat’ is as follows:
‘vtreat’ attempts to compute “out of sample” significances for each
variable effect ( the sig column in
scoreFrame) through cross-validation techniques.
‘vtreat’ is primarily intended to be “y-aware” processing. Of
particular interest is using vtreat::prepare() with
scale=TRUE which tries to put most columns in ‘y-effect’
units. This can be an important pre-processing step before attempting
dimension reduction (such as principal components methods).
The vtreat user should pick which sorts of variables they are want and also filter on estimated significance. Doing this looks like the following:
dTrainN <- data.frame(x=c('a','a','a','b','b',NA),
   z=c(1,2,3,4,NA,6),y=as.numeric(c(FALSE,FALSE,TRUE,FALSE,TRUE,TRUE)),
   stringsAsFactors = FALSE)
treatmentsN <- designTreatmentsN(dTrainN,colnames(dTrainN),'y',
                                  codeRestriction = c('lev', 
                                                      'catN',
                                                      'clean',
                                                      'isBAD'),
                                 verbose=FALSE)
print(treatmentsN$scoreFrame[,scoreColsToPrint])##   origName   varName  code          rsq       sig extraModelDegrees
## 1        x    x_catN  catN 0.000000e+00 1.0000000                 2
## 2        z         z clean 3.045045e-01 0.2562868                 0
## 3        z   z_isBAD isBAD 2.000000e-01 0.3739010                 0
## 4        x  x_lev_NA   lev 2.000000e-01 0.3739010                 0
## 5        x x_lev_x_a   lev 1.111111e-01 0.5185185                 0
## 6        x x_lev_x_b   lev 1.110223e-16 1.0000000                 0pruneSig <- 1.0 # don't filter on significance for this tiny example
vScoreFrame <- treatmentsN$scoreFrame
varsToUse <- vScoreFrame$varName[(vScoreFrame$sig<=pruneSig)]
print(varsToUse)## [1] "x_catN"    "z"         "z_isBAD"   "x_lev_NA"  "x_lev_x_a" "x_lev_x_b"origVarNames <- sort(unique(vScoreFrame$origName[vScoreFrame$varName %in% varsToUse]))
print(origVarNames)## [1] "x" "z"# prepare a treated data frame using only the "significant" variables
dTreated = prepare(treatmentsN, dTrainN, 
                   varRestriction = varsToUse)## Warning in prepare.treatmentplan(treatmentsN, dTrainN, varRestriction =
## varsToUse): possibly called prepare() on same data frame as
## designTreatments*()/mkCrossFrame*Experiment(), this can lead to over-fit.  To
## avoid this, please use mkCrossFrame*Experiment$crossFrame.##       x_catN   z z_isBAD x_lev_NA x_lev_x_a x_lev_x_b y
## 1 -0.1666667 1.0       0        0         1         0 0
## 2 -0.1666667 2.0       0        0         1         0 0
## 3 -0.1666667 3.0       0        0         1         0 1
## 4  0.0000000 4.0       0        0         0         1 0
## 5  0.0000000 3.2       1        0         0         1 1
## 6  0.5000000 6.0       0        1         0         0 1We strongly suggest using the standard variables coded as ‘lev’, ‘clean’, and ‘isBad’; and the “y aware” variables coded as ‘catN’ and ‘catB’. The non sub-model variables (‘catP’ and ‘catD’) can be useful (possibly as interactions or guards on the corresponding ‘catN’ and ‘catB’ variables) but also encode distributional facts about the data that may or may not be appropriate depending on your problem domain.
When displaying variables to end users we suggest using the original names and the min significance seen on any derived variable:
origVarNames <- sort(unique(vScoreFrame$origName[vScoreFrame$varName %in% varsToUse]))
print(origVarNames)## [1] "x" "z"origVarSigs <- vScoreFrame[vScoreFrame$varName %in% varsToUse,]
aggregate(sig~origName,data=origVarSigs,FUN=min)##   origName       sig
## 1        x 0.3739010
## 2        z 0.2562868