--- title: "1. First Steps" author: "Martin Binder" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{1. First Steps} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r, eval = TRUE, child = 'toc/vignettetoc.Rmd'} ``` ```{r, eval = TRUE, echo = FALSE, results = 'asis'} printToc(4) ``` ## About the Vignettes Since `mlrCPO` is a package with some depth to it, it comes with a few vignettes that each explain different aspects of its operation. These are the current document ("First Steps"), offering a short introduction and information on where to get started, "[mlrCPO Core](a_2_mlrCPO_core.html)", describing all the functions and tools offered by `mlrCPO` that are independent from specific `CPO`s, "[CPOs Built Into mlrCPO](a_3_all_CPOs.html)", listing all `CPO`s included in the `mlrCPO` package, and "[Building Custom CPOs](a_4_custom_CPOs.html)", describing the process of creating new `CPO`s that offer new functionality. All vignettes also have a "compact version" with the R output suppressed for readability. They are linked in the navigation section at the top. All vignettes assume that `mlrCPO` (and therefore its requirement `mlr`) is installed successfully and loaded using `library("mlrCPO")`. Help with installation is provided on the [project's GitHub page](https://github.com/mlr-org/mlrCPO). ## What is mlrCPO? "Composable Preprocessing Operators", "CPO", are an extension for the [mlr](https://github.com/mlr-org/mlr) ("Machine Learning in R") project which present preprocessing operations in the form of R objects. These CPO objects can be composed to form complex operations, they can be applied to data sets, and can be attached to mlr `Learner` objects to generate machine learning pipelines that combine preprocessing and model fitting. ### What is Preprocessing "Preprocessing", as understood by `mlrCPO`, is any manipulation of data used in a machine learning process to get it from its form as found in the wild into a form more fitting for the machine learning algorithm ("`Learner`") used for model fitting. It is important that the exact method of preprocessing is kept track of, to be able to perform this method when the resulting model is used to make predictions on new data. It is also important, when evaluating preprocessing methods e.g. using *resampling*, that the parameters of these methods are independent of the validation dataset and only depend on the training data set. `mlrCPO` tries to support the user in all these aspects of preprocessing: 1. By providing a large set of atomic preprocessing `CPO`s that can perform many different operations. Operations that go beyond the provided toolset can be implemented in custom `CPO`s. 2. By using "`CPOTrained`" objects that represent the preprocessing done on training data that should, in that way, be re-applied to new prediction data. 3. By making it possible to combine preprocessing objects with `mlr` "`Learner`" objects that represent the entinre machine learning pipeline to be tuned and evaluated. ## Preprocessing Operations At the centre of `mlrCPO` are "`CPO`" objects. To get a `CPO` object, it is necessary to call a *CPO Constructor*. A CPO Constructor sets up the parameters of a `CPO` and provides further options for its behaviour. Internally, CPO Constructors are *functions* that have a common interface and a friendly printer method. ```{r} cpoScale # a cpo constructor ``` ```{r} cpoAddCols ``` ```{r} cpoScale(center = FALSE) # create a CPO object that scales, but does not center, data ``` ```{r} cpoAddCols(Sepal.Area = Sepal.Length * Sepal.Width) # this would add a column ``` `CPO`s exist first to be *applied* to data. Every `CPO` represents a certain data transformation, and this transformation is performed when the `CPO` is applied. This can be done using the **`applyCPO`** function, or the **`%>>%`** operator. `CPO`s can be applied to `data.frame` objects, and to `mlr` "`Task`" objects. ```{r} iris.demo = iris[c(1, 2, 3, 51, 52, 102, 103), ] tail(iris.demo %>>% cpoQuantileBinNumerics()) # bin the data in below & above median ``` A useful feature of `CPO`s is that they can be concatenated to form new operations. Two `CPO`s can be combined using the **`composeCPO`** function or, as before, the **`%>>%`** operator. When two `CPO`s are combined, the product is a new `CPO` that can itself be composed or applied. The result of a composition represents the operation of first applying the first `CPO` and then the second `CPO`. Therefore, `data %>>% (cpo1 %>>% cpo2)` is the same as `(data %>>% cpo1) %>>% cpo2`. ```{r} # first create three quantile bins, then as.numeric() all columns to # get 1, 2 or 3 as the bin number quantilenum = cpoQuantileBinNumerics(numsplits = 3) %>>% cpoAsNumeric() iris.demo %>>% quantilenum ``` The last example shows that it is sometimes not a good idea to have a `CPO` affect the whole dataset. Therefore, when a `CPO` is created, it is possible to choose what columns the `CPO` should affect. The CPO Constructor has a variety of parameters, starting with `affect.`, that can be used to choose what columns the `CPO` operates on. To prevent `cpoAsNumeric` from influencing the `Species` column, we can thus do ```{r} quantilenum.restricted = cpoQuantileBinNumerics(numsplits = 3) %>>% cpoAsNumeric(affect.names = "Species", affect.invert = TRUE) iris.demo %>>% quantilenum.restricted ``` A more convenient method in this case, however, is to use an `mlr` "`Task`", which keeps track of the target column. "Feature Operation" `CPO`s (as all the ones shown) do not influence the target column. ```{r} demo.task = makeClassifTask(data = iris.demo, target = "Species") result = demo.task %>>% quantilenum getTaskData(result) ``` ## Hyperparameters When performing preprocessing, it is sometimes necessary to change a small aspect of a long preprocessing pipeline. Instead of having to re-construct the whole pipeline, `mlrCPO` offers the possibility to change *hyperparameters* of a `CPO`. This makes it very easy e.g. for tuning of preprocessing in combination with a machine learning algorithm. Hyperparameters of `CPO`s can be manipulated in the same way as they are manipulated for `Learners` in `mlr`, using **`getParamSet`** (to list the parameters), **`getHyperPars`** (to list the parameter values), and **`setHyperPars`** (to change these values). To get the parameter set of a `CPO`, it is also possible to use **verbose printing** using the `!` (exclamation mark) operator. ```{r} cpo = cpoScale() cpo ``` ```{r} getHyperPars(cpo) # list of parameter names and values ``` ```{r} getParamSet(cpo) # more detailed view of parameters and their type / range ``` ```{r} !cpo # equivalent to print(cpo, verbose = TRUE) ``` `CPO`s use copy semantics, therefore `setHyperPars` creates a copy of a `CPO` that has the changed hyperparameters. ```{r} cpo2 = setHyperPars(cpo, scale.scale = FALSE) cpo2 ``` ```{r} iris.demo %>>% cpo # scales and centers ``` ```{r} iris.demo %>>% cpo2 # only centers ``` When chaining many `CPO`s, it is possible for the many hyperparameters to lead to very cluttered `ParamSet`s, or even for hyperparameter names to clash. `mlrCPO` has two remedies for that. First, any `CPO` also has an **`id`** that is always prepended to the hyperparameter names. It can be set during construction, using the `id` parameter, or changed later using `setCPOId`. The latter one only works on primitive, i.e. not compound, `CPO`s. Set the `id` to `NULL` to use the `CPO`'s hyperparameters without a prefix. ```{r} cpo = cpoScale(id = "a") %>>% cpoScale(id = "b") # not very useful example getHyperPars(cpo) ``` The second remedy against hyperparameter clashes is different "exports" of hyperparameters: The hyperparameters that can be changed using `setHyperPars`, i.e. that are *exported* by a `CPO`, are a subset of the parameters of the `CPOConstructor`. For each kind of `CPO`, there is a standard set of parameters that are exported, but during construction, it is possible to influence the parameters that actually get exported via the `export` parameter. `export` can be one of a set of standard export settings (among them "`export.all`" and "`export.none`") or a `character` vector of the parameters to export. ```{r} cpo = cpoPca(export = c("center", "rank")) getParamSet(cpo) ``` ## Retrafo Manipulating data for preprocessing itself is relatively easy. A challenge comes when one wants to integrate preprocessing into a machine-learning pipeline: The same preprocessing steps that are performed on the training data need to be performed on the new prediction data. However, the transformation performed for prediction often needs information from the training step. For example, if training entail performing PCA, then for prediction, the data must not undergo another PCA, instead it needs to be rotated by the *rotation matrix* found by the training PCA. The process of obtaining the rotation matrix will be called "training" the `CPO`, and the object that contains the trained information is called `CPOTrained`. For preprocessing operations that operate only on *features* of a task (as opposed to the target column), the `CPOTrained` will always be applied to new incoming data, and hence be of class `CPORetrafo` and called a "**retrafo**" object. To obtain this retrafo object, one can use **`retrafo()`**. Retrafo objects can be applied to data just as `CPO`s can, by using the `%>>%` operator. ```{r} transformed = iris.demo %>>% cpoPca(rank = 3) transformed ``` ```{r} ret = retrafo(transformed) ret ``` To show that `ret` actually represents the exact same preprocessing operation, we can feed the first line of `iris.demo` back to it, to verify that the transformation is the same. ```{r} iris.demo[1, ] %>>% ret ``` We obviously would not have gotten there by feeding the first line to `cpoPca` directly: ```{r} iris.demo[1, ] %>>% cpoPca(rank = 3) ``` `CPOTrained` objects associated with an object are automatically chained when another `CPO` is applied. To prevent this from happening, it is necessary to "clear" the retrafos and [inverters](#inverter) associated with the object using **`clearRI()`**. ```{r} t2 = transformed %>>% cpoScale() retrafo(t2) ``` ```{r} t3 = clearRI(transformed) %>>% cpoScale() retrafo(t3) ``` Note that `clearRI` has no influence on the `CPO` operations themselves, and the resulting data is the same: ```{r} all.equal(t2, t3, check.attributes = FALSE) ``` It is also possible to chain `CPOTrained` object using `composeCPO()` or `%>>%`. This can be useful if the trafo chain loses access to the `retrafo` attribute for some reason. In general, it is only recommended to compose `CPOTrained` objects that were created in the same process and in correct order, since they are usually closely associated with the training data in a particular place within the preprocessing chain. ```{r} retrafo(transformed) %>>% retrafo(t3) # is the same as retrafo(t2) above. ``` ## Inverter So far only `CPO`s were introduced that change the feature columns of a `Task`. ("Feature Operation `CPO`s"--*FOCPO*s). There is another class of `CPO`s, "Target Operation `CPO`s" or *TOCPO*s, that can change a `Task`'s target columns. This comes at the cost of some complexity when performing prediction: Since the training data that was ultimately fed into a `Learner` had a transformed target column, the predictions made by the resulting model will not be directly comparable to the original target values. Consider `cpoLogTrafoRegr`, a `CPO` that log-transforms the target variable of a regression `Task`. The predictions made with a `Learner` on a log-transformed target variable will be in log-space and need to be exponentiated (or otherwise re-transformed). This inversion operation is represented by an "**inverter**" object that is attached to a transformation result similarly to a retrafo object, and can be obtained using the **`inverter()`** function. It is of class `CPOInverter`, a subclass of `CPOTrained`. ```{r} iris.regr = makeRegrTask(data = iris.demo, target = "Petal.Width") iris.logd = iris.regr %>>% cpoLogTrafoRegr() getTaskData(iris.logd) # log-transformed target 'Petal.Width' ``` ```{r} inv = inverter(iris.logd) # inverter object inv ``` The inverter object is used by the `invert()` function that inverts the prediction made by a model trained on the transformed task, and re-transforms this prediction to fit the space of the original target data. The inverter object caches the "truth" of the data being inverted (`iris.logd`, in the example), so `invert` can give information on the truth of the inverted data. ```{r} logmodel = train("regr.lm", iris.logd) pred = predict(logmodel, iris.logd) # prediction on the task itself pred ``` ```{r} invert(inv, pred) ``` This procedure can also be done with new incoming data. In general, more than just the `cpoLogTrafoRegr` operation could be done on the `iris.regr` task in the example, so to perform the complete preprocessing *and* inversion, one needs to use the retrafo object as well. When applying the retrafo object, a new inverter object is generated, which is specific to the exact new data that was being retransformed: ```{r} newdata = makeRegrTask("newiris", iris[7:9, ], target = "Petal.Width", fixup.data = "no", check.data = FALSE) ``` ```{r} # the retrafo does the same transformation(s) on newdata that were # done on the training data of the model, iris.logd. In general, this # could be more than just the target log transformation. newdata.transformed = newdata %>>% retrafo(iris.logd) getTaskData(newdata.transformed) ``` ```{r} pred = predict(logmodel, newdata.transformed) pred ``` ```{r} # the inverter of the newly transformed data contains information specific # to the newly transformed data. In the current case, that is just the # new "truth" column for the new data. inv.newdata = inverter(newdata.transformed) invert(inv.newdata, pred) ``` ### Constant Inverters The `cpoLogTrafoRegr` is a special case of TOCPO in that its inversion operation is *constant*: It does not depend on the new incoming data, so in theory it is not necessary to get a new inverter object for every piece of data that is being transformed. Therefore, it is possible to use the *retrafo* object for inversion in this case. However, the "truth" column will not be available in this case: ```{r} invert(retrafo(iris.logd), pred) ``` Whether a retrafo object is capable of performing inversion can be checked with the **`getCPOTrainedCapability()`** function. It returns a vector with named elements `"retrafo"` and `"invert"`, indicating whether a `CPOTrained` is capable of performing retrafo or inversion. A `1` indicates that the object can perform the action and has an effect, a `0` indicates that the action would have no effect (but also throws no error), and a `-1` means that the object is not capable of performing the action. ```{r} getCPOTrainedCapability(retrafo(iris.logd)) # can do both retrafo and inversion ``` ```{r} getCPOTrainedCapability(inv) # a pure inverter, can not be used for retrafo ``` ### General Inverters As an example of a `CPO` that does not have a constant inverter, consider `cpoRegrResiduals`, wich fits a regression model on training data and returns the residuals of this fit. When performing prediction, the `invert` action is to add predictions by the `CPO`'s model to the incoming predictions made by a model trained on the residuals. ```{r, warnings = FALSE} set.seed(123) # for reproducibility iris.resid = iris.regr %>>% cpoRegrResiduals("regr.lm") getTaskData(iris.resid) ``` ```{r} model.resid = train("regr.randomForest", iris.resid) newdata.resid = newdata %>>% retrafo(iris.resid) getTaskData(newdata.resid) # Petal.Width are now the residuals of lm model predictions ``` ```{r} pred = predict(model.resid, newdata.resid) pred ``` ```{r} # transforming this prediction back to compare # it to the original 'Petal.Width' inv.newdata = inverter(newdata.resid) invert(inv.newdata, pred) ``` ## Retrafoless CPOs Besides *FOCPO*s and *TOCPO*s, there are also "*Retrafoless*" `CPO`s (*ROCPO*s). These only perform operation in the training part of a machine learning pipeline, but in turn are the only `CPO`s that may change the number of rows in a dataset. The goal of ROCPOs is to change the number of data samples, but not to transform the data or target values themselves. Examples of ROCPOs are `cpoUndersample`, `cpoSmote`, and `cpoSample`. ```{r} sampled = iris %>>% cpoSample(size = 3) sampled ``` There is no retrafo or inverter associated with the result. Instead, both of them are [NULLCPO](#nullcpo) ```{r} retrafo(sampled) inverter(sampled) ``` ## CPO Learners Until now, the `CPO`s have been invoked explicitly to manipulate data and get retrafo and inverter objects. It is good to be aware of the data flows in a machine learning process involving preprocessing, but `mlrCPO` makes it very easy to automatize this. It is possible to *attach* a `CPO` to a `Learner` using **`attachCPO`** or the `%>>%`-operator. When a `CPO` is attached to a `Learner`, a `CPOLearner` is created. The `CPOLearner` performs the preprocessing operation dictated by the `CPO` before training the underlying model, and stores and uses the retrafo and inverter objects necessary during prediction. It is possible to attach compound `CPO`s, and it is possible to attach further `CPO`s to a `CPOLearner` to extend the preprocessing pipeline. Exported hyperparamters of a `CPO` are also present in a `CPOLearner` and can be changed using `setHyperPars`, as usual with other `Learner` objects. Recreating the pipeline from [General Inverters](#general-inverters) with a `CPOLearner` looks like the following. Note the prediction `pred` made in the end is identical with the one made above. ```{r} set.seed(123) # for reproducibility lrn = cpoRegrResiduals("regr.lm") %>>% makeLearner("regr.randomForest") lrn ``` ```{r, warnings = FALSE} model = train(lrn, iris.regr) pred = predict(model, newdata) pred ``` It is possible to get the retrafo object from a model trained with a `CPOLearner` using the `retrafo()` function. In this example, it is identical with the `retrafo(iris.resid)` gotten in the example in [General Inverters](#general-inverters). ```{r} retrafo(model) ``` ## CPO Tuning Since the hyperparameters of a `CPO` are present in a `CPOLearner`, is possible to tune hyperparameters of preprocessing operations. It can be done using `mlr`'s **`tuneParams()`** function and works identically to tuning common `Learner`-parameters. ```{r} icalrn = cpoIca() %>>% makeLearner("classif.logreg") getParamSet(icalrn) ``` ```{r} ps = makeParamSet( makeIntegerParam("ica.n.comp", lower = 1, upper = 8), makeDiscreteParam("ica.alg.typ", values = c("parallel", "deflation"))) # shorter version using pSS: # ps = pSS(ica.n.comp: integer[1, 8], ica.alg.typ: discrete[parallel, deflation]) ``` ```{r} tuneParams(icalrn, pid.task, cv5, par.set = ps, control = makeTuneControlGrid(), show.info = FALSE) ``` ## Syntactic Sugar Besides the `%>>%` operator, there are a few related operators which are short forms of operations that otherwise take more typing. * **`%<<%`** is similar to `%>>%` but works in the other direction. `a %>>% b` is the same as `b %<<% a`. * **`%<>>%` and `%<<<%`** are the `%>>%` or `%<<%` operators, combined with assignment. `a %<>>% b` is the same as `a = a %>>% b`. These operators perform the operations on their right before they do the assignment, so it is not necessary to use parentheses when writing `a = a %>>% b %>>% c` as `a %<>>% b %>>% c`. * **`%>|%` and `%|<%`** feed data in a `CPO` and gets the `retrafo()`. `data %>|% a` is the same as `retrafo(data %>>% a)`. The `%>|%` operator performs the operation on its right before getting the retrafo, so it is not necessary to use parentheses when writing `retrafo(data %>>% a %>>% b)` as `data %>|% a %>>% b`. ## Inspecting CPOs As described before, it is possible to *compose* `CPO`s to create relatively complex preprocessing pipelines. It is therefore necessary to have tools to inspect a `CPO` pipeline or related objects. The first line of attack when inspecting a `CPO` is always the `print` function. `print(x, verbose = TRUE)` will often print more information about a `CPO` than the ordinary print function. A shorthand alias for this is the exclamation point "**`!`**". When verbosely printing a `CPOConstructor`, the transformation functions are shown. When verbosely printing a `CPO`, the constituent elements are separately printed, each showing their parameter sets. ```{r} cpoAsNumeric # plain print !cpoAsNumeric # verbose print ``` ```{r} cpoScale() %>>% cpoIca() # plain print !cpoScale() %>>% cpoIca() # verbose print ``` When working with compound `CPO`s, it is sometimes necessary to manipulate a `CPO` inside a compound `CPO` pipeline. For this purpose, the **`as.list()`** generic is implemented for both `CPO` and `CPOTrained` for splitting a pipeline into a list of the primitive elements. The inverse is **`pipeCPO()`**, which takes a list of `CPO` or `CPOTrained` and concatenates them using `composeCPO()`. ```{r} as.list(cpoScale() %>>% cpoIca()) ``` ```{r} pipeCPO(list(cpoScale(), cpoIca())) ``` `CPOTrained` objects contain information about the retrafo or inversion to be performed for a `CPO`. It is possible to access this information using **`getCPOTrainedState()`**. The "state" of a `CPOTrained` object often contains a `$data` slot with information about the expected input and output format ("`ShapeInfo`") of incoming data, a slot for each of its hyperparameters, and a `$control` slot that is specific to the `CPO` in question. The `cpoPca` state, for example, contains the PCA rotation matrix and a vector for scaling and centering. The contents of a state's `$control` object are described in a `CPO`'s help page. ```{r} repca = retrafo(iris.demo %>>% cpoPca()) state = getCPOTrainedState(repca) state ``` It is even possible to change the "state" of a `CPOTrained` and construct a new `CPOTrained` using **`makeCPOTrainedFromState()`**. This is fairly advanced usage and only recommended for users familiar with the inner workings of the particular `CPO`. If we get familiar with the `cpoPca` `CPO` using the `!`-print (i.e. `!cpoPca`) to look at the retrafo function, we notice that the `control$center` and `control$scale` values are given to a call of `scale()`. If we want to create a new `CPOTrained` that does *not* perform centering or scaling during before applying the rotation matrix, we can change these values. ```{r} state$control$center = FALSE state$control$scale = FALSE nosc.repca = makeCPOTrainedFromState(cpoPca, state) ``` Comparing this to the original "`repca`" retrafo shows that the result of applying `repca` has generally smaller values because of the centering. ```{r} iris.demo %>>% repca ``` ```{r} iris.demo %>>% nosc.repca ``` ## Special CPOs There is a large and growing variety of `CPO`s that perform many different operations. It is advisable to browse through [CPOs Built Into mlrCPO](a_3_all_CPOs.html) for an overview. To get a list of all built-in `CPO`s, use **`listCPO()`**. A few important or "meta" `CPO`s that can be used to influence the behaviour of other `CPO`s are described here. ### NULLCPO The value associated with "no operation" is the `NULLCPO` value. It is the neutral element of the `%>>%` operations, and the value of `retrafo()` and `inverter()` when there are otherwise no associated retrafo or inverter values. ```{r} NULLCPO ``` ```{r} all.equal(iris %>>% NULLCPO, iris) cpoPca() %>>% NULLCPO ``` ### CPO Multiplexer The multiplexer makes it possible to combine many CPOs into one, with an extra `selected.cpo` parameter that chooses between them. This makes it possible to tune over many different tuner configurations at once. ```{r} cpm = cpoMultiplex(list(cpoIca, cpoPca(export = "export.all"))) !cpm ``` ```{r} iris.demo %>>% setHyperPars(cpm, selected.cpo = "ica", ica.n.comp = 3) ``` ```{r} iris.demo %>>% setHyperPars(cpm, selected.cpo = "pca", pca.rank = 3) ``` ### CPO Wrapper A simple CPO with one parameter which gets applied to the data as CPO. This is different from a multiplexer in that its parameter is free and can take any value that behaves like a CPO. On the downside, this does not expose the argument's parameters to the outside. ```{r} cpa = cpoWrap() !cpa ``` ```{r} iris.demo %>>% setHyperPars(cpa, wrap.cpo = cpoScale()) ``` ```{r} iris.demo %>>% setHyperPars(cpa, wrap.cpo = cpoPca()) ``` Attaching the cpo applicator to a learner gives this learner a "cpo" hyperparameter that can be set to any CPO. ```{r} getParamSet(cpoWrap() %>>% makeLearner("classif.logreg")) ``` ### CBind CPO `cbind` other CPOs as operation. The `cbinder` makes it possible to build DAGs of CPOs that perform different operations on data and paste the results next to each other. It is often useful to combine `cpoCbind` with `cpoSelect` to filter out columns that would otherwise be duplciated. ```{r} scale = cpoSelect(pattern = "Sepal", id = "first") %>>% cpoScale(id = "scale") scale.pca = scale %>>% cpoPca() cbinder = cpoCbind(scale, scale.pca, cpoSelect(pattern = "Petal", id = "second")) ``` `cpoCbind` recognises that `"scale"` happens before `"pca"`, but is also fed to the result directly. The verbose print draws a (crude) ascii-art graph. ```{r} !cbinder ``` ```{r} iris.demo %>>% cbinder ``` ## Custom CPOs Even though `CPO`s are very flexible and can be combined in many ways, it may be necessary to create completely custom `CPO`s. Custom CPOs can be created using the **`makeCPO()`** and related functions. "[Building Custom CPOs](a_4_custom_CPOs.html)" is a wide topic which has its own vignette. ## Summary - **`CPO`**s are built using **`CPOConstructor`**s by calling them like functions. - The available `CPOConstructors` can be found by using **`listCPO()`** or consulting [the relevant vignette](a_3_all_CPOs.html). - Verbose printing of `CPO`s and many related objects is available using the **`!`** (exclamation mark) operator. - `CPO`s export hyperparameters that are accessible using **`getParamSet()`** and **`getHyperPars()`**, and mutable using **`setHyperPars()`**. Which parameters are exported can be controlled using the **`export`** parameter during construction. - They can be composed (**`composeCPO()`**), applied to data (**`applyCPO()`**) and attached to `Learner`s (**`attachCPO()`**) using special functions for each of these operations, or using the general **`%>>%`** operator. - There are three fundamental kinds of `CPO`: **FOCPO** (Feature Operation `CPO`s), **TOCPO** (Target Operation `CPO`s) and **ROCPO** (Retrafoless `CPO`s). The first may only change feature columns, the second only target columns. While the last one may change both feature *and* target values and even the number of rows of a dataset, it does so with the understanding that new "prediction" data will not be transformed by it and is thus mainly useful for subsampling. - Data that was transformed by a (non-Retrafoless) `CPO` has a retrafo-**`CPOTrained`** object associated with it that can be retrieved using **`retrafo()`** and used to transform new prediction data in similar way as the original training data. - `CPOTrained` objects can themselves be composed using **`composeCPO`** or **`%>>%`**, although it is only recommended to compose `CPOTrained` objects in the same order as they were created, and only if they were created in the same preprocessing pipeline. - `CPOTrained` objects can be inspected using **`getCPOTrainedState()`**, and re-built with changed state using **`makeCPOTrainedFromState()`**. - Data that was transformed by a *TOCPO* also has an *inverter* associated with it, which can be retrieved using **`inverter()`**. An inverter is also created during application of a retrafo `CPOTrained`. - While *retrafo* `CPOTrained` are created during training and used on every prediction data set, *inverter* `CPOTrained` are created anew during each `CPO` and retrafo-`CPOTrained` application and are closely associated with the data that they were created with. - `CPOTrained` objects associated with data are stored in their "attributes" and are automatically chained when more `CPO`s are applied. **`clearRI()`** is used to remove the associated `CPOTrained` objects and prevent this chaining. - `CPO`s can be attached to `Learner`s to get **`CPOLearner`**s which automatically transform training *and* prediction data and perform prediction inversion. - `CPOLearner`s have the `Learner`'s *and* the `CPO`'s hyperparameters and can thus be manipulated using **`setHyperPars()`**, and can be tuned using **`tuneParams()`**. - Notable `CPO`s are **`NULLCPO`** (the neutral element of `%>>%`), **`cpoMultiplex`**, **`cpoWrap`**, and **`cpoCbind`**. - It is possible to create custom `CPO`s using **`makeCPO`** and similar functions. These are described [in their own vignette](a_4_custom_CPOs.html).