---
title: "Using missRanger"
date: "`r Sys.Date()`"
bibliography: "biblio.bib"
link-citations: true
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Using missRanger}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning = FALSE,
  message = FALSE
)
```

## Overview

{missRanger} is a **multivariate imputation algorithm** based on random forests. It is a fast alternative to the beautiful 'MissForest' algorithm of @stekhoven, and uses the {ranger} package [@wright] to fit the random forests.

The algorithm iterates until the average out-of-bag (OOB) error of the forests stops improving. The missing values are filled by OOB predictions of the best iteration.

- {missRanger} is **fast**.
- It allows for **out-of-sample applications**.
- It is **intuitive**: E.g., calling `missRanger(data, . ~ 1)` would impute all variables univariately, while `missRanger(data, Species ~ Sepal.Width)` would use `Sepal.Width` to impute `Species`.
- It works for a **variety of data types**.
- It combines random forest imputation with **predictive mean matching**. This avoids "new" values like 0.3334 in a 0-1 coded variable and helps to raise the variance of the imputations, which is especially important for **multiple imputation** (see additional vignettes).

## Installation

```r
# From CRAN
install.packages("missRanger")

# Development version
devtools::install_github("mayer79/missRanger")
```

## Usage

``` {r}
library(missRanger)

set.seed(3)

iris_NA <- generateNA(iris, p = 0.1)
head(iris_NA)
 
imp <- missRanger(iris_NA, num.trees = 100)
head(imp)
```

### Predictive mean matching

It worked, but the new values appear overly exact. To avoid this, we can add predictive mean matching (PMM) to the OOB predictions:

``` {r}
imp <- missRanger(iris_NA, pmm.k = 5, num.trees = 100, verbose = 0)
head(imp)
```

### Controlling the random forests

`missRanger()` offers many options. How would we use one feature per split (mtry = 1) with 200 trees?

``` {r}
imp <- missRanger(iris_NA, pmm.k = 5, num.trees = 200, mtry = 1, verbose = 0)
```

### Extended output

Setting `data_only = FALSE` (or `keep_forests = TRUE`) returns a "missRanger" object. With `keep_forests = TRUE`, this allows for out-of-sample applications:

```{r}
imp <- missRanger(
  iris_NA, pmm.k = 5, num.trees = 100, keep_forests = TRUE, verbose = 0
)
imp

summary(imp)

# Out-of-sample application
# saveRDS(imp, file = "imputation_model.rds")
# imp <- readRDS("imputation_model.rds")
predict(imp, head(iris_NA))
```

### Formulas

By default, `missRanger()` uses all columns to impute all columns with missings.

This can be modified by passing a formula: The left hand side specifies the variables to be imputed, while the right hand side lists the variables used for imputation.

```{r}
# Impute all variables with all (default)
m <- missRanger(iris_NA, formula = . ~ ., pmm.k = 5, num.trees = 100, verbose = 0)

# Don't use Species for imputation
m <- missRanger(iris_NA, . ~ . - Species, pmm.k = 5, num.trees = 100, verbose = 0)

# Impute Sepal.Length by Species (or not?)
m <- missRanger(iris_NA, Sepal.Length ~ Species, pmm.k = 5, num.trees = 100)
head(m)

# Only univariate imputation was done! Why? Because Species contains missing values
# itself and needs to appear on the LHS as well:
m <- missRanger(iris_NA, Sepal.Length + Species ~ Species, pmm.k = 5, num.trees = 100)
head(m)

# Impute all variables univariately
m <- missRanger(iris_NA, . ~ 1, verbose = 0)
```

### Speed-up things

`missRanger()` fits a random forest per variable and iteration. Thus, imputation can take long. Some tweaks:

- Use less trees, e.g., `num.trees = 100`.
- Use a smaller tree depth, e.g., `max.depth = 6`.
- Use large leaves, e.g., `min.node.size = 100`.
- Use smaller bootstrap samples, e.g., `sample.fraction = 0.2`.
- Use less iterations, e.g., `max.iter = 3`.

The first three items also help to greatly reduce the size of the models, which might become relevant in out-of-sample applications with `keep_forests = TRUE`.

### Trick: Use `case.weights` to reduce impact of rows with many missings

Using the `case.weights` argument, you can pass case weights to the imputation models. For instance, this allows to reduce the contribution of rows with many missings:

```r
m <- missRanger(
  iris_NA,
  num.trees = 100,
  pmm.k = 5,
  case.weights = rowSums(!is.na(iris_NA))
)
```

## References