Replace Missing Data • MetaPipe

library(MetaPipe)

MetaPipe can handle with missing data in a couple of ways:

Drop traits which exceed an NA proportion threshold (50% by default)
Replace NAs by half of the minimum value within each trait

Function call

replace_missing(raw_data = raw_data, 
                excluded_columns = c(2, 3, ..., M), 
                # Optional
                out_prefix = "metapipe", 
                prop_na = 0.5, 
                replace_na = FALSE)

where raw_data is a data frame containing the raw data, as described in Load Raw Data and excluded_columns is a vector containing the indices of the properties, e.g. c(2, 3, ..., M). The other arguments are optional, out_prefix is the prefix for output files, prop_na is the proportion of NA values (used to drop traits), and replace_na is a logical flag to indicate whether or not NAs should be replace by half of the minimum value.

Drop traits

By default the pipeline will drop traits that exceed an NA proportion threshold, this can be fine tuned by the user with the parameter prop_na. It is important to keep in mind the side effects of excluding variables from the QTL mapping, like making wrong conclusions regarding the most significant QTLs.

set.seed(123)
example_data <- data.frame(ID = 1:5,
                           P1 = c("one", "two", "three", "four", "five"), 
                           T1 = rnorm(5), 
                           T2 = rnorm(5),
                           T3 = c(NA, rnorm(4)),                     #  20 % NAs
                           T4 = c(NA, 1.2, -0.5, NA, 0.87),          #  40 % NAs
                           T5 = NA)                                  # 100 % NAs

workdir <- tempdir()
out_prefix <- file.path(workdir, "metapipe")

# Default parameters: NA proportion = 50%
replace_missing(example_data, c(2), out_prefix = out_prefix)
#> The following trait was dropped because it has 50% or more missing values: 
#>  - T5
#>   ID    P1          T1         T2        T3    T4
#> 1  1   one -0.56047565  1.7150650        NA    NA
#> 2  2   two -0.23017749  0.4609162 1.2240818  1.20
#> 3  3 three  1.55870831 -1.2650612 0.3598138 -0.50
#> 4  4  four  0.07050839 -0.6868529 0.4007715    NA
#> 5  5  five  0.12928774 -0.4456620 0.1106827  0.87

# NA proportion = 30%
replace_missing(example_data, c(2), prop_na = 0.3, out_prefix = out_prefix)
#> The following traits were dropped because they have 30% or more missing values: 
#>  - T4
#>  - T5
#>   ID    P1          T1         T2        T3
#> 1  1   one -0.56047565  1.7150650        NA
#> 2  2   two -0.23017749  0.4609162 1.2240818
#> 3  3 three  1.55870831 -1.2650612 0.3598138
#> 4  4  four  0.07050839 -0.6868529 0.4007715
#> 5  5  five  0.12928774 -0.4456620 0.1106827

Replace NAs

Alternatively, the user can indicate whether or not NA values should be replaced by finding the minimum value for each trait and dividing it by two. This result can be achieved by passing the parameter replace_na = TRUE. Users should be cautious when using this approach, as the replacement of data points can have side effects that might lead to the wrong conclusions.

set.seed(123)
example_data <- data.frame(ID = 1:5,
                           P1 = c("one", "two", "three", "four", "five"), 
                           T1 = rnorm(5), 
                           T2 = rnorm(5),
                           T3 = c(NA, rnorm(4)),                     #  20 % NAs
                           T4 = c(NA, 1.2, -0.5, NA, 0.87),          #  40 % NAs
                           T5 = NA)                                  # 100 % NAs

workdir <- tempdir()
out_prefix <- file.path(workdir, "metapipe")

# Default parameters: NA proportion = 50%
replace_missing(example_data, c(2), replace_na = TRUE, out_prefix = out_prefix)
#> The following trait was dropped because it has 100% missing values: 
#>  - T5
#>   ID    P1          T1         T2         T3    T4
#> 1  1   one -0.56047565  1.7150650 0.05534136 -0.25
#> 2  2   two -0.23017749  0.4609162 1.22408180  1.20
#> 3  3 three  1.55870831 -1.2650612 0.35981383 -0.50
#> 4  4  four  0.07050839 -0.6868529 0.40077145 -0.25
#> 5  5  five  0.12928774 -0.4456620 0.11068272  0.87

# NA proportion = 30%
replace_missing(example_data, 
                c(2), 
                prop_na = 0.3, 
                replace_na = TRUE, 
                out_prefix = out_prefix)
#> The following trait was dropped because it has 100% missing values: 
#>  - T5
#>   ID    P1          T1         T2         T3    T4
#> 1  1   one -0.56047565  1.7150650 0.05534136 -0.25
#> 2  2   two -0.23017749  0.4609162 1.22408180  1.20
#> 3  3 three  1.55870831 -1.2650612 0.35981383 -0.50
#> 4  4  four  0.07050839 -0.6868529 0.40077145 -0.25
#> 5  5  five  0.12928774 -0.4456620 0.11068272  0.87

From the last example can be seen that prop_na and replace_na are mutually exclusive, and replace_na has precedence over prop_na.

Next, see Assess Normality.