MetaPipe
can handle with missing data in a couple of ways:
replace_missing(raw_data = raw_data,
excluded_columns = c(2, 3, ..., M),
# Optional
out_prefix = "metapipe",
prop_na = 0.5,
replace_na = FALSE)
where raw_data
is a data frame containing the raw data, as described in Load Raw Data and excluded_columns
is a vector containing the indices of the properties, e.g. c(2, 3, ..., M)
. The other arguments are optional, out_prefix
is the prefix for output files, prop_na
is the proportion of NA values (used to drop traits), and replace_na
is a logical flag to indicate whether or not NAs should be replace by half of the minimum value.
By default the pipeline will drop traits that exceed an NA proportion threshold, this can be fine tuned by the user with the parameter prop_na
. It is important to keep in mind the side effects of excluding variables from the QTL mapping, like making wrong conclusions regarding the most significant QTLs.
set.seed(123)
example_data <- data.frame(ID = 1:5,
P1 = c("one", "two", "three", "four", "five"),
T1 = rnorm(5),
T2 = rnorm(5),
T3 = c(NA, rnorm(4)), # 20 % NAs
T4 = c(NA, 1.2, -0.5, NA, 0.87), # 40 % NAs
T5 = NA) # 100 % NAs
workdir <- tempdir()
out_prefix <- file.path(workdir, "metapipe")
# Default parameters: NA proportion = 50%
replace_missing(example_data, c(2), out_prefix = out_prefix)
#> The following trait was dropped because it has 50% or more missing values:
#> - T5
#> ID P1 T1 T2 T3 T4
#> 1 1 one -0.56047565 1.7150650 NA NA
#> 2 2 two -0.23017749 0.4609162 1.2240818 1.20
#> 3 3 three 1.55870831 -1.2650612 0.3598138 -0.50
#> 4 4 four 0.07050839 -0.6868529 0.4007715 NA
#> 5 5 five 0.12928774 -0.4456620 0.1106827 0.87
# NA proportion = 30%
replace_missing(example_data, c(2), prop_na = 0.3, out_prefix = out_prefix)
#> The following traits were dropped because they have 30% or more missing values:
#> - T4
#> - T5
#> ID P1 T1 T2 T3
#> 1 1 one -0.56047565 1.7150650 NA
#> 2 2 two -0.23017749 0.4609162 1.2240818
#> 3 3 three 1.55870831 -1.2650612 0.3598138
#> 4 4 four 0.07050839 -0.6868529 0.4007715
#> 5 5 five 0.12928774 -0.4456620 0.1106827
Alternatively, the user can indicate whether or not NA values should be replaced by finding the minimum value for each trait and dividing it by two. This result can be achieved by passing the parameter replace_na = TRUE
. Users should be cautious when using this approach, as the replacement of data points can have side effects that might lead to the wrong conclusions.
set.seed(123)
example_data <- data.frame(ID = 1:5,
P1 = c("one", "two", "three", "four", "five"),
T1 = rnorm(5),
T2 = rnorm(5),
T3 = c(NA, rnorm(4)), # 20 % NAs
T4 = c(NA, 1.2, -0.5, NA, 0.87), # 40 % NAs
T5 = NA) # 100 % NAs
workdir <- tempdir()
out_prefix <- file.path(workdir, "metapipe")
# Default parameters: NA proportion = 50%
replace_missing(example_data, c(2), replace_na = TRUE, out_prefix = out_prefix)
#> The following trait was dropped because it has 100% missing values:
#> - T5
#> ID P1 T1 T2 T3 T4
#> 1 1 one -0.56047565 1.7150650 0.05534136 -0.25
#> 2 2 two -0.23017749 0.4609162 1.22408180 1.20
#> 3 3 three 1.55870831 -1.2650612 0.35981383 -0.50
#> 4 4 four 0.07050839 -0.6868529 0.40077145 -0.25
#> 5 5 five 0.12928774 -0.4456620 0.11068272 0.87
# NA proportion = 30%
replace_missing(example_data,
c(2),
prop_na = 0.3,
replace_na = TRUE,
out_prefix = out_prefix)
#> The following trait was dropped because it has 100% missing values:
#> - T5
#> ID P1 T1 T2 T3 T4
#> 1 1 one -0.56047565 1.7150650 0.05534136 -0.25
#> 2 2 two -0.23017749 0.4609162 1.22408180 1.20
#> 3 3 three 1.55870831 -1.2650612 0.35981383 -0.50
#> 4 4 four 0.07050839 -0.6868529 0.40077145 -0.25
#> 5 5 five 0.12928774 -0.4456620 0.11068272 0.87
From the last example can be seen that prop_na
and replace_na
are mutually exclusive, and replace_na
has precedence over prop_na
.
Next, see Assess Normality.