Replace missing values (NAs) in a dataset, the user can choose between two actions to handle missing data:

  1. Drop traits (variables) that exceed a given threshold, prop_na, a rate of missing (NA) and total observations.

  2. Replace missing values by half of the minimum within each trait.

Finally, if there are traits for which all entries are missing, these will be removed from the dataset and stored in a external CSV file called "<out_prefix>_NA_raw_data.csv".

Usage,
replace_missing(
  raw_data,
  excluded_columns = NULL,
  out_prefix = file.path(tempdir(), "metapipe"),
  prop_na = 0.5,
  replace_na = FALSE
)

Arguments

raw_data

Data frame containing the raw data.

excluded_columns

Numeric vector containing the indices of the dataset properties that are non-numeric, excluded columns.

out_prefix

Prefix for output files and plots.

prop_na

Proportion of missing/total observations, if a trait exceeds this threshold and replace_na = FALSE, then it will be dropped out.

replace_na

Boolean flag to indicate whether or not missing values should be replaced by half of the minimum value within each trait.

Value

Data frame containing the raw data without missing values.

Examples

                                        
# Toy dataset                                        
example_data <- data.frame(ID = c(1,2,3,4,5), 
                           P1 = c("one", "two", "three", "four", "five"), 
                           T1 = rnorm(5), 
                           T2 = rnorm(5),
                           T3 = c(NA, rnorm(4)),                  #  20 % NAs
                           T4 = c(NA, 1.2, -0.5, NA, 0.87),       #  40 % NAs
                           T5 = NA)                               # 100 % NAs
out_prefix = file.path(tempdir(), "metapipe")
MetaPipe::replace_missing(example_data, c(1, 2), out_prefix = out_prefix)
#> The following trait was dropped because it has 50% or more missing values: 
#>  - T5
#>   ID    P1          T1         T2        T3    T4
#> 1  1   one -0.56047565  1.7150650        NA    NA
#> 2  2   two -0.23017749  0.4609162 1.2240818  1.20
#> 3  3 three  1.55870831 -1.2650612 0.3598138 -0.50
#> 4  4  four  0.07050839 -0.6868529 0.4007715    NA
#> 5  5  five  0.12928774 -0.4456620 0.1106827  0.87
MetaPipe::replace_missing(example_data, 
                          c(1, 2), 
                          prop_na =  0.25,
                          out_prefix = out_prefix)
#> The following traits were dropped because they have 25% or more missing values: 
#>  - T4
#>  - T5
#>   ID    P1          T1         T2        T3
#> 1  1   one -0.56047565  1.7150650        NA
#> 2  2   two -0.23017749  0.4609162 1.2240818
#> 3  3 three  1.55870831 -1.2650612 0.3598138
#> 4  4  four  0.07050839 -0.6868529 0.4007715
#> 5  5  five  0.12928774 -0.4456620 0.1106827
MetaPipe::replace_missing(example_data, 
                          c(1, 2), 
                          replace_na =  TRUE,
                          out_prefix = out_prefix)
#> The following trait was dropped because it has 100% missing values: 
#>  - T5
#>   ID    P1          T1         T2         T3    T4
#> 1  1   one -0.56047565  1.7150650 0.05534136 -0.25
#> 2  2   two -0.23017749  0.4609162 1.22408180  1.20
#> 3  3 three  1.55870831 -1.2650612 0.35981383 -0.50
#> 4  4  four  0.07050839 -0.6868529 0.40077145 -0.25
#> 5  5  five  0.12928774 -0.4456620 0.11068272  0.87


# F1 Seedling Ionomics dataset
data(ionomics) # Includes some missing data
out_prefix <- file.path(tempdir(), "ionomics")
ionomics_rev <- MetaPipe::replace_missing(ionomics, 
                                          c(1, 2), 
                                          out_prefix = out_prefix)
ionomics_rev <- MetaPipe::replace_missing(ionomics, 
                                          excluded_columns = c(1, 2), 
                                          prop_na =  0.025, 
                                          out_prefix = out_prefix)
#> The following trait was dropped because it has 2.5% or more missing values: 
#>  - Se78
ionomics_rev <- MetaPipe::replace_missing(ionomics, 
                                          excluded_columns = c(1, 2),
                                          replace_na =  TRUE, 
                                          out_prefix = out_prefix)
knitr::kable(ionomics_rev[1:5, 1:8])
#> 
#> 
#> |ID    | SampleWeight|     Ca44|      K39|      P31|       Li7|      B11|      Na23|
#> |:-----|------------:|--------:|--------:|--------:|---------:|--------:|---------:|
#> |E_001 |           79| 15894.22| 5888.311| 1743.118| 0.0128699| 18.66673|  6.970224|
#> |E_002 |           93| 13155.45| 7013.400| 2244.684| 0.0119316| 14.47693|  5.866392|
#> |E_004 |           97| 14182.51| 7966.273| 2311.057| 0.0212316| 14.71313| 10.251955|
#> |E_005 |           82| 22550.82| 7514.089| 2315.675| 0.0233063| 20.10630| 11.773697|
#> |E_006 |           99| 15982.76| 7608.464| 1995.193| 0.0588128| 12.97801| 11.043837|