MetaPipe
only accepts comma-separated values (CSV) files with the following structure:
ID | [Property]1 | … | [Property]M | [Trait]1 | … | [Trait]N |
---|---|---|---|---|---|---|
where the first column (ID
) should be an unique identifier for each entry, if there are repeated values MetaPipe
will aggregate and replace them by a single row (mean across entries). The data structure can have 0 to M properties, including categorical and numerical. Finally, at least one one trait is expected.
The function call is as follows:
where raw_data_filename
is the filename containing the raw data, either absolute or relative paths are accepted. The argument excluded_columns
is a vector containing the indices of the properties, e.g. c(2, 3, ..., M)
.
# Toy dataset
set.seed(123)
example_data <- data.frame(ID = c(1,2,3,4,5),
P1 = c("one", "two", "three", "four", "five"),
T1 = rnorm(5),
T2 = rnorm(5),
T3 = c(NA, rnorm(4)), # 20 % NAs
T4 = c(NA, 1.2, -0.5, NA, 0.87), # 40 % NAs
T5 = NA) # 100 % NAs
workdir <- tempdir()
## Write to disk
write.csv(example_data,
file.path(workdir, "example_data.csv"),
row.names = FALSE)
## Create copy with duplicated rows
write.csv(example_data[c(1:5, 1, 2), ],
file.path(workdir, "example_data_dup.csv"),
row.names = FALSE)
# Load the data
load_raw(file.path(workdir, "example_data.csv"), c(2))
#> ID P1 T5 T1 T2 T3 T4
#> 1 1 one NA -0.56047565 1.7150650 NA NA
#> 2 2 two NA -0.23017749 0.4609162 1.2240818 1.20
#> 3 3 three NA 1.55870831 -1.2650612 0.3598138 -0.50
#> 4 4 four NA 0.07050839 -0.6868529 0.4007715 NA
#> 5 5 five NA 0.12928774 -0.4456620 0.1106827 0.87
load_raw(file.path(workdir, "example_data_dup.csv"), c(2))
#> ID P1 T5 T1 T2 T3 T4
#> 1 1 one NA -0.56047565 1.7150650 NA NA
#> 2 2 two NA -0.23017749 0.4609162 1.2240818 1.20
#> 3 3 three NA 1.55870831 -1.2650612 0.3598138 -0.50
#> 4 4 four NA 0.07050839 -0.6868529 0.4007715 NA
#> 5 5 five NA 0.12928774 -0.4456620 0.1106827 0.87
Next, see either Replace Missing Data [Optional] or Assess Normality.