The aim of this practical is to practice how to merge multiple datasets.
In this practical, we will work with example datasets that are created by the following syntax:
<- data.frame(id = 1:5,
basedat age = c(37.9, 22.6, 26.9, 37.3, 49.2),
sex = factor(c('male', 'female', 'female', 'male', 'male')),
center = c('EMC', 'EMC', 'LUMC', 'VUMC', 'LUMC')
)
<- data.frame(id = c(1, 1, 1, 2, 2, 3, 4, 4, 4, 4, 5, 5, 5),
FUdat time = c(3.6, 5.3, 7.8, 4.3, 8.2, 5.3, 3.5,
4.9, 6.4, 8.1, 2.2, 6.4, 7.0),
weight = c(136.7, 131.5, 135.0, 113.7, 128.9, 97.1,
168.1, 147.7, 153.1, 149.6, 152.6, 137.7, 133.1)
)
<- data.frame(id = 1:12,
labdat ID = c(1, 1, 2, 2, 2, 4, 4, 4, 4, 5, 5, 6),
labtime = c(3.1, 6.0, 4.3, 5.8, 8.2, 3.5,
4.9, 6.4, 8.1, 2.2, 6.4, 4.7),
chol = c(347, 309, 215, 436, 360, 335, 1002, 808, 615,
932, 413, 337),
center = c(rep('EMC', 5), rep('VUMC', 4), rep('LUMC', 3))
)
The resulting datasets are:
basedat, a dataset with some baseline information on the patients:id | age | sex | center |
---|---|---|---|
1 | 37.9 | male | EMC |
2 | 22.6 | female | EMC |
3 | 26.9 | female | LUMC |
4 | 37.3 | male | VUMC |
5 | 49.2 | male | LUMC |
weight
:
id | time | weight |
---|---|---|
1 | 3.6 | 136.7 |
1 | 5.3 | 131.5 |
1 | 7.8 | 135.0 |
2 | 4.3 | 113.7 |
2 | 8.2 | 128.9 |
3 | 5.3 | 97.1 |
4 | 3.5 | 168.1 |
4 | 4.9 | 147.7 |
4 | 6.4 | 153.1 |
4 | 8.1 | 149.6 |
5 | 2.2 | 152.6 |
5 | 6.4 | 137.7 |
5 | 7.0 | 133.1 |
id | ID | labtime | chol | center |
---|---|---|---|---|
1 | 1 | 3.1 | 347 | EMC |
2 | 1 | 6.0 | 309 | EMC |
3 | 2 | 4.3 | 215 | EMC |
4 | 2 | 5.8 | 436 | EMC |
5 | 2 | 8.2 | 360 | EMC |
6 | 4 | 3.5 | 335 | VUMC |
7 | 4 | 4.9 | 1002 | VUMC |
8 | 4 | 6.4 | 808 | VUMC |
9 | 4 | 8.1 | 615 | VUMC |
10 | 5 | 2.2 | 932 | LUMC |
11 | 5 | 6.4 | 413 | LUMC |
12 | 6 | 4.7 | 337 | LUMC |
basedat
and FUdat
basedat
which contains “baseline” information on each
patient and FUdat
, which contains repeated measurements of
a variable weight
.merge()
by
reading the help file.basedat
and FUdat
by the variable
id
.<- merge(basedat, FUdat)
dat1 dat1
## id age sex center time weight
## 1 1 37.9 male EMC 3.6 136.7
## 2 1 37.9 male EMC 5.3 131.5
## 3 1 37.9 male EMC 7.8 135.0
## 4 2 22.6 female EMC 4.3 113.7
## 5 2 22.6 female EMC 8.2 128.9
## 6 3 26.9 female LUMC 5.3 97.1
## 7 4 37.3 male VUMC 3.5 168.1
## 8 4 37.3 male VUMC 4.9 147.7
## 9 4 37.3 male VUMC 6.4 153.1
## 10 4 37.3 male VUMC 8.1 149.6
## 11 5 49.2 male LUMC 2.2 152.6
## 12 5 49.2 male LUMC 6.4 137.7
## 13 5 49.2 male LUMC 7.0 133.1
Check for two of the subjects that the merged data are correct.
subset()
could be used here.
# for subject 1
subset(basedat, id == 1)
## id age sex center
## 1 1 37.9 male EMC
subset(FUdat, id == 1)
## id time weight
## 1 1 3.6 136.7
## 2 1 5.3 131.5
## 3 1 7.8 135.0
subset(dat1, id == 1)
## id age sex center time weight
## 1 1 37.9 male EMC 3.6 136.7
## 2 1 37.9 male EMC 5.3 131.5
## 3 1 37.9 male EMC 7.8 135.0
Looks OK, we now have the information from the baseline dataset repeated in each row for the repeated measurements of the patient.
# for subject 4, using a different method of subsetting (just as a reminder how that worked)
$id == 4, ] basedat[basedat
## id age sex center
## 4 4 37.3 male VUMC
$id == 4, ] FUdat[FUdat
## id time weight
## 7 4 3.5 168.1
## 8 4 4.9 147.7
## 9 4 6.4 153.1
## 10 4 8.1 149.6
$id == 4, ] dat1[dat1
## id age sex center time weight
## 7 4 37.3 male VUMC 3.5 168.1
## 8 4 37.3 male VUMC 4.9 147.7
## 9 4 37.3 male VUMC 6.4 153.1
## 10 4 37.3 male VUMC 8.1 149.6
Check that the resulting dat1
has the correct
dimensions.
dim()
could be used here.
dim(dat1)
## [1] 13 6
dim(FUdat)
## [1] 13 3
dim(basedat)
## [1] 5 4
FUdat
and basedat
contain data on the exact
same set of subjects and only the id
variable is
overlapping between the two datasets. Therefore, the resulting
dat1
should have the same number of rows as the data
containing the repeated measurements (FUdat
) and has one
column less than the sum of the number of columns of both input datasets
(6 = 3 + 4 - 1).
labdat
We now want to merge the previously merged data dat1
with labdat
.
Note that in labdat
we have the variables
id
: sample numberID
: patient identifierMerge dat1
and labdat
by patient
identifier.
by.x
and by.y
.
<- merge(dat1, labdat, by.x = 'id', by.y = 'ID')
dat2 dat2
## id age sex center.x time weight id.y labtime chol center.y
## 1 1 37.9 male EMC 3.6 136.7 1 3.1 347 EMC
## 2 1 37.9 male EMC 3.6 136.7 2 6.0 309 EMC
## 3 1 37.9 male EMC 7.8 135.0 1 3.1 347 EMC
## 4 1 37.9 male EMC 7.8 135.0 2 6.0 309 EMC
## 5 1 37.9 male EMC 5.3 131.5 1 3.1 347 EMC
## 6 1 37.9 male EMC 5.3 131.5 2 6.0 309 EMC
## 7 2 22.6 female EMC 4.3 113.7 3 4.3 215 EMC
## 8 2 22.6 female EMC 4.3 113.7 5 8.2 360 EMC
## 9 2 22.6 female EMC 4.3 113.7 4 5.8 436 EMC
## 10 2 22.6 female EMC 8.2 128.9 3 4.3 215 EMC
## 11 2 22.6 female EMC 8.2 128.9 5 8.2 360 EMC
## 12 2 22.6 female EMC 8.2 128.9 4 5.8 436 EMC
## 13 4 37.3 male VUMC 3.5 168.1 6 3.5 335 VUMC
## 14 4 37.3 male VUMC 3.5 168.1 7 4.9 1002 VUMC
## 15 4 37.3 male VUMC 3.5 168.1 9 8.1 615 VUMC
## 16 4 37.3 male VUMC 3.5 168.1 8 6.4 808 VUMC
## 17 4 37.3 male VUMC 4.9 147.7 6 3.5 335 VUMC
## 18 4 37.3 male VUMC 4.9 147.7 7 4.9 1002 VUMC
## 19 4 37.3 male VUMC 4.9 147.7 9 8.1 615 VUMC
## 20 4 37.3 male VUMC 4.9 147.7 8 6.4 808 VUMC
## 21 4 37.3 male VUMC 6.4 153.1 6 3.5 335 VUMC
## 22 4 37.3 male VUMC 6.4 153.1 7 4.9 1002 VUMC
## 23 4 37.3 male VUMC 6.4 153.1 9 8.1 615 VUMC
## 24 4 37.3 male VUMC 6.4 153.1 8 6.4 808 VUMC
## 25 4 37.3 male VUMC 8.1 149.6 6 3.5 335 VUMC
## 26 4 37.3 male VUMC 8.1 149.6 7 4.9 1002 VUMC
## 27 4 37.3 male VUMC 8.1 149.6 9 8.1 615 VUMC
## 28 4 37.3 male VUMC 8.1 149.6 8 6.4 808 VUMC
## 29 5 49.2 male LUMC 2.2 152.6 10 2.2 932 LUMC
## 30 5 49.2 male LUMC 2.2 152.6 11 6.4 413 LUMC
## 31 5 49.2 male LUMC 6.4 137.7 10 2.2 932 LUMC
## 32 5 49.2 male LUMC 6.4 137.7 11 6.4 413 LUMC
## 33 5 49.2 male LUMC 7.0 133.1 10 2.2 932 LUMC
## 34 5 49.2 male LUMC 7.0 133.1 11 6.4 413 LUMC
Did you notice that in the resulting dat2
we now have
the two variables center.x
and center.y
, and
that we have a variable id.y
(which is the sample number
from labdat
)?
Moreover, there are two time variables: time
(from
dat1
) and labtime
(from
labdat
).
Re-do the merge of dat1
and labdat
taking
into account the following points
id.y
) a more informative
suffixby.x
and by.y
.
suffixes
.
<- merge(dat1, labdat,
dat3 by.x = c('id', 'center', 'time'),
by.y = c('ID', 'center', 'labtime'),
suffixes = c('_base_FU', '_lab'))
dat3
## id center time age sex weight id_lab chol
## 1 2 EMC 4.3 22.6 female 113.7 3 215
## 2 2 EMC 8.2 22.6 female 128.9 5 360
## 3 4 VUMC 3.5 37.3 male 168.1 6 335
## 4 4 VUMC 4.9 37.3 male 147.7 7 1002
## 5 4 VUMC 6.4 37.3 male 153.1 8 808
## 6 4 VUMC 8.1 37.3 male 149.6 9 615
## 7 5 LUMC 2.2 49.2 male 152.6 10 932
## 8 5 LUMC 6.4 49.2 male 137.7 11 413
Here it does not actually matter how we specify the first element of
the argument suffixes
because it is not used.
You may have noticed that the merged dat3
only contains
very few rows.
all
.
By default, the argument all
is set to
FALSE
. This means that only those rows are kept that have
identical entries for all merging variables in both datasets. Only a few
values of time
and labtime
are identical.
When we set all = TRUE
all rows will be kept. Where
time
and labtime
do not match, two rows are
created and NA
is filled in.
<- merge(dat1, labdat,
dat4 by.x = c('id', 'center', 'time'),
by.y = c('ID', 'center', 'labtime'),
suffixes = c('_base_FU', '_lab'), all = TRUE)
dat4
## id center time age sex weight id_lab chol
## 1 1 EMC 3.1 NA <NA> NA 1 347
## 2 1 EMC 3.6 37.9 male 136.7 NA NA
## 3 1 EMC 5.3 37.9 male 131.5 NA NA
## 4 1 EMC 6.0 NA <NA> NA 2 309
## 5 1 EMC 7.8 37.9 male 135.0 NA NA
## 6 2 EMC 4.3 22.6 female 113.7 3 215
## 7 2 EMC 5.8 NA <NA> NA 4 436
## 8 2 EMC 8.2 22.6 female 128.9 5 360
## 9 3 LUMC 5.3 26.9 female 97.1 NA NA
## 10 4 VUMC 3.5 37.3 male 168.1 6 335
## 11 4 VUMC 4.9 37.3 male 147.7 7 1002
## 12 4 VUMC 6.4 37.3 male 153.1 8 808
## 13 4 VUMC 8.1 37.3 male 149.6 9 615
## 14 5 LUMC 2.2 49.2 male 152.6 10 932
## 15 5 LUMC 6.4 49.2 male 137.7 11 413
## 16 5 LUMC 7.0 49.2 male 133.1 NA NA
## 17 6 LUMC 4.7 NA <NA> NA 12 337