In this practical, we will work with example datasets that are created by the following syntax:
<- data.frame(id = 1:5,
basedat age = c(37.9, 22.6, 26.9, 37.3, 49.2),
sex = factor(c('male', 'female', 'female', 'male', 'male')),
center = c('EMC', 'EMC', 'LUMC', 'VUMC', 'LUMC')
)
<- data.frame(id = c(1, 1, 1, 2, 2, 3, 4, 4, 4, 4, 5, 5, 5),
FUdat time = c(3.6, 5.3, 7.8, 4.3, 8.2, 5.3, 3.5,
4.9, 6.4, 8.1, 2.2, 6.4, 7.0),
weight = c(136.7, 131.5, 135.0, 113.7, 128.9, 97.1,
168.1, 147.7, 153.1, 149.6, 152.6, 137.7, 133.1)
)
<- data.frame(id = 1:12,
labdat ID = c(1, 1, 2, 2, 2, 4, 4, 4, 4, 5, 5, 6),
labtime = c(3.1, 6.0, 4.3, 5.8, 8.2, 3.5,
4.9, 6.4, 8.1, 2.2, 6.4, 4.7),
chol = c(347, 309, 215, 436, 360, 335, 1002, 808, 615,
932, 413, 337),
center = c(rep('EMC', 5), rep('VUMC', 4), rep('LUMC', 3))
)
The resulting datasets are:
basedatid | age | sex | center |
---|---|---|---|
1 | 37.9 | male | EMC |
2 | 22.6 | female | EMC |
3 | 26.9 | female | LUMC |
4 | 37.3 | male | VUMC |
5 | 49.2 | male | LUMC |
id | time | weight |
---|---|---|
1 | 3.6 | 136.7 |
1 | 5.3 | 131.5 |
1 | 7.8 | 135.0 |
2 | 4.3 | 113.7 |
2 | 8.2 | 128.9 |
3 | 5.3 | 97.1 |
4 | 3.5 | 168.1 |
4 | 4.9 | 147.7 |
4 | 6.4 | 153.1 |
4 | 8.1 | 149.6 |
5 | 2.2 | 152.6 |
5 | 6.4 | 137.7 |
5 | 7.0 | 133.1 |
id | ID | labtime | chol | center |
---|---|---|---|---|
1 | 1 | 3.1 | 347 | EMC |
2 | 1 | 6.0 | 309 | EMC |
3 | 2 | 4.3 | 215 | EMC |
4 | 2 | 5.8 | 436 | EMC |
5 | 2 | 8.2 | 360 | EMC |
6 | 4 | 3.5 | 335 | VUMC |
7 | 4 | 4.9 | 1002 | VUMC |
8 | 4 | 6.4 | 808 | VUMC |
9 | 4 | 8.1 | 615 | VUMC |
10 | 5 | 2.2 | 932 | LUMC |
11 | 5 | 6.4 | 413 | LUMC |
12 | 6 | 4.7 | 337 | LUMC |
basedat
and FUdat
Run the syntax given above to create the datasets basedat
which contains “baseline” information on each patient and FUdat
, which contains repeated measurements of a variable weight
.
Merge basedat
and FUdat
by the variable id
.
merge()
.
<- merge(basedat, FUdat)
dat1 dat1
## id age sex center time weight
## 1 1 37.9 male EMC 3.6 136.7
## 2 1 37.9 male EMC 5.3 131.5
## 3 1 37.9 male EMC 7.8 135.0
## 4 2 22.6 female EMC 4.3 113.7
## 5 2 22.6 female EMC 8.2 128.9
## 6 3 26.9 female LUMC 5.3 97.1
## 7 4 37.3 male VUMC 3.5 168.1
## 8 4 37.3 male VUMC 4.9 147.7
## 9 4 37.3 male VUMC 6.4 153.1
## 10 4 37.3 male VUMC 8.1 149.6
## 11 5 49.2 male LUMC 2.2 152.6
## 12 5 49.2 male LUMC 6.4 137.7
## 13 5 49.2 male LUMC 7.0 133.1
Check for two of the subjects that the merged data are correct.
subset()
could be used here.
# for subject 1
subset(basedat, id == 1)
## id age sex center
## 1 1 37.9 male EMC
subset(FUdat, id == 1)
## id time weight
## 1 1 3.6 136.7
## 2 1 5.3 131.5
## 3 1 7.8 135.0
subset(dat1, id == 1)
## id age sex center time weight
## 1 1 37.9 male EMC 3.6 136.7
## 2 1 37.9 male EMC 5.3 131.5
## 3 1 37.9 male EMC 7.8 135.0
# for subject 4, using a different method of subsetting (just as a reminder how that worked)
$id == 4, ] basedat[basedat
## id age sex center
## 4 4 37.3 male VUMC
$id == 4, ] FUdat[FUdat
## id time weight
## 7 4 3.5 168.1
## 8 4 4.9 147.7
## 9 4 6.4 153.1
## 10 4 8.1 149.6
$id == 4, ] dat1[dat1
## id age sex center time weight
## 7 4 37.3 male VUMC 3.5 168.1
## 8 4 37.3 male VUMC 4.9 147.7
## 9 4 37.3 male VUMC 6.4 153.1
## 10 4 37.3 male VUMC 8.1 149.6
Check that the resulting dat1
has the correct dimensions.
dim()
could be used here.
dim(dat1)
## [1] 13 6
dim(FUdat)
## [1] 13 3
dim(basedat)
## [1] 5 4
FUdat
and basedat
contain data on the exact same set of subjects and only the id
variable is overlapping between the two datasets. Therefore, the resulting dat1
should have the same number of rows as the data containing the repeated measurements (FUdat
) and has one column less than the sum of the number of columns of both input datasets (6 = 3 + 4 - 1).
labdat
We now want to merge the previously merged data dat1
with labdat
.
Note that in labdat
we have the variables
id
: sample numberID
: patient identifierMerge dat1
and labdat
by patient identifier.
by.x
and by.y
.
<- merge(dat1, labdat, by.x = 'id', by.y = 'ID')
dat2 dat2
## id age sex center.x time weight id.y labtime chol center.y
## 1 1 37.9 male EMC 3.6 136.7 1 3.1 347 EMC
## 2 1 37.9 male EMC 3.6 136.7 2 6.0 309 EMC
## 3 1 37.9 male EMC 7.8 135.0 1 3.1 347 EMC
## 4 1 37.9 male EMC 7.8 135.0 2 6.0 309 EMC
## 5 1 37.9 male EMC 5.3 131.5 1 3.1 347 EMC
## 6 1 37.9 male EMC 5.3 131.5 2 6.0 309 EMC
## 7 2 22.6 female EMC 4.3 113.7 3 4.3 215 EMC
## 8 2 22.6 female EMC 4.3 113.7 5 8.2 360 EMC
## 9 2 22.6 female EMC 4.3 113.7 4 5.8 436 EMC
## 10 2 22.6 female EMC 8.2 128.9 3 4.3 215 EMC
## 11 2 22.6 female EMC 8.2 128.9 5 8.2 360 EMC
## 12 2 22.6 female EMC 8.2 128.9 4 5.8 436 EMC
## 13 4 37.3 male VUMC 3.5 168.1 6 3.5 335 VUMC
## 14 4 37.3 male VUMC 3.5 168.1 7 4.9 1002 VUMC
## 15 4 37.3 male VUMC 3.5 168.1 9 8.1 615 VUMC
## 16 4 37.3 male VUMC 3.5 168.1 8 6.4 808 VUMC
## 17 4 37.3 male VUMC 4.9 147.7 6 3.5 335 VUMC
## 18 4 37.3 male VUMC 4.9 147.7 7 4.9 1002 VUMC
## 19 4 37.3 male VUMC 4.9 147.7 9 8.1 615 VUMC
## 20 4 37.3 male VUMC 4.9 147.7 8 6.4 808 VUMC
## 21 4 37.3 male VUMC 6.4 153.1 6 3.5 335 VUMC
## 22 4 37.3 male VUMC 6.4 153.1 7 4.9 1002 VUMC
## 23 4 37.3 male VUMC 6.4 153.1 9 8.1 615 VUMC
## 24 4 37.3 male VUMC 6.4 153.1 8 6.4 808 VUMC
## 25 4 37.3 male VUMC 8.1 149.6 6 3.5 335 VUMC
## 26 4 37.3 male VUMC 8.1 149.6 7 4.9 1002 VUMC
## 27 4 37.3 male VUMC 8.1 149.6 9 8.1 615 VUMC
## 28 4 37.3 male VUMC 8.1 149.6 8 6.4 808 VUMC
## 29 5 49.2 male LUMC 2.2 152.6 10 2.2 932 LUMC
## 30 5 49.2 male LUMC 2.2 152.6 11 6.4 413 LUMC
## 31 5 49.2 male LUMC 6.4 137.7 10 2.2 932 LUMC
## 32 5 49.2 male LUMC 6.4 137.7 11 6.4 413 LUMC
## 33 5 49.2 male LUMC 7.0 133.1 10 2.2 932 LUMC
## 34 5 49.2 male LUMC 7.0 133.1 11 6.4 413 LUMC
Did you notice that in the resulting dat2
we now have the two variables center.x
and center.y
, and that we have a variable id.y
(which is the sample number from labdat
)?
Moreover, there are two time variables: time
(from dat1
) and labtime
(from labdat
).
Re-do the merge of dat1
and labdat
taking into account the following points
id.y
) a more informative suffixby.x
and by.y
.
suffixes
.
<- merge(dat1, labdat,
dat3 by.x = c('id', 'center', 'time'),
by.y = c('ID', 'center', 'labtime'),
suffixes = c('_base_FU', '_lab'))
dat3
## id center time age sex weight id_lab chol
## 1 2 EMC 4.3 22.6 female 113.7 3 215
## 2 2 EMC 8.2 22.6 female 128.9 5 360
## 3 4 VUMC 3.5 37.3 male 168.1 6 335
## 4 4 VUMC 4.9 37.3 male 147.7 7 1002
## 5 4 VUMC 6.4 37.3 male 153.1 8 808
## 6 4 VUMC 8.1 37.3 male 149.6 9 615
## 7 5 LUMC 2.2 49.2 male 152.6 10 932
## 8 5 LUMC 6.4 49.2 male 137.7 11 413
Note:
Here it does not actually matter how we specify the first element of the argument suffixes
because it is not used.
You may have noticed that the merged dat3
only contains very few rows.
all
.
By default, the argument all
is set to FALSE
. This means that only those rows are kept that have identical entries for all merging variables in both datasets. Only a few values of time
and labtime
are identical.
When we set all = TRUE
all rows will be kept. Where time
and labtime
do not match, two rows are created and NA
is filled in.
<- merge(dat1, labdat,
dat4 by.x = c('id', 'center', 'time'),
by.y = c('ID', 'center', 'labtime'),
suffixes = c('_base_FU', '_lab'), all = TRUE)
dat4
## id center time age sex weight id_lab chol
## 1 1 EMC 3.1 NA <NA> NA 1 347
## 2 1 EMC 3.6 37.9 male 136.7 NA NA
## 3 1 EMC 5.3 37.9 male 131.5 NA NA
## 4 1 EMC 6.0 NA <NA> NA 2 309
## 5 1 EMC 7.8 37.9 male 135.0 NA NA
## 6 2 EMC 4.3 22.6 female 113.7 3 215
## 7 2 EMC 5.8 NA <NA> NA 4 436
## 8 2 EMC 8.2 22.6 female 128.9 5 360
## 9 3 LUMC 5.3 26.9 female 97.1 NA NA
## 10 4 VUMC 3.5 37.3 male 168.1 6 335
## 11 4 VUMC 4.9 37.3 male 147.7 7 1002
## 12 4 VUMC 6.4 37.3 male 153.1 8 808
## 13 4 VUMC 8.1 37.3 male 149.6 9 615
## 14 5 LUMC 2.2 49.2 male 152.6 10 932
## 15 5 LUMC 6.4 49.2 male 137.7 11 413
## 16 5 LUMC 7.0 49.2 male 133.1 NA NA
## 17 6 LUMC 4.7 NA <NA> NA 12 337
© Nicole Erler