In this practical, we will work with example datasets that are created by the following syntax:
basedat <- data.frame(id = 1:5,
age = c(37.9, 22.6, 26.9, 37.3, 49.2),
sex = factor(c('male', 'female', 'female', 'male', 'male')),
center = c('EMC', 'EMC', 'LUMC', 'VUMC', 'LUMC')
)
FUdat <- data.frame(id = c(1, 1, 1, 2, 2, 3, 4, 4, 4, 4, 5, 5, 5),
time = c(3.6, 5.3, 7.8, 4.3, 8.2, 5.3, 3.5,
4.9, 6.4, 8.1, 2.2, 6.4, 7.0),
weight = c(136.7, 131.5, 135.0, 113.7, 128.9, 97.1,
168.1, 147.7, 153.1, 149.6, 152.6, 137.7, 133.1)
)
labdat <- data.frame(id = 1:12,
ID = c(1, 1, 2, 2, 2, 4, 4, 4, 4, 5, 5, 6),
labtime = c(3.1, 6.0, 4.3, 5.8, 8.2, 3.5,
4.9, 6.4, 8.1, 2.2, 6.4, 4.7),
chol = c(347, 309, 215, 436, 360, 335, 1002, 808, 615,
932, 413, 337),
center = c(rep('EMC', 5), rep('VUMC', 4), rep('LUMC', 3))
)The resulting datasets are:
basedat| id | age | sex | center |
|---|---|---|---|
| 1 | 37.9 | male | EMC |
| 2 | 22.6 | female | EMC |
| 3 | 26.9 | female | LUMC |
| 4 | 37.3 | male | VUMC |
| 5 | 49.2 | male | LUMC |
| id | time | weight |
|---|---|---|
| 1 | 3.6 | 136.7 |
| 1 | 5.3 | 131.5 |
| 1 | 7.8 | 135.0 |
| 2 | 4.3 | 113.7 |
| 2 | 8.2 | 128.9 |
| 3 | 5.3 | 97.1 |
| 4 | 3.5 | 168.1 |
| 4 | 4.9 | 147.7 |
| 4 | 6.4 | 153.1 |
| 4 | 8.1 | 149.6 |
| 5 | 2.2 | 152.6 |
| 5 | 6.4 | 137.7 |
| 5 | 7.0 | 133.1 |
| id | ID | labtime | chol | center |
|---|---|---|---|---|
| 1 | 1 | 3.1 | 347 | EMC |
| 2 | 1 | 6.0 | 309 | EMC |
| 3 | 2 | 4.3 | 215 | EMC |
| 4 | 2 | 5.8 | 436 | EMC |
| 5 | 2 | 8.2 | 360 | EMC |
| 6 | 4 | 3.5 | 335 | VUMC |
| 7 | 4 | 4.9 | 1002 | VUMC |
| 8 | 4 | 6.4 | 808 | VUMC |
| 9 | 4 | 8.1 | 615 | VUMC |
| 10 | 5 | 2.2 | 932 | LUMC |
| 11 | 5 | 6.4 | 413 | LUMC |
| 12 | 6 | 4.7 | 337 | LUMC |
basedat and FUdatRun the syntax given above to create the datasets basedat which contains “baseline” information on each patient and FUdat, which contains repeated measurements of a variable weight.
Merge basedat and FUdat by the variable id.
merge().
dat1 <- merge(basedat, FUdat)
dat1## id age sex center time weight
## 1 1 37.9 male EMC 3.6 136.7
## 2 1 37.9 male EMC 5.3 131.5
## 3 1 37.9 male EMC 7.8 135.0
## 4 2 22.6 female EMC 4.3 113.7
## 5 2 22.6 female EMC 8.2 128.9
## 6 3 26.9 female LUMC 5.3 97.1
## 7 4 37.3 male VUMC 3.5 168.1
## 8 4 37.3 male VUMC 4.9 147.7
## 9 4 37.3 male VUMC 6.4 153.1
## 10 4 37.3 male VUMC 8.1 149.6
## 11 5 49.2 male LUMC 2.2 152.6
## 12 5 49.2 male LUMC 6.4 137.7
## 13 5 49.2 male LUMC 7.0 133.1
Check for two of the subjects that the merged data are correct.
subset() could be used here.
# for subject 1
subset(basedat, id == 1)## id age sex center
## 1 1 37.9 male EMC
subset(FUdat, id == 1)## id time weight
## 1 1 3.6 136.7
## 2 1 5.3 131.5
## 3 1 7.8 135.0
subset(dat1, id == 1)## id age sex center time weight
## 1 1 37.9 male EMC 3.6 136.7
## 2 1 37.9 male EMC 5.3 131.5
## 3 1 37.9 male EMC 7.8 135.0
# for subject 4, using a different method of subsetting (just as a reminder how that worked)
basedat[basedat$id == 4, ]## id age sex center
## 4 4 37.3 male VUMC
FUdat[FUdat$id == 4, ]## id time weight
## 7 4 3.5 168.1
## 8 4 4.9 147.7
## 9 4 6.4 153.1
## 10 4 8.1 149.6
dat1[dat1$id == 4, ]## id age sex center time weight
## 7 4 37.3 male VUMC 3.5 168.1
## 8 4 37.3 male VUMC 4.9 147.7
## 9 4 37.3 male VUMC 6.4 153.1
## 10 4 37.3 male VUMC 8.1 149.6
Check that the resulting dat1 has the correct dimensions.
dim() could be used here.
dim(dat1)## [1] 13 6
dim(FUdat)## [1] 13 3
dim(basedat)## [1] 5 4
FUdat and basedat contain data on the exact same set of subjects and only the id variable is overlapping between the two datasets. Therefore, the resulting dat1 should have the same number of rows as the data containing the repeated measurements (FUdat) and has one column less than the sum of the number of columns of both input datasets (6 = 3 + 4 - 1).
labdatWe now want to merge the previously merged data dat1 with labdat.
Note that in labdat we have the variables
id: sample numberID: patient identifierMerge dat1 and labdat by patient identifier.
by.x and by.y.
dat2 <- merge(dat1, labdat, by.x = 'id', by.y = 'ID')
dat2## id age sex center.x time weight id.y labtime chol center.y
## 1 1 37.9 male EMC 3.6 136.7 1 3.1 347 EMC
## 2 1 37.9 male EMC 3.6 136.7 2 6.0 309 EMC
## 3 1 37.9 male EMC 7.8 135.0 1 3.1 347 EMC
## 4 1 37.9 male EMC 7.8 135.0 2 6.0 309 EMC
## 5 1 37.9 male EMC 5.3 131.5 1 3.1 347 EMC
## 6 1 37.9 male EMC 5.3 131.5 2 6.0 309 EMC
## 7 2 22.6 female EMC 4.3 113.7 3 4.3 215 EMC
## 8 2 22.6 female EMC 4.3 113.7 5 8.2 360 EMC
## 9 2 22.6 female EMC 4.3 113.7 4 5.8 436 EMC
## 10 2 22.6 female EMC 8.2 128.9 3 4.3 215 EMC
## 11 2 22.6 female EMC 8.2 128.9 5 8.2 360 EMC
## 12 2 22.6 female EMC 8.2 128.9 4 5.8 436 EMC
## 13 4 37.3 male VUMC 3.5 168.1 6 3.5 335 VUMC
## 14 4 37.3 male VUMC 3.5 168.1 7 4.9 1002 VUMC
## 15 4 37.3 male VUMC 3.5 168.1 9 8.1 615 VUMC
## 16 4 37.3 male VUMC 3.5 168.1 8 6.4 808 VUMC
## 17 4 37.3 male VUMC 4.9 147.7 6 3.5 335 VUMC
## 18 4 37.3 male VUMC 4.9 147.7 7 4.9 1002 VUMC
## 19 4 37.3 male VUMC 4.9 147.7 9 8.1 615 VUMC
## 20 4 37.3 male VUMC 4.9 147.7 8 6.4 808 VUMC
## 21 4 37.3 male VUMC 6.4 153.1 6 3.5 335 VUMC
## 22 4 37.3 male VUMC 6.4 153.1 7 4.9 1002 VUMC
## 23 4 37.3 male VUMC 6.4 153.1 9 8.1 615 VUMC
## 24 4 37.3 male VUMC 6.4 153.1 8 6.4 808 VUMC
## 25 4 37.3 male VUMC 8.1 149.6 6 3.5 335 VUMC
## 26 4 37.3 male VUMC 8.1 149.6 7 4.9 1002 VUMC
## 27 4 37.3 male VUMC 8.1 149.6 9 8.1 615 VUMC
## 28 4 37.3 male VUMC 8.1 149.6 8 6.4 808 VUMC
## 29 5 49.2 male LUMC 2.2 152.6 10 2.2 932 LUMC
## 30 5 49.2 male LUMC 2.2 152.6 11 6.4 413 LUMC
## 31 5 49.2 male LUMC 6.4 137.7 10 2.2 932 LUMC
## 32 5 49.2 male LUMC 6.4 137.7 11 6.4 413 LUMC
## 33 5 49.2 male LUMC 7.0 133.1 10 2.2 932 LUMC
## 34 5 49.2 male LUMC 7.0 133.1 11 6.4 413 LUMC
Did you notice that in the resulting dat2 we now have the two variables center.x and center.y, and that we have a variable id.y (which is the sample number from labdat)?
Moreover, there are two time variables: time (from dat1) and labtime (from labdat).
Re-do the merge of dat1 and labdat taking into account the following points
id.y) a more informative suffixby.x and by.y.
suffixes.
dat3 <- merge(dat1, labdat,
by.x = c('id', 'center', 'time'),
by.y = c('ID', 'center', 'labtime'),
suffixes = c('_base_FU', '_lab'))dat3## id center time age sex weight id_lab chol
## 1 2 EMC 4.3 22.6 female 113.7 3 215
## 2 2 EMC 8.2 22.6 female 128.9 5 360
## 3 4 VUMC 3.5 37.3 male 168.1 6 335
## 4 4 VUMC 4.9 37.3 male 147.7 7 1002
## 5 4 VUMC 6.4 37.3 male 153.1 8 808
## 6 4 VUMC 8.1 37.3 male 149.6 9 615
## 7 5 LUMC 2.2 49.2 male 152.6 10 932
## 8 5 LUMC 6.4 49.2 male 137.7 11 413
Note:
Here it does not actually matter how we specify the first element of the argument suffixes because it is not used.
You may have noticed that the merged dat3 only contains very few rows.
all.
By default, the argument all is set to FALSE. This means that only those rows are kept that have identical entries for all merging variables in both datasets. Only a few values of time and labtime are identical.
When we set all = TRUE all rows will be kept. Where time and labtime do not match, two rows are created and NA is filled in.
dat4 <- merge(dat1, labdat,
by.x = c('id', 'center', 'time'),
by.y = c('ID', 'center', 'labtime'),
suffixes = c('_base_FU', '_lab'), all = TRUE)dat4## id center time age sex weight id_lab chol
## 1 1 EMC 3.1 NA <NA> NA 1 347
## 2 1 EMC 3.6 37.9 male 136.7 NA NA
## 3 1 EMC 5.3 37.9 male 131.5 NA NA
## 4 1 EMC 6.0 NA <NA> NA 2 309
## 5 1 EMC 7.8 37.9 male 135.0 NA NA
## 6 2 EMC 4.3 22.6 female 113.7 3 215
## 7 2 EMC 5.8 NA <NA> NA 4 436
## 8 2 EMC 8.2 22.6 female 128.9 5 360
## 9 3 LUMC 5.3 26.9 female 97.1 NA NA
## 10 4 VUMC 3.5 37.3 male 168.1 6 335
## 11 4 VUMC 4.9 37.3 male 147.7 7 1002
## 12 4 VUMC 6.4 37.3 male 153.1 8 808
## 13 4 VUMC 8.1 37.3 male 149.6 9 615
## 14 5 LUMC 2.2 49.2 male 152.6 10 932
## 15 5 LUMC 6.4 49.2 male 137.7 11 413
## 16 5 LUMC 7.0 49.2 male 133.1 NA NA
## 17 6 LUMC 4.7 NA <NA> NA 12 337
© Nicole Erler