Merging Datasets

The aim of this practical is to practice how to merge multiple datasets.

Data

In this practical, we will work with example datasets that are created by the following syntax:

basedat <- data.frame(id = 1:5,
                      age = c(37.9, 22.6, 26.9, 37.3, 49.2),
                      sex = factor(c('male', 'female', 'female', 'male', 'male')),
                      center = c('EMC', 'EMC', 'LUMC', 'VUMC', 'LUMC')
)

FUdat <- data.frame(id = c(1, 1, 1, 2, 2, 3, 4, 4, 4, 4, 5, 5, 5),
                    time = c(3.6, 5.3, 7.8, 4.3, 8.2, 5.3, 3.5, 
                             4.9, 6.4, 8.1, 2.2, 6.4, 7.0),
                    weight = c(136.7, 131.5, 135.0, 113.7, 128.9, 97.1, 
                               168.1, 147.7, 153.1, 149.6, 152.6, 137.7, 133.1)
)

labdat <- data.frame(id = 1:12,
                     ID = c(1, 1, 2, 2, 2, 4, 4, 4, 4, 5, 5, 6),
                     labtime = c(3.1, 6.0, 4.3, 5.8, 8.2, 3.5, 
                                 4.9, 6.4, 8.1, 2.2, 6.4, 4.7),
                     chol = c(347, 309, 215, 436, 360, 335, 1002, 808, 615, 
                              932, 413, 337),
                     center = c(rep('EMC', 5), rep('VUMC', 4), rep('LUMC', 3))
)

The resulting datasets are:

basedat, a dataset with some baseline information on the patients:

id	age	sex	center
1	37.9	male	EMC
2	22.6	female	EMC
3	26.9	female	LUMC
4	37.3	male	VUMC
5	49.2	male	LUMC

FUdat, a longitudinal dataset containing repeatedly measured values of weight:

id	time	weight
1	3.6	136.7
1	5.3	131.5
1	7.8	135.0
2	4.3	113.7
2	8.2	128.9
3	5.3	97.1
4	3.5	168.1
4	4.9	147.7
4	6.4	153.1
4	8.1	149.6
5	2.2	152.6
5	6.4	137.7
5	7.0	133.1

labdat, a longitudinal dataset with repeatedly measured laboratory values:

id	ID	labtime	chol	center
1	1	3.1	347	EMC
2	1	6.0	309	EMC
3	2	4.3	215	EMC
4	2	5.8	436	EMC
5	2	8.2	360	EMC
6	4	3.5	335	VUMC
7	4	4.9	1002	VUMC
8	4	6.4	808	VUMC
9	4	8.1	615	VUMC
10	5	2.2	932	LUMC
11	5	6.4	413	LUMC
12	6	4.7	337	LUMC

Merging `basedat` and `FUdat`

Task 1

Run the syntax given above to create the datasets basedat which contains “baseline” information on each patient and FUdat, which contains repeated measurements of a variable weight.
Familiarize yourself with the function merge() by reading the help file.
Merge basedat and FUdat by the variable id.

Solution 1

dat1 <- merge(basedat, FUdat)
dat1

##    id  age    sex center time weight
## 1   1 37.9   male    EMC  3.6  136.7
## 2   1 37.9   male    EMC  5.3  131.5
## 3   1 37.9   male    EMC  7.8  135.0
## 4   2 22.6 female    EMC  4.3  113.7
## 5   2 22.6 female    EMC  8.2  128.9
## 6   3 26.9 female   LUMC  5.3   97.1
## 7   4 37.3   male   VUMC  3.5  168.1
## 8   4 37.3   male   VUMC  4.9  147.7
## 9   4 37.3   male   VUMC  6.4  153.1
## 10  4 37.3   male   VUMC  8.1  149.6
## 11  5 49.2   male   LUMC  2.2  152.6
## 12  5 49.2   male   LUMC  6.4  137.7
## 13  5 49.2   male   LUMC  7.0  133.1

Task 2

Check for two of the subjects that the merged data are correct.

The function subset() could be used here.

Solution 2

# for subject 1
subset(basedat, id == 1)

##   id  age  sex center
## 1  1 37.9 male    EMC

subset(FUdat, id == 1)

##   id time weight
## 1  1  3.6  136.7
## 2  1  5.3  131.5
## 3  1  7.8  135.0

subset(dat1, id == 1)

##   id  age  sex center time weight
## 1  1 37.9 male    EMC  3.6  136.7
## 2  1 37.9 male    EMC  5.3  131.5
## 3  1 37.9 male    EMC  7.8  135.0

Looks OK, we now have the information from the baseline dataset repeated in each row for the repeated measurements of the patient.

# for subject 4, using a different method of subsetting (just as a reminder how that worked)
basedat[basedat$id == 4, ]

##   id  age  sex center
## 4  4 37.3 male   VUMC

FUdat[FUdat$id == 4, ]

##    id time weight
## 7   4  3.5  168.1
## 8   4  4.9  147.7
## 9   4  6.4  153.1
## 10  4  8.1  149.6

dat1[dat1$id == 4, ]

##    id  age  sex center time weight
## 7   4 37.3 male   VUMC  3.5  168.1
## 8   4 37.3 male   VUMC  4.9  147.7
## 9   4 37.3 male   VUMC  6.4  153.1
## 10  4 37.3 male   VUMC  8.1  149.6

Task 3

Check that the resulting dat1 has the correct dimensions.

The function dim() could be used here.

Solution 3

dim(dat1)

## [1] 13  6

dim(FUdat)

## [1] 13  3

dim(basedat)

## [1] 5 4

FUdat and basedat contain data on the exact same set of subjects and only the id variable is overlapping between the two datasets. Therefore, the resulting dat1 should have the same number of rows as the data containing the repeated measurements (FUdat) and has one column less than the sum of the number of columns of both input datasets (6 = 3 + 4 - 1).

Merging with `labdat`

We now want to merge the previously merged data dat1 with labdat.

Task 1

Note that in labdat we have the variables

id: sample number
ID: patient identifier

Merge dat1 and labdat by patient identifier.

Use the arguments by.x and by.y.

Solution 1

dat2 <- merge(dat1, labdat, by.x = 'id', by.y = 'ID')
dat2

##    id  age    sex center.x time weight id.y labtime chol center.y
## 1   1 37.9   male      EMC  3.6  136.7    1     3.1  347      EMC
## 2   1 37.9   male      EMC  3.6  136.7    2     6.0  309      EMC
## 3   1 37.9   male      EMC  7.8  135.0    1     3.1  347      EMC
## 4   1 37.9   male      EMC  7.8  135.0    2     6.0  309      EMC
## 5   1 37.9   male      EMC  5.3  131.5    1     3.1  347      EMC
## 6   1 37.9   male      EMC  5.3  131.5    2     6.0  309      EMC
## 7   2 22.6 female      EMC  4.3  113.7    3     4.3  215      EMC
## 8   2 22.6 female      EMC  4.3  113.7    5     8.2  360      EMC
## 9   2 22.6 female      EMC  4.3  113.7    4     5.8  436      EMC
## 10  2 22.6 female      EMC  8.2  128.9    3     4.3  215      EMC
## 11  2 22.6 female      EMC  8.2  128.9    5     8.2  360      EMC
## 12  2 22.6 female      EMC  8.2  128.9    4     5.8  436      EMC
## 13  4 37.3   male     VUMC  3.5  168.1    6     3.5  335     VUMC
## 14  4 37.3   male     VUMC  3.5  168.1    7     4.9 1002     VUMC
## 15  4 37.3   male     VUMC  3.5  168.1    9     8.1  615     VUMC
## 16  4 37.3   male     VUMC  3.5  168.1    8     6.4  808     VUMC
## 17  4 37.3   male     VUMC  4.9  147.7    6     3.5  335     VUMC
## 18  4 37.3   male     VUMC  4.9  147.7    7     4.9 1002     VUMC
## 19  4 37.3   male     VUMC  4.9  147.7    9     8.1  615     VUMC
## 20  4 37.3   male     VUMC  4.9  147.7    8     6.4  808     VUMC
## 21  4 37.3   male     VUMC  6.4  153.1    6     3.5  335     VUMC
## 22  4 37.3   male     VUMC  6.4  153.1    7     4.9 1002     VUMC
## 23  4 37.3   male     VUMC  6.4  153.1    9     8.1  615     VUMC
## 24  4 37.3   male     VUMC  6.4  153.1    8     6.4  808     VUMC
## 25  4 37.3   male     VUMC  8.1  149.6    6     3.5  335     VUMC
## 26  4 37.3   male     VUMC  8.1  149.6    7     4.9 1002     VUMC
## 27  4 37.3   male     VUMC  8.1  149.6    9     8.1  615     VUMC
## 28  4 37.3   male     VUMC  8.1  149.6    8     6.4  808     VUMC
## 29  5 49.2   male     LUMC  2.2  152.6   10     2.2  932     LUMC
## 30  5 49.2   male     LUMC  2.2  152.6   11     6.4  413     LUMC
## 31  5 49.2   male     LUMC  6.4  137.7   10     2.2  932     LUMC
## 32  5 49.2   male     LUMC  6.4  137.7   11     6.4  413     LUMC
## 33  5 49.2   male     LUMC  7.0  133.1   10     2.2  932     LUMC
## 34  5 49.2   male     LUMC  7.0  133.1   11     6.4  413     LUMC

Task 2

Did you notice that in the resulting dat2 we now have the two variables center.x and center.y, and that we have a variable id.y (which is the sample number from labdat)?

Moreover, there are two time variables: time (from dat1) and labtime (from labdat).

Re-do the merge of dat1 and labdat taking into account the following points

merge by patient identifier and time of measurement
make sure there is only one “center” variable
give the sample number (id.y) a more informative suffix

Use the arguments by.x and by.y.

Use the argument suffixes.

Solution 2

dat3 <- merge(dat1, labdat, 
              by.x = c('id', 'center', 'time'),
              by.y = c('ID', 'center', 'labtime'),
              suffixes = c('_base_FU', '_lab'))

dat3

##   id center time  age    sex weight id_lab chol
## 1  2    EMC  4.3 22.6 female  113.7      3  215
## 2  2    EMC  8.2 22.6 female  128.9      5  360
## 3  4   VUMC  3.5 37.3   male  168.1      6  335
## 4  4   VUMC  4.9 37.3   male  147.7      7 1002
## 5  4   VUMC  6.4 37.3   male  153.1      8  808
## 6  4   VUMC  8.1 37.3   male  149.6      9  615
## 7  5   LUMC  2.2 49.2   male  152.6     10  932
## 8  5   LUMC  6.4 49.2   male  137.7     11  413

Here it does not actually matter how we specify the first element of the argument suffixes because it is not used.

Task 3

You may have noticed that the merged dat3 only contains very few rows.

Can you explain what happened?
Re-do the last merge so that the resulting dataset has the maximal number of rows.

You need to use the argument all.

Solution 3

By default, the argument all is set to FALSE. This means that only those rows are kept that have identical entries for all merging variables in both datasets. Only a few values of time and labtime are identical.

When we set all = TRUE all rows will be kept. Where time and labtime do not match, two rows are created and NA is filled in.

dat4 <- merge(dat1, labdat, 
              by.x = c('id', 'center', 'time'),
              by.y = c('ID', 'center', 'labtime'),
              suffixes = c('_base_FU', '_lab'), all = TRUE)

dat4

##    id center time  age    sex weight id_lab chol
## 1   1    EMC  3.1   NA   <NA>     NA      1  347
## 2   1    EMC  3.6 37.9   male  136.7     NA   NA
## 3   1    EMC  5.3 37.9   male  131.5     NA   NA
## 4   1    EMC  6.0   NA   <NA>     NA      2  309
## 5   1    EMC  7.8 37.9   male  135.0     NA   NA
## 6   2    EMC  4.3 22.6 female  113.7      3  215
## 7   2    EMC  5.8   NA   <NA>     NA      4  436
## 8   2    EMC  8.2 22.6 female  128.9      5  360
## 9   3   LUMC  5.3 26.9 female   97.1     NA   NA
## 10  4   VUMC  3.5 37.3   male  168.1      6  335
## 11  4   VUMC  4.9 37.3   male  147.7      7 1002
## 12  4   VUMC  6.4 37.3   male  153.1      8  808
## 13  4   VUMC  8.1 37.3   male  149.6      9  615
## 14  5   LUMC  2.2 49.2   male  152.6     10  932
## 15  5   LUMC  6.4 49.2   male  137.7     11  413
## 16  5   LUMC  7.0 49.2   male  133.1     NA   NA
## 17  6   LUMC  4.7   NA   <NA>     NA     12  337

Merging Datasets

Biostatistics II: Introduction to R

Data

Merging basedat and FUdat

Task 1

Solution 1

Task 2

Solution 2

Task 3

Solution 3

Merging with labdat

Task 1

Solution 1

Task 2

Solution 2

Task 3

Solution 3

Merging `basedat` and `FUdat`

Merging with `labdat`