Overview

rtrackr provides data logging for every record in a dataset throughout the processing chain. In most cases, when records are altered or one record is divided to multiple records, rtrackr will simply assign a new trackr id and log changes when a record is updated.

When data is summarised, on the other hand (multiple records become a single record), rtrackr needs to record the trackr_ids of all parent records. trackr_summarise() provides a convenient way to summarise data without losing information in the trackr_id column.

trackr_summarise() works by combining all parent ids into one row, separated by a “,”. The same operation would work for combining records manually outside of R.

Example workflow

We will use a simple workflow To demonstrate the use of trackr_summarise() in a data processing chain. Continuing from getting started, we will create a new dataset, and log a new processing timepoint with trackr_new().

trackr_dir <- '~/Documents/trackr_dir'
df <- data.frame(a = c('a', 'b', 'c'), b = c(1, 2, 3))
df <- trackr_new(df, trackr_dir = trackr_dir, suppress_success = TRUE)
Details
a b trackr_id
a 1 b479ebc38592e0529bc3177b16fcf060d0217389_4751028c3d830cf93f7d1e64d5e4d58c9d01ee32
b 2 b479ebc38592e0529bc3177b16fcf060d0217389_d6543aeb67806714fa9e9567dc5c46b2106ae843
c 3 b479ebc38592e0529bc3177b16fcf060d0217389_acecf5a5dfb445683737c2f7f135b18c5eaee1a7

Now, we will bind the dataset to itself, and make a change to one version.

df <- rbind(df, df %>% dplyr::mutate(b = b + 1))
df <- trackr_timepoint(df, trackr_dir = trackr_dir, timepoint_message = 'Merged dataframes', suppress_success = TRUE)
Details
a b trackr_id
a 1 8f5b26808f687c079adce500a5e344cff43ca72e_4751028c3d830cf93f7d1e64d5e4d58c9d01ee32
b 2 8f5b26808f687c079adce500a5e344cff43ca72e_d6543aeb67806714fa9e9567dc5c46b2106ae843
c 3 8f5b26808f687c079adce500a5e344cff43ca72e_acecf5a5dfb445683737c2f7f135b18c5eaee1a7
a 2 8f5b26808f687c079adce500a5e344cff43ca72e_9b298ab2c9539c0d7f187ce26315f459fa58e78e
b 3 8f5b26808f687c079adce500a5e344cff43ca72e_0d62e87ad3877b0c51934dd0c38cba3467375b09
c 4 8f5b26808f687c079adce500a5e344cff43ca72e_b1ba9407cf19da9a3625e46f5378484323799a05

trackr_summarise is a simple wrapper around dplyr::summarise and accepts the same arguments.

df <- df %>%
  dplyr::group_by(a) %>%
  trackr_summarise(n = dplyr::n())
Details
a n trackr_id
a 2 8f5b26808f687c079adce500a5e344cff43ca72e_4751028c3d830cf93f7d1e64d5e4d58c9d01ee32, 8f5b26808f687c079adce500a5e344cff43ca72e_9b298ab2c9539c0d7f187ce26315f459fa58e78e
b 2 8f5b26808f687c079adce500a5e344cff43ca72e_d6543aeb67806714fa9e9567dc5c46b2106ae843, 8f5b26808f687c079adce500a5e344cff43ca72e_0d62e87ad3877b0c51934dd0c38cba3467375b09
c 2 8f5b26808f687c079adce500a5e344cff43ca72e_acecf5a5dfb445683737c2f7f135b18c5eaee1a7, 8f5b26808f687c079adce500a5e344cff43ca72e_b1ba9407cf19da9a3625e46f5378484323799a05

Now, we can log a new timepoint with trackr_timepoint().

df <- trackr_timepoint(df, trackr_dir = trackr_dir, timepoint_message = 'Summarised dataframes', suppress_success = TRUE)
Details
a n trackr_id
a 2 2ac72639f7e4ff60f9aaaa9ed990a286bd612138_9b298ab2c9539c0d7f187ce26315f459fa58e78e
b 2 2ac72639f7e4ff60f9aaaa9ed990a286bd612138_d6543aeb67806714fa9e9567dc5c46b2106ae843
c 2 2ac72639f7e4ff60f9aaaa9ed990a286bd612138_c7a74dd8ad5c5743a5983d3102600cc7f9df9370

We will make and log one more change, to better visualize the effect of the summarise operation.

df <- df %>% dplyr::mutate(n = n + 100)
df <- trackr_timepoint(df, trackr_dir = trackr_dir, timepoint_message = 'Added 100', suppress_success = TRUE)
Details
a n trackr_id
a 102 a9af0af14bc75b4aeec4e8ba01fffea302a60522_a6fdb93cf5f88eca9f0526de0e67d93bffad81f4
b 102 a9af0af14bc75b4aeec4e8ba01fffea302a60522_2e04d93f061e33d2966c96a229cfe59825948396
c 102 a9af0af14bc75b4aeec4e8ba01fffea302a60522_7bd3eee84fb112faf3c18e0765d496d4471b7432

To visualize this operation on one record, we create a trackr_lineage and trackr_network. See getting started for more information.

target_id <- df$trackr_id[1]
trackr_lineage(target_id, trackr_dir)
## [1] "Successfully written a9af0af14bc75b4aeec4e8ba01fffea302a60522_a6fdb93cf5f88eca9f0526de0e67d93bffad81f4_lineage.json"
lineage_fn <- paste0(trackr_dir, '/', target_id, '_lineage.json')

trackr_network(lineage_fn)

Clean up

clean_trackr_dir(trackr_dir)

Article by Hamish Gibbs 2020-06-18 15:47:42. To report a problem with this package, please create an issue on GitHub.