summarising_data.Rmd
rtrackr
provides data logging for every record in a dataset throughout the processing chain. In most cases, when records are altered or one record is divided to multiple records, rtrackr
will simply assign a new trackr id and log changes when a record is updated.
When data is summarised, on the other hand (multiple records become a single record), rtrackr
needs to record the trackr_ids
of all parent records. trackr_summarise()
provides a convenient way to summarise data without losing information in the trackr_id
column.
trackr_summarise()
works by combining all parent ids into one row, separated by a “,”. The same operation would work for combining records manually outside of R.
We will use a simple workflow To demonstrate the use of trackr_summarise()
in a data processing chain. Continuing from getting started, we will create a new dataset, and log a new processing timepoint with trackr_new()
.
trackr_dir <- '~/Documents/trackr_dir' df <- data.frame(a = c('a', 'b', 'c'), b = c(1, 2, 3)) df <- trackr_new(df, trackr_dir = trackr_dir, suppress_success = TRUE)
a | b | trackr_id |
---|---|---|
a | 1 | b479ebc38592e0529bc3177b16fcf060d0217389_4751028c3d830cf93f7d1e64d5e4d58c9d01ee32 |
b | 2 | b479ebc38592e0529bc3177b16fcf060d0217389_d6543aeb67806714fa9e9567dc5c46b2106ae843 |
c | 3 | b479ebc38592e0529bc3177b16fcf060d0217389_acecf5a5dfb445683737c2f7f135b18c5eaee1a7 |
Now, we will bind the dataset to itself, and make a change to one version.
df <- rbind(df, df %>% dplyr::mutate(b = b + 1)) df <- trackr_timepoint(df, trackr_dir = trackr_dir, timepoint_message = 'Merged dataframes', suppress_success = TRUE)
a | b | trackr_id |
---|---|---|
a | 1 | 8f5b26808f687c079adce500a5e344cff43ca72e_4751028c3d830cf93f7d1e64d5e4d58c9d01ee32 |
b | 2 | 8f5b26808f687c079adce500a5e344cff43ca72e_d6543aeb67806714fa9e9567dc5c46b2106ae843 |
c | 3 | 8f5b26808f687c079adce500a5e344cff43ca72e_acecf5a5dfb445683737c2f7f135b18c5eaee1a7 |
a | 2 | 8f5b26808f687c079adce500a5e344cff43ca72e_9b298ab2c9539c0d7f187ce26315f459fa58e78e |
b | 3 | 8f5b26808f687c079adce500a5e344cff43ca72e_0d62e87ad3877b0c51934dd0c38cba3467375b09 |
c | 4 | 8f5b26808f687c079adce500a5e344cff43ca72e_b1ba9407cf19da9a3625e46f5378484323799a05 |
trackr_summarise
is a simple wrapper around dplyr::summarise
and accepts the same arguments.
df <- df %>% dplyr::group_by(a) %>% trackr_summarise(n = dplyr::n())
a | n | trackr_id |
---|---|---|
a | 2 | 8f5b26808f687c079adce500a5e344cff43ca72e_4751028c3d830cf93f7d1e64d5e4d58c9d01ee32, 8f5b26808f687c079adce500a5e344cff43ca72e_9b298ab2c9539c0d7f187ce26315f459fa58e78e |
b | 2 | 8f5b26808f687c079adce500a5e344cff43ca72e_d6543aeb67806714fa9e9567dc5c46b2106ae843, 8f5b26808f687c079adce500a5e344cff43ca72e_0d62e87ad3877b0c51934dd0c38cba3467375b09 |
c | 2 | 8f5b26808f687c079adce500a5e344cff43ca72e_acecf5a5dfb445683737c2f7f135b18c5eaee1a7, 8f5b26808f687c079adce500a5e344cff43ca72e_b1ba9407cf19da9a3625e46f5378484323799a05 |
Now, we can log a new timepoint with trackr_timepoint()
.
df <- trackr_timepoint(df, trackr_dir = trackr_dir, timepoint_message = 'Summarised dataframes', suppress_success = TRUE)
a | n | trackr_id |
---|---|---|
a | 2 | 2ac72639f7e4ff60f9aaaa9ed990a286bd612138_9b298ab2c9539c0d7f187ce26315f459fa58e78e |
b | 2 | 2ac72639f7e4ff60f9aaaa9ed990a286bd612138_d6543aeb67806714fa9e9567dc5c46b2106ae843 |
c | 2 | 2ac72639f7e4ff60f9aaaa9ed990a286bd612138_c7a74dd8ad5c5743a5983d3102600cc7f9df9370 |
We will make and log one more change, to better visualize the effect of the summarise operation.
df <- df %>% dplyr::mutate(n = n + 100) df <- trackr_timepoint(df, trackr_dir = trackr_dir, timepoint_message = 'Added 100', suppress_success = TRUE)
a | n | trackr_id |
---|---|---|
a | 102 | a9af0af14bc75b4aeec4e8ba01fffea302a60522_a6fdb93cf5f88eca9f0526de0e67d93bffad81f4 |
b | 102 | a9af0af14bc75b4aeec4e8ba01fffea302a60522_2e04d93f061e33d2966c96a229cfe59825948396 |
c | 102 | a9af0af14bc75b4aeec4e8ba01fffea302a60522_7bd3eee84fb112faf3c18e0765d496d4471b7432 |
To visualize this operation on one record, we create a trackr_lineage
and trackr_network
. See getting started for more information.
target_id <- df$trackr_id[1] trackr_lineage(target_id, trackr_dir)
## [1] "Successfully written a9af0af14bc75b4aeec4e8ba01fffea302a60522_a6fdb93cf5f88eca9f0526de0e67d93bffad81f4_lineage.json"
lineage_fn <- paste0(trackr_dir, '/', target_id, '_lineage.json') trackr_network(lineage_fn)
clean_trackr_dir(trackr_dir)
Article by Hamish Gibbs 2020-06-18 15:47:42. To report a problem with this package, please create an issue on GitHub.