Installation

rtrackr is currently available through GitHub. To install the development version, use:

remotes::install_github("hamishgibbs/rtrackr")

Logging a new dataset

Dataset tracking starts with the trackr_new() function. Load a new dataset into R and use trackr_new() to record each record in the dataset. Here, we create a simple dataframe as an example.

df <- data.frame(a = c('a', 'b', 'c'), b = c(1, 2, 3))
Details
a b
a 1
b 2
c 3

trackr_new() adds a trackr_id column to the dataframe, the starting point for all data logging with rtrackr. Be sure to specify where rtrackr log files should be stored with trackr_dir = (you will see an error if you do not).

trackr_dir <- '~/Documents/trackr_dir'
df <- trackr_new(df, trackr_dir = trackr_dir)
## [1] "Successfully written trackr file c3541976c36e94ef6b25d3a6a70b136f58cd2e26.json"
## [1] "Successfully written trackr data log c3541976c36e94ef6b25d3a6a70b136f58cd2e26_dl.json"
Details
a b trackr_id
a 1 c3541976c36e94ef6b25d3a6a70b136f58cd2e26_4751028c3d830cf93f7d1e64d5e4d58c9d01ee32
b 2 c3541976c36e94ef6b25d3a6a70b136f58cd2e26_d6543aeb67806714fa9e9567dc5c46b2106ae843
c 3 c3541976c36e94ef6b25d3a6a70b136f58cd2e26_acecf5a5dfb445683737c2f7f135b18c5eaee1a7

You should see an output similar to what is printed above “Successfully written …”. These messages tell us that the dataset has been successfully logged and that two log files have been written. One file (the trackr file) records the lineage of each row in the dataset. The other file (with the suffix "_dl") is the data log - a record of the dataset at this point in time.

Defining timepoints

rtrackr is designed to record dataset changes throughout the processing chain, allowing for a full history of important changes to each record. a timepoint defines a certain time in the processing chain, and is similar to a commit in git. You can record a timepoint with trackr_timepoint()

First, we will make a change to the logged dataset.

df <- df %>% dplyr::mutate(b = b + 1)
Details
a b trackr_id
a 2 c3541976c36e94ef6b25d3a6a70b136f58cd2e26_4751028c3d830cf93f7d1e64d5e4d58c9d01ee32
b 3 c3541976c36e94ef6b25d3a6a70b136f58cd2e26_d6543aeb67806714fa9e9567dc5c46b2106ae843
c 4 c3541976c36e94ef6b25d3a6a70b136f58cd2e26_acecf5a5dfb445683737c2f7f135b18c5eaee1a7

Then, we will log this change as a new timepoint.

df <- trackr_timepoint(df, trackr_dir = trackr_dir, timepoint_message = 'First processing step')
## [1] "Successfully written trackr file 60487d1e5ab8465ba4b039ed55a18342ab184454.json"
## [1] "Successfully written trackr data log 60487d1e5ab8465ba4b039ed55a18342ab184454_dl.json"
Details
a b trackr_id
a 2 60487d1e5ab8465ba4b039ed55a18342ab184454_9b298ab2c9539c0d7f187ce26315f459fa58e78e
b 3 60487d1e5ab8465ba4b039ed55a18342ab184454_0d62e87ad3877b0c51934dd0c38cba3467375b09
c 4 60487d1e5ab8465ba4b039ed55a18342ab184454_b1ba9407cf19da9a3625e46f5378484323799a05

You should see the same success messages as you did for trackr_new(), above. Notice the argument timepoint_message, which gives an optional, plain text record of this timepoint, similar to a commit message in git. To suppress success messages, use suppress_success = TRUE, the default is FALSE.

Congratulations! Using rtrackr should be as easy as tracking a new dataset and logging subsequent processing timepoints. For the special case of summarizing multiple records into one, see Summarising data.

Please note: rtrackr uses the current UNIX timestamp (in seconds) to define a trackr_id. Logging multiple timepoints in the same second is not permitted, and trackr_timepoint will wait until the following second to write a new trackr log file. This means that repeated calls to trackr_timepoint may delay code execution. For more information, see how it works.

Querying record history

Now that we have logged a dataset and a processing timepoint, we can query the history of a record in the most recent version of the dataset. Use trackr_lineage() to get all parent records of a given trackr_id.

target_id <- df$trackr_id[1]
trackr_lineage(target_id, trackr_dir)
## [1] "Successfully written 60487d1e5ab8465ba4b039ed55a18342ab184454_9b298ab2c9539c0d7f187ce26315f459fa58e78e_lineage.json"

For a more intelligible way to represent the lineage of a trackr_id, use trackr_network() to create an interactive network of the dataset history.

lineage_fn <- paste0(trackr_dir, '/', target_id, '_lineage.json')

trackr_network(lineage_fn)

Clean up

After creating, recording, and querying the history of changes to a dataset, use clean_trackr_dir() to remove all log files.

clean_trackr_dir(trackr_dir)
## [1] "Removed: 60487d1e5ab8465ba4b039ed55a18342ab184454_9b298ab2c9539c0d7f187ce26315f459fa58e78e_lineage.json"
## [2] "Removed: 60487d1e5ab8465ba4b039ed55a18342ab184454_dl.json"                                              
## [3] "Removed: 60487d1e5ab8465ba4b039ed55a18342ab184454.json"                                                 
## [4] "Removed: c3541976c36e94ef6b25d3a6a70b136f58cd2e26_dl.json"                                              
## [5] "Removed: c3541976c36e94ef6b25d3a6a70b136f58cd2e26.json"

Warning: clean_trackr_dir is a simple wrapper function that will delete any .json files in the trackr_dir. Use caution when mixing trackr log files with other files. The recommended practice is to create a separate directory for all trackr log files.

Article by Hamish Gibbs 2020-06-18 15:47:17. To report a problem with this package, please create an issue on GitHub.