getting_started.Rmd
rtrackr
is currently available through GitHub. To install the development version, use:
remotes::install_github("hamishgibbs/rtrackr")
Dataset tracking starts with the trackr_new()
function. Load a new dataset into R and use trackr_new()
to record each record in the dataset. Here, we create a simple dataframe as an example.
df <- data.frame(a = c('a', 'b', 'c'), b = c(1, 2, 3))
a | b |
---|---|
a | 1 |
b | 2 |
c | 3 |
trackr_new()
adds a trackr_id
column to the dataframe, the starting point for all data logging with rtrackr
. Be sure to specify where rtrackr
log files should be stored with trackr_dir =
(you will see an error if you do not).
trackr_dir <- '~/Documents/trackr_dir' df <- trackr_new(df, trackr_dir = trackr_dir)
## [1] "Successfully written trackr file c3541976c36e94ef6b25d3a6a70b136f58cd2e26.json"
## [1] "Successfully written trackr data log c3541976c36e94ef6b25d3a6a70b136f58cd2e26_dl.json"
a | b | trackr_id |
---|---|---|
a | 1 | c3541976c36e94ef6b25d3a6a70b136f58cd2e26_4751028c3d830cf93f7d1e64d5e4d58c9d01ee32 |
b | 2 | c3541976c36e94ef6b25d3a6a70b136f58cd2e26_d6543aeb67806714fa9e9567dc5c46b2106ae843 |
c | 3 | c3541976c36e94ef6b25d3a6a70b136f58cd2e26_acecf5a5dfb445683737c2f7f135b18c5eaee1a7 |
You should see an output similar to what is printed above “Successfully written …”. These messages tell us that the dataset has been successfully logged and that two log files have been written. One file (the trackr file) records the lineage of each row in the dataset. The other file (with the suffix "_dl") is the data log - a record of the dataset at this point in time.
rtrackr
is designed to record dataset changes throughout the processing chain, allowing for a full history of important changes to each record. a timepoint
defines a certain time in the processing chain, and is similar to a commit
in git
. You can record a timepoint
with trackr_timepoint()
First, we will make a change to the logged dataset.
df <- df %>% dplyr::mutate(b = b + 1)
a | b | trackr_id |
---|---|---|
a | 2 | c3541976c36e94ef6b25d3a6a70b136f58cd2e26_4751028c3d830cf93f7d1e64d5e4d58c9d01ee32 |
b | 3 | c3541976c36e94ef6b25d3a6a70b136f58cd2e26_d6543aeb67806714fa9e9567dc5c46b2106ae843 |
c | 4 | c3541976c36e94ef6b25d3a6a70b136f58cd2e26_acecf5a5dfb445683737c2f7f135b18c5eaee1a7 |
Then, we will log this change as a new timepoint
.
df <- trackr_timepoint(df, trackr_dir = trackr_dir, timepoint_message = 'First processing step')
## [1] "Successfully written trackr file 60487d1e5ab8465ba4b039ed55a18342ab184454.json"
## [1] "Successfully written trackr data log 60487d1e5ab8465ba4b039ed55a18342ab184454_dl.json"
a | b | trackr_id |
---|---|---|
a | 2 | 60487d1e5ab8465ba4b039ed55a18342ab184454_9b298ab2c9539c0d7f187ce26315f459fa58e78e |
b | 3 | 60487d1e5ab8465ba4b039ed55a18342ab184454_0d62e87ad3877b0c51934dd0c38cba3467375b09 |
c | 4 | 60487d1e5ab8465ba4b039ed55a18342ab184454_b1ba9407cf19da9a3625e46f5378484323799a05 |
You should see the same success messages as you did for trackr_new()
, above. Notice the argument timepoint_message
, which gives an optional, plain text record of this timepoint, similar to a commit message in git
. To suppress success messages, use suppress_success = TRUE
, the default is FALSE
.
Congratulations! Using rtrackr
should be as easy as tracking a new dataset and logging subsequent processing timepoints. For the special case of summarizing multiple records into one, see Summarising data.
Please note: rtrackr
uses the current UNIX timestamp (in seconds) to define a trackr_id
. Logging multiple timepoints in the same second is not permitted, and trackr_timepoint
will wait until the following second to write a new trackr log file. This means that repeated calls to trackr_timepoint
may delay code execution. For more information, see how it works.
Now that we have logged a dataset and a processing timepoint, we can query the history of a record in the most recent version of the dataset. Use trackr_lineage()
to get all parent records of a given trackr_id.
target_id <- df$trackr_id[1] trackr_lineage(target_id, trackr_dir)
## [1] "Successfully written 60487d1e5ab8465ba4b039ed55a18342ab184454_9b298ab2c9539c0d7f187ce26315f459fa58e78e_lineage.json"
For a more intelligible way to represent the lineage of a trackr_id, use trackr_network()
to create an interactive network of the dataset history.
lineage_fn <- paste0(trackr_dir, '/', target_id, '_lineage.json') trackr_network(lineage_fn)
After creating, recording, and querying the history of changes to a dataset, use clean_trackr_dir()
to remove all log files.
clean_trackr_dir(trackr_dir)
## [1] "Removed: 60487d1e5ab8465ba4b039ed55a18342ab184454_9b298ab2c9539c0d7f187ce26315f459fa58e78e_lineage.json"
## [2] "Removed: 60487d1e5ab8465ba4b039ed55a18342ab184454_dl.json"
## [3] "Removed: 60487d1e5ab8465ba4b039ed55a18342ab184454.json"
## [4] "Removed: c3541976c36e94ef6b25d3a6a70b136f58cd2e26_dl.json"
## [5] "Removed: c3541976c36e94ef6b25d3a6a70b136f58cd2e26.json"
Warning: clean_trackr_dir
is a simple wrapper function that will delete any .json
files in the trackr_dir
. Use caution when mixing trackr log files with other files. The recommended practice is to create a separate directory for all trackr log files.
Article by Hamish Gibbs 2020-06-18 15:47:17. To report a problem with this package, please create an issue on GitHub.