How it works • rtrackr

Overview

rtrackr provides a way to track changes to individual records during program execution in R. rtrackr relies on the trackr_new() and trackr_timepoint() functions to initiate dataset tracking and log important points in the data processing chain.

Please Note:

rtrackr relies on the trackr_id column to identify a record and its lineage. If this column is deleted or altered - it is not possible to continue tracking records using rtrackr. For more details on the special case of summarising data with rtrackr see Summarising data.

rtrackr relies on the sha1 hash function exported from digest::sha1. sha1 has been chosen because of the relatively short hashes that it produces, reducing the file size of log files. Using sha1 increases the (unlikely) probability of a hash collision. Please see more information on the sha1 hash function here.

The trackr_id column

The trackr_id column is an id field that uniquely identifies each record in a dataset. Each trackr_id has two components, separated by an underscore "_".

The first component of a trackr_id is the hash of a the parent log file, used to identify the log file where a record’s parent record is stored. The parent file hash is defined as the hash of the timepoint timestamp (a UNIX timestamp) + collapsed row hashes (see below).

The second component of a trackr_id is the row hash of each individual record. The row hash is the hash of a string made up of all values in a row, separated by an empty string "". Considering a dataframe of one row with values a = "a" and b = 1, the row hash would be computed as digest::sha1("a1"), or 4751028c3d830cf93f7d1e64d5e4d58c9d01ee32.

Details

Defining the example dataset:

df <- data.frame(a = 'a', b = 1)

a	b
a	1

An example row hash for the first (and only) row of the dataframe:

row <- 'a1'
digest::sha1(row)

## [1] "4751028c3d830cf93f7d1e64d5e4d58c9d01ee32"

The file hash of this dataframe at UNIX time 1592300508 would therefore be digest::sha1("15923005084751028c3d830cf93f7d1e64d5e4d58c9d01ee32"), or 566f5d1c67d708801092ded906f6180d5ae40c0a.

Details

Using the same row hash as above:

file_hash <- digest::sha1(c('1592300508', digest::sha1(row)))

For a dataframe with two rows, a = "a", b = 1 and a = "b", b = 2, recorded at UNIX time 1592300508, the file hash would be digest::sha1("15923005084751028c3d830cf93f7d1e64d5e4d58c9d01ee32d6543aeb67806714fa9e9567dc5c46b2106ae843"), or 60a98eebfddc76740acc443d9687f56b37a47893.

Details

digest::sha1(c('1592300508', digest::sha1('a1'), digest::sha1('b2')))

## [1] "60a98eebfddc76740acc443d9687f56b37a47893"

Validating a trackr data log

For more information on getting started withrtrackr please see Getting started.

trackr_new() and trackr_timepoint() both provide the option to write a data log along with a standard trackr log file with the argument log_data = TRUE. This is recommended, as it provides a full history of all timepoints in a data processing chain, but may not be practical for repeatedly logging very large datasets.

Any data logs that have been saved using log_data = TRUE can be validated by re-computing the file hash of the logged data using the timestamp stored in the corresponding trackr log file. rtrackr provides the validate_data_log() function to validate a trackr data log automatically.

Other considerations

A primary motivation for the development of this package is the use case of data files that are processed in R and shared with users who further alter the data manually (for human data cleaning, manual updates, or record validation). It is important to consider the effect of manual record processing on data logged by rtrackr. When sharing data logged in rtrackr outside of R, please consider the special case of summarising data.

Article by Hamish Gibbs 2020-06-18 15:47:20. To report a problem with this package, please create an issue on GitHub.