how_it_works.Rmd
rtrackr
provides a way to track changes to individual records during program execution in R. rtrackr
relies on the trackr_new()
and trackr_timepoint()
functions to initiate dataset tracking and log important points in the data processing chain.
Please Note:
rtrackr
relies on the trackr_id
column to identify a record and its lineage. If this column is deleted or altered - it is not possible to continue tracking records using rtrackr
. For more details on the special case of summarising data with rtrackr
see Summarising data.
rtrackr
relies on the sha1
hash function exported from digest::sha1
. sha1
has been chosen because of the relatively short hashes that it produces, reducing the file size of log files. Using sha1
increases the (unlikely) probability of a hash collision. Please see more information on the sha1
hash function here.
The trackr_id
column is an id field that uniquely identifies each record in a dataset. Each trackr_id
has two components, separated by an underscore "_"
.
The first component of a trackr_id
is the hash of a the parent log file, used to identify the log file where a record’s parent record is stored. The parent file hash is defined as the hash of the timepoint timestamp (a UNIX timestamp) + collapsed row hashes (see below).
The second component of a trackr_id is the row hash of each individual record. The row hash is the hash of a string made up of all values in a row, separated by an empty string ""
. Considering a dataframe of one row with values a = "a"
and b = 1
, the row hash would be computed as digest::sha1("a1")
, or 4751028c3d830cf93f7d1e64d5e4d58c9d01ee32
.
Details
Defining the example dataset:
df <- data.frame(a = 'a', b = 1)
a | b |
---|---|
a | 1 |
An example row hash for the first (and only) row of the dataframe:
row <- 'a1' digest::sha1(row)
## [1] "4751028c3d830cf93f7d1e64d5e4d58c9d01ee32"
The file hash of this dataframe at UNIX time 1592300508
would therefore be digest::sha1("15923005084751028c3d830cf93f7d1e64d5e4d58c9d01ee32")
, or 566f5d1c67d708801092ded906f6180d5ae40c0a
.
Details
Using the same row hash as above:
For a dataframe with two rows, a = "a"
, b = 1
and a = "b"
, b = 2
, recorded at UNIX time 1592300508
, the file hash would be digest::sha1("15923005084751028c3d830cf93f7d1e64d5e4d58c9d01ee32d6543aeb67806714fa9e9567dc5c46b2106ae843")
, or 60a98eebfddc76740acc443d9687f56b37a47893
.
For more information on getting started withrtrackr
please see Getting started.
trackr_new()
and trackr_timepoint()
both provide the option to write a data log along with a standard trackr log file with the argument log_data = TRUE
. This is recommended, as it provides a full history of all timepoints in a data processing chain, but may not be practical for repeatedly logging very large datasets.
Any data logs that have been saved using log_data = TRUE
can be validated by re-computing the file hash of the logged data using the timestamp stored in the corresponding trackr log file. rtrackr
provides the validate_data_log()
function to validate a trackr data log automatically.
A primary motivation for the development of this package is the use case of data files that are processed in R and shared with users who further alter the data manually (for human data cleaning, manual updates, or record validation). It is important to consider the effect of manual record processing on data logged by rtrackr
. When sharing data logged in rtrackr
outside of R, please consider the special case of summarising data.
Article by Hamish Gibbs 2020-06-18 15:47:20. To report a problem with this package, please create an issue on GitHub.