Title: | Processing Infectious Disease Datasets in IIDDA. |
---|---|
Description: | Part of an open toolchain for processing infectious disease datasets available through the IIDDA data repository. |
Authors: | Steve Walker [aut, cre], Samara Manzin [aut], Michael Roswell [aut], Gabrielle MacKinnon [aut], Ronald Jin [aut] |
Maintainer: | Steve Walker <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.0.0 |
Built: | 2024-11-20 12:43:25 UTC |
Source: | https://github.com/canmod/iidda-tools |
Add column 'basal_disease' to tidy dataset
add_basal_disease(data, lookup)
add_basal_disease(data, lookup)
data |
A tidy data set with a 'disease' column. |
lookup |
A lookup table with 'disease' and 'nesting_disease' columns that describe a global disease hierarchy that will be applied to find the basal disease of each 'disease' in data. |
tidy dataset with basal disease
Add lists of unique values and ranges of values to a the metadata of an IIDDA data set.
add_column_summaries(tidy_data, dataset_name, metadata)
add_column_summaries(tidy_data, dataset_name, metadata)
tidy_data |
Data frame of prepared data that are ready to be packaged as an IIDDA tidy data set. |
dataset_name |
Character string giving IIDDA identifier of the dataset. |
metadata |
Output of |
Add lists of unique sets of values for a given filter group
add_filter_group_values(tidy_data, dataset_name, metadata)
add_filter_group_values(tidy_data, dataset_name, metadata)
tidy_data |
Data frame of prepared data that are ready to be packaged as an IIDDA tidy data set. |
dataset_name |
Character string giving IIDDA identifier of the dataset. |
metadata |
Output of |
Add title and description metadata to a table and its columns.
add_metadata(table, table_metadata, column_metadata)
add_metadata(table, table_metadata, column_metadata)
table |
dataframe (or dataframe-like object) |
table_metadata |
named list (or list-like object) such that
|
column_metadata |
dataframe with rownames equal to the columns
in |
version of table
with added metadata attributes
Add provenance information to an IIDDA dataset, by creating columns containing the scan and digitization IDs associated with each record.
add_provenance(tidy_data, tidy_dataset)
add_provenance(tidy_data, tidy_dataset)
tidy_data |
Data frame in IIDDA tidy form. |
tidy_dataset |
The IIDDA identifier associated with the dataset for which 'tidy_data' serves as an intermediate object during its creation. |
Prep Script Outcomes
all_prep_script_outcomes() successful_prep_script_outcomes() failed_prep_script_outcomes() error_tar(tar_name)
all_prep_script_outcomes() successful_prep_script_outcomes() failed_prep_script_outcomes() error_tar(tar_name)
tar_name |
Name of a tar archive to be created with log files of failed prep script outcomes. |
Data frame with all prep script outcomes in the project.
successful_prep_script_outcomes()
: Data frame with all successful prep
script outcomes
failed_prep_script_outcomes()
: Data frame with all failed prep
script outcomes
error_tar()
: Tar archive with log files of failed
prep script outcomes.
Basal Disease
basal_disease(disease, disease_lookup, encountered_diseases = character())
basal_disease(disease, disease_lookup, encountered_diseases = character())
disease |
Disease for which to determine basal disease |
disease_lookup |
Table with two columns – disease and nesting_disease |
encountered_diseases |
Character vector of diseases already found. Typically this left at the default value of an empty character vector. |
The root disease that input disease maps to in disease_lookup.
Convert URL in GitHub blob storage format to GitHub raw data format.
blob_to_raw(urls)
blob_to_raw(urls)
urls |
Character vector of GitHub URLs in blob storage |
blob_to_raw("https://github.com/canmod/iidda-tools/blob/main/R/iidda/R/github_parsing.R")
blob_to_raw("https://github.com/canmod/iidda-tools/blob/main/R/iidda/R/github_parsing.R")
Create a data frame for representing a rectangular range of cells in an Excel file. This is useful for adding blank cells that do not get read in by 'xlsx_cells'.
cell_block(cells_data)
cell_block(cells_data)
cells_data |
Data read in using 'xlsx_cells', or just any data frame with integer columns 'row' and 'col'. |
Error if columns in the tidy data are not in metadata Schema and if all values in a column are NA
check_metadata_cols(tidy_data, metadata)
check_metadata_cols(tidy_data, metadata)
tidy_data |
data.frame resulting from data prep scripts |
metadata |
Nested named list describing metadata for the tidy data |
Error if columns in the metadata Schema are not in tidy data
check_tidy_data_cols(table, column_metadata)
check_tidy_data_cols(table, column_metadata)
table |
dataframe (or dataframe-like object) |
column_metadata |
dataframe with rownames equal to the columns
in |
Collapse all value columns into a single character
column
for data frames that have one row per cell in an xlsx file.
collapse_xlsx_value_columns(data)
collapse_xlsx_value_columns(data)
data |
Data frame representing an xlsx file. |
Combine data from different Excel sheets associated with specific weeks in 1956-2000 Canadian communicable disease incidence data prep pipelines.
combine_weeks(cleaned_sheets, sheet_dates, metadata)
combine_weeks(cleaned_sheets, sheet_dates, metadata)
cleaned_sheets |
List of data frames – one for each sheet |
sheet_dates |
Data frame describing sheet dates (TODO: more info needed) |
metadata |
Output of |
Get metadata for a harmonized data source, given metadata for the corresponding tidy data source metadata and initial harmonized data source metadata.
convert_harmonized_metadata( tidy_metadata, harmonized_metadata, tidy_source, harmonized_dataset_id, tidy_source_metadata_path )
convert_harmonized_metadata( tidy_metadata, harmonized_metadata, tidy_source, harmonized_dataset_id, tidy_source_metadata_path )
tidy_metadata |
Metadata from |
harmonized_metadata |
Initial metadata from
|
tidy_source |
IIDDA data source ID for a data source that is being harmonized. |
harmonized_dataset_id |
ID of dataset being harmonized. |
tidy_source_metadata_path |
Output of |
Convert a metadata path to one corresponding to tidy data being harmonized.
convert_metadata_path(metadata_path, harmonized_source, tidy_source)
convert_metadata_path(metadata_path, harmonized_source, tidy_source)
metadata_path |
Path to a collection of tracking tables. |
harmonized_source |
IIDDA data source ID for a harmonized source. |
tidy_source |
IIDDA data source ID for a data source that is being harmonized. |
Create a temporary file containing a copy of a file under git version control for a particular revision of that file.
cp_git_version(file, version_hash)
cp_git_version(file, version_hash)
file |
Path to file. |
version_hash |
Git version hash. |
Temporary file path containing the copy.
Create a directory of JSON files from a CSV file.
csv_to_json_files(csv_path, json_dir, name_field, use_extension = FALSE)
csv_to_json_files(csv_path, json_dir, name_field, use_extension = FALSE)
csv_path |
Path to the CSV file. |
json_dir |
Path to the directory for saving the JSON files. |
name_field |
Name of the field in the CSV file that contains the names for each JSON file. All values in this field must be unique. |
use_extension |
If there is a column in the CSV file called 'extension', should it be used to produced json filenames of the form 'value-in-name-field.value-in-extension-field.json'? |
Create a directory of JSON files from a data frame.
data_to_json_files(data, json_dir, name_field, use_extension = FALSE)
data_to_json_files(data, json_dir, name_field, use_extension = FALSE)
data |
Data frame |
json_dir |
Path to the directory for saving the JSON files. |
name_field |
Name of the field in the CSV file that contains the names for each JSON file. All values in this field must be unique. |
use_extension |
If there is a column in the CSV file called 'extension', should it be used to produced json filenames of the form 'value-in-name-field.value-in-extension-field.json'? |
Values are TRUE if that particular disease occurred at least once in a period that ended in that particular year, and FALSE otherwise.
disease_coverage_heatmap(table, disease_col = "disease")
disease_coverage_heatmap(table, disease_col = "disease")
table |
dataframe (or dataframe-like object). Tidy dataset of all compiled datasets |
disease_col |
specifies level of disease (i.e. disease_family, disease, disease_subclass) |
Function used to produce the time-scale cross-check for the CANMOD digitization project.
do_time_scale_cross_check(sum_of_timescales)
do_time_scale_cross_check(sum_of_timescales)
sum_of_timescales |
Dataframe with aggregations to different time scales (TODO: describe) |
Drop empty rows in a table using is_empty
.
drop_empty_rows(table)
drop_empty_rows(table)
table |
data frame |
Save the records of a dataset that contain empty values in 'columns'. This report will be saved in the 'supporting-output/dataset_id' directory.
empty_column_report(data, columns, dataset_id)
empty_column_report(data, columns, dataset_id)
data |
Data frame. |
columns |
Character vector of columns giving the columns to check for emptiness. |
dataset_id |
ID for the dataset that |
Force empty strings to be blank. See is_empty
.
empty_is_blank(x)
empty_is_blank(x)
x |
object to test |
Convert all missing values to NA
empty_to_na(data)
empty_to_na(data)
data |
data frame resulting from data prep scripts |
End-Dates of Epiweeks
epiweek_end_date(year, week)
epiweek_end_date(year, week)
year |
Integer vector of years. |
week |
Integer vector of weeks. |
Date vector of the end-dates of each specified epiweek.
Note that unless you specify an appropriate contents_pattern
extract_between_paren
will not work as you probably expect
if there are multiple sets of parentheses. You can use exclusion patterns
to make this work better (e.g. content_pattern = '[^)]*'
).
extract_between_paren( x, left = "\\(", right = "\\)", contents_pattern = ".*" ) extract_all_between_paren( x, left = "\\(", right = "\\)", contents_pattern = ".*", max_iters = 100 )
extract_between_paren( x, left = "\\(", right = "\\)", contents_pattern = ".*" ) extract_all_between_paren( x, left = "\\(", right = "\\)", contents_pattern = ".*", max_iters = 100 )
x |
Character vector |
left |
Left parenthetical string |
right |
Right parenthetical string |
contents_pattern |
Regex pattern for the contents between parentheses |
max_iters |
maximum number of items to return |
Character vector with NA's for elements in x
that
do not have parentheses and the substring between the first matching
parentheses.
x = c("-", "", NA, "1", "3", "1 (Alta.)", "(Sask) 20") extract_between_paren(x)
x = c("-", "", NA, "1", "3", "1 (Alta.)", "(Sask) 20") extract_between_paren(x)
Extract a character vector from a list or return a blank string if it doesn't exist or if a proper list isn't passed.
extract_char_or_blank(l, e)
extract_char_or_blank(l, e)
l |
List |
e |
Name of the focal element |
Try to extract a list element, and return a blank list if it doesn't exist or if a proper list is not passed.
extract_or_blank(l, e)
extract_or_blank(l, e)
l |
List |
e |
Name of the focal element |
Convenience function to do fill_re_template
and
wrap_age_patterns
in one step.
fill_and_wrap(re_templates, which_bound, purpose, prefix = "")
fill_and_wrap(re_templates, which_bound, purpose, prefix = "")
re_templates |
a set of |
which_bound |
resolve the template to match lower or upper bounds, neither (the default), or single |
purpose |
character string indicating the purpose of the resulting regular expression |
prefix |
pattern to match at the beginning of the string that marks the beginning of age information |
Resolve a length-1 character vector containing a regex template into a regular expression for matching age bound information in disease category names
fill_re_template(re_template, which_bound = "neither")
fill_re_template(re_template, which_bound = "neither")
re_template |
template that resolve to regular expressions for matching age information contained in category names |
which_bound |
resolve the template to match lower or upper bounds, neither (the default), or single |
Fix the format of a CSV file that is not in IIDDA format.
fix_csv(filename)
fix_csv(filename)
filename |
Path to the CSV file |
Logical value that is 'TRUE' if the CSV needed fixing and 'FALSE' otherwise.
Convert words describing frequencies to phrases.
freq_to_by(freq)
freq_to_by(freq)
freq |
one of '"weekly"' (becomes '"7 days"'), '"4-weekly"' (becomes '"28 days"'), '"monthly"' (becomes '"1 month"') |
Convert words describing frequencies to corresponding numbers of days
freq_to_days(freq)
freq_to_days(freq)
freq |
one of '"weekly"' (becomes '7'), '"4-weekly"' (becomes '28'), '"monthly"' (returns an error) |
Get all Dependencies
get_all_dependencies(source, dataset)
get_all_dependencies(source, dataset)
source |
Source ID. |
dataset |
dataset ID. |
Superseded by functionality in 'iidda.api'.
get_canmod_digitization_metadata(tracking_list)
get_canmod_digitization_metadata(tracking_list)
tracking_list |
output of |
Get an object with metadata information about a particular dataset from tracking tables.
get_dataset_metadata(dataset)
get_dataset_metadata(dataset)
dataset |
Dataset identifier. |
Get Dataset path
get_dataset_path(source, dataset, ext = "csv")
get_dataset_path(source, dataset, ext = "csv")
source |
Source ID. |
dataset |
dataset ID. |
ext |
Dataset file extension. |
Synonym for the `[`
operator for use in pipelines.
get_elements()
get_elements()
Get the first item in each sublist of sublists (ugh ... I know).
get_firsts(l, key)
get_firsts(l, key)
l |
A list of lists of lists |
key |
Name of focal sublist (TODO: needs better description/motivation) |
l = list( a = list( A = list( i = 1, ii = 2 ), B = list( i = 3, ii = 4 ) ), b = list( A = list( i = 5, ii = 6 ), B = list( i = 7, ii = 8 ) ) ) get_firsts(l, "A") get_firsts(l, "B")
l = list( a = list( A = list( i = 1, ii = 2 ), B = list( i = 3, ii = 4 ) ), b = list( A = list( i = 5, ii = 6 ), B = list( i = 7, ii = 8 ) ) ) get_firsts(l, "A") get_firsts(l, "B")
Get list of items within each inner list of a list of lists
get_items(l, keys)
get_items(l, keys)
l |
A list of lists. |
keys |
Name of the items in the inner lists. |
Get Lookup Table
get_lookup_table(table_name = c("location_iso"))
get_lookup_table(table_name = c("location_iso"))
table_name |
Name of a lookup table |
Get Main Script
get_main_script(source, dataset)
get_main_script(source, dataset)
source |
Source ID. |
dataset |
dataset ID. |
Get Source Path
get_source_path(source)
get_source_path(source)
source |
Source ID. |
Read in CSV files that contain the single-source-of-truth for metadata to be used in a data prep script.
get_tracking_metadata( tidy_dataset, digitization, tracking_path, original_format = TRUE, for_lbom = FALSE )
get_tracking_metadata( tidy_dataset, digitization, tracking_path, original_format = TRUE, for_lbom = FALSE )
tidy_dataset |
key to the tidy dataset being produced by the script |
digitization |
key to the digitization being used by the script |
tracking_path |
string giving path to the tracking data |
original_format |
should the original tracking table format be used? |
for_lbom |
are these data being read for the LBoM repo? |
This function currently assumes that a single tidy dataset is being produced from a single digitized file.
Unique Column Values
get_unique_col_values(l)
get_unique_col_values(l)
l |
list of data frames with the same column names |
list of unique values in each column
Get with Key by Regex
get_with_key(l, key, pattern, ...)
get_with_key(l, key, pattern, ...)
l |
list of lists |
key |
name of item in inner list |
pattern |
regex pattern with which to match values of the key |
... |
additional arguments to pass to |
subset of elements of l
that match the pattern
Convert GitHub URLs into Raw Format (not working)
git_path_to_raw_github(urls, branch = "master")
git_path_to_raw_github(urls, branch = "master")
urls |
TODO |
branch |
TODO |
Simplify String with List of Numbers Grouped by Dashes
group_with_dash(x)
group_with_dash(x)
x |
atomic vector |
length-1 character string giving a sorted list of numbers with contiguous numbers grouped by dashes.
group_with_dash(c("3840", "34", "2", "3", "1", "33", '5-50')) group_with_dash(group_with_dash(c("3840", "34", "2", "3", "1", "33", '5-50')))
group_with_dash(c("3840", "34", "2", "3", "1", "33", '5-50')) group_with_dash(group_with_dash(c("3840", "34", "2", "3", "1", "33", '5-50')))
List of lookup tables for harmonizing historical inconsistencies in naming.
harmonization_lookup_tables
harmonization_lookup_tables
A list of data frames, one for each column with historical naming inconsistencies:
Unique names of locations found in IIDDA
National jurisdiction codes
Sub-national jurisdiction codes
Unique names of sexes found in IIDDA
Numeric sex codes
For example, NFLD and Newfoundland can both be represented using the iso-3166-2 standard as CA-NL. These tables can be joined to data in IIDDA to produce standardized variables that harmonize historical inconsistencies.
Return the Shortest ICD-10 Codes that Match a Regex Pattern. Requires an internet connection.
icd_finder(disease_pattern, maximum_number_results = 10L, ...)
icd_finder(disease_pattern, maximum_number_results = 10L, ...)
disease_pattern |
Regex pattern describing a disease. |
maximum_number_results |
Integer giving the maximum number of ICD codes to return, with preference given to shorter codes. |
... |
Arguments to pass on to |
icd_finder("chick") ## Struc by chicken!!
icd_finder("chick") ## Struc by chicken!!
Identifies time scales (wk, mo, qr, yr) and location types (province or country) within a tidy dataset.
identify_scales( data, location_type_fixer = canada_province_scale_finder, time_scale_identifier = identify_time_scales )
identify_scales( data, location_type_fixer = canada_province_scale_finder, time_scale_identifier = identify_time_scales )
data |
Data frame in IIDDA tidy format to add time scale and location scale information. |
location_type_fixer |
Function that takes a data frame in IIDDA tidy format and adds or fixes the 'location_type' field. |
time_scale_identifier |
Function that takes a data frame in IIDDA tidy format and adds the 'time_scale' field. |
Get the global data dictionary for IIDDA
iidda_data_dictionary()
iidda_data_dictionary()
This function requires an internet connection.
Create New IIDDA Dataset from Single File
iidda_from_single_file(single_file, new_repo, lifecycle)
iidda_from_single_file(single_file, new_repo, lifecycle)
single_file |
path to single data file |
new_repo |
path to new IIDDA repository |
lifecycle |
character vector giving the lifecycle state (https://github.com/davidearn/iidda/blob/main/LIFECYCLE.md) Probably 'Unreleased', but it could in principle be 'Static', 'Dynamic', or 'Superseded'. |
No return value. Call to produce a new directory structure in a new IIDDA git repository containing a single source data file.
Converts geographical location information, as it was described in a source document, to equivalent ISO-3166 and ISO-3166-2 codes.
iso_3166_codes(tidy_data, locations_iso)
iso_3166_codes(tidy_data, locations_iso)
tidy_data |
data frame containing a field called |
locations_iso |
table containing three columns: |
Converts start and end dates into ISO-8601-compliant date ranges.
iso_8601_dateranges(start_date, end_date)
iso_8601_dateranges(start_date, end_date)
start_date |
date vector |
end_date |
date vector |
Convert date vectors into string vectors with ISO-8601 compliant format.
iso_8601_dates(dates)
iso_8601_dates(dates)
dates |
date vector |
Superseded by iso_3166_codes
.
iso_codes(tidy_data, locations_iso = read.csv("tracking/locations_ISO.csv"))
iso_codes(tidy_data, locations_iso = read.csv("tracking/locations_ISO.csv"))
tidy_data |
data frame containing a field called |
locations_iso |
table containing three columns: |
Create a CSV file from a set of JSON files.
json_files_to_csv(json_paths, csv_path)
json_files_to_csv(json_paths, csv_path)
json_paths |
Vector of paths to JSON files. |
csv_path |
Path for saving the resulting CSV file. |
Create a data frame from a set of JSON files.
json_files_to_data(json_paths)
json_files_to_data(json_paths)
json_paths |
Vector of paths to JSON files. |
Create a set of key-value pairs by extracting elements from within a list of named-lists.
key_val(l, key, value)
key_val(l, key, value)
l |
A list of named lists |
key |
A name of an element in each list in |
value |
A name of an element in each list in |
f = system.file("example_data_dictionary.json", package = "iidda") d = jsonlite::read_json(f) key_val(d, "name", "type")
f = system.file("example_data_dictionary.json", package = "iidda") d = jsonlite::read_json(f) key_val(d, "name", "type")
List Dataset IDs
list_dataset_ids(source)
list_dataset_ids(source)
source |
Source ID. |
List Dataset IDs by Source
list_dataset_ids_by_source()
list_dataset_ids_by_source()
List Dependency IDs
list_dependency_ids( source, dataset, type = c("PrepScripts", "Scans", "Digitizations", "AccessScripts") )
list_dependency_ids( source, dataset, type = c("PrepScripts", "Scans", "Digitizations", "AccessScripts") )
source |
Source ID. |
dataset |
Dataset ID. |
type |
Type of resource. |
List Dependency IDs for Source
list_dependency_ids_for_source( source, type = c("PrepScripts", "Scans", "Digitizations", "AccessScripts") )
list_dependency_ids_for_source( source, type = c("PrepScripts", "Scans", "Digitizations", "AccessScripts") )
source |
IIDDA source ID, which should correspond to metadata in 'metadata/sources/souce.json' and a folder in 'pipelines'. |
type |
Type of dependency. |
List Dependency Paths
list_dependency_paths( source, dataset, type = c("PrepScripts", "Scans", "Digitizations", "AccessScripts") )
list_dependency_paths( source, dataset, type = c("PrepScripts", "Scans", "Digitizations", "AccessScripts") )
source |
Source ID. |
dataset |
dataset ID. |
type |
Type of resource. |
Extract list items by regular expression matching on their names.
list_extract(x, pattern, ...)
list_extract(x, pattern, ...)
x |
A list. |
pattern |
A regular expression |
... |
Arguments to pass to |
List File ID
list_file_id(..., ext)
list_file_id(..., ext)
... |
Path components to directory containing the resources. |
ext |
Optional string giving the file extension of the resources. If missing then all resources are given. |
List of matching files without their extensions.
List Prep Script IDs
list_prep_script_ids(source)
list_prep_script_ids(source)
source |
Source ID. |
List Resources IDs
list_resource_ids( source, type = c("TidyDatasets", "PrepScripts", "Scans", "Digitizations", "AccessScripts") )
list_resource_ids( source, type = c("TidyDatasets", "PrepScripts", "Scans", "Digitizations", "AccessScripts") )
source |
Source ID. |
type |
Type of resource. |
Extract elements of lists using x-path-like syntax.
list_xpath(l, ...)
list_xpath(l, ...)
l |
A hierarchical list. |
... |
Character strings describing the path down the hierarchy. |
l = list( a = list( A = list( i = 1, ii = 2 ), B = list( i = 3, ii = 4 ) ), b = list( A = list( i = 5, ii = 6 ), B = list( i = 7, ii = 8 ) ) ) list_xpath(l, "A", "i") list_xpath(l, "B", "ii")
l = list( a = list( A = list( i = 1, ii = 2 ), B = list( i = 3, ii = 4 ) ), b = list( A = list( i = 5, ii = 6 ), B = list( i = 7, ii = 8 ) ) ) list_xpath(l, "A", "i") list_xpath(l, "B", "ii")
Lookup Value
lookup(named_keys, l)
lookup(named_keys, l)
named_keys |
named character vector with values giving keys
to lookup in |
l |
list with names to match against the
values of |
Create a lookup function that takes a character vector of disease category
names and returns a vector of equal length containing either the lower
or upper age bounds contained in the categories. If no bound is present
then NA
is returned.
make_age_hash_table( categories, re_templates, which_bound = c("lower", "upper", "neither", "single"), prefix = "" )
make_age_hash_table( categories, re_templates, which_bound = c("lower", "upper", "neither", "single"), prefix = "" )
categories |
character vector of disease category names |
re_templates |
list of templates that resolve to regular expressions for matching age information contained in category names |
which_bound |
resolve the template to match lower or upper bounds, neither (the default), or single |
prefix |
pattern to match at the beginning of the string that marks the beginning of age information |
vector containing either the lower or upper age bounds contained in the categories
Create a dependency file and prep script for a dataset that is a compilation of other datasets. These files are created once and any edits should be made manually to the created files.
make_compilation_dependencies(compilation_dataset, dataset_paths)
make_compilation_dependencies(compilation_dataset, dataset_paths)
compilation_dataset |
Dataset ID for which dependencies are being declared. |
dataset_paths |
Relative paths to dependencies. |
Create IIDDA Config File
make_config( path = file.path(getwd(), "config.json"), iidda_owner = "", iidda_repo = "", github_token = "", .overwrite = FALSE )
make_config( path = file.path(getwd(), "config.json"), iidda_owner = "", iidda_repo = "", github_token = "", .overwrite = FALSE )
path |
path for storing config file |
iidda_owner |
TODO |
iidda_repo |
TODO |
github_token |
TODO |
.overwrite |
should existing config.json files be overwritten |
Make DataCite JSON Metadata
make_data_cite_tidy_data(metadata, file)
make_data_cite_tidy_data(metadata, file)
metadata |
Output of get_tracking_metadata |
file |
Path to metadata file |
Create a dependency file for a dataset. This file is created once and any edits should be made manually to the created file.
make_dataset_dependencies(tidy_dataset, paths)
make_dataset_dependencies(tidy_dataset, paths)
tidy_dataset |
Dataset ID for which dependencies are being declared. |
paths |
Relative paths to dependencies. |
Make Dataset Metadata
make_dataset_metadata(tidy_dataset, type, ...)
make_dataset_metadata(tidy_dataset, type, ...)
tidy_dataset |
Dataset ID for which metadata is being produced. |
type |
Type of dataset (e.g., CDI, Mortality). |
... |
Additional metadata fields to provide. If invalid fields are supplied, an error message will be given. |
Make one json metadata file for each resource (i.e., prep/access script or digitization/scan of data)) in a source pipeline associated with a data source (i.e., a sub-directory of 'pipelines'). Existing metadata files will not be overwritten.
make_resource_metadata(source)
make_resource_metadata(source)
source |
Source ID. |
Make a sub-directory of 'pipelines' containing a data and/or code source.
make_source_directory(source, files)
make_source_directory(source, files)
source |
Source ID. |
files |
Character vector of files that are either already in the pipeline or that should be added. |
Make a json file associated with a new data source (i.e., a sub-directory of 'pipelines').
make_source_metadata(source, organization, location, ...)
make_source_metadata(source, organization, location, ...)
source |
Source ID. |
organization |
Organization from which the source was obtained. |
location |
Location for which data was collected. |
... |
Additional metadata fields to provide. If invalid fields are supplied, an error message will be given. |
To be used in conjunction with tracking_table_keys
.
melt_tracking_table_keys(keys)
melt_tracking_table_keys(keys)
keys |
Character vector of |
Construct an object with functions for handling missing values.
MissingHandlers( unclear = c("Unclear", "unclear", "uncleaar", "uncelar", "r"), not_reported = c("", "Not available", "*", "Not reportable", "missing"), zeros = "-" )
MissingHandlers( unclear = c("Unclear", "unclear", "uncleaar", "uncelar", "r"), not_reported = c("", "Not available", "*", "Not reportable", "missing"), zeros = "-" )
unclear |
Character vector giving values corresponding to numbers that were unclear to data enterers. |
not_reported |
Character vector giving values corresponding to numbers that were not reported in the original source. |
zeros |
Character vector giving values corresponding to '0' but that were entered as another character to resemble the original source. |
An environment with functions for handling missing values.
Mock API Hook
mock_api_hook(repo_path)
mock_api_hook(repo_path)
repo_path |
Path to an IIDDA repository. |
Copied from lme4:::namedList
.
nlist(...)
nlist(...)
... |
a list of objects |
Save the records of a dataset that contain non-numeric data within a specified numeric field. This report will be saved in the 'supporting-output/dataset_id' directory.
non_numeric_report(data, numeric_column, dataset_id)
non_numeric_report(data, numeric_column, dataset_id)
data |
Data frame. |
numeric_column |
Name of a numeric column in |
dataset_id |
ID for the dataset that |
Normalize the names of diseases to simplify the harmonization of disease names across historical sources.
normalize_diseases(diseases)
normalize_diseases(diseases)
diseases |
Character vector of disease names |
Open a Path on Mac OS or Windows
open_locally(urls, command = "open", args = character()) open_resources_locally( id, type = c("scans", "digitizations", "prep-scripts", "access-scripts") ) open_all_resources_locally(id) open_scans_locally(id) open_digitizations_locally(id)
open_locally(urls, command = "open", args = character()) open_resources_locally( id, type = c("scans", "digitizations", "prep-scripts", "access-scripts") ) open_all_resources_locally(id) open_scans_locally(id) open_digitizations_locally(id)
urls |
Character vector of GitHub URLs in blob storage |
command |
Command-line function to use to open the file (not applicable on Windows systems. |
args |
Additional options to pass to |
id |
Resource ID. |
type |
Type of resource. |
open_resources_locally()
: Open IIDDA pipeline resources locally.
open_all_resources_locally()
: Open all pipeline resources regardless of
resource type.
open_scans_locally()
: Open scans locally.
open_digitizations_locally()
: Open digitizations locally.
Construct regex for Boolean-or.
or_pattern(x, at_start = TRUE, at_end = TRUE)
or_pattern(x, at_start = TRUE, at_end = TRUE)
x |
Character vector of alternative patterns. |
at_start |
Match only at the start of strings. |
at_end |
Match only at the end of strings. |
Add rows to a data frame with 'cases_this_period' and 'period_end_date' for representing missing weeks. TODO: generalize to other time scales.
pad_weeks(data, ...)
pad_weeks(data, ...)
data |
Data frame with a cases_this_period columns and a period_end_date column that is spaced weekly (but possibly with gaps). |
... |
Passed on to data.frame to create new constant columns. |
Could use https://github.com/EdwinTh/padr.
The input 'data' but with new rows for missing weeks. These rows have 'NA' in 'cases_this_period' and other columns that are not passed through '...' or that were not constant in the input 'data' (in which case these constant values are passed on to the output data frame).
Pager
pager(page, n_per_page, rev = TRUE)
pager(page, n_per_page, rev = TRUE)
page |
What page should be returned? |
n_per_page |
How many entries on each page? |
rev |
Should page one be at the end? |
Function of 'x' to return the 'page'th 'page' of size 'n_per_page' of 'x'.
Set the types of a dataset with all character-valued columns using a data dictionary that defines the types.
parse_columns(data, data_dictionary)
parse_columns(data, data_dictionary)
data |
Data frame with all character-valued columns. |
data_dictionary |
List of lists giving a data dictionary. |
Syntactic sugar for common string pasting operations.
x %_% y x %+% y x %.% y x %-% y
x %_% y x %+% y x %.% y x %-% y
x |
character vector |
y |
character vector |
%+%
Paste with a blank separator, like python string concatenation
%_%
Paste with underscore separator
%.%
Paste with dot separator – useful for adding file extensions
%-%
Paste with dash separator – useful for representing contiguous numbers
x
concatenated with y
'google' %.% 'com' 'snake' %_% 'case'
'google' %.% 'com' 'snake' %_% 'case'
Create an R script providing a place to start when exploring an IIDDA pipeline.
pipeline_exploration_starter(script_filename, exploration_project_path, ...)
pipeline_exploration_starter(script_filename, exploration_project_path, ...)
script_filename |
Name for the generated script. |
exploration_project_path |
Path to the folder for containing the script.
If this path doesn't exist, then it is created. If |
... |
Additional arguments to pass to |
The R script has the following:
1. Example code for printing out the data sources and datasets in the IIDDA pipeline repository. 2. Code for finding the paths to datasets and to the scripts for generating them. 3. Code for generating and/or reading in a user-selected IIDDA dataset.
Once the data are read in, the user is free to do whatever they want to with it.
Return a path in absolute form (if that is how it is specified) or
relative to the IIDDA project root found using proj_root
.
proj_path(...)
proj_path(...)
... |
Path components for |
Find the root path of an IIDDA-associated project (or any project with a file of a specific name in the root).
proj_root(filename = ".iidda", start_dir = getwd(), default_root = start_dir) in_proj(filename = ".iidda", start_dir = getwd())
proj_root(filename = ".iidda", start_dir = getwd(), default_root = start_dir) in_proj(filename = ".iidda", start_dir = getwd())
filename |
String giving the name of the file that identifies the project. |
start_dir |
Optional directory from which to start looking for 'filename'. |
default_root |
Project root to use if 'filename' is not found. |
Recursively walk up the file tree from 'start_dir' until 'filename' is found, and return the path to the directory containing 'filename'. If 'filename' is not found, return 'default_root'
in_proj()
: Is a particular directory inside a project as
indicated by 'filename'.
Uses the Raw GitHub API
raw_github(owner, repo, path, user = NULL, token = NULL, branch = "master")
raw_github(owner, repo, path, user = NULL, token = NULL, branch = "master")
owner |
User or Organization of the repo |
repo |
Repository name |
path |
Path to the file that you want to obtain |
user |
Your username (only required for private repos) |
token |
OAuth personal access token (only required for private repos) |
branch |
Name of the branch (defaults to 'master') |
Read Column Metadata
read_column_metadata(dataset, pattern)
read_column_metadata(dataset, pattern)
dataset |
IIDDA dataset ID. |
pattern |
Regular expression pattern for filtering candidate paths to be read from. |
Read Data Columns
read_data_columns(filename)
read_data_columns(filename)
filename |
Path to a CSV file in IIDDA format. |
Read in a data frame from a CSV file using the CSV dialect adopted by IIDDA.
read_data_frame(filename, col_classes = "character")
read_data_frame(filename, col_classes = "character")
filename |
String giving the filename. |
col_classes |
See |
Read in digitized data to be prepared within the IIDDA project.
read_digitized_data(metadata)
read_digitized_data(metadata)
metadata |
Output of |
Read Global Metadata
read_global_metadata( id, type = c("columns", "organization", "sources", "tidy-datasets") )
read_global_metadata( id, type = c("columns", "organization", "sources", "tidy-datasets") )
id |
ID to the 'type' of entity. |
type |
Type of entity. |
Read Lookup
read_lookup(lookup_id)
read_lookup(lookup_id)
lookup_id |
IIDDA ID associated with an item in a 'lookup-tables' directory in an IIDDA repository. |
Read Prerequisite Data
read_prerequisite_data(dataset_id, numeric_column_for_report = NULL)
read_prerequisite_data(dataset_id, numeric_column_for_report = NULL)
dataset_id |
IIDDA dataset ID. |
numeric_column_for_report |
Optional numeric column name to specify
for producing a report using |
Read Prerequisite Metadata
read_prerequisite_metadata(dataset, pattern)
read_prerequisite_metadata(dataset, pattern)
dataset |
IIDDA dataset ID. |
pattern |
Regular expression pattern for filtering candidate paths to metadata. |
Read Prerequisite Paths
read_prerequisite_paths(dataset, pattern)
read_prerequisite_paths(dataset, pattern)
dataset |
IIDDA dataset ID. |
pattern |
Regular expression pattern for filtering candidate paths to be read from. |
Read Resource Metadata
read_resource_metadata(dataset, pattern)
read_resource_metadata(dataset, pattern)
dataset |
IIDDA dataset ID. |
pattern |
Regular expression pattern for filtering candidate paths to be read from. |
Read Tidy Data and Metadata files
read_tidy_data(tidy_data_path, just_csv = FALSE)
read_tidy_data(tidy_data_path, just_csv = FALSE)
tidy_data_path |
path to folder containing 4 files: tidy data and resulting metadata for each prep script |
just_csv |
return only the tidy csv file or a list with the csv and its metadata |
Read metadata tracking tables for an IIDDA project.
read_tracking_tables(path)
read_tracking_tables(path)
path |
Path containing tracking tables. |
(Deprecated)
readme_classic_iidda
readme_classic_iidda
An object of class character
of length 1.
Convenience function for a one-time setup of all metadata required for a new prep script. The assumptions are that (1) the prep script is a '.R' file in the 'prep-scripts' directory of a directory within the 'pipelines' directory and (2) that this script produces a csv file in the 'derived-datasets' directory with the same 'basename()' as this '.R' file. Messages are printed with paths to newly created and/or existing metadata, derived data, and dependency files that should be checked manually. Sometimes it is helpful to delete some of these files and rerun 'register_prep_script'. However, this 'register_prep_script' function should not be used in a script that is intended to be run multiple times, as going forward the metadata and dependency files should be edited manually.
register_prep_script(script_path, type)
register_prep_script(script_path, type)
script_path |
Path to the prep-script being registered. |
type |
Type of the dataset being produced (e.g., CDI, Mortality). TODO: Give a list of acceptable values. Should be programmatically produced. |
Convert a set of absolute paths to relative paths with respect to a specified 'containing_path'
relative_paths(paths, containing_path = proj_root())
relative_paths(paths, containing_path = proj_root())
paths |
Vector of absolute paths. |
containing_path |
Target working directory to be relative to. |
Remove age information from a vector of category names
remove_age(categories, re_templates, prefix = "") memoise_remove_age(categories, re_templates, prefix = "")
remove_age(categories, re_templates, prefix = "") memoise_remove_age(categories, re_templates, prefix = "")
categories |
vector of category names |
re_templates |
list of templates that resolve to regular expressions for matching age information contained in category names |
prefix |
pattern to match at the beginning of the string that marks the beginning of age information |
Remove Parenthesized Substring
remove_between_paren( x, left = "\\(", right = "\\)", contents_pattern = ".*" )
remove_between_paren( x, left = "\\(", right = "\\)", contents_pattern = ".*" )
x |
Character vector |
left |
Left parenthetical string |
right |
Right parenthetical string |
contents_pattern |
Regex pattern for the contents between parentheses |
Version of x
with first parenthesized substrings removed
x = c("-", "", NA, "1", "3", "1 (Alta.)", "(Sask) 20") remove_between_paren(x)
x = c("-", "", NA, "1", "3", "1 (Alta.)", "(Sask) 20") remove_between_paren(x)
Process output from regmatches to return the correct age bound.
Used in the lookup function created by make_age_hash_table
return_matched_age_bound(x)
return_matched_age_bound(x)
x |
character vector from the list output of regmatches,
containing regex matches of age bound information contained in
disease category names. each |
Character string with matched age bound
Remove Trailing / Leading Slash
rm_trailing_slash(x) rm_leading_slash(x)
rm_trailing_slash(x) rm_leading_slash(x)
x |
Character vector with paths. |
Character vector without trailing/leading slash.
Save the resulting objects of a data prep script into an R data file. The names of the resulting objects are given by the names of the result list.
save_result(result, metadata)
save_result(result, metadata)
result |
Named list of data resulting from data prep scripts |
metadata |
Nested named list describing metadata for the result.
It must have a |
Set Extension
set_ext(paths, ext)
set_ext(paths, ext)
paths |
Character vector giving file paths. |
ext |
String giving the file extension to add to the |
Deprecated – iidda.api package is more robust.
set_iidda_col_types(data)
set_iidda_col_types(data)
data |
Dataset from IIDDA Api |
Set the types of the columns of a data frame.
set_types(data, types)
set_types(data, types)
data |
data frame |
types |
dict-like list with keys giving column names and values giving types |
data frame with changed column types – note that the
returned data frame is a plain base R data.frame
(i.e. not a tibble
or data.table
).
Source from Digitization ID
source_from_digitization_id(digitization_ids)
source_from_digitization_id(digitization_ids)
digitization_ids |
Character vector of digitization IDs |
Character vector of source IDs associated with digitization.
Version of the sprintf
base R function that adds basic templating – https://stackoverflow.com/a/55423080/2047693.
sprintf_named(template, ..., .check = TRUE)
sprintf_named(template, ..., .check = TRUE)
template |
template |
... |
Named arguments with strings that fill template variables of the same name between %{ and }s |
.check |
Should the consistency between the arguments and the template be checked? |
Because this is based on the sprintf function, use %%
when you
would like a single %
to appear in the template. However, when
supplying a single %
to a named argument will result in a single
%
in the output.
You can use syntactically invalid names for arguments by enclosing them in backticks in the argument list, but not in the template.
sprintf_named("You might like to download datasets from %{repo}s.", repo = "IIDDA")
sprintf_named("You might like to download datasets from %{repo}s.", repo = "IIDDA")
List of lists of lists that exploits tab completion to make it convenient to get vectors of all synonyms associated with a particular standard code. This mechanism is useful when searching for data in IIDDA.
standards
standards
List of lists of character vectors containing the original historical names:
Historical national names associated with each iso-3166 code.
Historical national and sub-national names associated with each iso-3166 code.
Historical sub-national names associated with each iso-3166-2 code.
Historical names referring to sexes associated with each iso-5218 code.
Prepare Mortality Data from Statistics Canada
statcan_mort_prep(data)
statcan_mort_prep(data)
data |
Output of |
Data frame complying with the IIDDA requirements for tidy datasets.
Strip the 'blob part' of a GitHub URL so that it is a path relative to a local clone of the associated repo.
strip_blob_github(urls)
strip_blob_github(urls)
urls |
Character vector of GitHub URLs in blob storage |
strip_blob_github("https://github.com/canmod/iidda-tools/blob/main/R/iidda/R/github_parsing.R")
strip_blob_github("https://github.com/canmod/iidda-tools/blob/main/R/iidda/R/github_parsing.R")
Sum Timescales
sum_timescales(data, filter_out_bad_time_scales = TRUE)
sum_timescales(data, filter_out_bad_time_scales = TRUE)
data |
Data frame to aggregate to different time scales. |
filter_out_bad_time_scales |
Should time scales be filtered out if they do not have a reasonable number of periods (e.g., 600 weeks in a year would be filtered out). |
Consecutive or overlapping date ranges are summarised into a single date range, non-consecutive date ranges are kept as is.
summarise_dates(x_start, x_end, range_operator = " to ", collapse = TRUE)
summarise_dates(x_start, x_end, range_operator = " to ", collapse = TRUE)
x_start |
vector of period starting dates. |
x_end |
vector of period ending dates. |
range_operator |
string to go between the start and end date, defaults to " to ". |
collapse |
boolean to collapse all dates into one comma separated string, defaults to TRUE. |
vector or single string of summarised date ranges.
Summarise disease name columns in an IIDDA dataset.
summarise_diseases(data)
summarise_diseases(data)
data |
Data frame hopefully containing at least one of 'disease' or 'historical_disease'. If all are missing then the output summary is a blank string. |
A string summarizing the data in the columns.
Consecutive or overlapping integers separated by commas or semi-colons are summarised into a single integer range, non-consecutive integer ranges are kept as is.
summarise_integers(x, range_operator = "-", collapse = TRUE)
summarise_integers(x, range_operator = "-", collapse = TRUE)
x |
vector of integers |
range_operator |
string to go between the starting and ending integer in the range, defaults to "-". |
collapse |
boolean to collapse all integer ranges into one comma separated string, defaults to TRUE. |
vector or single string of summarised integer ranges.
Summarise several columns in an IIDDA dataset that specify the geographic location of each row.
summarise_locations(data)
summarise_locations(data)
data |
Data frame hopefully containing at least one of 'iso_3166', 'iso_3166_2', or 'location'. If all are missing then the output summary is a blank string. |
A string summarizing the data in the columns.
Summarise time periods in an IIDDA dataset.
summarise_periods(data, cutoff = 50) summarise_periods_vec(period_start_date, period_end_date, cutoff = 50)
summarise_periods(data, cutoff = 50) summarise_periods_vec(period_start_date, period_end_date, cutoff = 50)
data |
Data frame hopefully containing both 'period_start_date' and 'period_end_date'. If either are missing an error results. |
cutoff |
Number of characters, above which the output string takes the form 'max-date to min-date (with gaps)'. |
period_start_date |
Column with the start dates of the periods. |
period_end_date |
Column with the end dates of the periods. |
A string summarizing the data in the columns
summarise_periods_vec()
: For use inside 'mutate' and 'summarise'
functions.
Summarise vector of strings separated by commas or semi-colons into a single character separated string. Removes empty strings, repeated strings and trims white space.
summarise_strings(x, sep = ", ")
summarise_strings(x, sep = ", ")
x |
vector |
sep |
character separator, defaults to ", " |
single string of summarised strings.
Test the results of a data prep script (not finished).
test_result(result)
test_result(result)
result |
Named list of data resulting from data prep scripts |
Find 'island rows' in a dataset with ordered rows. Islands have a
series variable that is not 'NA' surrounded by 'NA' values in that
same variable. This function could work well with pad_weeks
,
if you are looking for weekly 'islands'.
time_series_islands(data, series_variable, time_variable = NULL)
time_series_islands(data, series_variable, time_variable = NULL)
data |
A dataset (must be ordered if 'time_variable' is 'NULL'). |
series_variable |
Name of a series variable. |
time_variable |
Optional variable to use for ordering the dataset before islands are located. |
Tracking Table Keys
tracking_table_keys
tracking_table_keys
An object of class list
of length 5.
Which Tracking Tables have a Particular Column
tracking_tables_with_column(metadata, col_nm)
tracking_tables_with_column(metadata, col_nm)
metadata |
Output of |
col_nm |
Name of a column. |
Attempt to automatically convert a dataset from 'disease|_subclass|_family' format of disease ID to the '|nesting_disease' format.
two_field_format(dataset)
two_field_format(dataset)
dataset |
A tidy data set with 'disease|_subclass|_family' columns. |
Replacing list elements with list('')
for each element that is null, not a character
vector, or length zero.
unlist_char_list(x)
unlist_char_list(x)
x |
list of character vectors |
Vectorized String Substitution
vsub(pattern, replacement, x, ...)
vsub(pattern, replacement, x, ...)
pattern , replacement , x
|
first three arguments to |
... |
additional arguments to pass on to |
Wrap list of regular expressions for matching age bounds in disease category names, so that the resulting regular expressions can be used for different purposes (extraction, removal, or validation)
wrap_age_patterns( patterns, purpose = c("extraction", "removal", "validate"), prefix = "" )
wrap_age_patterns( patterns, purpose = c("extraction", "removal", "validate"), prefix = "" )
patterns |
vector of regular expressions for matching age bound information in disease category names |
purpose |
character string indicating the purpose of the resulting regular expression |
prefix |
pattern to match at the beginning of the string that marks the beginning of age information |
Write a data frame to a CSV file using the CSV dialect adopted by IIDDA.
write_data_frame(data, filename)
write_data_frame(data, filename)
data |
A data frame to write |
filename |
string giving the filename |
Write Local Data Dictionaries
write_local_data_dictionaries(metadata, path)
write_local_data_dictionaries(metadata, path)
metadata |
Output of |
path |
Path to a new JSON file. |
Write Tidy Digitized Data and Metadata
write_tidy_data(tidy_data, metadata, tidy_dir = NULL)
write_tidy_data(tidy_data, metadata, tidy_dir = NULL)
tidy_data |
Data frame of prepared data that are ready to be packaged as an IIDDA tidy data set. |
metadata |
Output of |
tidy_dir |
If |
file names where data were written
Report on the differences between two xlsx files.
xlsx_diff(path_one, path_two, ...)
xlsx_diff(path_one, path_two, ...)
path_one |
Path to an Excel file. |
path_two |
Path to an Excel file. |
... |
Additional arguments to pass to |
Either 'TRUE' if the two files are identical, or a list with the
following items.
* 'all_equal' : Result of applying all.equal
to the
data frames representing each Excel file.
* 'in_both_but_different' : Data frame containing cells that are in both
Excel files but with different values.
* 'in_one_only' : Data frame containing cells that are in the first
Excel file but not the second.
* 'in_two_only' : Data frame containing cells that are in the second
Excel file but not the first.
Convert an Excel file to a CSV file.
xlsx_to_csv(xlsx_path, csv_path)
xlsx_to_csv(xlsx_path, csv_path)
xlsx_path |
Path to an Excel file. |
csv_path |
Path to a new CSV file. |