Package 'LBoM.tools' reference manual

Title:	Tools for Curating London Bills of Mortality and Registrar General Data
Description:	Part of an open toolchain for processing infectious disease datasets available through the IIDDA data repository.
Authors:	Steve Walker [aut, cre], Jen Freeman [aut]
Maintainer:	Steve Walker <[email protected]>
License:	GPL (>= 3)
Version:	0.1.0
Built:	2024-10-24 06:00:02 UTC
Source:	https://github.com/canmod/LBoM.tools

Assert Current Scope

Description

Scope of the data.

Usage

assert_current_scope(data_table, metadata)
assert_current_scope(data_table, metadata)

Arguments

`data_table`	table containing data from tidyxl::xlsx_cells output
`metadata`	result of `get_tracking_metadata`

Value

filtered data list after entities that are out of scope have been removed. Data is in tidyxl::xlsx_cells format where each record corresponds to one cell in source data.

Combine Sheets

Description

Combine Sheets

Usage

combine_sheets(data_table, time_metadata, dataset_type)
combine_sheets(data_table, time_metadata, dataset_type)

Arguments

`data_table`	Data frame returned by 'read_digitized_data'.
`time_metadata`	List returned by `make_time_metadata`.
`dataset_type`	Type of dataset (e.g., Mortality, Birth)

Convert to date

Description

Convert date string to date data type.

Usage

convert_to_date(y, m, d, period)

convert_vec_to_date(y, m, d, period)
convert_to_date(y, m, d, period)

convert_vec_to_date(y, m, d, period)

Arguments

`y`	string containing 4-digit year
`m`	string containing month numbers (1-12) or 'NA'.
`d`	string containing day numbers (1-31) or 'NA'.
`period`	a character string indicating whether the years are the start or end of a period. This argument is only available when 'm' and 'd' are 'NA'. Valid inputs are "start"(January 1) or "end"(December 31), default is NULL.

Value

Date object either a valid date or NA_Date_

Functions

convert_vec_to_date(): Vectorized version of date conversion

Create date fields

Description

Creates two new fields of class 'Date' named 'period_start_date' and 'period_end_date' in input table using existing fields containing date range information.

Usage

create_date_fields(data_table)
create_date_fields(data_table)

Arguments

data_table

table containing data from tidyxl::xlsx_cells output

Value

'data_table' containing additional fields named 'period_start_date' and 'period_end_date'

Date range issues

Description

Identify gaps and overlaps in weekly date ranges in Excel source data sheets.

Usage

date_range_issues(data_table)
date_range_issues(data_table)

Arguments

data_table

table containing data from tidyxl::xlsx_cells output

Value

all records in 'data_table' that correspond to gaps and overlaps in weekly date range data

Identify entire columns in Excel sheets that contain no numeric or text data. These columns are outside the boundaries of the data table contained in the sheet but they may contain Excel formatting that causes the data reading functionality to perceive these columns as containing data.

Usage

empty_cols(data_table)
empty_cols(data_table)

Arguments

data_table

table containing data from tidyxl::xlsx_cells output

Value

all records in 'data_table' that correspond to entire Excel sheet columns with no data

Empty rows

Description

Identify entire rows in Excel sheets that contain no numeric or text data. These rows are outside the boundaries of the data table contained in the sheet but they may contain Excel formatting that causes the data reading functionality to perceive these rows as containing data.

Usage

empty_rows(data_table)
empty_rows(data_table)

Arguments

data_table

table containing data from tidyxl::xlsx_cells output

Value

all records in 'data_table' that correspond to entire Excel sheet rows with no data

Exclude fields

Description

Temporarily exclude some fields that require additional work/thought on how they should be incorporated in the tidy data set. These include: - some parish total fields - parish grouping fields are now included in scope for the mortality data set using reference tables that map parish groups with individual parishes. Parish grouping fields for birth data is currently not in scope, however it can be included in scope by following the framework used for the mortality data set. - relative fields - fields that contain relative counts (ex. "increases in burials this week")

Usage

exclude_fields(data_table)
exclude_fields(data_table)

Arguments

data_table

table containing data from tidyxl::xlsx_cells output

Value

all records in 'data_table' that correspond to excluded fields

Data quality filter

Description

Create data quality issue report from input table with the option to filter out data quality issues.

Usage

filter_out_data_quality(data_table, metadata, filter_data = TRUE, skip = TRUE)
filter_out_data_quality(data_table, metadata, filter_data = TRUE, skip = TRUE)

Arguments

`data_table`	table containing data from tidyxl::xlsx_cells output
`metadata`	result of `get_tracking_metadata`
`filter_data`	boolean flag to filter out data quality issues. For the 'repeated_field_name' data quality issue, records are always filtered to create unique records despite this flag to ensure later pipeline processing steps can function.
`skip`	completely skip data quality checking and just return the input dataset

Value

Returns input ‘data_table' with possibly additional unique records corresponding to the ’repeated_field_name' data quality issue when 'filter_data=FALSE'. If 'filter_data=TRUE' function will return 'data_table' after data quality issue records have been removed.

Fix Tidy Data Columns

Description

Fix Tidy Data Columns

Usage

fix_tidy_data_columns(data_set)
fix_tidy_data_columns(data_set)

Arguments

data_set

processed output from one of the data processing functions (e.g. process_mortality)

Header rows

Description

Identify entire rows in Excel sheets that contain header comments in at least one of the cells. These rows appear at the top of the Excel sheets and are assumed to start with a hash tag character.

Usage

header_rows(data_table)
header_rows(data_table)

Arguments

data_table

table containing data from tidyxl::xlsx_cells output

Value

all records in 'data_table' that correspond to header rows

Inconsistent time column names

Description

Identify sheets that contain time data formatted differently than all other sheets. (One Excel file contains weekly time data with no ".from" or ".to" suffix for "year", "month", and "day" columns.)

Usage

inconsistent_column_names(data_table)
inconsistent_column_names(data_table)

Arguments

data_table

table containing mortality data from tidyxl::xlsx_cells output

Value

all records in 'data_table' that correspond to inconsistent time column names

Invalid dates

Description

Identifies data records that correspond to invalid dates in Excel source data sheets. Invalid dates refer to all dates that are 'NA_Date_' after 'create_date_fields()' was executed. For example, non-existent dates (i.e.'2000-02-30') are invalid dates.

Usage

invalid_dates(data_table)
invalid_dates(data_table)

Arguments

data_table

table containing data from tidyxl::xlsx_cells output

Value

all records that correspond to invalid dates in 'data_table'.

LBoM Preprocessing

Description

LBoM Preprocessing

Usage

lbom_pre_processing(data, metadata)
lbom_pre_processing(data, metadata)

Arguments

`data`	Dataset to be preprocessed in IIDDA pipelines.
`metadata`	Metadata in form returned by iidda::get_dataset_metadata

Concatenate Year, Month, and Day Fields into a yyyy-mm-dd String.

Description

Concatenate Year, Month, and Day Fields into a yyyy-mm-dd String.

Usage

make_date_string(y, m, d, period = NULL)
make_date_string(y, m, d, period = NULL)

Arguments

`y`	vector containing years
`m`	vector containing month numbers (1-12) or 'NA'.
`d`	vector containing day-of-month numbers (1-31) or 'NA'.
`period`	a character string indicating whether the years are the start or end of a period. This argument is only available when 'm' and 'd' are 'NA'. Valid inputs are "start"(January 1) or "end"(December 31), default is NULL.

Make Date String

Description

Concatenate Year, Month, and Day Fields into a yyyy-mm-dd String.

Usage

make_date_string_vec(y, m, d, period = NULL)

make_date_string_vec_year(y, m, d, period = NULL)
make_date_string_vec(y, m, d, period = NULL)

make_date_string_vec_year(y, m, d, period = NULL)

Arguments

`y`	vector containing years
`m`	vector containing month numbers (1-12) or 'NA'.
`d`	vector containing day-of-month numbers (1-31) or 'NA'.
`period`	a character string indicating whether the years are the start or end of a period. This argument is only available when 'm' and 'd' are 'NA'. Valid inputs are "start"(January 1) or "end"(December 31), default is NULL.

Functions

make_date_string_vec_year(): Make date string from year information only

Make Report Path

Description

Make the path to be used for storing reports and supporting output, by modifying the path to the tidy data to be found in the associated metadata.

Usage

make_report_path(metadata)
make_report_path(metadata)

Arguments

metadata

result of get_tracking_metadata

Make Time Metadata

Description

Make Time Metadata

Usage

make_time_metadata(data_table)
make_time_metadata(data_table)

Arguments

data_table

Data frame returned by 'read_digitized_data'.

Missing fields

Description

Identify columns in Excel sheets that contain numeric data with no associated field name. Given there is no field name the tidy data splitting functionality incorrectly assigns this data.

Usage

missing_fields(data_table)
missing_fields(data_table)

Arguments

data_table

table containing data from tidyxl::xlsx_cells output

Value

all records in 'data_table' that correspond to entire Excel sheet columns with no field name

Process Data

Description

Process all data to convert from tidyxl::xlsx_cells format to long format. The output of this function contains data provenance information that allow one to trace each record back to a particular cell in the digitized Excel data.

Usage

process_population(data_table, metadata)

process_all_cause(data_table, metadata)

process_births(data_table, metadata)

process_plague(data_table, metadata)

process_mortality(data_table, metadata)
process_population(data_table, metadata)

process_all_cause(data_table, metadata)

process_births(data_table, metadata)

process_plague(data_table, metadata)

process_mortality(data_table, metadata)

Arguments

`data_table`	table containing data in tidyxl::xlsx_cells format
`metadata`	result of `get_tracking_metadata`

Value

table containing counts in long format (by sex if applicable) and compact time metadata format, age data extracted from cause field names (if applicable)

Functions

process_population(): Process population data
process_all_cause(): Process all-cause mortality data
process_births(): Process birth data
process_plague(): Process plague data
process_mortality(): Process mortality data

Process Excel Sheet

Description

Process excel sheet to convert from tidyxl::xlsx_cells format to tidy long format

Usage

process_sheet(
  sheet,
  col_time_metadata,
  address_time_metadata,
  numeric_time_metadata,
  data_type = c("mortality", "all-cause-mortality", "births", "plague", "population")
)
process_sheet(
  sheet,
  col_time_metadata,
  address_time_metadata,
  numeric_time_metadata,
  data_type = c("mortality", "all-cause-mortality", "births", "plague", "population")
)

Arguments

`sheet`	table containing one Excel sheet in tidyxl::xlsx_cells format
`col_time_metadata`	time metadata fields with "col_" prefix
`address_time_metadata`	time metadata fields with "address_" prefix
`numeric_time_metadata`	time metadata fields with "numeric_" prefix
`data_type`	TODO: describe data type

Value

processed sheet as a tibble

Question marks

Description

Identify cells that contain question marks. It is not clear what type of missing data is being represented by a question mark.

Usage

question_marks(data_table)
question_marks(data_table)

Arguments

data_table

table containing mortality data from tidyxl::xlsx_cells output

Value

all records in 'data_table' that correspond to cells with question marks

Read Source RDS

Description

Read in source RDS data

Usage

read_source_RDS(
  source_data_folder,
  data_category = list("mortality", "acm", "birth", "plague", "population",
    "unknown_data_category")
)
read_source_RDS(
  source_data_folder,
  data_category = list("mortality", "acm", "birth", "plague", "population",
    "unknown_data_category")
)

Arguments

`source_data_folder`	top level folder where source RDS files are saved
`data_category`	data category

Value

tibble containing all data in specified data category

Repeated field names

Description

Field names that appear multiple times in Excel source data sheets. In order to have one data point per each time period (row in the sheet), the decision was made (https://github.com/davidearn/data_work/issues/192) to select the maximum numeric value for each record for these fields. Additionally, for each time period if both numeric and non-numeric values exist we prioritize numeric values, and in the presence of numeric ties we select only one of the numeric values. All records identified by these decisions are updated to have a new column number (that does not contain data in the digitized files) so they can easily be identified, and original columns are excluded in the final data set.

Usage

repeated_field_names(data_table)
repeated_field_names(data_table)

Arguments

data_table

table containing data from tidyxl::xlsx_cells output

Value

all records in 'data_table' that correspond to repeated field names

Split Source Excel

Description

Read in source .xlsx data files and perform some data pre-processing. Splits data into data categories and saves resulting table as RDS file.

Usage

split_source_excel(
  source_data_folder,
  intermediate_data_folder,
  data_type = c("mortality", "all-cause-mortality", "births", "plague", "population")
)
split_source_excel(
  source_data_folder,
  intermediate_data_folder,
  data_type = c("mortality", "all-cause-mortality", "births", "plague", "population")
)

Arguments

`source_data_folder`	folder where source data is saved.
`intermediate_data_folder`	top level folder for output, resulting RDS output file will be saved in intermediate_data_folder > data_type location.
`data_type`	data type to save as RDS file. Valid inputs are mortality, all-cause-mortality, births, plague, population. The default is `mortality`.

Value

the combined data frame that is saved in an RDS file

Unclassified field names

Description

Identify field names in Excel source data sheets that match regular expression patterns in representative_categories.csv that have been labelled data quality issues. Additionally, find field names that do not have a regular expression match, or have more that one match.

Usage

unclassified_field_names(data_table, representative_test_categories)
unclassified_field_names(data_table, representative_test_categories)

Arguments

`data_table`	table containing data from tidyxl::xlsx_cells output
`representative_test_categories`	data frame containing the reference table

Value

all records in 'data_table' that correspond to unclassified field names in tidyxl::xlsx_cells format

Package 'LBoM.tools'

Help Index

Assert Current Scope

Description

Usage

Arguments

Value

See Also

Combine Sheets

Description

Usage

Arguments

Convert to date

Description

Usage

Arguments

Value

Functions

Create date fields

Description

Usage

Arguments

Value

Date range issues

Description

Usage

Arguments

Value

See Also

Empty cols

Description

Usage

Arguments

Value

See Also

Empty rows

Description

Usage

Arguments

Value

See Also

Exclude fields

Description

Usage

Arguments

Value

See Also

Data quality filter

Description

Usage

Arguments

Value

Fix Tidy Data Columns

Description

Usage

Arguments

Header rows

Description

Usage

Arguments

Value

See Also

Inconsistent time column names

Description

Usage

Arguments

Value

See Also

Invalid dates

Description

Usage

Arguments

Value

See Also

LBoM Preprocessing

Description

Usage

Arguments

Concatenate Year, Month, and Day Fields into a yyyy-mm-dd String.

Description