Title: | Tools for Analyzing IIDDA Datasets |
---|---|
Description: | This package contains tools for working with data obtained from the International Infectious Disease Data Archive. |
Authors: | Steven Walker |
Maintainer: | Steven Walker <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.0.0 |
Built: | 2024-11-20 12:43:44 UTC |
Source: | https://github.com/canmod/iidda-tools |
Adds entries to user-defined lookup table
Entries should have names or columns from the user lookup table
standards
can be used for entries
add_user_entries(entries, user_table_path)
add_user_entries(entries, user_table_path)
entries |
dataframe or named list of entries to add |
user_table_path |
string indicating path to user lookup table |
user lookup table with added entries in the path
Open a browser at the locations of the dependencies associated with a set of datasets.
browse_pipeline_dependencies( dataset_ids, dependency_types = c("IsCompiledBy", "IsDerivedFrom", "References"), metadata = iidda.api::ops_staging$metadata(dataset_ids = dataset_ids) )
browse_pipeline_dependencies( dataset_ids, dependency_types = c("IsCompiledBy", "IsDerivedFrom", "References"), metadata = iidda.api::ops_staging$metadata(dataset_ids = dataset_ids) )
dataset_ids |
Character vector of dataset identifiers. |
dependency_types |
Vector of types of dependencies to browse. Possible
values include |
metadata |
Optional list giving dataset metadata. The default uses the IIDDA API, which requires the internet. |
Order Canadian Provinces Geographically
ca_iso_3166_2(data)
ca_iso_3166_2(data)
data |
Dataset containing an 'iso_3166_2' field with Canadian province and territory codes. |
Test if x is a Date, coerce if not
check_date(x)
check_date(x)
x |
vector of putative dates |
vector with class Date, or error
d1 <- check_date("1920-01-01") d1 class(d1) # returns an error if x can't be coerced to Date easily # check_date("may 29th")
d1 <- check_date("1920-01-01") d1 class(d1) # returns an error if x can't be coerced to Date easily # check_date("may 29th")
The important cleaning steps include (1) removing 'CA-' from ISO-3166-2 codes (because within Canada this is redundant) and (2) filtering out all time-scales but the 'best'. so that there is no chance of double-counting cases.
clean_canmod_cdi(canmod_cdi, ...)
clean_canmod_cdi(canmod_cdi, ...)
canmod_cdi |
Dataset from IIDDA of type 'CANMOD CDI'. |
... |
Arguments to pass on to |
Compute Moving Average of Time Series
ComputeMovingAverage(ma_window_length = 52)
ComputeMovingAverage(ma_window_length = 52)
ma_window_length |
length of moving average window, this will depend on the time scale in the data. Defaults to 52, so that weekly data is averaged over years. |
a function to remove to compute the moving average of a time series variable
## Returned Function
- Arguments * 'data' data frame containing time series data * 'series_variable' column name of series variable in 'data', default is "deaths" * 'time_variable' column name of time variable in 'data', default is "period_end_date" - Return - all fields in 'data' with the 'series_variable' data replaced with the moving average.
Create age bin descriptions for joining age_group lookup table
create_bin_desc(age_df)
create_bin_desc(age_df)
age_df |
data frame of data with age_group column |
data frame of data with bin_desc column
Factor Time Scale
factor_time_scale(data)
factor_time_scale(data)
data |
A tidy data set with a 'time_scale' column. |
A data set with a factored time_scale column.
Make new records for instances when the sum of leaf diseases is less than the reported total for their basal disease. The difference between these counts gets disease name 'basal_disease'_unaccounted'.
find_unaccounted_cases(data)
find_unaccounted_cases(data)
data |
A tidy data set with a 'basal_disease' column. |
A data set containing records that are the difference between a reported total for a basal_disease and the sum of their leaf diseases.
Data for a Particular Disease
generate_disease_df(canmod_cdi, disease_name, years = NULL, add_gaps = TRUE)
generate_disease_df(canmod_cdi, disease_name, years = NULL, add_gaps = TRUE)
canmod_cdi |
Dataset from IIDDA of type 'CANMOD CDI'. |
disease_name |
Name to match in the 'nesting_disease' column of a 'CANMOD CDI' dataset. |
years |
If not 'NULL', a vector of years to keep in the output data. |
add_gaps |
If 'TRUE', add records with 'NA' in 'cases_this_period' that correspond to time-periods without any data. |
Creates an empty table in a specified directory using columns names from another data frame
generate_empty_df(dir_path, lookup_table, csv_name)
generate_empty_df(dir_path, lookup_table, csv_name)
dir_path |
string indicating path to directory |
lookup_table |
data frame with column names to include in table |
csv_name |
string indicating name of the created .csv file |
empty csv file with columns from lookup_table
in the directory if successfully generated
Creates an empty user-defined lookup table in a specified directory
generate_user_table(path, lookup_table_type)
generate_user_table(path, lookup_table_type)
path |
string indicating path to directory |
lookup_table_type |
string indicating type of lookup table |
csv file of empty lookup table with columns from lookup_table_type
in the directory if successful
Add zeros to data set that are implied by a '0' reported at a coarser timescale.
get_implied_zeros(data)
get_implied_zeros(data)
data |
A tidy data set with the following minimal set of columns: 'disease', 'nesting_disease', 'year', 'original_dataset_id', 'iso_3166_2', 'basal_disease', 'time_scale', 'period_start_date', 'period_end_date', 'period_mid_date', 'days_this_period', 'dataset_id' |
A tidy data set with inferred 0s.
Get label of associated time unit
get_unit_labels(unit)
get_unit_labels(unit)
unit |
time unit, one of iidda.analysis:::time_units |
label of associated time unit
Wrapper of 'seq.Date()' and 'lubridate::floor_date'
grid_dates( start_date = "1920-01-01", end_date = "2020-01-01", by = "1 week", unit = "week", lookback = TRUE, week_start = 7 )
grid_dates( start_date = "1920-01-01", end_date = "2020-01-01", by = "1 week", unit = "week", lookback = TRUE, week_start = 7 )
start_date |
starting date |
end_date |
end date |
by |
increment of the sequence. Optional. See ‘Details’. |
unit |
a string, When When |
lookback |
Logical, should the first value start before 'start_date' |
week_start |
week start day (Default is 7, Sunday. Set |
vector of Dates at the first of each week, month, year
grid_dates(start_date = "2023-04-01" , end_date = "2023-05-16") grid_dates(start_date = "2023-04-01" , end_date = "2023-05-16" , lookback = FALSE) grid_dates(start_date = "2020-04-01" , end_date = "2023-05-16" , by = "2 months" , unit = "month") grid_dates(start_date = "2020-04-01" , end_date = "2023-05-16" , by = "2 months")
grid_dates(start_date = "2023-04-01" , end_date = "2023-05-16") grid_dates(start_date = "2023-04-01" , end_date = "2023-05-16" , lookback = FALSE) grid_dates(start_date = "2020-04-01" , end_date = "2023-05-16" , by = "2 months" , unit = "month") grid_dates(start_date = "2020-04-01" , end_date = "2023-05-16" , by = "2 months")
Remove or replace values that are NA
.
HandleMissingValues(na_remove = FALSE, na_replace = NULL)
HandleMissingValues(na_remove = FALSE, na_replace = NULL)
na_remove |
boolean value, if 'TRUE' remove 'NA's in series variable |
na_replace |
numeric value to replace 'NA's in series variable, if NULL no replacement is performed |
a function to remove or replace missing values.
## Returned Function
- Arguments * 'data' data frame containing time series data * 'series_variable' column name of series variable in 'data', default is "deaths" - Return - all fields in 'data' with either 'NA' records removed or replaced
Remove or replace series variable values that are zero.
HandleZeroValues(zero_remove = FALSE, zero_replace = NULL)
HandleZeroValues(zero_remove = FALSE, zero_replace = NULL)
zero_remove |
boolean value, if 'TRUE' remove zeroes in series variable |
zero_replace |
numeric value to replace zeroes in series variable, if NULL no replacement is performed |
a function to remove or replace zero values.
## Returned Function
- Arguments * 'data' data frame containing time series data * 'series_variable' column name of series variable in 'data', default is "deaths" - Return - all fields in 'data' with either zero records removed or replaced
Get starting time period, ending time period and mortality cause name from the data set for use in axis and main plot titles.
iidda_get_metadata( data, time_variable = "period_end_date", descriptor_variable = "cause" )
iidda_get_metadata( data, time_variable = "period_end_date", descriptor_variable = "cause" )
data |
data frame containing time series data |
time_variable |
column name of time variable in 'data', default is "period_end_date" |
descriptor_variable |
column name of the descriptor variable in 'data', default is "cause" for mortality data sets. |
a list in order containing minimum time period, maximum time period and cause name.
Add a bar plot to an exiting ggplot plot object. Graphical choices were made to closely reflect plots generated with 'LBoM::monthly_bar_graph' and 'LBoM::weekly_bar_graph'.
iidda_plot_bar( plot_object, data = NULL, series_variable = "deaths", time_unit = "week" )
iidda_plot_bar( plot_object, data = NULL, series_variable = "deaths", time_unit = "week" )
plot_object |
a 'ggplot2' plot object |
data |
data frame containing data prepped for bar plotting, typically output from 'iidda_prep_bar()'. If 'NULL' data is inherited from 'plot_object' |
series_variable |
column name of series variable in 'data', default is "deaths" |
time_unit |
time unit to display bar graphs on the x-axis. Defaults to "week" or one of iidda.analysis:::time_units that starts with "month". Should generalize at some point to be able to take any time_unit argument. |
a ggplot2 plot object containing a bar graphs of time series data
Add a box plot to an exiting ggplot plot object. Graphical choices were made to closely reflect plots generated with 'LBoM::monthly_box_plot'.
iidda_plot_box( plot_object, data = NULL, series_variable = "deaths", time_unit = "week", ... )
iidda_plot_box( plot_object, data = NULL, series_variable = "deaths", time_unit = "week", ... )
plot_object |
a 'ggplot2' plot object |
data |
data frame containing data prepped for box plotting, typically output from 'iidda_prep_box()'. If 'NULL' data is inherited from 'plot_object' |
series_variable |
column name of series variable in 'data', default is "deaths" |
time_unit |
time unit to display box plots on the x-axis. Defaults to "week", should be able to handle any time_unit from iidda.analysis:::time_units. |
... |
other arguments to be passed to 'scale_x_discrete' |
a ggplot2 plot object containing a box plots of time series data
Add a yearly vs. weekly heatmap to an exiting ggplot plot object. Graphical choices were made to closely reflect plots generated with'LBoM::seasonal_heat_map'.
iidda_plot_heatmap( plot_object, data = NULL, series_variable = "deaths", start_year_variable = "Year", end_year_variable = "End Year", start_day_variable = "Day of Year", end_day_variable = "End Day of Year", colour_trans = "log2", NA_colour = "black", palette_colour = "RdGy", ... )
iidda_plot_heatmap( plot_object, data = NULL, series_variable = "deaths", start_year_variable = "Year", end_year_variable = "End Year", start_day_variable = "Day of Year", end_day_variable = "End Day of Year", colour_trans = "log2", NA_colour = "black", palette_colour = "RdGy", ... )
plot_object |
a 'ggplot2' plot object |
data |
data frame containing data prepped for yearly vs. weekly heatmaps, typically output from 'iidda_prep_heatmap()'. If 'NULL' data is inherited from 'plot_object'. |
series_variable |
column name of series variable in 'data', default is "deaths" |
start_year_variable |
column name of time variable containing the year of the starting period, defaults to "Year" |
end_year_variable |
column name of time variable containing the year of the ending period, defaults to "End Year" |
start_day_variable |
column name of time variable containing the day of the starting period, defaults to "Day of Year" |
end_day_variable |
column name of time variable containing the day of the ending period, defaults to "End Day of Year" |
colour_trans |
string indicating colour transformation, one of "log2", "sqrt" or "linear" |
NA_colour |
colour for 'NA' values, defaults to "black" |
palette_colour |
colour of heatmap palette, defaults to "RdGy". Should specify what type of palette colours are accepted by this argument. |
... |
Not currently used. |
a ggplot2 plot object containing a yearly vs. weekly heatmap of time series data
Add a rectangular highlighted region to an existing ggplot2 plot object
iidda_plot_highlight( plot_object, data = NULL, series_variable = "deaths", time_variable = "period_end_date", filter_variable = "period_end_date", filter_start = "1700-01-01", filter_end = "1800-01-01", ... )
iidda_plot_highlight( plot_object, data = NULL, series_variable = "deaths", time_variable = "period_end_date", filter_variable = "period_end_date", filter_start = "1700-01-01", filter_end = "1800-01-01", ... )
plot_object |
a 'ggplot2' plot object |
data |
data frame containing time series data. If 'NULL' data is inherited from 'plot_object'. This has only been tested with data output from 'iidda_plot_ma'. |
series_variable |
column name of series variable in 'data', default is "deaths" |
time_variable |
column name of time variable in 'data', default is "period_end_date" |
filter_variable |
column name of variable to filter on in 'data', default is "period_end_date" |
filter_start |
value of 'filter_variable' for starting range, default is "1700-01-01" |
filter_end |
value of 'filter_variable' for ending range, default is "1800-01-01" |
... |
other arguments to be passed to 'ggforce::geom_mark_rect', for example annotating with text |
a ggplot2 plot object a rectangular plot highlight
Add a moving average time series line to an exiting ggplot plot object. Graphical choices were made to closely reflect plots generated with 'LBoM::plot.LBoM'.
iidda_plot_ma( plot_object, data = NULL, series_variable = "deaths", time_variable = "period_end_date" )
iidda_plot_ma( plot_object, data = NULL, series_variable = "deaths", time_variable = "period_end_date" )
plot_object |
a 'ggplot2' plot object |
data |
data frame containing moving average time series data, typically output from 'iidda_prep_ma()'. If 'NULL' data is inherited from 'plot_object' |
series_variable |
column name of series variable in 'data', default is "deaths" |
time_variable |
column name of time variable in 'data', default is "period_end_date" |
a ggplot2 plot object containing a moving average time series
Add a rohani heatmap to an exiting ggplot plot object. Possibly to be extended to include time series in a separate facet.
iidda_plot_rohani_heatmap( plot_object, data = NULL, series_variable = "deaths", start_year_variable = "Year", end_year_variable = "End Year", start_day_variable = "Day of Year", end_day_variable = "End Day of Year", grouping_variable = "cause", colour_trans = log1p_modified_trans(), n_colours = (scales::brewer_pal(palette = "YlOrRd"))(9), NA_colour = "black", palette_colour = "YlOrRd" )
iidda_plot_rohani_heatmap( plot_object, data = NULL, series_variable = "deaths", start_year_variable = "Year", end_year_variable = "End Year", start_day_variable = "Day of Year", end_day_variable = "End Day of Year", grouping_variable = "cause", colour_trans = log1p_modified_trans(), n_colours = (scales::brewer_pal(palette = "YlOrRd"))(9), NA_colour = "black", palette_colour = "YlOrRd" )
plot_object |
a 'ggplot2' plot object |
data |
data frame containing data prepped for yearly vs. weekly heatmaps, typically output from 'iidda_prep_heatmap()'. If 'NULL' data is inherited from 'plot_object'. |
series_variable |
column name of series variable in 'data', default is "deaths" |
start_year_variable |
column name of time variable containing the year of the starting period, defaults to "Year" |
end_year_variable |
column name of time variable containing the year of the ending period, defaults to "End Year" |
start_day_variable |
column name of time variable containing the day of the starting period, defaults to "Day of Year" |
end_day_variable |
column name of time variable containing the day of the ending period, defaults to "End Day of Year" |
grouping_variable |
column name of grouping variable to appear on the y-axis of the heatmap. |
colour_trans |
function to scale colours, to be supplied to trans argument of scale_fill_gradientn() |
n_colours |
vector of colours to be supplied to scale_fill_gradientn() |
NA_colour |
colour for 'NA' values, defaults to "black" |
palette_colour |
colour of heatmap palette, defaults to "RdGy". Should specify what type of palette colours are accepted by this argument. |
a ggplot2 plot object containing a yearly vs. weekly heatmap of time series data
Scale time series data by transformation.
iidda_plot_scales(plot_object, data = NULL, scale_transform = "log1p")
iidda_plot_scales(plot_object, data = NULL, scale_transform = "log1p")
plot_object |
a 'ggplot2' plot object |
data |
data frame containing time series data. If 'NULL' data is inherited from 'plot_object'. |
scale_transform |
transformation to apply to |
a ggplot2 plot object with scaled y
data
Add a time series line to an exiting ggplot plot object.
iidda_plot_series( plot_object, data = NULL, series_variable = "deaths", time_variable = "period_end_date", time_unit = "year" )
iidda_plot_series( plot_object, data = NULL, series_variable = "deaths", time_variable = "period_end_date", time_unit = "year" )
plot_object |
a 'ggplot2' plot object |
data |
data frame containing time series data, typically output from 'iidda_prep_series()'. If 'NULL' data is inherited from 'plot_object' |
series_variable |
column name of series variable in 'data', default is "deaths" |
time_variable |
column name of time variable in 'data', default is "period_end_date" |
time_unit |
time unit to display on the x-axis. |
a ggplot2 plot object containing a moving average time series
Add basic features to a ggplot2 plot object including title, subtitle and classic 'ggplot2::theme_bw' theme.
iidda_plot_settings( plot_object, data, min_time = "min_time", max_time = "max_time", descriptor_name = "descriptor_name", theme = iidda_theme )
iidda_plot_settings( plot_object, data, min_time = "min_time", max_time = "max_time", descriptor_name = "descriptor_name", theme = iidda_theme )
plot_object |
a 'ggplot2' plot object |
data |
list containing metadata. If 'NULL' data is inherited from 'plot_object'. |
min_time |
name of field in data containing the minimum time period range, defaults to "min_time". |
max_time |
name of field in data containing the minimum time period range, defaults to "max_time". |
descriptor_name |
either the name of a field in data containing the descriptor or a string to be used as the plot title. If there are too more than 3 elements in the descriptor field, then 'descriptor_variable' is used as the plot title. |
theme |
ggplot theme |
a ggplot2 plot object with title, subtitle and adjusted theme.
Plot wavelet to look similar to base R plot of WaveletComp::wt.image
using ggplot2 functionality.
Some visual choices were made to reflect work done by Steven Lee (https://github.com/davidearn/StevenLee)
and Kevin Zhao (https://github.com/davidearn/KevinZhao).
iidda_plot_wavelet( plot_object, data = NULL, wavelet_data, contour_data, y_variable_name = "Period (years)", fill_variable_name = "Power", max_period = 10, colour_levels = 250, start_hue = 0, end_hue = 0.7, sig_lvl = 0.05 )
iidda_plot_wavelet( plot_object, data = NULL, wavelet_data, contour_data, y_variable_name = "Period (years)", fill_variable_name = "Power", max_period = 10, colour_levels = 250, start_hue = 0, end_hue = 0.7, sig_lvl = 0.05 )
plot_object |
a 'ggplot2' plot object |
data |
data frame containing wavelet data prepped for use in |
wavelet_data |
list containing raw wavelet transformed data, typically output from |
contour_data |
data set containing contour data prepped for use in |
y_variable_name |
name of y variable in plot, defaults to "Period (years)". |
fill_variable_name |
name of colour fill variable in plot, defaults to "Power". |
max_period |
maximum period to appear on the plot, defaults to 10 years. |
colour_levels |
number of colours to pass to |
start_hue |
starting hue colour to pass to |
end_hue |
ending hue colour to pass to |
sig_lvl |
significance level for white contours |
a ggplot2 object of a wavelet
Prep data for plotting bar graphs. Prep steps were taken from 'LBoM::monthly_bar_graph' and 'LBoM::weekly_bar_graph' and they include handling missing values and aggregating series data by time unit grouping variable.
iidda_prep_bar( data, series_variable = "deaths", time_variable = "period_end_date", time_unit = "week", handle_missing_values = HandleMissingValues(na_remove = FALSE, na_replace = NULL), handle_zero_values = HandleZeroValues(zero_remove = FALSE, zero_replace = NULL) )
iidda_prep_bar( data, series_variable = "deaths", time_variable = "period_end_date", time_unit = "week", handle_missing_values = HandleMissingValues(na_remove = FALSE, na_replace = NULL), handle_zero_values = HandleZeroValues(zero_remove = FALSE, zero_replace = NULL) )
data |
data frame containing time series data |
series_variable |
column name of series variable in 'data', default is "deaths" |
time_variable |
column name of time variable in 'data', default is "period_end_date" |
time_unit |
time unit to sum series data over, must be one of iidda.analysis:::time_units, defaults to "week". |
handle_missing_values |
function to handle missing values, defaults to HandleMissingValues |
handle_zero_values |
function to handle zero values, defaults to HandleZeroValues |
'data' with records prepped for plotting bar graphs with 'series_variable' and 'time_unit' field. The name of the resulting 'time_unit' field will be named from lubridate_funcs.
Prep data for plotting box plots. Prep steps were taken from 'LBoM::monthly_box_plot' and they include handling missing values and creating additional time unit fields.
iidda_prep_box( data, series_variable = "deaths", time_variable = "period_end_date", time_unit = "week", handle_missing_values = HandleMissingValues(na_remove = FALSE, na_replace = NULL), handle_zero_values = HandleZeroValues(zero_remove = FALSE, zero_replace = NULL) )
iidda_prep_box( data, series_variable = "deaths", time_variable = "period_end_date", time_unit = "week", handle_missing_values = HandleMissingValues(na_remove = FALSE, na_replace = NULL), handle_zero_values = HandleZeroValues(zero_remove = FALSE, zero_replace = NULL) )
data |
data frame containing time series data |
series_variable |
column name of series variable in 'data', default is "deaths" |
time_variable |
column name of time variable in 'data', default is "period_end_date" |
time_unit |
time unit to create field from 'time_variable'. Must be one of iidda.analysis:::time_units, defaults to "week". |
handle_missing_values |
function to handle missing values, defaults to HandleMissingValues |
handle_zero_values |
function to handle zero values, defaults to HandleZeroValues |
all fields in'data' with records prepped for plotting box plots. The name of the new 'time_unit' field will be named from lubridate_funcs.
Prep data for plotting moving average. Prep steps were taken from 'LBoM::plot.LBoM' and they include handling missing values and zeroes, optionally trimming time series and computing the moving average.
iidda_prep_ma( data, series_variable = "deaths", time_variable = "period_end_date", trim_zeroes = TRUE, trim_series = TrimSeries(zero_lead = FALSE, zero_trail = FALSE), handle_missing_values = HandleMissingValues(na_remove = FALSE, na_replace = NULL), handle_zero_values = HandleZeroValues(zero_remove = FALSE, zero_replace = NULL), compute_moving_average = ComputeMovingAverage(ma_window_length = 52) )
iidda_prep_ma( data, series_variable = "deaths", time_variable = "period_end_date", trim_zeroes = TRUE, trim_series = TrimSeries(zero_lead = FALSE, zero_trail = FALSE), handle_missing_values = HandleMissingValues(na_remove = FALSE, na_replace = NULL), handle_zero_values = HandleZeroValues(zero_remove = FALSE, zero_replace = NULL), compute_moving_average = ComputeMovingAverage(ma_window_length = 52) )
data |
data frame containing time series data |
series_variable |
column name of series variable in 'data', default is "deaths" |
time_variable |
column name of time variable in 'data', default is "period_end_date" |
trim_zeroes |
boolean value to filter data to exclude leading and trailing zeroes |
trim_series |
function to trim leading and trailing series zeroes, defaults to TrimSeries |
handle_missing_values |
function to handle missing values, defaults to HandleMissingValues |
handle_zero_values |
function to handle zero values, defaults to HandleZeroValues |
compute_moving_average |
function to compute the moving average of 'series_variable' |
all fields in 'data' with records prepped for plotting moving average time series
Prep data for rohani plots. Prep steps include creating additional time unit fields, summarizing the series variable by time unit and grouping variable (the x and y axis variables) ,and optionally normalizing series data to be in the range (0,1). By default, the grouping variable is ranked in order of the summarized series variable. Needs to be generalized more, might need to handle the case where the desired y-axis is a second time unit, as in the seasonal heatmap plot and therefore making use of the year_end_fix function.
iidda_prep_rohani( data, series_variable = "deaths", time_variable = "period_end_date", start_time_variable = "period_end_date", time_unit = c("year"), grouping_variable = "cause", ranking_variable = NULL, normalize = FALSE, handle_missing_values = HandleMissingValues(na_remove = FALSE, na_replace = NULL), handle_zero_values = HandleZeroValues(zero_remove = FALSE, zero_replace = NULL), create_nonexistent = FALSE )
iidda_prep_rohani( data, series_variable = "deaths", time_variable = "period_end_date", start_time_variable = "period_end_date", time_unit = c("year"), grouping_variable = "cause", ranking_variable = NULL, normalize = FALSE, handle_missing_values = HandleMissingValues(na_remove = FALSE, na_replace = NULL), handle_zero_values = HandleZeroValues(zero_remove = FALSE, zero_replace = NULL), create_nonexistent = FALSE )
data |
data frame containing time series data |
series_variable |
column name of series variable in 'data', default is "deaths" |
time_variable |
column name of time variable in 'data', default is "period_end_date" |
start_time_variable |
column name of time variable in 'data', default is "period_end_date" |
time_unit |
a vector of new time unit fields to create from 'start_time_variable' and 'end_time_variable'. Defaults to "c("year")". The currently functionality expects that "year" is included, should be made more general to incorporate any of iidda.analysis:::time_units. |
grouping_variable |
column name of grouping variable to appear on the y-axis of the heatmap. |
ranking_variable |
column name of variable used to rank the grouping variable. |
normalize |
boolean flag to normalize 'series_variable' data to be between 0 and 1. |
handle_missing_values |
function to handle missing values, defaults to HandleMissingValues |
handle_zero_values |
function to handle zero values, defaults to HandleZeroValues |
create_nonexistent |
boolean flag to create |
all fields in'data' with records prepped for plotting rohani heatmaps. The name of the new 'time_unit' fields will be named from lubridate_funcs.
Prep data for seasonal heatmap plots. Prep steps were taken from 'LBoM::seasonal_heat_map' and they include creating additional time unit fields, splitting weeks that cover the year end, and optionally normalizing series data to be in the range (0,1).
iidda_prep_seasonal_heatmap( data, series_variable = "deaths", start_time_variable = "period_start_date", end_time_variable = "period_end_date", time_unit = c("yday", "year"), prepend_string = "End ", normalize = FALSE, ... )
iidda_prep_seasonal_heatmap( data, series_variable = "deaths", start_time_variable = "period_start_date", end_time_variable = "period_end_date", time_unit = c("yday", "year"), prepend_string = "End ", normalize = FALSE, ... )
data |
data frame containing time series data |
series_variable |
column name of series variable in 'data', default is "deaths" |
start_time_variable |
column name of time variable in 'data', default is "period_start_date" |
end_time_variable |
column name of time variable in 'data', default is "period_end_date" |
time_unit |
a vector of new time unit fields to create from 'start_time_variable' and 'end_time_variable'. Defaults to "c("yday","year")". The currently functionality expects that both "yday" and "year" are included, should be made more general to incorporate any of iidda.analysis:::time_units. |
prepend_string |
string to prepend to newly created time_unit fields to distinguish between time_unit fields corresponding to starting versus ending time periods. Defaults to "End ". For example, a 'time_unit' of "year" will create a field name "Year" from 'start_time_variable' and a field called "End Year" created from 'end_time_variable'. |
normalize |
boolean flag to normalize 'series_variable' data to be between 0 and 1. |
... |
optional arguments to 'year_end_fix()' |
all fields in'data' with records prepped for plotting seasonal heatmaps. The name of the new 'time_unit' fields will be named from lubridate_funcs.
Prep data for basic time series plot. Prep steps were taken from 'LBoM::plot.LBoM' and they include handling missing values and zeroes, and optionally trimming time series.
iidda_prep_series( data, series_variable = "deaths", time_variable = "period_end_date", grouping_variable = "cause", time_unit = "year", summarize_series = TRUE, trim_zeroes = TRUE, trim_series = TrimSeries(zero_lead = FALSE, zero_trail = FALSE), handle_missing_values = HandleMissingValues(na_remove = FALSE, na_replace = NULL), handle_zero_values = HandleZeroValues(zero_remove = FALSE, zero_replace = NULL) )
iidda_prep_series( data, series_variable = "deaths", time_variable = "period_end_date", grouping_variable = "cause", time_unit = "year", summarize_series = TRUE, trim_zeroes = TRUE, trim_series = TrimSeries(zero_lead = FALSE, zero_trail = FALSE), handle_missing_values = HandleMissingValues(na_remove = FALSE, na_replace = NULL), handle_zero_values = HandleZeroValues(zero_remove = FALSE, zero_replace = NULL) )
data |
data frame containing time series data |
series_variable |
column name of series variable in 'data', default is "deaths" |
time_variable |
column name of time variable in 'data', default is "period_end_date" |
grouping_variable |
column name of the grouping variable in 'data' to summarize the series variable over, if 'summarize=TRUE' |
time_unit |
time unit to sum series data over, must be one of iidda.analysis:::time_units, defaults to "year". |
summarize_series |
boolean value to indicate summarizing by 'time_unit' over the series variable |
trim_zeroes |
boolean value to filter data to exclude leading and trailing zeroes |
trim_series |
function to trim leading and trailing series zeroes, defaults to TrimSeries |
handle_missing_values |
function to handle missing values, defaults to HandleMissingValues |
handle_zero_values |
function to handle zero values, defaults to HandleZeroValues |
all fields in 'data' with records prepped for plotting moving average time series
Prep data for wavelet plot. Prep steps were taken from code provided by Steven Lee (https://github.com/davidearn/StevenLee) and Kevin Zhao (https://github.com/davidearn/KevinZhao).
iidda_prep_wavelet( data, trend_data, time_variable = "period_end_date", series_variable = "deaths", trend_variable = "deaths", series_suffix = "_series", trend_suffix = "_trend", wavelet_variable = "detrend_norm", output_emd_trend = "emd_trend", output_norm = "norm", output_sqrt_norm = "sqrt_norm", output_log_norm = "log_norm", output_emd_norm = "emd_norm", output_emd_sqrt = "emd_sqrt", output_emd_log = "emd_log", output_detrend_norm = "detrend_norm", output_detrend_sqrt = "detrend_sqrt", output_detrend_log = "detrend_log", data_harmonizer = SeriesHarmonizer(time_variable, series_variable), trend_data_harmonizer = SeriesHarmonizer(time_variable, trend_variable), data_deheaper = WaveletDeheaper(time_variable, series_variable), trend_deheaper = WaveletDeheaper(time_variable, trend_variable), joiner = WaveletJoiner(time_variable, series_suffix, trend_suffix), interpolator = WaveletInterpolator(time_variable, series_variable, trend_variable, series_suffix, trend_suffix), normalizer = WaveletNormalizer(time_variable, series_variable, trend_variable, series_suffix, trend_suffix, output_emd_trend, output_norm, output_sqrt_norm, output_log_norm, output_emd_norm, output_emd_sqrt, output_emd_log, output_detrend_norm, output_detrend_sqrt, output_detrend_log), transformer = WaveletTransformer(time_variable, wavelet_variable, dt = 1/52, dj = 1/50, lowerPeriod = 1/2, upperPeriod = 10, n.sim = 1000, make.pval = TRUE, date.format = "%Y-%m-%d") )
iidda_prep_wavelet( data, trend_data, time_variable = "period_end_date", series_variable = "deaths", trend_variable = "deaths", series_suffix = "_series", trend_suffix = "_trend", wavelet_variable = "detrend_norm", output_emd_trend = "emd_trend", output_norm = "norm", output_sqrt_norm = "sqrt_norm", output_log_norm = "log_norm", output_emd_norm = "emd_norm", output_emd_sqrt = "emd_sqrt", output_emd_log = "emd_log", output_detrend_norm = "detrend_norm", output_detrend_sqrt = "detrend_sqrt", output_detrend_log = "detrend_log", data_harmonizer = SeriesHarmonizer(time_variable, series_variable), trend_data_harmonizer = SeriesHarmonizer(time_variable, trend_variable), data_deheaper = WaveletDeheaper(time_variable, series_variable), trend_deheaper = WaveletDeheaper(time_variable, trend_variable), joiner = WaveletJoiner(time_variable, series_suffix, trend_suffix), interpolator = WaveletInterpolator(time_variable, series_variable, trend_variable, series_suffix, trend_suffix), normalizer = WaveletNormalizer(time_variable, series_variable, trend_variable, series_suffix, trend_suffix, output_emd_trend, output_norm, output_sqrt_norm, output_log_norm, output_emd_norm, output_emd_sqrt, output_emd_log, output_detrend_norm, output_detrend_sqrt, output_detrend_log), transformer = WaveletTransformer(time_variable, wavelet_variable, dt = 1/52, dj = 1/50, lowerPeriod = 1/2, upperPeriod = 10, n.sim = 1000, make.pval = TRUE, date.format = "%Y-%m-%d") )
data |
data frame containing time series data |
trend_data |
data frame containing time series trend data |
time_variable |
column name of time variable in 'data', default is "period_end_date" |
series_variable |
column name of series variable in 'data', default is "deaths_series" |
trend_variable |
column name of series variable in 'data', default is "deaths_trend" |
series_suffix |
suffix to be appended to series data fields |
trend_suffix |
suffix to be appended to trend data fields |
wavelet_variable |
name of the field in 'data' to be wavelet transformed |
output_emd_trend |
name of output field for the empirical mode decomposition applied to 'trend_variable' |
output_norm |
name of output field for the 'series_variable' normalized by 'output_emd_trend' |
output_sqrt_norm |
name of output field for the square root of 'output_norm' |
output_log_norm |
name of output field for the logarithm of ('output_norm' + 'eps') |
output_emd_norm |
name of output field for the empirical mode decomposition applied to 'output_norm' |
output_emd_sqrt |
name of output field for the empirical mode decomposition applied to 'output_sqrt_norm' |
output_emd_log |
name of output field for the empirical mode decomposition applied to 'output_log_norm' |
output_detrend_norm |
name of output field for the computed field 'output_norm'-'output_emd_norm' |
output_detrend_sqrt |
name of output field for the computed field 'output_sqrt_norm'-'output_emd_sqrt' |
output_detrend_log |
name of output field for the computed field 'output_log_norm'-'output_emd_log' |
data_harmonizer |
function that harmonizes time scales and series names so there is one data point per time unit |
trend_data_harmonizer |
function that harmonizes time scales and trend names so there is one data point per time unit |
data_deheaper |
function that fixes heaping errors on series data |
trend_deheaper |
function that fixes heaping errors on trend data |
joiner |
function that joins series and trend data sets |
interpolator |
function that linearly interpolates series and trend data |
normalizer |
function that computes normalized fields |
transformer |
function that computes wavelet transform |
list containing:
* transforemd_data
- wavelet transformed data
* tile_data_to_plot
- data set of the wavelet transformed data
prepped for plotting with ggplot2::geom_tile
* contour_data_to_plot
- data set of the transformed wavelet data prepped
for plotting with ggplot2::geom_contour
Themes for ggplot2
iidda_theme() iidda_theme_time() iidda_theme_heat() iidda_theme_above()
iidda_theme() iidda_theme_time() iidda_theme_heat() iidda_theme_above()
iidda_theme_time()
: Theme for plots where the x-axis represents time.
No x-axis titles will be plotted with this theme, because the meaning of a
time axis is obvious.
iidda_theme_heat()
: Theme for heatmaps where the x-axis represents time.
No x-axis titles will be plotted with this theme, because the
meaning of a time axis is obvious. Grid lines are not plotted with this
theme because interpretation can be compromised when grid lines
are visible through the colours of the heatmap.
iidda_theme_above()
: Theme for plots where the x-axis represents time,
but for which time information is not displayed because there are vertically
aligned plots below with the same time axis.
Joins lookup table in API to data
join_lookup_table(raw_data, lookup_type, api_hook)
join_lookup_table(raw_data, lookup_type, api_hook)
raw_data |
data frame of table to be harmonized |
lookup_type |
string indicating type of lookup table from API to join |
api_hook |
API operations list |
data frame of harmonized data with keys from API
Joins user-defined lookup table to data
join_user_table(raw_data, user_table_path, lookup_type, join_by)
join_user_table(raw_data, user_table_path, lookup_type, join_by)
raw_data |
data frame of table to be harmonized |
user_table_path |
string indicating path to user-defined lookup table |
lookup_type |
string indicating type of lookup table (disease, location, sex). Used to determine columns to join by if |
join_by |
vector of strings indicating columns to join by (optional if |
data frame of harmonized data with user-defined keys
Slight modification of 'log1p_trans()' to include better breaks that are log1p-based (log-based and shifted 1 so that breaks can be computed in the presence of zeroes.)
log1p_modified_trans(n = 10)
log1p_modified_trans(n = 10)
n |
number of desired breaks |
a scales::trans_new
function
Left joins lookup table to data frame of data.
lookup_join(raw_data, lookup_table, join_by, verbose = FALSE)
lookup_join(raw_data, lookup_table, join_by, verbose = FALSE)
raw_data |
Data frame of data to be harmonized. |
lookup_table |
Data frame of lookup table. |
join_by |
Vector of strings indicating columns to left_join by
(can use |
verbose |
Print information about the lookup. |
Data frame of newly harmonized and resolved data. Note that all entries in the returned data frame are strings.
lubridate functions with desired interpretable labels
lubridate_funcs
lubridate_funcs
An object of class character
of length 10.
Get associated lubridate function to compute time unit.
make_time_trans(unit = unname(time_units))
make_time_trans(unit = unname(time_units))
unit |
time unit, one or more of iidda.analysis:::time_units |
function to compute time unit
Compute a vector giving the mid-points of a vector of temporal periods,
defined by start dates and one of either a vector of end dates or a vector
of period lengths in days (see num_days
). You can either
return a date, with mid_dates
, or a date-time, with mid_times
.
In addition to the type of return value (date vs time), the former rounds
down to the nearest date whereas the latter is accurate to the nearest hour
and so can account for uneven
mid_dates(start_date, end_date, period_length) mid_times(start_date, end_date, period_length)
mid_dates(start_date, end_date, period_length) mid_times(start_date, end_date, period_length)
start_date |
Vector of period starting dates |
end_date |
Vector of period ending dates. If missing then
|
period_length |
Vector of integers giving the period length in days.
If missing then it is calculated using |
Create new time unit fields
mutate_time_vars( data, unit = unname(time_units), input_nm = "period_end_date", output_nm = get_unit_labels(unit) )
mutate_time_vars( data, unit = unname(time_units), input_nm = "period_end_date", output_nm = get_unit_labels(unit) )
data |
data set containing an input time field |
unit |
time unit, one of iidda.analysis:::time_units |
input_nm |
field name in 'data' containing input time field |
output_nm |
field name of newly created time unit field, by default uses get_unit_labels(). |
all fields in 'data' with additional time unit field
Defines column names to join by for a type of lookup table
names_to_join_by(lookup_type)
names_to_join_by(lookup_type)
lookup_type |
string indicating type of lookup table (disease, location, sex, age group) |
vector of column names to join by for the type of lookup table
Take a tidy data set with a potentially complex disease hierarchy and flatten this hierarchy so that, at any particular time and location (or some other context), all diseases in the 'disease' column have the same 'nesting_disease'.
normalize_disease_hierarchy( data, disease_lookup, grouping_columns = c("period_start_date", "period_end_date", "location"), basal_diseases_to_prune = character(), find_unaccounted_cases = TRUE, specials_pattern = "_unaccounted$" )
normalize_disease_hierarchy( data, disease_lookup, grouping_columns = c("period_start_date", "period_end_date", "location"), basal_diseases_to_prune = character(), find_unaccounted_cases = TRUE, specials_pattern = "_unaccounted$" )
data |
A tidy data set with the following minimal set of columns: 'disease', 'nesting_disease', 'basal_disease', 'period_start_date', 'period_end_date', and 'location'. Note that the latter three can be modified with 'grouping_columns'. |
disease_lookup |
A lookup table with 'disease' and 'nesting_disease' columns that describe a global disease hierarchy that will be applied locally to flatten disease hierarchy at each point in time and space in the tidy data set in the 'data' argument. |
grouping_columns |
Character vector of column names to use when grouping to determine the context. |
basal_diseases_to_prune |
Character vector of 'disease's to remove from 'data'. |
find_unaccounted_cases |
Make new records for instances when the sum of leaf diseases is less than the reported total for their basal disease. |
specials_pattern |
Optional regular expression to use to match 'disease' names in 'data' that should be added to the lookup table. This is useful for disease names that are not historical and produced for harmonization purposes. The most common example is '"_unaccounted$"', which is the default. Setting this argument to 'NULL' avoids adding any special disease names to the lookup table. |
Filter out overlapping sources for the same 'disease/nesting_disease/basal_disease', 'period_start_date', 'period_end_date' , and 'iso_3166_2', with the choice to keep either national level data (i.e. from Statistics Canada / Dominion Bureau of Statistics / Health Canada) or provincial level data (from a provincial ministry of Health).
normalize_duplicate_sources(data, preferred_jurisdiction = "national")
normalize_duplicate_sources(data, preferred_jurisdiction = "national")
data |
A tidy data set with columns 'dataset_id' , 'period_start_date', 'period_end_date' , 'disease' , 'nesting_disease' , 'basal_disease', and 'time_scale'. |
preferred_jurisdiction |
'national' or 'provincial', indicating which jurisdiction level will be kept if these sources overlap. |
A data set with no overlapping sources.
Set geographic order of provinces and territories and remove country-level data.
normalize_location(data)
normalize_location(data)
data |
Tidy dataset with an iso_3166_2 column. |
Tidy dataset without country-level data and with provinces and territories geographically ordered.
Normalize Population
normalize_population(data, harmonized_population)
normalize_population(data, harmonized_population)
data |
Tidy dataset with columns period_start_date, period_end_date iso_3166_2. |
harmonized_population |
Harmonized population data with columns date, iso_3166_2, and population (other columns will be dropped). |
Tidy dataset joined with harmonized population.
Choose a single best 'time_scale' for each year in a dataset, grouped by nesting disease. This best 'time_scale' is defined as the longest of the shortest time scales in each location and sub-disease.
normalize_time_scales( data, initial_group = c("year", "iso_3166", "iso_3166_2", "disease", "nesting_disease", "basal_disease"), final_group = c("basal_disease"), get_implied_zeros = TRUE, aggregate_if_unavailable = TRUE )
normalize_time_scales( data, initial_group = c("year", "iso_3166", "iso_3166_2", "disease", "nesting_disease", "basal_disease"), final_group = c("basal_disease"), get_implied_zeros = TRUE, aggregate_if_unavailable = TRUE )
data |
A tidy data set with columns 'time_scale', 'period_start_date' and 'period_end_date'. |
initial_group |
Character vector naming columns for defining the initial grouping used to compute the shortest time scales. |
final_group |
Character vector naming columns for defining the final grouping used to compute the longest of the shortest time scales. |
get_implied_zeros |
Add zeros that are implied by a '0' reported at a coarser timescale. |
aggregate_if_unavailable |
If a location is not reporting for the determined 'best timescale', but is reporting at a finer timescale, aggregate this finer timescale to the 'best timescale'. |
A data set only containing records with the optimal time scale.
Compute a vector giving the number of days in a set of periods, given equal length vectors of the start date and end date of these periods. This
num_days(start_date, end_date) num_days_util(start_date, end_date)
num_days(start_date, end_date) num_days_util(start_date, end_date)
start_date |
Vector of period starting dates |
end_date |
Vector of period ending dates |
num_days_util()
: Low-level interface for 'num_days'.
Obtain period midpoints and average daily rates for count data
period_averager( data, count_col = "cases_this_period", start_col = "period_start_date", end_col = "period_end_date", norm_col = NULL, norm_const = 1e+05, keep_raw = TRUE, keep_cols = names(data) )
period_averager( data, count_col = "cases_this_period", start_col = "period_start_date", end_col = "period_end_date", norm_col = NULL, norm_const = 1e+05, keep_raw = TRUE, keep_cols = names(data) )
data |
Data frame with rows at minimum containing period start and end dates and a count variable. |
count_col |
Character, name of count data column. |
start_col |
Character, name of start date column. |
end_col |
Character, name of end date column. |
norm_col |
Character, name of column giving data for normalization.
A good option is often |
norm_const |
Numeric value for multiplying the |
keep_raw |
Logical value indicating whether to force all |
keep_cols |
Character vector containing the names of columns in the
input |
Data frame containing the following fields.
Columns from the original dataset specified using keep_raw
and
keep_cols
.
year
: Year of the period_start_date
.
num_days
: Length of the period in days from the beginning of the
period_start_date
to the end of the period_end_date
.
period_mid_time
: Timestamp of the middle of the period.
period_mid_date
: Date containing the period_mid_time
.
daily_rate
: Daily count rate, which by default is given by
daily_rate = count_col / num_days
. If the name of
norm_col
is specified then
daily_rate = norm_const * count_col / num_days / norm_col
.
When interpreting these formulas, please keep in mind that
norm_const
is a numeric constant, num_days
is a derived
numeric column, and count_col
and norm_col
are columns
supplied within the input data
object.
set.seed(666) data <- data.frame(disease = "senioritis" , period_start_date = seq(as.Date("2023-04-03"), as.Date("2023-06-05"), by = 7) , period_end_date = seq(as.Date("2023-04-09"), as.Date("2023-06-11"), by = 7) , cases_this_period = sample(0:100, 10, replace = TRUE) , location = "college" ) period_averager(data, keep_raw = TRUE, keep_cols = c("disease", "location"))
set.seed(666) data <- data.frame(disease = "senioritis" , period_start_date = seq(as.Date("2023-04-03"), as.Date("2023-06-05"), by = 7) , period_end_date = seq(as.Date("2023-04-09"), as.Date("2023-06-11"), by = 7) , cases_this_period = sample(0:100, 10, replace = TRUE) , location = "college" ) period_averager(data, keep_raw = TRUE, keep_cols = c("disease", "location"))
Create function that aggregates information over time periods, normalizes a count variable, and creates new fields to summarize this information.
PeriodAggregator( time_variable = "period_mid_time", period_width_variable = "num_days", count_variable = "cases_this_period", norm_variable = "population_reporting", rate_variable = "daily_rate", norm_exponent = 5 )
PeriodAggregator( time_variable = "period_mid_time", period_width_variable = "num_days", count_variable = "cases_this_period", norm_variable = "population_reporting", rate_variable = "daily_rate", norm_exponent = 5 )
time_variable |
Name of the variable to characterize the temporal location of the time period. |
period_width_variable |
Name of variable to characterize the width of the time period. |
count_variable |
Name of variable to characterize the count variable being normalized. |
norm_variable |
Name of variable to be used to normalize the count variable. |
rate_variable |
Name of variable to be used to store the normalized count variable. |
norm_exponent |
Exponent to use in normalization. The default is '5', which means 'per 100,000'. |
Quantile transformation, adapted from https://stackoverflow.com/questions/38874741/transform-color-scale-to-probability-transformed-color-distribution-with-scale-f
quantile_trans(x)
quantile_trans(x)
x |
vector to be transformed |
a scales::trans_new
function
Read IIDDA Dataset into a Dataframe
read_iidda_dataset(dataset_id)
read_iidda_dataset(dataset_id)
dataset_id |
ID for a dataset in the IIDDA |
Resolves any duplicate columns that results after left_join due to shared columns between data frames. Rule: Keeps old values if all newly joined values are NA. Keeps new values otherwise (even if some entries are empty)
resolve_join(df)
resolve_join(df)
df |
data frame with duplicate columns ending in |
data frame with one remaining column for duplicates
Harmonizes the series variable in 'data' so there is one data value for each time unit in time variable (to account for different variations in disease/cause name)
SeriesHarmonizer(time_variable = "period_end_date", series_variable = "deaths")
SeriesHarmonizer(time_variable = "period_end_date", series_variable = "deaths")
time_variable |
column name of time variable in 'data', default is "period_end_date" |
series_variable |
column name of series variable in 'data', default is "deaths" |
function to harmonize disease/cause names
## Returned Function
- Arguments * 'data' data frame containing time series data - Return - all fields in 'data' with summarized series variable for unique time variable
Length of time in days representated by an object
time_extent(x, time_id)
time_extent(x, time_id)
x |
an object |
time_id |
identifier for finding time axis information in the object |
Default Time Scale Picker
time_scale_picker(data)
time_scale_picker(data)
data |
Data to transform. |
Vector of all possible time units, most or all are derived from lubridate functions
time_units
time_units
An object of class character
of length 29.
Time Scale Picker
TimeScalePicker( time_scale_variable = "time_scale", time_group_variable = "year" )
TimeScalePicker( time_scale_variable = "time_scale", time_group_variable = "year" )
time_scale_variable |
Variable identifying the time scale of records. The values of such a variable should be things like '"wk"', '"mo"', '"yr"'. |
time_group_variable |
Variable identifying a grouping variable for the time scales (e.g. a column identifying the year.). |
Convert a character vector (i.e. a character column) into a title for a plot.
titleize(title_info, max_items = 3L, max_chars = 15L)
titleize(title_info, max_items = 3L, max_chars = 15L)
title_info |
Character vector to be summarized into a title |
max_items |
TODO |
max_chars |
TODO |
Remove leading or trailing zeroes in a time series data set.
TrimSeries(zero_lead = FALSE, zero_trail = FALSE)
TrimSeries(zero_lead = FALSE, zero_trail = FALSE)
zero_lead |
boolean value, if 'TRUE' remove leading zeroes in 'data' |
zero_trail |
boolean value, if 'TRUE' remove trailing zeroes in 'data' |
a function to remove to remove leading and/or trailing zeroes
## Returned Function
- Arguments * 'data' data frame containing time series data * 'series_variable' column name of series variable in 'data', default is "deaths" * 'time_variable' column name of time variable in 'data', default is "period_end_date" - Return - all fields in 'data' with filtered records to trim leading and/or trailing zeroes
Combine two time series data sets with the option to handle overlapping time periods. This is particularly useful for data sets that come from two sources (ex. LBoM and RG). Assumes both data sets have the same number of columns with the same names.
union_series(x, y, overlap = TRUE, time_variable = "period_end_date")
union_series(x, y, overlap = TRUE, time_variable = "period_end_date")
x |
first data frame containing time series data |
y |
second data frame containing time series data |
overlap |
boolean to indicate if 'x' should get priority with overlapping time periods in 'y'. If 'TRUE' the returned data frame will contain all data from 'x', and the filtered 'y' data that does not overlap with 'x'. If FALSE, a union between 'x' and 'y' is returned. |
time_variable |
column name of time variable in 'x' and 'y', default is "period_end_date" |
combined 'x' and 'y' data frames with optional filtering for overlaps
Get unique tokens from iidda metadata
unique_entries(entries, metadata_search)
unique_entries(entries, metadata_search)
entries |
List returned by |
metadata_search |
Character, field from which unique tokens are desired |
Character vector of unique tokens for a given field from all iidda datasets
Validate if variable is a date data type in the data set.
valid_time_vars(var_nm, data)
valid_time_vars(var_nm, data)
var_nm |
string of variable name |
data |
data frame |
boolean of validation status
Fixes heaping errors in time series. The structure of this function was taken from the function 'find_heap_and_deheap' created by Kevin Zhao (https://github.com/davidearn/KevinZhao/blob/main/Report/make_SF_RData.R). This needs to be better documented.
WaveletDeheaper( time_variable = "period_end_date", series_variable = "deaths", first_date = "1830-01-01", last_date = "1841-12-31", week_start = 45, week_end = 5 )
WaveletDeheaper( time_variable = "period_end_date", series_variable = "deaths", first_date = "1830-01-01", last_date = "1841-12-31", week_start = 45, week_end = 5 )
time_variable |
column name of time variable in 'data', default is "period_end_date" |
series_variable |
column name of series variable in 'data', default is "deaths" |
first_date |
string containing earliest date to look for heaping errors |
last_date |
string containing last date to look for heaping errors |
week_start |
numeric value of the first week number to start looking for heaping errors |
week_end |
numeric value of the last week number to look for heaping errors |
function to fix heaping errors
## Returned Function
- Arguments * 'data' data frame containing time series data - Return - all fields in 'data' with an additional field called "deheaped_" concatenated with 'series_variable'. If no heaping errors are found, this additional field is identical to the field 'series_variable
Linearly interpolates 'NA' values in both series and trend variables.
WaveletInterpolator( time_variable = "period_end_date", series_variable = "deaths", trend_variable = "deaths", series_suffix = "_series", trend_suffix = "_trend" )
WaveletInterpolator( time_variable = "period_end_date", series_variable = "deaths", trend_variable = "deaths", series_suffix = "_series", trend_suffix = "_trend" )
time_variable |
column name of time variable in 'data', default is "period_end_date" |
series_variable |
column name of series variable in 'data', default is "deaths_series" |
trend_variable |
column name of series variable in 'data', default is "deaths_trend" |
series_suffix |
suffix to be appended to series data fields |
trend_suffix |
suffix to be appended to trend data fields |
function that linearly interpolates series and trend data.
## Returned Function
- Arguments * 'data' data frame containing time series data - Return - 'data' with linearly interpolated 'series_variable' and 'trend_variable'
Joins series data and trend datasets and keeps all time units in one of the datasets.
WaveletJoiner( time_variable = "period_end_date", series_suffix = "_series", trend_suffix = "_trend", keep_series_dates = TRUE )
WaveletJoiner( time_variable = "period_end_date", series_suffix = "_series", trend_suffix = "_trend", keep_series_dates = TRUE )
time_variable |
column name of time variable in 'data', default is "period_end_date" |
series_suffix |
suffix to be appended to series data fields |
trend_suffix |
suffix to be appended to trend data fields |
keep_series_dates |
boolean flag to indicate if the dates in 'series_data' should be kept and data from 'trend_data' is left joined, if 'FALSE' dates from 'trend_data' are left joined instead |
function to join data and trend data sets
## Returned Function
- Arguments * 'series_data' data frame containing time series data * 'trend_data' data frame containing trend data - Return - joined data set by 'time_variable' with updated field names
Creates normalizing fields in 'data'
WaveletNormalizer( time_variable = "period_end_date", series_variable = "deaths", trend_variable = "deaths", series_suffix = "_series", trend_suffix = "_trend", output_emd_trend = "emd_trend", output_norm = "norm", output_sqrt_norm = "sqrt_norm", output_log_norm = "log_norm", output_emd_norm = "emd_norm", output_emd_sqrt = "emd_sqrt", output_emd_log = "emd_log", output_detrend_norm = "detrend_norm", output_detrend_sqrt = "detrend_sqrt", output_detrend_log = "detrend_log", eps = 0.01 )
WaveletNormalizer( time_variable = "period_end_date", series_variable = "deaths", trend_variable = "deaths", series_suffix = "_series", trend_suffix = "_trend", output_emd_trend = "emd_trend", output_norm = "norm", output_sqrt_norm = "sqrt_norm", output_log_norm = "log_norm", output_emd_norm = "emd_norm", output_emd_sqrt = "emd_sqrt", output_emd_log = "emd_log", output_detrend_norm = "detrend_norm", output_detrend_sqrt = "detrend_sqrt", output_detrend_log = "detrend_log", eps = 0.01 )
time_variable |
column name of time variable in 'data', default is "period_end_date" |
series_variable |
column name of series variable in 'data', default is "deaths_series" |
trend_variable |
column name of series variable in 'data', default is "deaths_trend" |
series_suffix |
suffix to be appended to series data fields |
trend_suffix |
suffix to be appended to trend data fields |
output_emd_trend |
name of output field for the empirical mode decomposition applied to 'trend_variable' |
output_norm |
name of output field for the 'series_variable' normalized by 'output_emd_trend' |
output_sqrt_norm |
name of output field for the square root of 'output_norm' |
output_log_norm |
name of output field for the logarithm of ('output_norm' + 'eps') |
output_emd_norm |
name of output field for the empirical mode decomposition applied to 'output_norm' |
output_emd_sqrt |
name of output field for the empirical mode decomposition applied to 'output_sqrt_norm' |
output_emd_log |
name of output field for the empirical mode decomposition applied to 'output_log_norm' |
output_detrend_norm |
name of output field for the computed field 'output_norm'-'output_emd_norm' |
output_detrend_sqrt |
name of output field for the computed field 'output_sqrt_norm'-'output_emd_sqrt' |
output_detrend_log |
name of output field for the computed field 'output_log_norm'-'output_emd_log' |
eps |
numeric value for normalized data to be perturbed by before computing the logarithm |
function that creates normalized trend and de-trended fields
## Returned Function
- Arguments * 'data' data frame containing time series data - Return - 'data' with additional normalized fields
Compute the wavelet transform.
WaveletTransformer( time_variable = "period_end_date", wavelet_variable = "detrend_norm", ... )
WaveletTransformer( time_variable = "period_end_date", wavelet_variable = "detrend_norm", ... )
time_variable |
name of the time variable field in 'data' |
wavelet_variable |
name of the field in 'data' to be wavelet transformed |
... |
Arguments passed on to |
function that computes wavelet transform
## Returned Function
- Arguments * 'data' data frame containing time series data - Return - the wavelet transform object from EMD::analyze.wavelet applied to 'wavelet_variable' in 'data'
Weeks covering the year end are split into two records. The first week is adjusted to end on day 365 (or 366 in leap years), and the second week starts on the first day of the year. This was adapted from 'LBoM::edge_fix' which keeps the same series variable value for both of the newly created weeks. This doesn't seem to make much difference when viewing the heatmap, however it might make sense to do something sensible like dividing the series variable value in half and allocating each week to have half of the values.
Weeks covering the year end are split into two records. The first week is adjusted to end on day 365 (or 366 in leap years), and the second week starts on the first day of the year. This was adapted from 'LBoM::edge_fix' which keeps the same series variable value for both of the newly created weeks. This doesn't seem to make much difference when viewing the seasonal heatmap, however it might make sense to do something sensible like dividing the series variable value in half and allocating each week to have half of the values.
year_end_fix( data, series_variable = "deaths", start_year_variable = "Year", end_year_variable = "End Year", start_day_variable = "Day of Year", end_day_variable = "End Day of Year", temp_year_variable = "yr" ) year_end_fix( data, series_variable = "deaths", start_year_variable = "Year", end_year_variable = "End Year", start_day_variable = "Day of Year", end_day_variable = "End Day of Year", temp_year_variable = "yr" )
year_end_fix( data, series_variable = "deaths", start_year_variable = "Year", end_year_variable = "End Year", start_day_variable = "Day of Year", end_day_variable = "End Day of Year", temp_year_variable = "yr" ) year_end_fix( data, series_variable = "deaths", start_year_variable = "Year", end_year_variable = "End Year", start_day_variable = "Day of Year", end_day_variable = "End Day of Year", temp_year_variable = "yr" )
data |
data frame containing time series data |
series_variable |
column name of series variable in 'data', default is "deaths" |
start_year_variable |
column name of time variable containing the year of the starting period, defaults to "Year" |
end_year_variable |
column name of time variable containing the year of the ending period, defaults to "End Year" |
start_day_variable |
column name of time variable containing the day of the starting period, defaults to "Day of Year" |
end_day_variable |
column name of time variable containing the day of the ending period, defaults to "End Day of Year" |
temp_year_variable |
temporary variable name when pivoting the data frame |
all fields in 'data' with only records corresponding to year end weeks that have been split
all fields in 'data' with only records corresponding to year end weeks that have been split