---
title: "Fitting to Real Data"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Fitting to Real Data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: macpan2.bib
link-citations: TRUE
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width=7,
  fig.height=7
)
```

[![status](https://img.shields.io/badge/status-mature%20draft-yellow)](https://canmod.github.io/macpan2/articles/vignette-status#mature-draft)

The `vignette("quickstart")` and `vignette("calibration")` articles work with simulated data, but it is more interesting to fit models to real data. The [International Infectious Disease Data Archive (IIDDA)](https://canmod.github.io/iidda-tools/iidda.api) is a useful place to look for incidence, mortality, and population data for illustrating `macpan2`. This archive contains data from [this data digitization project](http://canmod.net/digitization), which we will use here. What we find is that with real data, even with relatively simple models and calibration problems, issues can arise that require careful thought.

## Setup

As usual we need the following packages.  If you don't have them, please get them in the usual way (i.e. using `install.packages`).

```{r other_packages, message=FALSE}
library(macpan2)
library(dplyr)
library(ggplot2)
library(broom.mixed)
```

In addition, we need the `iidda.api` package.

```{r installation, message=FALSE}
if (!require(iidda.api)) {
  repos = c(
      "https://canmod.r-universe.dev"
    , "https://cran.r-project.org"
  )
  install.packages("iidda.api", repos = repos)
}
api_hook = iidda.api::ops_staging
```

And we set a few options for convenience.

```{r api_hook}
options(iidda_api_msgs = FALSE, macpan2_verbose = FALSE)
```


## One Scarlet Fever Outbreak in Ontario

The following code will pull data for a single scarlet fever outbreak in Ontario that ended in 1930.

```{r scarlet_fever_ontario}
scarlet_fever_ontario = api_hook$filter(
    resource_type = "Compilation"
  , dataset_ids = "canmod-cdi-harmonized"
  , iso_3166_2 = "CA-ON"  ## get ontario data
  , time_scale = "wk"  ## weekly incidence data only 
  , disease = "scarlet-fever"
  
  # get data between 1929-08-01 and 1930-10-01
  , period_end_date = "1929-08-01..1930-10-01"
)
print(scarlet_fever_ontario)
```

This is what the data looks like.

```{r scarlet_fever_ontario_plot}
(scarlet_fever_ontario
  |> ggplot(aes(period_end_date, cases_this_period))
  + geom_line() + geom_point()
  + ggtitle("Scarlet Fever Incidence in Ontario, Canada")
  + theme_bw()
)
```


## SIR Model

We will begin by pulling the simple sir model from the model library.

```{r}
sir = mp_tmb_library("starter_models", "sir", package = "macpan2")
print(sir)
```


## Modify Model for Reality

Once we take any library model off the shelf we need to modify it a bit to be more realistic. We do this with the `mp_tmb_insert()` function, which inserts new default values and model expressions.

We first change the population size to the population of Ontario at the time.

```{r}
ontario_population = api_hook$filter(
    resource_type = "Compilation"
  , dataset_ids = "canmod-pop-normalized"
  , iso_3166_2 = "CA-ON"  ## get ontario data
  
  # get data between 1929-08-01 and 1930-10-01
  , date = "1929-08-01..1930-10-01"
)
sf_sir = mp_tmb_insert(sir
  , default = list(N = median(ontario_population$population))
)
```

Second, our SIR model assumes that all of the cases are recorded in the data, but this is not realistic.  We therefore modify the model to include under-reporting, by creating a case `reports` variable that is given by the product of incidence (i.e. weekly `infection` rate) and a reporting probability.

```{r}
sf_sir = mp_tmb_insert(sf_sir
    ## insert this expression ...
  , expressions = list(reports ~ infection * report_prob)
    
    ## at the end (i.e. Infinity) of the expressions evaluated
    ## 'during' each iteration of the simulation loop ...
  , at = Inf  
  , phase = "during"
  
    ## add a new default value for the reporting probability
  , default = list(report_prob = 1/300)
)
```

Here we can print out the modified model to see if our changes were made successfully.

```{r}
print(sf_sir)
```


## Preparing the Data

The first step in preparing data for `macpan2` is to simulate from the model that you are considering. Here we simulate the `"reports"` variable because it corresponds to the reported incidence (the number of new cases per time-step, which in our case is one week, multiplied by a reporting probability).
```{r}
sir_simulator = mp_simulator(sf_sir
  , time_steps = 5
  , outputs = "reports"
)
head(mp_trajectory(sir_simulator))
```

The next step is to get our data into a format that is compatible with the format of these simulations. In particular, we need `matrix`, `time`, and `value` columns. We can omit the `row` and `col` columns, because all of the 'matrices' in the model are 1-by-1 scalars.

```{r}
observed_data = (scarlet_fever_ontario
  ## select the variables to be modelled -- a time-series of case reports.
  |> select(period_end_date, cases_this_period)
  
  ## change the column headings so that they match the columns
  ## in the simulated trajectories.
  |> mutate(matrix = "reports")
  |> rename(value = cases_this_period)
  
  ## create a `time` column with the time-step IDs that will correspond
  ## to the time-steps in the simulation. this column heading also 
  ## must match the column with the time-steps in the simulated trajectories
  |> mutate(time = seq_along(period_end_date))
)
print(head(observed_data))
```


## Set up the Optimizer

Now we can create an object that can be calibrated.

```{r}
sir_cal = mp_tmb_calibrator(
    spec = sf_sir
  , data = observed_data
  
  ## name the trajectory variable, with a name that
  ## is the same in both the spec and the data
  , traj = "reports"  
  
  ## fit the following parameters
  , par = c("beta", "gamma", "I", "report_prob")
)
```

Here we assert that we will fit `beta`, `gamma`, (the initial value of) `I`, and `report_prob`. Note that we can only choose to fit parameters in the `default` list of the model spec. In particular, `I` is in the default list because the model requires the initial number of infectious individuals, whereas `S` is not because it is derived before the simulation loop as `S ~ N - I - R`.

```{r}
print(sf_sir)
```

## Run the Optimization

```{r}
sir_opt = mp_optimize(sir_cal)
```

Not off to a good start.  Those warnings are not necessarily bad, but they might get us thinking.

## Examine the fit

```{r}
print(sir_opt)
```

OK, things are not great.  The `convergence` code is `1`, indicating that the model did _not_ converge (`convergence == 0` is good). Examining the parameter estimates, which are stored internally in the calibrator object, things get worse.

```{r}
mp_tmb_coef(sir_cal, conf.int = TRUE)
```

That doesn't look right!  Those are very high `beta` and `gamma` values, and the confidence intervals are enormous and overlap zero.

But the model fit doesn't look all that bad.

```{r}
fitted_data = mp_trajectory_sd(sir_cal, conf.int = TRUE)
(observed_data
  |> ggplot()
  + geom_point(aes(time, value))
  + geom_line(aes(time, value)
    , data = fitted_data
  )
  + geom_ribbon(aes(time, ymin = conf.low, ymax = conf.high)
    , data = fitted_data
    , alpha = 0.2
    , colour = "red"
  )
  + theme_bw()
  + facet_wrap(~matrix, ncol = 1, scales = "free")
)
```

What is going on? Can we modify this model and/or fitting procedure to get both reasonable parameter estimates and a reasonable trajectory?

## Fixing the Optimization Problem

One way to address the lack of convergence of the optimizer is to assess the degree to which the parameter estimates are biologically reasonable. Take the recovery rate, which is estimated as `gamma ~ 26.5` per time-step. Given that a time-step is one week in this model, this estimate implies that individuals recover from scarlet fever in `1/26.5` weeks -- much less than a day. A quick search suggests to me that this recovery rate is not reasonable for scarlet fever, and that a rough guess that the infectious stage lasts one week is much more plausible than `1/26.5` weeks. Now look at the transmission rate, `beta`.  It is also estimated to be pretty large, but what is more interesting is the large standard errors for both `beta` and `gamma`. These standard errors suggest that the estimates are not precise. However, the estimated correlation between `beta` and `gamma` is very close to `1`, suggesting that many values for these parameters would fit the data well as long as they are both of similar magnitude.

```{r}
mp_tmb_fixef_cov(sir_cal) |> cov2cor()
```

This diagnosis, if correct, suggests that the data are not sufficiently informative to identify values for both `beta` and `gamma`. Resolving such identifiability issues is often best done by introducing prior information, such as our rough guess that `gamma` is close to `1`. The simplest way to include such prior information is to move `gamma` out of the `pars` argument to `mp_tmb_calibrator` and into the `default` argument, as below.

```{r}
sir_cal_assume_gamma = mp_tmb_calibrator(
    spec = sf_sir
  , data = observed_data
  
  ## name the trajectory variable, with a name that
  ## is the same in both the spec and the data
  , traj = "reports"  
  
  ## fit the following parameters
  , par = c("beta", "I", "report_prob")
  , default = list(gamma = 1)
)
```

This calibration specification does not try to jointly fit both `beta` and `gamma`, but rather fits only `beta` while assuming that `gamma = 1`.

```{r}
sir_opt_assume_gamma = mp_optimize(sir_cal_assume_gamma)
print(sir_opt_assume_gamma)
```

More warnings, but now the optimizer converges. And the standard errors in the coefficient table and fixed effect correlations seem more plausible.

```{r}
mp_tmb_coef(sir_cal_assume_gamma)
mp_tmb_fixef_cov(sir_cal_assume_gamma) |> cov2cor()
```

And the fit looks similar to the four-parameter model, which is consistent with our diagnosis of non-identifiability. 

```{r}
fitted_data = mp_trajectory_sd(sir_cal_assume_gamma, conf.int = TRUE)
(observed_data
  |> ggplot()
  + geom_point(aes(time, value))
  + geom_line(aes(time, value)
    , data = fitted_data
  )
  + geom_ribbon(aes(time, ymin = conf.low, ymax = conf.high)
    , data = fitted_data
    , alpha = 0.2
    , colour = "red"
  )
  + theme_bw()
  + facet_wrap(~matrix, ncol = 1, scales = "free")
)
```

Caution: Ben, what is that paper that you always recommend about being careful when prior information is included without prior uncertainty?
This one [@elderdUncertainty2006]

## Learning the Functional form of Time Variation in Transmission (New!)

Let's get a bit more data to see two seasons.
```{r}
scarlet_fever_ontario = api_hook$filter(
    resource_type = "Compilation"
  , dataset_ids = "canmod-cdi-harmonized"
  , iso_3166_2 = "CA-ON"  ## get ontario data
  , time_scale = "wk"  ## weekly incidence data only 
  , disease = "scarlet-fever"
  
  # get data between 1929-08-01 and 1931-10-01
  , period_end_date = "1929-08-01..1931-10-01"  
)
(scarlet_fever_ontario
  |> ggplot(aes(period_end_date, cases_this_period))
  + geom_line() + geom_point()
  + ggtitle("Scarlet Fever Incidence in Ontario, Canada")
  + theme_bw()
)
observed_data = (scarlet_fever_ontario
  ## select the variables to be modelled -- a time-series of case reports.
  |> select(period_end_date, cases_this_period)
  
  ## change the column headings so that they match the columns
  ## in the simulated trajectories.
  |> mutate(matrix = "reports")
  |> rename(value = cases_this_period)
  
  ## create a `time` column with the time-step IDs that will correspond
  ## to the time-steps in the simulation. this column heading also 
  ## must match the column with the time-steps in the simulated trajectories
  |> mutate(time = seq_along(period_end_date))
)
print(head(observed_data))
```


Prepare to fit to the data. We make a function so that we can easily update the dimension of the radial basis.

```{r}
make_rbf_calibrator = function(dimension) {
  mp_tmb_calibrator(
      spec = sf_sir
    , data = observed_data
    , traj = "reports"  
    
    ## -----------------------------
    ## this is the key bit
    , tv = mp_rbf("beta", dimension)
    ## -----------------------------
    
    , par = c("gamma", "I", "report_prob")
  )
}
```

And it is also convenient to make a function for plotting the results
```{r}
plot_fit = function(cal_object) {
  fitted_data = mp_trajectory_sd(cal_object, conf.int = TRUE)
  (observed_data
    |> ggplot()
    + geom_point(aes(time, value))
    + geom_line(aes(time, value)
      , data = fitted_data
      , colour = "red"
    )
    + geom_ribbon(aes(time, ymin = conf.low, ymax = conf.high)
      , data = fitted_data
      , alpha = 0.2
      , colour = "red"
    )
    + theme_bw()
  )
}
```

Now we try fitting for a number of different dimensions.

```{r}
sir_cal = make_rbf_calibrator(dimension = 1)
mp_optimize(sir_cal)
plot_fit(sir_cal)
```


```{r}
sir_cal = make_rbf_calibrator(dimension = 2)
mp_optimize(sir_cal)
plot_fit(sir_cal)
```

```{r}
sir_cal = make_rbf_calibrator(dimension = 3)
mp_optimize(sir_cal)
plot_fit(sir_cal)
```

```{r}
sir_cal = make_rbf_calibrator(dimension = 4)
mp_optimize(sir_cal)
plot_fit(sir_cal)
```

```{r}
sir_cal = make_rbf_calibrator(dimension = 5)
mp_optimize(sir_cal)
plot_fit(sir_cal)
```

```{r}
sir_cal = make_rbf_calibrator(dimension = 6)
mp_optimize(sir_cal)
plot_fit(sir_cal)
```

```{r}
sir_cal = make_rbf_calibrator(dimension = 7)
mp_optimize(sir_cal)
plot_fit(sir_cal)
```

Interestingly we have now managed to fit `gamma`
```{r}
mp_tmb_coef(sir_cal)
```

This is a much slower recovery rate. Expected time in the R box is about one year! Not believeable, but the point is to make it easier to fit models so you can try more things.


```{r}
mp_tmb_calibrator(
    spec = sf_sir
  , data = observed_data
  , traj = "reports"  
  , tv = mp_rbf("beta", dimension = 7)
  , par = c("gamma", "I", "report_prob")
  , outputs = c("reports", "beta")
)
```

## References