-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible API for models #160
Comments
@pratikunterwegs @bahadzie @BlackEdder @rozeggo @Bisaloo @chartgerink. Just a sketch / brain dump after our discussions yesterday. Thoughts appreciated. |
Thanks @TimTaylor for this sketch. Broadly, the external user-facing 'safe' and internal 'unsafe' implementation already exists for all C++ implementations (default, Vacamole, diphtheria), where the C++/Rcpp function I'm happy to hear suggestions from any who were not at the meeting regarding this.
Contact interventions and vaccinations combine well with objects of the same type. Rate interventions also combine, but the combination must target the same parameter; e.g. combining an intervention on Users aiming to run two different scenarios, differing in their interventions, would indeed need to pass a list of lists. While unwieldy, we do need to see it from a sustainability perspective - the codebase is already unapproachable for the majority of the team so it would be good to avoid further complication (as bundling interventions and vaccinations together would require unbundling the object once passed to C++).
This case assumes parameter uncertaintly, hence the seed. I would say the user could/should pass a vector of parameter values, which could be drawn from a distribution and passed to each function call. Eventually, functionality could be added to e.g. {epiparameter}
Suggestions on what works here would be great. The key consideration is that the output would eventually be ingested by a function that calculates differences between discrete scenarios (e.g. intervention 1 vs intervention 2), and there would need to be a way to identify these scenarios within the output.
I think we decided yesterday to eventually class the output. Happy to hear suggestions. |
Unless I misunderstood the seed issue is more about stochastic models where you need to compare interventions across hundreds of realisations so need to start with the same seeds??? |
Well those too, my omission. |
Let's ensure we're on the same page re: seeds and stochasticity: AFAICT simple case: list (of list) multiple interventions within function callseir(
alpha, beta, gamma, contact_matrix,
intervention = list(interventions_1, interventions_2),
n = 1000
) in this case we can easily ensure more complicated case: different interventions across two function callsres1 <- seir(
alpha, beta, gamma, contact_matrix,
intervention = list(interventions_1),
n = 1000
)
res2 <- seir(
alpha, beta, gamma, contact_matrix,
intervention = list(interventions_2),
n = 1000,
res1$random_seeds # list of the seeds - something along the lines of this???
) We need some way to ensure comparibility for the user. Now (arguably) this is where {scenarios} could come in but I think it's easier to think about it within {epidemics} first as I'm not sure how easy making a generic {scenarios} package would be. |
Yes, this is basically what we would need - if we go for this approach of parameter and seed management for users. |
I'm tackling the various sub-issues in this thread one by one:
First up is the API for parameter uncertainty, which is relatively easy to implement if users draw parameter values and pass them to the function. A draft of the internal workings for the default model, using {data.table}, is in Some questions, happy to discuss here or on Slack next week:
Attaching a small reprex here: library(epidemics)
library(socialmixr)
#>
#> Attaching package: 'socialmixr'
#> The following object is masked from 'package:utils':
#>
#> cite
library(microbenchmark)
polymod <- socialmixr::polymod
contact_data <- socialmixr::contact_matrix(
polymod,
countries = "United Kingdom",
age.limits = c(0, 20, 40),
symmetric = TRUE
)
#> Using POLYMOD social contact data. To cite this in a publication, use the 'cite' function
#> Removing participants that have contacts without age information. To change this behaviour, set the 'missing.contact.age' option
contact_matrix <- t(contact_data$matrix)
demography_vector <- contact_data$demography$population
# Prepare some initial objects
uk_population <- population(
name = "UK population",
contact_matrix = contact_matrix,
demography_vector = demography_vector,
initial_conditions = matrix(
c(0.9999, 0.0001, 0, 0, 0),
nrow = nrow(contact_matrix), ncol = 5L,
byrow = TRUE
)
)
data = model_default_cpp(
uk_population,
transmissibility = rnorm(1000, 1.3/7, 0.01)
)
data
#> population transmissibility infectiousness_rate recovery_rate
#> <list> <num> <num> <num>
#> 1: <population[4]> 0.1992952 0.5 0.1428571
#> 2: <population[4]> 0.1688840 0.5 0.1428571
#> 3: <population[4]> 0.1958680 0.5 0.1428571
#> 4: <population[4]> 0.1798174 0.5 0.1428571
#> 5: <population[4]> 0.1896007 0.5 0.1428571
#> ---
#> 996: <population[4]> 0.1779147 0.5 0.1428571
#> 997: <population[4]> 0.1905737 0.5 0.1428571
#> 998: <population[4]> 0.1799411 0.5 0.1428571
#> 999: <population[4]> 0.1824868 0.5 0.1428571
#> 1000: <population[4]> 0.1764790 0.5 0.1428571
#> intervention vaccination time_dependence time_end increment
#> <list> <list> <list> <num> <num>
#> 1: 100 1
#> 2: 100 1
#> 3: 100 1
#> 4: 100 1
#> 5: 100 1
#> ---
#> 996: 100 1
#> 997: 100 1
#> 998: 100 1
#> 999: 100 1
#> 1000: 100 1
#> data run_id
#> <list> <int>
#> 1: <data.frame[1515x4]> 1
#> 2: <data.frame[1515x4]> 2
#> 3: <data.frame[1515x4]> 3
#> 4: <data.frame[1515x4]> 4
#> 5: <data.frame[1515x4]> 5
#> ---
#> 996: <data.frame[1515x4]> 996
#> 997: <data.frame[1515x4]> 997
#> 998: <data.frame[1515x4]> 998
#> 999: <data.frame[1515x4]> 999
#> 1000: <data.frame[1515x4]> 1000
# benchmarking
microbenchmark(
model_default_cpp(
uk_population,
transmissibility = rnorm(1000, 1.3/7, 0.01)
),
times = 10
)
#> Warning in microbenchmark(model_default_cpp(uk_population, transmissibility =
#> rnorm(1000, : less accurate nanosecond times to avoid potential integer
#> overflows
#> Unit: seconds
#> expr
#> model_default_cpp(uk_population, transmissibility = rnorm(1000, 1.3/7, 0.01))
#> min lq mean median uq max neval
#> 1.781095 1.816511 1.839991 1.84262 1.865887 1.878012 10 Created on 2024-02-02 with reprex v2.0.2 |
Adding some thoughts on this API for @TimTaylor. Re: a safe/unsafe implementation; composable elements are cross-checked against each other. E.g. interventions and vaccinations must have the same number of coefficients as demography groups in the population. Allowing a list of (lists of) interventions requires cross checking for each element using This could be avoided if each model function call allowed only a single list of interventions (on contacts and parameters; i.e., current functionality), as this would reduce input cross-checking to 1 instance. This still allows parameter uncertainty, as these are not needed for the cross-checks on composable elements. The user would have to run the model for each intervention scenario, but it would make for a faster run time. |
I'm still getting familiar with the package specifics, so happy to follow this discussion but unable to provide meaningful insights on the detailed elements. For what it's worth, I would recommend to do as little seed handling within the functions as possible. If at all possible, it would be helpful to rely on the seed setting outside of the function, so as a given. I do not know to what degree it is possible in this specific use case, of course 😊 If you end up including seed setting, please make sure to set the seeds in a non-global manner (as is best practice). You can do this with |
I agree with this. Setting a seed in R is pretty easy so I think this should be part of training and not built into the tool. Still, it's a balance between building a robust package and a tool whose convenience promotes uptake, I've found. |
@pratikunterwegs / @chartgerink - I won't really have too much time to look at this this week (I've not yet worked through the reprex you showed). But to add some more context / commentary to your comments:
Would allowing parameter uncertainty and different interventions but restricting to a single population matrix be a solution? I think allowing multiple lists of interventions is likely to be useful within the call as managing the random number streams will be awkward otherwise for users. If we went the single list of intervention route then we would need to return a list of
At a high level users will still be able to set.seed() at the start of their scripts. The issue here is that managing streams across multiple stochastic realisations with multiple interventions and parameters is much more fiddly. If it's just one repetition at a time with just one list of interventions then yes it wouldn't be too onerous on the user (they'd still need to manage seeds between repetitions and have a good way of doing so). Basically how much we need to manage the seeds is very linked to whether we go with this sort of API or not.
We wouldn't be setting the initial seed - only ensuring the seeds align between replications so AFAICT nothing would need restoring. I'd be worried if a stochastic function didn't change the seed 😅
Generally yes, but here there is likely a lot of book keeping for people to do (and possibly get wrong) involving both getting and setting seeds across interventions and parameters. Repeating sentence from earlier as very relevant here:
|
Thanks @TimTaylor -w e (@rozeggo, @bahadzie and I) just met again, and I'll be working through the various sub-issues here bit by bit. So rather than one giant PR with a fully thought-through implementation, I'll make several smaller PRs where it would be great if you could take a look once you have time.
No - each intervention would have to be checked against the population matrix (or the vector of demography group size), as this determines whether they are compatible. Consider a single
This is still quite ott imo. The simpler way to start imo is to allow passing a vector of parameters for parameter uncertainty, and having users reuse that vector for alternative scenarios.
The issue of combined parameter uncertainty and stochastic uncertainty is one I'll set aside for now while focusing on parameter uncertainty, but definitely to be taken up in the coming weeks.
This strikes me as a nice-to-have, hand-holding feature, rather than critically necessary for the tool. I don't mind working towards this but perhaps something to keep to one side for now.
We have decided to go with this API - but seed management can be left to one side until we get to a doing-comparisons stage. That's still a while away. I don't personally see the benefits of storing or returning seeds just yet, but perhaps they will become clearer as this effort progresses. |
Yeah that's cool. Think we are getting confused at the mo.
Yeah I don't see this as a problem. The main saving is coming from not having to check e.g. 3000 different parameters just 3. Think it's fine to check multiple intervention lists against the population matrix as I'm doubting there is much overhead there.
The issue manifests if you allow users to do multiple repetitions for a set of parameters and wanted to compare interventions across multiple calls (not a problem if done within one call). Whilst a user could match the seed for the first repetition they couldn't for subsequent repetitions as the intervention will effect the random number stream |
Thanks for sharing the previous discussion and the interesting points raised here. Generally, I'm in favour of having a dual API which allows us to solve the simplicity - flexibility trade-off by placing two points on the scale: a access to the full flexilibity via a low-level API and a simplified, less flexible high-level API. The way I had imagined it however is that the high-level API / the vectorised function would be part of scenarios. This may seem like a minor point but I believe it would be helpful to ensure we follow good design patterns and limit coupling between both APIs. On this note, the current proposal doubles the size of the namespace, which may not be ideal. An alternative approach would be to have a single generic function for the high level API (e.g., On a related note, I think we're probably in agreement about this but it's worth explicitly mentioning it: we should follow the vctrs strategy on recycling to avoid unexpected behaviour. In terms of class, I agree a nested data.frame is probably the best choice, rather than a list. Even though it's more complex than a non-nested data.frame, it sill has a similar feel for users, and tidyverse tools are directly usable with it.
While I appreciate the need for data.table internally, it is possible to convert it to a data.frame before returning so non-data.table users don't get confused? I'm also quite uncomfortable with the seed management as I suspect it contains a couple of issues and edge cases. Such as if we want to allow parallel processing in the future. But I'll need to think about it more and it seems anyways that we don't really have a choice. |
I strongly suggest that there is another meeting/discussion to agree the API to work towards before any further development is done as think there's a risk of going in circles otherwise. |
Thanks @Bisaloo, just a few clarifications below.
I agree that the internal function should not be exported; otherwise, the two level structure already exists for Rcpp model functions, which are called by R-only exported functions. Additionally, I'm opening an issue to propose removal of the R-only versions of ODE model code in favour of the C++ implementation, which would reduce the namespace and the codebase. This is because our new understanding of the use case (1000s of runs per modelling script/task) suggests that a slower implementation is no longer useful, adds to maintenance, and is too deeply buried to be a good teaching tool.
I agree with this, although users might be hoping for base R style recyling, and this might need to be addressed - something for later.
Agreed - all {epidemics} outputs are data.frames; this is only for an initial prototype on the branch. @TimTaylor - happy to have another meeting, but I think this would be more productive if we were discussing a prototype rather than a hypothetical. I'll get an initial version of parameter uncertainty in the default model up and running this week, and would be happy to meet to discuss that. Will put this in the Slack now. |
For future reference, I am now convinced the approach proposed in this issue first message is the best path forward. In particular, @TimTaylor mentioned the following benefits:
Also for future reference, I should clarify that the main benefit of this implementation IMO is not really performance or sparing users from writing their own loops (this level of R proficiency will likely be necessary to use any of our packages anyways), but it allows us to have a stable and predictable output format (or class) for scenario modelling, which facilitates integration with downstream packages. I have mentioned reservations about how parallel computing would work in this framework but we agreed:
|
The structure of the classed output probably warrants an additional issue raising. Points to consider:
|
Thanks both, I'm opening some related issues now, which will be collected under this project. The structure of the classed output is already open for discussion under #156 if you have any thoughts. |
The code that this issue really affects is the stochastic Ebola model Ebola, which is also the slowest one as it is written in pure R. A good way forward here could be to refactor Edit: an alternative is of course to calculate the output of |
@pratikunterwegs - I've submitted a small PR (#171) that may improve |
Thanks @TimTaylor, will take a look and merge soon. My initial thought was to check whether I could find this functionality in GSL through RcppGSL; my next thought was to replace the Erlang with the Gamma distribution. But the simpler solution is of course to peel off these initial parts from the later iteration step. One other speed bottleneck is usage of |
If |
I did actually have a C++ version which iirc was not much faster. Was removed in #140. Perhaps speeding up |
Overview
Expose two functions to users:
An unsafe (little / no input checking) one that works on individual runs and interventions. This is more like an internal function but we expose it for "advanced" users. Only returns results and is not classed.
A vectorised one that allows comparisons of interventions:
Motivation for this API
Input checking overhead adds up over a large number of runs. As we expect users to explore different levels of uncertainty then the number of runs we expect them to perform adds up. We want to provide a safe way for users to do multiple runs.
For stochastic models, comparing interventions across multiple runs involves careful handling of random number streams. This is easy to ignore and get wrong. To this end we do want to support users in managing this. Whether this should be part of a scenarios package or not can be addressed later once we have experience of the best API.
SEIR Example
Challenges / questions
Do interventions combine nicely (e.g. vaccinations and contact reductions) in to a single intervention or will we need to take lists of lists. Implementation wise this is not a problem but will it feel unwieldy for users?
If a user wants to do separate function calls to compare interventions then we need to provide a way for them to match random seeds for each internal iteration. This means that we would need to return the random seed for each iteration. For the default random number generator in R this corresponds to an integer of length 626. Having to return
length(param) * n
vectors of length626
) could get unwieldy (and large) - thoughts???What format for output of the vectorised function? List or nested data.frame / tibble / data.table.?
Do we class the output?
Are the random seed challenges the things that could motivate a scenarios package? Would handling it in epidemics be too much? Too early to say?
The text was updated successfully, but these errors were encountered: