Cross-validation of hydromad model specification

Using a split dataset, estimate parameters using optimisation on data subsets and evaluate their performance on all other subsets.

crossValidate(
  MODEL,
  periods,
  name.Model.str = paste(MODEL$sma, MODEL$routing),
  name.Cal.objfn = "unknown",
  name.Catchment = as.character(MODEL$call$DATA),
  fitBy,
  ...,
  trace = isTRUE(hydromad.getOption("trace")),
  parallel = hydromad.getOption("parallel")[["crossValidate"]]
)

Arguments

MODEL: an object of class hydromad.
periods: named list of start and end dates, passed to splitData
name.Model.str: Name to give to this model structure to allow a combined analysis
name.Cal.objfn: Name to give to this model identification process (e.g. name of objective function and/or optimisation algorithm), to allow a combined analysis
name.Catchment: Name to give to this catchment to allow a combined analysis
fitBy: function to estimate parameters of MODEL, e.g. fitByOptim, fitBySCE
...: Arguments passed to fitBy
trace: Whether to report messages.
parallel: name of method to use for parallelisation ("foreach" or "none"), or list giving settings for parallelisation. See hydromad_parallelisation.

Value

A runlist of n*n models evaluated in each of n periods with parameters estimated from each of n periods, of subclass crossvalidation

Parallelisation

crossValidate optionally allows the separate optimisations to be run concurrently with the parallel option method="foreach". This is usually only worthwhile for longer running optimisations such as fitBySCE rather than relatively fast methods such as fitByOptim.

The total runtime is limited by the runtime of the slowest of optimisations, e.g. the longest data subset, most complex objective function response surface, or slowest computer on which an optimisation is being run. Some of the workers may therefore be idle (potentially wasting money) even though others are still running.

The evaluation of parameters on validation data subsets is also optionally parallelised through the function update.runlist, by setting hydromad.options(parallel=list(update.runlist="clusterApply")). The advantage of this is likely to be minor, unless a large number of cross validation periods are used, due to the overhead involved and its relative speed compared to the optimisation. Note that this requires parallelisation to be setup on the worker, which is where the evaluation occurs.

If the parallelisation backend for foreach supports it, the cross-validations can be set to occur in the background using parallel=list(method="foreach",async=TRUE). In this case, the function returns immediately and the progress and results can be retrieved using functions provided by the parallelisation backend. This can be useful to submit a number of cross-validations for which the results are not immediately needed. As with a single cross-validation, mixing long and short running optimisations can make it difficult to fully utilise available workers.

In future, it may also be possible to parallelise each optimisation itself in addition to or instead of parallelising the optimisation of each data period.

Author

Joseph Guillaume

Examples