Title: | Out-of-Sample R² with Standard Error Estimation |
---|---|
Description: | Estimates out-of-sample R² through bootstrap or cross-validation as a measure of predictive performance. In addition, a standard error for this point estimate is provided, and confidence intervals are constructed. |
Authors: | Stijn Hawinkel [cre, aut] |
Maintainer: | Stijn Hawinkel <[email protected]> |
License: | GPL-2 |
Version: | 1.0.11 |
Built: | 2024-10-30 06:45:24 UTC |
Source: | https://github.com/sthawinke/oosse |
The .632 bootstrap estimation of the MSE
boot632(y, x, id, fitFun, predFun)
boot632(y, x, id, fitFun, predFun)
y |
The vector of outcome values |
x |
The matrix of predictors |
id |
the sample indices resampled with replacement |
fitFun |
The function for fitting the prediction model |
predFun |
The function for evaluating the prediction model |
The implementation follows (Efron and Tibshirani 1997)
The MSE estimate
Efron B, Tibshirani R (1997). “Improvements on cross-validation: The 632+ bootstrap method.” J. Am. Stat. Assoc., 92(438), 548 - 560.
Repeated .632 bootstrapa
boot632multiple(nBootstraps, y, ...)
boot632multiple(nBootstraps, y, ...)
nBootstraps |
The number of .632 bootstraps |
y |
The vector of outcome values |
... |
passed onto boot632 |
The estimated MSE
The oob bootstrap (smooths leave-one-out CV)
bootOob(y, x, id, fitFun, predFun)
bootOob(y, x, id, fitFun, predFun)
y |
The vector of outcome values |
x |
The matrix of predictors |
id |
sample indices sampled with replacement |
fitFun |
The function for fitting the prediction model |
predFun |
The function for evaluating the prediction model |
The implementation follows (Efron and Tibshirani 1997)
matrix of errors and inclusion times
Efron B, Tibshirani R (1997). “Improvements on cross-validation: The 632+ bootstrap method.” J. Am. Stat. Assoc., 92(438), 548 - 560.
RNA-sequencing data of genetically identical Brassica napus plants in autumn, with 5 phenotypes next spring, as published by De Meyer S, Cruz DF, De Swaef T, Lootens P, Block JD, Bird K, Sprenger H, Van de Voorde M, Hawinkel S, Van Hautegem T, Inzé D, Nelissen H, Roldán-Ruiz I, Maere S (2022). “Predicting yield traits of individual field-grown Brassica napus plants from rosette-stage leaf gene expression.” bioRxiv. doi:10.1101/2022.10.21.513275, https://www.biorxiv.org/content/early/2022/10/23/2022.10.21.513275.full.pdf..
Brassica
Brassica
A list with two components Expr and Pheno
Matrix with Rlog values of 1000 most expressed genes
Data frame with 5 phenotypes and x and y coordinates of the plants in the field
(De Meyer et al. 2022)
Calculate a confidence interval for R², MSE and MST
buildConfInt(oosseObj, what = c("R2", "MSE", "MST"), conf = 0.95)
buildConfInt(oosseObj, what = c("R2", "MSE", "MST"), conf = 0.95)
oosseObj |
The result of the R2oosse call |
what |
For which property should the ci be found: R² (default), MSE or MST |
conf |
the confidence level required |
The upper bound of the interval is truncated at 1 for the R² and the lower bound at 0 for the MSE
The confidence intervals for R² and the MSE are based on standard errors and normal approximations. The confidence interval for the MST is based on the chi-squared distribution as in equation (16) of (Harding et al. 2014), but with inflation by a factor (n+1)/n. All quantities are out-of-sample.
A vector of length 2 with lower and upper bound of the confidence interval
Harding B, Tremblay C, Cousineau D (2014). “Standard errors: A review and evaluation of standard error estimators using Monte Carlo simulations.” The Quantitative Methods for Psychology, 10(2), 107 - 123.
data(Brassica) fitFunLM = function(y, x){lm.fit(y = y, x = cbind(1, x))} predFunLM = function(mod, x) {cbind(1,x) %*% mod$coef} R2lm = R2oosse(y = Brassica$Pheno$Leaf_8_width, x = Brassica$Expr[, 1:10], fitFun = fitFunLM, predFun = predFunLM, nFolds = 10) buildConfInt(R2lm) buildConfInt(R2lm, what = "MSE") buildConfInt(R2lm, what = "MST")
data(Brassica) fitFunLM = function(y, x){lm.fit(y = y, x = cbind(1, x))} predFunLM = function(mod, x) {cbind(1,x) %*% mod$coef} R2lm = R2oosse(y = Brassica$Pheno$Leaf_8_width, x = Brassica$Expr[, 1:10], fitFun = fitFunLM, predFun = predFunLM, nFolds = 10) buildConfInt(R2lm) buildConfInt(R2lm, what = "MSE") buildConfInt(R2lm, what = "MST")
Check whether supplied prediction function meets the requirements
checkFitFun(fitFun, reqArgs = c("y", "x"))
checkFitFun(fitFun, reqArgs = c("y", "x"))
fitFun |
The prediction function, or its name as character string |
reqArgs |
The vector of required arguments |
Throws an error when requirements not met, otherwise returns the function
Estimate correlation between MSE and MST estimators
estCorMSEMST( y, x, fitFun, predFun, methodMSE, methodCor, nBootstrapsCor, nFolds, nBootstraps )
estCorMSEMST( y, x, fitFun, predFun, methodMSE, methodCor, nBootstrapsCor, nFolds, nBootstraps )
y |
The vector of outcome values |
x |
The matrix of predictors |
fitFun |
The function for fitting the prediction model |
predFun |
The function for evaluating the prediction model |
methodMSE |
The method to estimate the MSE, either "CV" for cross-validation or "bootstrap" for .632 bootstrap |
methodCor |
The method to estimate the correlation between MSE and MST estimators, either "nonparametric" or "jackknife" |
nBootstrapsCor |
The number of bootstraps to estimate the correlation |
nFolds |
The number of outer folds for cross-validation |
nBootstraps |
The number of .632 bootstraps |
the estimated correlation
Estimate MSE and its standard error
estMSE( y, x, fitFun, predFun, methodMSE, nFolds, nInnerFolds, cvReps, nBootstraps )
estMSE( y, x, fitFun, predFun, methodMSE, nFolds, nInnerFolds, cvReps, nBootstraps )
y |
The vector of outcome values |
x |
The matrix of predictors |
fitFun |
The function for fitting the prediction model |
predFun |
The function for evaluating the prediction model |
methodMSE |
The method to estimate the MSE, either "CV" for cross-validation or "bootstrap" for .632 bootstrap |
nFolds |
The number of outer folds for cross-validation |
nInnerFolds |
The number of inner cross-validation folds |
cvReps |
The number of repeats for the cross-validation |
nBootstraps |
The number of .632 bootstraps |
The nested cross-validation scheme follows (Bates et al. 2023), the .632 bootstrap is implemented as in (Efron and Tibshirani 1997)
A vector with MSE estimate and its standard error
Bates S, Hastie T, Tibshirani R (2023).
“Cross-validation: What does it estimate and how well does it do it?”
J. Am. Stat. Assoc., 118(ja), 1 - 22.
doi:10.1080/01621459.2023.2197686, https://doi.org/10.1080/01621459.2023.2197686.
Efron B, Tibshirani R (1997).
“Improvements on cross-validation: The 632+ bootstrap method.”
J. Am. Stat. Assoc., 92(438), 548 - 560.
Format seconds into human readable format
formatSeconds(seconds, digits = 2)
formatSeconds(seconds, digits = 2)
seconds |
The number of seconds to be formatted |
digits |
the number of digits for rounding |
A character vector expressing time in human readable format
Calculate standard error on MSE from nested CV results
getSEsNested(cvSplitReps, nOuterFolds, n)
getSEsNested(cvSplitReps, nOuterFolds, n)
cvSplitReps |
The list of outer and inner CV results |
nOuterFolds |
The number of outer folds |
n |
The sample size |
The calculation of the standard error of the MSE as proposed by (Bates et al. 2023)
The estimate of the MSE and its standard error
Bates S, Hastie T, Tibshirani R (2023). “Cross-validation: What does it estimate and how well does it do it?” J. Am. Stat. Assoc., 118(ja), 1 - 22. doi:10.1080/01621459.2023.2197686, https://doi.org/10.1080/01621459.2023.2197686.
Helper function to check if matrix is positive definite
isPD(mat, tol = 1e-06)
isPD(mat, tol = 1e-06)
mat |
The matrix |
tol |
The tolerance |
A boolean indicating positive definiteness
Process the out-of-bag bootstraps to get to standard errors following Efron 1997
processOob(x)
processOob(x)
x |
the list with out=of=bag bootstrap results |
out-of-bag MSE estimate and standard error
Estimate out-of-sample R² and its standard error
R2oosse( y, x, fitFun, predFun, methodMSE = c("CV", "bootstrap"), methodCor = c("nonparametric", "jackknife"), printTimeEstimate = TRUE, nFolds = 10L, nInnerFolds = nFolds - 1L, cvReps = 200L, nBootstraps = 200L, nBootstrapsCor = 50L, ... )
R2oosse( y, x, fitFun, predFun, methodMSE = c("CV", "bootstrap"), methodCor = c("nonparametric", "jackknife"), printTimeEstimate = TRUE, nFolds = 10L, nInnerFolds = nFolds - 1L, cvReps = 200L, nBootstraps = 200L, nBootstrapsCor = 50L, ... )
y |
The vector of outcome values |
x |
The matrix of predictors |
fitFun |
The function for fitting the prediction model |
predFun |
The function for evaluating the prediction model |
methodMSE |
The method to estimate the MSE, either "CV" for cross-validation or "bootstrap" for .632 bootstrap |
methodCor |
The method to estimate the correlation between MSE and MST estimators, either "nonparametric" or "jackknife" |
printTimeEstimate |
A boolean, should an estimate of the running time be printed? |
nFolds |
The number of outer folds for cross-validation |
nInnerFolds |
The number of inner cross-validation folds |
cvReps |
The number of repeats for the cross-validation |
nBootstraps |
The number of .632 bootstraps |
nBootstrapsCor |
The number of bootstraps to estimate the correlation |
... |
passed onto fitFun and predFun |
Implements the calculation of the R² and its standard error by (Hawinkel et al. 2023). Multithreading is used as provided by the BiocParallel or doParallel packages, A rough estimate of expected computation time is printed when printTimeEstimate is true, but this is purely indicative. The options to estimate the mean squared error (MSE) are cross-validation (Bates et al. 2023) or the .632 bootstrap (Efron and Tibshirani 1997).
A list with components
R2 |
Estimate of the R² with standard error |
MSE |
Estimate of the MSE with standard error |
MST |
Estimate of the MST with standard error |
corMSEMST |
Estimated correlation between MSE and MST estimators |
params |
List of parameters used |
fullModel |
The model trained on the entire dataset using fitFun |
n |
The sample size of the training data |
Bates S, Hastie T, Tibshirani R (2023).
“Cross-validation: What does it estimate and how well does it do it?”
J. Am. Stat. Assoc., 118(ja), 1 - 22.
doi:10.1080/01621459.2023.2197686, https://doi.org/10.1080/01621459.2023.2197686.
Efron B, Tibshirani R (1997).
“Improvements on cross-validation: The 632+ bootstrap method.”
J. Am. Stat. Assoc., 92(438), 548 - 560.
Hawinkel S, Waegeman W, Maere S (2023).
“Out-of-sample R2: Estimation and inference.”
Am. Stat., 1 - 16.
doi:10.1080/00031305.2023.2216252, https://doi.org/10.1080/00031305.2023.2216252.
data(Brassica) #Linear model fitFunLM = function(y, x){lm.fit(y = y, x = cbind(1, x))} predFunLM = function(mod, x) {cbind(1,x) %*% mod$coef} y = Brassica$Pheno$Leaf_8_width R2lm = R2oosse(y = Brassica$Pheno$Leaf_8_width, x = Brassica$Expr[, 1:10], fitFun = fitFunLM, predFun = predFunLM, nFolds = 10)
data(Brassica) #Linear model fitFunLM = function(y, x){lm.fit(y = y, x = cbind(1, x))} predFunLM = function(mod, x) {cbind(1,x) %*% mod$coef} y = Brassica$Pheno$Leaf_8_width R2lm = R2oosse(y = Brassica$Pheno$Leaf_8_width, x = Brassica$Expr[, 1:10], fitFun = fitFunLM, predFun = predFunLM, nFolds = 10)
Calculate out-of-sample R² and its standard error based on MSE estimates
RsquaredSE(MSE, margVar, SEMSE, n, corMSEMST)
RsquaredSE(MSE, margVar, SEMSE, n, corMSEMST)
MSE |
An estimate of the mean squared error (MSE) |
margVar |
The marginal variance of the outcome, not scaled by (n+1)/n |
SEMSE |
The standard error on the MSE estimate |
n |
the sample size of the training data |
corMSEMST |
The correlation between MSE and marginal variance estimates |
This function is exported to allow the user to estimate the MSE and its standard error and the correlation between MSE and MST estimators himself. The marginal variance is scaled by (n+1)/n to the out-of-sample MST, so the user does not need to do this.
A vector with the R² and standard error estimates
Hawinkel S, Waegeman W, Maere S (2023). “Out-of-sample R2: Estimation and inference.” Am. Stat., 1 - 16. doi:10.1080/00031305.2023.2216252, https://doi.org/10.1080/00031305.2023.2216252.
#The out-of-sample R² calculated using externally provided estimates RsquaredSE(MSE = 3, margVar = 4, SEMSE = 0.4, n = 50, corMSEMST = 0.75)
#The out-of-sample R² calculated using externally provided estimates RsquaredSE(MSE = 3, margVar = 4, SEMSE = 0.4, n = 50, corMSEMST = 0.75)
Perform simple CV, and return the MSE estimate
simpleCV(y, x, fitFun, predFun, nFolds)
simpleCV(y, x, fitFun, predFun, nFolds)
y |
The vector of outcome values |
x |
The matrix of predictors |
fitFun |
The function for fitting the prediction model |
predFun |
The function for evaluating the prediction model |
nFolds |
The number of outer folds for cross-validation |
The MSE estimate