Assessing Model Accuracy
UFPE
A key aim of this course is to introduce a wide range of statistical learning methods that extend far beyond the standard linear regression approach.
Why is it necessary to introduce so many different statistical learning approaches, rather than just a single best method?
There is no free lunch in statistics: no one method dominates all others over all possible data sets.
On a particular data set, one specific method may work best, but some other method may work better on a similar but different data set.
Hence it is an important task to decide for any given set of data which method produces the best results.
Selecting the best approach can be one of the most challenging parts of performing statistical learning in practice.
We will discuss some of the most important concepts that arise in selecting a statistical learning procedure for a specific data set.
We’ll cover:
Measuring the Quality of Fit: How to evaluate model performance.
The Bias-Variance Trade-Off: A fundamental concept in model selection.
In order to evaluate the performance of a statistical learning method on a given data set, we need some way to measure how well its predictions actually match the observed data.
We need to quantify the extent to which the predicted response value for a given observation is close to the true response value for that observation.
In the regression setting, the most commonly-used measure is the mean squared error (MSE):
\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{f}(x_i))^2 \]
where:
The MSE is the average of the squared differences between the predicted and actual values.
How do we measure how well a prediction function \(\hat{f}: \mathbb{R}^p \to \mathbb{R}\) performs?
We need a criterion. In regression, we often use a Loss Function \(\mathcal{L}(predicted, actual)\) to quantify the error for a single prediction.
Examples:
The Risk of a function \(\hat{f}\) is its expected loss over data \((X, Y)\).
\[R(\hat{f}) = E[\mathcal{L}(\hat{f}(X), Y)]\]
Using the quadratic loss, the prediction in this case is called the Mean Squared Error (MSE):
\[R(\hat{f}) = E[(Y - \hat{f}(X))^2]\]
Let \((X_{new}, Y_{new})\) a new observation not used to estimate or train \(\hat{f}\)
Using the quadratic loss, the prediction risk in this case is called the MSE of prediction:
\[R_{pred}(\hat{f}) = E[(Y_{new} - \hat{f}(X_{new}))^2]\]
Minimizing prediction risk (with quadratic loss) \(R_{pred}(\hat{f})\) is closely related to estimating the regression function \(f(x) = E[Y|X=x]\).
Let’s define the regression risk as the MSE for estimating \(f(X)\):
\[R_{reg}(\hat{f}) = E \left[ \left(f(X) - \hat{f}(X) \right)^2\right]\]
The following theorem formalizes the connection.
Theorem 1: Suppose we define the prediction risk of \(\hat{f}: \mathbb{R}^p \to \mathbb{R}\) via quadratic loss: \(R_{pred}(\hat{f}) = E \left[ \left(Y - \hat{f}(X) \right)^2\right]\), where \((X, Y)\) is a new observation. Let the regression risk be \(R_{reg}(\hat{f}) = E \left[ \left(f(X) - \hat{f}(X) \right)^2\right]\), where \(f(X) = E[Y|X]\). Then:
\[R_{pred}(\hat{f}) = R_{reg}(\hat{f}) + E[Var[Y|X]]\]
1
Proof.: \[ \begin{aligned} R_{pred}(\hat{f}) &= E\left[\left(Y - \hat{f}(X)\right)^2\right] \\ &= E\left[\left(Y - f(X) + f(X) - \hat{f}(X)\right)^2\right] \\ &= E\left[ \left(f(X) - \hat{f}(X)\right)^2 + \left(Y - f(X)\right)^2 + 2 \left(f(X) - \hat{f}(X) \right) \left(Y - f(X) \right) \right] \\ &= E\left[\left(f(X) - \hat{f}(X) \right)^2 \right] + E\left[\left(Y - f(X) \right)^2 \right] + 2 E\left[\left(f(X) - \hat{f}(X) \right) \left( Y - f(X) \right) \right] \end{aligned} \]
Now consider the cross term, using the Law of Total Expectation1: \[E[A] = E[E[A|X]]\]
\[ \begin{aligned} E \left[ \left(f(X) - \hat{f}(X) \right) \left(Y - f(X) \right) \right] &= E\Big[ E \left[ \left( f(X) - \hat{f}(X) \right) \left( Y - f(X) \right) | X \right] \Big] \\ &= E\Big[ \left( f(X) - \hat{f}(X) \right) \underbrace{E \left[ \left( Y - f(X) \right) | X \right]}_{= E[Y|X] - f(X) = 0} \Big] \\ &= E \left[ \left( f(X) - \hat{f}(X) \right) \cdot 0 \right] = 0 \end{aligned} \]
So, the cross term is zero. We are left with:
\[R_{pred}(\hat{f}) = E[(f(X) - \hat{f}(X))^2] + E[(Y - f(X))^2]\]
The first term is the definition of the regression risk: \[E[(f(X) - \hat{f}(X))^2] = R_{reg}(\hat{f})\]
The second term is the expected conditional variance: \[E[(Y - f(X))^2] = E[ E[ (Y - f(X))^2 | X ] ] = E[Var[Y|X]]\]
Therefore: \[R_{pred}(\hat{f}) = R_{reg}(\hat{f}) + E[Var[Y|X]]\]
\(\square\)
\[R_{pred}(\hat{f}) = \underbrace{E \left[ \left( f(X) - \hat{f}(X) \right)^2 \right]}_{R_{reg}(\hat{f}) \text{ (Reducible Error)}} + \underbrace{E \left[ Var[Y|X] \right]}_{\text{Irreducible Error}}\]
\[\underset{\hat{f}}{\operatorname{arg min}} R_{pred}(\hat{f}) = \underset{\hat{f}}{\operatorname{arg min}} R_{reg}(\hat{f}) = f(x)\]
Why is low conditional risk \(R(\hat{f}) = E \left[\mathcal{L}(\hat{f}(X), Y) \right]\) desirable?
Imagine we observe a large, new (validation) set of data \((X_{n+1}, Y_{n+1}), \dots, (X_{n+m}, Y_{n+m})\), drawn i.i.d from the same distribution as \((X, Y)\).
By the Law of Large Numbers, if \(m\) is large, the average loss on this new data will approximate the risk:
\[ \frac{1}{m} \sum_{i=1}^{m} \mathcal{L}(\hat{f}(X_{n+i}), Y_{n+i}) \approx R(\hat{f}) \]
So, if \(R(\hat{f})\) is small, we expect \(\hat{f}(X_{new}) \approx Y_{new}\) on average for future observations. This holds for any loss function \(\mathcal{L}\).
The objective of methods from a predictive viewpoint is, therefore:
To provide methods that yield good estimators \(\hat{f}\) of the true function \(f(x)\), meaning estimators with low risk \(R(\hat{f})\).
In practice, we don’t know \(f(x)\) and can’t calculate \(R(\hat{f})\) exactly, so we need ways to estimate risk or choose models that are likely to have low risk on unseen data1.
Left: Simulated data. Black curve is the true f. Orange: linear regression. Blue & Green: smoothing splines with different flexibility.
Right: Grey: Training MSE. Red: Validation MSE. Dashed: Minimum possible validation MSE. Squares show training/validation MSE for the fits on the left.
Here, f is highly non-linear. Both training and validation MSE decrease rapidly before the validation MSE starts to increase slowly. More flexibility is needed than in the previous example. The linear model is underfitting.
Here, the true f is much closer to linear. Linear regression provides a good fit. The validation MSE decreases only slightly before increasing. The simple linear model performs best. The more flexible models are overfitting to noise.
- The flexibility level corresponding to the minimum validation MSE varies considerably between datasets.
- We need methods to estimate the validation MSE using the available training data.
Two common approaches:
- The flexibility level corresponding to the minimum validation MSE varies considerably between datasets.
- We need methods to estimate the validation MSE using the available training data.
Two common approaches:
Data Splitting (and Validation Sets)
A simple strategy to estimate validation error is data splitting:
Fit the model using the training set. This means estimating any parameters of the model (e.g., coefficients in linear regression).
Predict the responses for the observations in the validation set using the fitted model.
Calculate the MSE (or other error metric) on the validation set. This provides an estimate of the validation MSE.
Suppose we have \(n\) observations: \((x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\).
We might split the data into:
We estimate the risk, \(\hat{R}(f)\), which is an approximation of the true risk \(R(f)=E[\mathcal{L}(f(X);Y)]\), using the validation set:
\[ \hat{R}(f) = \frac{1}{n-s} \sum_{i=s+1}^{n} (y_i - \hat{f}(x_i))^2 \]
where \(\hat{f}\) is the model fit using the training data only, and the sum is over the validation set.
The U-shape observed in the validation MSE curves (Figures below) is a result of two competing properties of statistical learning methods: bias and variance.
Variance refers to the amount by which \(\hat{f}\) would change if we estimated it using a different training data set.
Ideally, the estimate for f should not vary too much between training sets. The model should be relatively stable.
High variance: Small changes in the training data can result in large changes in \(\hat{f}\). The model is very sensitive to the specific training data it sees.
More flexible statistical methods generally have higher variance. They can fit the training data very closely, but this can lead to overfitting and high variability.
Bias refers to the error introduced by approximating a real-life problem (which may be very complex) with a simpler model.
The optimal flexibility level (where validation MSE is minimized) differs across datasets, depending on the true f and the noise level.
So far, we’ve focused on regression (where the response variable \(Y\) is quantitative). The concepts of bias-variance trade-off also apply to classification (where \(Y\) is qualitative), with some modifications.
In classification, the most common measure of accuracy is the training error rate:
\[ \frac{1}{n} \sum_{i=1}^{n} I(y_i \neq \hat{y}_i) \]
As in regression, we are most interested in the error rate on validation observations:
\[ \text{Ave}(I(\mathbf{y} \neq \mathbf{\hat{y}})) = \frac{1}{n_{\text{val}}} \sum_{i \in \mathcal{T}} I(y_i \neq \hat{y}_i) \]
\(\mathbf{\hat{y}}\) is the predicted class label for a validation observation with predictor \(\dot{\mathbf{x}}\).
A good classifier has a low validation error rate.
\(\mathcal{T}\) is the index set of validation observations.
Ave: abbreviation of “Average”
Structure (for a binary classification problem):
Predicted: Positive | Predicted: Negative | Total Actual | |
---|---|---|---|
Actual: Positive | TP | FN | TP + FN |
Actual: Negative | FP | TN | FP + TN |
Total Predicted | TP + FP | FN + TN | n |
Predicted: Positive | Predicted: Negative | Total Actual | |
---|---|---|---|
Actual: Positive | TP | FN | TP + FN |
Actual: Negative | FP | TN | FP + TN |
Total Predicted | TP + FP | FN + TN | n |
Scenario: Validation set (\(n=100\)), 10 Actual Positive (Minority), 90 Actual Negative (Majority).
Model Predictions:
Confusion Matrix:
Predicted: Positive | Predicted: Negative | Total Actual | |
---|---|---|---|
Actual: Positive | 7 | 3 | 10 |
Actual: Negative | 2 | 88 | 90 |
Total Predicted | 9 | 91 | 100 |
Confusion Matrix:
Predicted: Positive | Predicted: Negative | Total Actual | |
---|---|---|---|
Actual: Positive | 7 | 3 | 10 |
Actual: Negative | 2 | 88 | 90 |
Total Predicted | 9 | 91 | 100 |
High, right?! But let’s look deeper.
Formula: \[Precision = \frac{TP}{TP + FP}\]
Focus: Minimizing False Positives (FP). High precision means the model is trustworthy when it predicts the positive class.
Relevance: Important when the cost of a False Positive is high (e.g., flagging a legitimate transaction as fraud, sending unnecessary alerts).
\[Precision = \frac{TP}{TP + FP}\]
Using Example:
TP = 7
FP = 2
Predicted Positives = TP + FP = 7 + 2 = 9
Precision: \(\frac{7}{7 + 2} = \frac{7}{9} \approx 0.778\) or 77.8%
Interpretation: When this model predicts “Positive”, it is correct about 77.8% of the time.
\[Recall = \frac{TP}{TP + FN}\]
Using Example:
TP = 7
FN = 3
Actual Positives = TP + FN = 7 + 3 = 10
Recall: \(\frac{7}{7 + 3} = \frac{7}{10} = 0.70\) or 70.0%
Interpretation: This model only found 70% of the actual Positive cases. 30% were missed (FN). This highlights the model’s weakness, unlike the 95% accuracy.
Formula: \[F_1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}\]
Concept: The harmonic mean of Precision and Recall. It gives more weight to lower values, meaning it penalizes models where one metric is high and the other is very low more than a simple average would.
Relevance: Useful when you need a balance between Precision and Recall, and there isn’t a strong reason to prioritize one significantly over the other.
\[F_1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}\]
Using Example:
Precision ≈ 0.778
Recall = 0.70
\(F_1\)-Score: \(2 \times \frac{0.778 \times 0.70}{0.778 + 0.70} = 2 \times \frac{0.5446}{1.478} \approx 2 \times 0.368 \approx 0.737\)
Interpretation: The \(F_1\)-Score of 0.737 provides a more balanced view of performance than the 95% accuracy, reflecting the trade-off between the model’s precision and its ability to find positive cases.
\[MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}}\]
Using Example:
Interpretation: The MCC value of 0.71 indicates a reasonably good correlation between the model’s predictions and the actual classes, taking into account both correct and incorrect predictions for both classes in a balanced way.
\[κ = \frac{P_o - P_e}{1 - P_e}\]
Using Example from Slide 8 (TP=7, TN=88, FP=2, FN=3, n=100):
Interpretation: The Kappa value of ~0.71 indicates substantial agreement between the model’s predictions and the actual values. This metric reinforces that the model has learned patterns.
Recommendation: Always look at the Confusion Matrix first.
Aprendizado de Máquina: uma abordagem estatística, Izibicki, R. and Santos, T. M., 2020, link: https://rafaelizbicki.com/AME.pdf.
An Introduction to Statistical Learning: with Applications in R, James, G., Witten, D., Hastie, T. and Tibshirani, R., Springer, 2013, link: https://www.statlearning.com/.
Mathematics for Machine Learning, Deisenroth, M. P., Faisal. A. F., Ong, C. S., Cambridge University Press, 2020, link: https://mml-book.com.
An Introduction to Statistical Learning: with Applications in python, James, G., Witten, D., Hastie, T. and Tibshirani, R., Taylor, J., Springer, 2023, link: https://www.statlearning.com/.
Matrix Calculus (for Machine Learning and Beyond), Paige Bright, Alan Edelman, Steven G. Johnson, 2025, link: https://arxiv.org/abs/2501.14787.
Machine Learning Beyond Point Predictions: Uncertainty Quantification, Izibicki, R., 2025, link: https://rafaelizbicki.com/UQ4ML.pdf.
Mathematics of Machine Learning, Petersen, P. C., 2022, link: http://www.pc-petersen.eu/ML_Lecture.pdf.
Machine Learning - Prof. Jodavid Ferreira