Classification Methods
UFPE
represent random variables (r.v.);
represent realizations of random variables;
represent random vectors;
represent realizations of random vectors;
represent random matrices;
represent realizations of random matrices;
dimension of features, variables, parameters;
sample size;
\(i\)-th observation, instance (e.g., \(i=1, ..., n\));
\(j\)-th feature, variable, parameter (e.g., \(j=1, ..., p\));
number of classes (i.e., cardinality of class set \(\mathcal{C}\));
number of folds in CV / number of nearest neighbors in \(k\)-NN.
\[R(f) := \mathbb{E}[I(Y \neq f(\mathbf{x}))] = P(Y \neq f(\mathbf{x}))\]
\[L(f(\mathbf{x}), Y) = I(Y \neq f(\mathbf{x}))\]
Theorem: Suppose we define the risk of a prediction function \(\hat{f}: \mathbb{R}^p \to \mathcal{C}\) via 0-1 loss: \(R(\hat{f}) = P(Y \neq \hat{f}(\mathbf{x}))\), where \((\mathbf{x},Y)\) is a new draw from the joint distribution.
Then the function \(\hat{f}\) that minimizes \(R(\hat{f})\) is given by:
\[\hat{f}(\dot{\mathbf{x}}) = \underset{c \in \mathcal{C}}{\text{arg max }} P(Y=c|\mathbf{x}=\dot{\mathbf{x}})\]
\[\begin{align} R(\hat{f}) = & \mathbb{E}[I(Y \neq \hat{f}(\mathbf{x}))] = P(Y \neq \hat{f}(\mathbf{x})) \\ = & \int_{\mathbb{R}^p} P(Y \neq \hat{f}(\dot{\mathbf{x}})|\mathbf{x}=\dot{\mathbf{x}}) p(\dot{\mathbf{x}}) d\dot{\mathbf{x}} \\ = & \int_{\mathbb{R}^p} [I(\hat{f}(\dot{\mathbf{x}})=c_2)P(Y=c_1|\mathbf{x}=\dot{\mathbf{x}}) + \\ & I(\hat{f}(\dot{\mathbf{x}})=c_1)P(Y=c_2|\mathbf{x}=\dot{\mathbf{x}})] p(\dot{\mathbf{x}}) d\dot{\mathbf{x}} \end{align}\]
where \(p(\dot{\mathbf{x}})\) is the probability density function of \(\mathbf{x}\).
\[ \hat{f}(\dot{\mathbf{x}}) = c_1 \iff P(Y=c_1|\mathbf{x}=\dot{\mathbf{x}}) \ge \frac{1}{2} \]
\[ R(\hat{f}) = 1 - \mathbb{E}\left[ \max_{c \in \mathcal{C}} P(Y=c | \mathbf{x}) \right] \]
\[ R(\hat{f}) = \mathbb{E}\left[ \min\left( P(Y=c_1 | \mathbf{x}), P(Y=c_2 | \mathbf{x}) \right) \right] \]
Theorem: Let \(\mathcal{G} = \{\hat{f}_1, \dots, \hat{f}_T\}\) be a finite class of \(T\) candidate classifiers estimated from a training set. Let \(\tilde{R}(\hat{f})\) be the estimated error of \(\hat{f}\) on an independent validation set of size \(m = n - s\). Let \(\hat{f}^*\) be the model that minimizes the true risk \(R(\hat{f})\) among \(\hat{f} \in \mathcal{G}\), and let \(\hat{\tilde{f}}\) be the model that minimizes the estimated validation risk \(\tilde{R}(\hat{f})\) among \(\hat{f} \in \mathcal{G}\).
Then, for any \(\epsilon > 0\), with probability at least \(1-\epsilon\): \[ R(\hat{\tilde{f}}) \le R(\hat{f}^*) + 2 \sqrt{\frac{1}{2m} \log \frac{2T}{\epsilon}} \]
Plug-in classifiers generally fall into two main paradigms:
Equivalently, the model assumes a linear relationship for the log-odds (logit): \[ \log \left( \frac{P(Y=1|\mathbf{x}=\dot{\mathbf{x}})}{1 - P(Y=1|\mathbf{x}=\dot{\mathbf{x}})} \right) = \boldsymbol{\beta}^\top \dot{\mathbf{x}} = \beta_0 + \beta_1 \dot{x}_1 + \dots + \beta_p \dot{x}_p \]
We do not assume this relationship is perfectly true in nature, but we use it as a parametric approximation to build a classifier.
Coefficients \(\boldsymbol{\beta}\) are estimated using Maximum Likelihood Estimation (MLE).
Given i.i.d. observations \((\dot{\mathbf{x}}_i, y_i)_{i=1}^n\), the conditional likelihood function is: \[ L(\boldsymbol{\beta}; \dot{\mathbf{y}}, \dot{\mathbf{X}}) = \prod_{i=1}^n [p_i(\boldsymbol{\beta})]^{y_i} [1 - p_i(\boldsymbol{\beta})]^{1-y_i} \] where \(p_i(\boldsymbol{\beta}) = P(Y_i=1|\mathbf{x}_i=\dot{\mathbf{x}}_i, \boldsymbol{\beta}) = \frac{1}{1 + e^{-\boldsymbol{\beta}^\top \dot{\mathbf{x}}_i}}\).
Maximizing the likelihood is equivalent to maximizing the log-likelihood: \[ \ell(\boldsymbol{\beta}) = \sum_{i=1}^n \left[ y_i \log p_i(\boldsymbol{\beta}) + (1-y_i) \log(1 - p_i(\boldsymbol{\beta})) \right] = \sum_{i=1}^n \left[ y_i \boldsymbol{\beta}^\top \dot{\mathbf{x}}_i - \log(1 + e^{\boldsymbol{\beta}^\top \dot{\mathbf{x}}_i}) \right] \]
Since there is no closed-form solution for \(\hat{\boldsymbol{\beta}}\), numerical optimization (e.g., Newton-Raphson or Iteratively Reweighted Least Squares - IRLS) is used.
When the number of classes \(K = |\mathcal{C}| > 2\):
Multinomial Logistic Regression (Softmax Regression): Directly models the multi-class probabilities using: \[ P(Y=c|\mathbf{x}=\dot{\mathbf{x}}) = \frac{e^{\boldsymbol{\beta}_c^\top \dot{\mathbf{x}}}}{\sum_{s \in \mathcal{C}} e^{\boldsymbol{\beta}_s^\top \dot{\mathbf{x}}}} \] To ensure identifiability, one class is chosen as a reference (e.g., \(c_{ref}\)), setting \(\boldsymbol{\beta}_{c_{ref}} = \mathbf{0}\).
One-vs-Rest (OvR):
\[ P(Y=c|\mathbf{x}=\dot{\mathbf{x}}) = \frac{p(\dot{\mathbf{x}}|Y=c)P(Y=c)}{\sum_{s \in \mathcal{C}} p(\dot{\mathbf{x}}|Y=s)P(Y=s)} \]
The Naive Bayes classifier simplifies this by making the “naive” assumption that features are conditionally pairwise independent given the class: \[ p(\dot{\mathbf{x}}|Y=c) = p((\dot{x}_1, \dots, \dot{x}_p)|Y=c) = \prod_{j=1}^p p_j(\dot{x}_j|Y=c) \]
This assumption is often unrealistic in practice, yet Naive Bayes frequently yields competitive classification performance.
Under this assumption, we only need to estimate \(p\) univariate densities \(p_j(\dot{x}_j|Y=c)\) for each class.


Thus, the predicted class of observation \(\dot{\mathbf{x}}_0\) is defined as:
\[ \hat{y}_0 = \underset{i \in \mathcal{N}_{\dot{\mathbf{x}}_0}}{\text{mode }} y_{i} \]
A distance function \(d(\dot{\mathbf{x}}_i, \dot{\mathbf{x}}_j)\) must satisfy the following properties:
\(d(\dot{\mathbf{x}}_i, \dot{\mathbf{x}}_j) \geq 0\) (Non-negativity);
\(d(\dot{\mathbf{x}}_i, \dot{\mathbf{x}}_j) = 0 \iff \dot{\mathbf{x}}_i = \dot{\mathbf{x}}_j\) (Identity of indiscernibles);
\(d(\dot{\mathbf{x}}_i, \dot{\mathbf{x}}_j) = d(\dot{\mathbf{x}}_j, \dot{\mathbf{x}}_i)\) (Symmetry);
\(d(\dot{\mathbf{x}}_i, \dot{\mathbf{x}}_j) \leq d(\dot{\mathbf{x}}_i, \dot{\mathbf{x}}_k) + d(\dot{\mathbf{x}}_k, \dot{\mathbf{x}}_j)\) (Triangle inequality).
The two most common distance metrics are the Euclidean distance and the Mahalanobis distance.
The Euclidean distance between two vectors \(\dot{\mathbf{x}}_i\) and \(\dot{\mathbf{x}}_j\) is given by:
\[d(\dot{\mathbf{x}}_i, \dot{\mathbf{x}}_j) = \sqrt{\sum_{l=1}^{p} (x_{il} - x_{jl})^2} = \sqrt{(\dot{\mathbf{x}}_i - \dot{\mathbf{x}}_j)^\top (\dot{\mathbf{x}}_i - \dot{\mathbf{x}}_j)}.\]
The Mahalanobis distance is given by:
\[d(\dot{\mathbf{x}}_i, \dot{\mathbf{x}}_j) = \sqrt{(\dot{\mathbf{x}}_i - \dot{\mathbf{x}}_j)^\top \hat{\boldsymbol{\Sigma}}^{-1} (\dot{\mathbf{x}}_i - \dot{\mathbf{x}}_j)},\] where \(\hat{\boldsymbol{\Sigma}}\) is the sample covariance matrix of the training data features.
Tree models typically suffer from high variance. Ensemble methods mitigate this:

Aprendizado de Máquina: uma abordagem estatística, Izibicki, R. and Santos, T. M., 2020, link: https://rafaelizbicki.com/AME.pdf.
An Introduction to Statistical Learning: with Applications in R, James, G., Witten, D., Hastie, T. and Tibshirani, R., Springer, 2013, link: https://www.statlearning.com/.
Mathematics for Machine Learning, Deisenroth, M. P., Faisal. A. F., Ong, C. S., Cambridge University Press, 2020, link: https://mml-book.com.
An Introduction to Statistical Learning: with Applications in python, James, G., Witten, D., Hastie, T. and Tibshirani, R., Taylor, J., Springer, 2023, link: https://www.statlearning.com/.
Matrix Calculus (for Machine Learning and Beyond), Paige Bright, Alan Edelman, Steven G. Johnson, 2025, link: https://arxiv.org/abs/2501.14787.
Machine Learning Beyond Point Predictions: Uncertainty Quantification, Izibicki, R., 2025, link: https://rafaelizbicki.com/UQ4ML.pdf.
Mathematics of Machine Learning, Petersen, P. C., 2022, link: http://www.pc-petersen.eu/ML_Lecture.pdf.
Machine Learning - Prof. Jodavid Ferreira