Data and Models
UFPE
represent random variables (r.v.);
represent realizations of random variables;
represent random vectors;
represent realizations of random vectors;
represent random matrices;
represent realizations of random matrices;
dimension of features, variables, parameters
sample size
\(i\)-th observation, instance (e.g., \(i=1, ..., n\))
\(j\)-th feature, variable, parameter (e.g., \(j=1, ..., p\))
Machine Learning combines data, models, and optimization and learning.
Three essential elements:
Core objective:
“What defines a good model?”
Performance on unseen data + objective metrics
The function \(f\) is unknown. It represents the IDEAL solution.
ML algorithms aim to find a function \(\hat{f} \approx f\).
The data are organized into a matrix \(\dot{\mathbf{X}} \in \mathbb{R}^{n \times p}\), where:
Example:
| ID | Age | Salary |
|---|---|---|
| 1 | 25 | 3000 |
| 2 | 40 | 5000 |
| 3 | 60 | 7000 |
| 4 | 30 | 3500 |
| 5 | 40 | 8000 |
where each row is a vector \(\dot{\mathbf{x}}_i \in \mathbb{R}^p\), \(i = 1, \ldots, n\), referred to as an instance, observation, or even a sample.
A model can be represented by a mathematical function \(f: \mathbb{R}^p \rightarrow \mathbb{R}\):
\[f(\dot{\mathbf{x}}) = \boldsymbol{\theta}^\top \dot{\mathbf{x}} + \theta_0\]
where:
In probabilistic models, we associate a probability distribution with the data, thereby incorporating uncertainty through these distributions.
In the example below, featuring regression with Gaussian noise, it is assumed that the value \(y_i\) follows a normal distribution centered at \(\boldsymbol{\theta}^\top \dot{\mathbf{x}}_i\):
\[p(y_i|\dot{\mathbf{x}}_i, \boldsymbol{\theta}) = \mathcal{N}(y_i|\boldsymbol{\theta}^\top \dot{\mathbf{x}}_i, \sigma^2)\]
\[p(y_i|\dot{\mathbf{x}}_i, \boldsymbol{\theta}) = \mathcal{N}(y_i|\boldsymbol{\theta}^\top \dot{\mathbf{x}}_i, \sigma^2)\]
Key concepts include:
Key concepts include:

Supervised learning is the most prevalent form of machine learning. It is employed to predict an outcome based on a set of input variables.
It is termed “supervised” because the algorithm learns from a labeled dataset—that is, a dataset containing both inputs and their corresponding desired outputs.
Supervised learning is utilized across a wide range of applications, typically involving classification and regression algorithms.
When the target output is a qualitative variable \((y \in \mathbb{N})\), the task is referred to as CLASSIFICATION.
An optimized model is obtained through training and learning based on observations \(\mathbf{x}_i\), for \(i = 1, 2, \ldots, n\), for which the corresponding desired response \(y_i\) is available.
Each observation \(\mathbf{x}_i\) may consist of \(p\) independent variables (features) \(\mathbf{x}_i = x_{i1}, x_{i2}, \ldots, x_{ip}\), where \(p \geq 1\).
For example:
\[\mathbf{x}_{new} = [\text{Weather = 'Rainy', Temperature = 'Warm', Wind Speed = 'Weak'}] \rightarrow y_{new}= ?\]
When the desired output is a quantitative variable—belonging either to the set of integers \((y \in \mathbb{Z})\) or real numbers \((y \in \mathbb{R})\)1, the task is referred to as REGRESSION2.
Examples:
Unsupervised learning is the branch of machine learning employed to derive inferences from unlabeled datasets.
It is termed “unsupervised” because the algorithm is not trained on labeled data. Instead, the algorithm identifies underlying patterns within the training data and is capable of making inferences about new data.
Unsupervised learning is utilized across a wide range of applications, primarily through clustering and dimensionality reduction algorithms.
While typically applied to a dataset where the identified clusters represent the primary objective, it is also possible to assign new observations to these established groups based on the similarity between the new data points and the existing cluster members.
Instead of a static sample, the process involves a sequence of actions performed within an environment; the algorithm learns the optimal policy to maximize a cumulative reward.
For every action, feedback is provided, indicating whether the action was favorable or unfavorable.
For example: Winning a game of checkers (draughts) after a series of strategic moves.
Figure 1
Note that the process is iterative, meaning that revisiting previous stages is necessary to refine the model.
Dataset: Broadly refers to the collection of data utilized in machine learning. Each record is termed an observation, example, instance, or sample, and is composed of variables or features. Features represent the relevant attributes used to characterize the observations.
Training Set: The data subset used to train the model (learning phase). It comprises a set of observations along with their corresponding target outputs.
Test Set: The data subset used to evaluate the trained model. It simulates real-world scenarios where the model is applied to unseen data. It consists of a set of input observations.
If the data is good, there is no guarantee that the model will be good
If the data is not good, we can guarantee that the model will be bad
Cleaning \(\rightarrow\) Removal of duplicate data, outliers, missing data, avoiding inconsistencies and errors in data reading;
Dimensionality Reduction \(\rightarrow\) Avoiding dimensionality explosion by reducing the number of variables;
Normalization \(\rightarrow\) Standardizing the data, preventing variables with different scales from influencing the model, reducing noise, and improving model performance;
The process begins with data cleaning, but note the division of work using real data, raw data, and clean data, which is the result of the cleaning process.
Generally, machine learning models do not perform well with missing, duplicate, or inconsistent data.
It is common to have collected data that can only be used after a preparation step, which may include:
The conversion stage occurs between preprocessing and feature selection; it entails transforming the data into a format suitable for the model.
Examples include:

Datasets typically contain a high dimensionality of features. Some may be redundant or irrelevant to the target prediction and, therefore, can be excluded.
This stage is independent of the specific machine learning algorithm employed (algorithm-agnostic).
There are four primary motivations for implementing feature selection.
Three main categories of methods are highlighted:
This category encompasses techniques based on the correlation between features and the target variable.
Generally, these methods utilize statistical parameters to select the most relevant features by applying a threshold or establishing a feature ranking.
Common Methods:
Limitations:
This category encompasses techniques that utilize a machine learning model to evaluate feature importance.
Generally, these methods employ a machine learning algorithm to assess the significance of features and select the most relevant subset.
Common Methods:
Limitations:
This category encompasses techniques in which feature selection is intrinsically incorporated into the model training process.
The most prevalent form of embedded feature selection is regularization.
Common Methods:
Limitations:
L1 Regularization (Lasso):
For example: weights of redundant features are naturally nullified…
General overview of the model construction process (Steps 1 through 6). Note the significance of Step 2!
An example of Supervised Learning \(\rightarrow\) The Learning Phase
An example of Supervised Learning \(\rightarrow\) The Learning Phase
An example of Supervised Learning \(\rightarrow\) The Prediction Phase
What defines a “good” model?
What are the right questions to ask?
Model Validity - Definitions
Generalization Capability: The ability of the model to provide accurate predictions for unseen data.
Error: The discrepancy between the predicted value and the ground truth (actual value). There are two primary types:
Model Capacity: The model’s ability to fit the training data effectively while maintaining its ability to generalize to new data.
Preprocessing: Feature encoding and Normalization (where applicable)
Training: Conducted via optimization methods
Evaluation: Performed through performance metrics and cross-validation
Aprendizado de Máquina: uma abordagem estatística, Izibicki, R. and Santos, T. M., 2020, link: https://rafaelizbicki.com/AME.pdf.
An Introduction to Statistical Learning: with Applications in R, James, G., Witten, D., Hastie, T. and Tibshirani, R., Springer, 2013, link: https://www.statlearning.com/.
Mathematics for Machine Learning, Deisenroth, M. P., Faisal. A. F., Ong, C. S., Cambridge University Press, 2020, link: https://mml-book.com.
An Introduction to Statistical Learning: with Applications in python, James, G., Witten, D., Hastie, T. and Tibshirani, R., Taylor, J., Springer, 2023, link: https://www.statlearning.com/.
Matrix Calculus (for Machine Learning and Beyond), Paige Bright, Alan Edelman, Steven G. Johnson, 2025, link: https://arxiv.org/abs/2501.14787.
Machine Learning Beyond Point Predictions: Uncertainty Quantification, Izibicki, R., 2025, link: https://rafaelizbicki.com/UQ4ML.pdf.
Mathematics of Machine Learning, Petersen, P. C., 2022, link: http://www.pc-petersen.eu/ML_Lecture.pdf.
Machine Learning - Prof. Jodavid Ferreira