Programa de Pós-Graduação em Estatística
UFPE
Publication on Annals of the Brazilian Academy of Sciences:
Machine learning classification based on \(k\)-Nearest Neighbors for PolSAR data
Authors: Ferreira, J.A. & Rodrigues, A.K.G. & Ospina, R. & Gomez, L.
Year: 2024 - DOI:10.1590/0001-3765202420230064
represent random variables (r.v.);
represent realizations of random variables;
represent random vectors;
represent realizations of random vectors;
represent random matrices;
represent realizations of random matrices;
dimension of features, variables, parameters
sample size
\(i\)-th observation, instance (e.g., \(i=1, ..., n\))
\(j\)-th feature, variable, parameter (e.g., \(j=1, ..., p\))



Speckle in SAR image. Source: (NASCIMENTO, 2012).
Speckle imposes on SAR/PolSAR data a multiplicative nature such that SAR intensities or multidimensional intensities of a given PolSAR do not behave like the normal or Gaussian law.
Methods considering speckle action have been proposed in several areas:
\[ \mathbf{S}=\left( \begin{array}{cc} S_{hh} & S_{hv} \\ S_{vh} & S_{vv} \end{array} \right) \label{matrizpolar} \]
\[ \mathbf{z}=\left( \begin{array}{c} S_{hh}\\ S_{hv}\\ S_{vv} \end{array} \right) \label{matrizpolarreduzida} \]

\(\mathbf{z}_i = (S_1^{(i)}\,S_2^{(i)}\,\cdots\,S_p^{(i)})^{\top} \, \in \mathbb{C}^{p}\) is the \(i\)-th vector associated with \(p\) polarization channels in a sample of \(L\) extracted information from the same scene, for \(i=1,\ldots,L\).
\[ d(\mathbf{x}_1,\mathbf{x}_2) = \parallel \mathbf{x}_1 - \mathbf{x}_2 \parallel = \sqrt{\sum\limits_{j=1}^{p}(x_{1j} - x_{2j})^2}. \]
Consider \(G\) non-empty partitions, which are \(C_1,C_2, \ldots,C_c\), such that, \(C_t \subset \mathbb{R}^p\) is a subset of size \(n_t\), for \(t = 1, \ldots,c\), \(n = \sum_{t=1}^{c} n_t\), \(C_u\cap C_v = \emptyset\) for all \(u \neq v.\)
Let \(\mathbf{x}_r \in \mathbb{R}^p\) an observation to be classified, where \(r \notin 1, \ldots, n\) and \(\{\mathbf{x}_{i_1},\mathbf{x}_{i_2},\ldots,\mathbf{x}_{i_k} \}\) the set of \(k\) neighbors, such that, \(k\leq n\) and \(\{{i_1},{i_2},\ldots,{i_k} \} \subset \{1,2,\ldots,n\}\). So for \(t = 1, \ldots,c\), the classification rule of the \(k\)-NN is defined as:
\[\begin{equation} \widehat{c}_{_{kNN}} = \underset{c}{\mathrm{arg max}} \ \widehat{P}_k(c | \mathbf{x}_r), \end{equation}\]
where \(\widehat{P}_k(c | \mathbf{x}_r) = k^{-1}\sum_{\nu=1}^{k}\mathbb{I}_c(\mathbf{x}_{i_\nu})\) and
\[\begin{eqnarray} \mathbb{I}_c(\mathbf{x}_{i_\nu}) = \left\{ \begin{array}{rl} 1,&\text{if } \mathbf{x}_{i_\nu} \in C_j,\\ 0,& \text{if } \mathbf{x}_{i_\nu} \notin C_j. \end{array} \right. \nonumber \end{eqnarray}\]
The kernel k-NN classifier is a natural extension of the \(k\)-NN algorithm in which the training samples are transformed into a feature space through a nonlinear transformation.
Thus, consider a transformation in the \(p\)-dimensional feature space for a \(s\)-dimensional feature space (usually with \(p \leq s\)), in which, given an arbitrary vector \(\mathbf{x} \in E_1\) (original \(p\)-dimensional feature space), with \(\psi(\mathbf{x})\) being the respective vector in \(E_2\) (new \(s\)-dimensional feature space), the transformation is obtained as follows:
\[\begin{eqnarray} \begin{array}{c} \mathbf{x} = (x_1,x_2,\ldots, x_p), \quad\quad \mathbf{x} \in E_1\\ \quad\quad\quad\quad \quad\quad\quad\quad \Bigg\downarrow ( \text{transformation of characteristics})\\ \psi(\mathbf{x}) = \left(\phi_1(\mathbf{x}),\phi_2(\mathbf{x}), \ldots, \phi_s(\mathbf{x})\right), \quad\quad \psi(\mathbf{x}) \in E_2 \\ \end{array} \label{transformation_of_char} \end{eqnarray}\]
\[\begin{equation} K(\mathbf{x}_1,\mathbf{x}_2) = \langle \psi(\mathbf{x}_1),\psi(\mathbf{x}_2) \rangle, \end{equation}\]
where \(\langle \psi(\mathbf{x}_1),\psi(\mathbf{x}_2) \rangle\) is the internal product of \(\psi(\mathbf{x}_1)\) and \(\psi(\mathbf{x}_2)\).
\[\begin{equation} K(\mathbf{x}_1,\mathbf{x}_2) = \exp{\left\lbrace - \dfrac{\parallel \mathbf{x}_1-\mathbf{x}_2 \parallel^2}{2\sigma^2} \right\rbrace}, \label{radial} \end{equation}\]
where, \(\sigma\) is an hyperparameter adjustable.
\[\begin{equation} K(\mathbf{x}_1,\mathbf{x}_2) = (1 + \langle \mathbf{x}_1,\mathbf{x}_2 \rangle)^\delta; \end{equation}\]
\[\begin{equation} K(\mathbf{x}_1,\mathbf{x}_2) = \tanh(\alpha \langle \mathbf{x}_1,\mathbf{x}_2 \rangle + \beta), \end{equation}\]
in which \(\delta\), \(\alpha\) and \(\beta\) are hyperparameters.
\[\begin{eqnarray} d^2(\psi(\mathbf{x}_1),\psi(\mathbf{x}_2)) = \parallel \psi(\mathbf{x}_1) - \psi(\mathbf{x}_2) \parallel^2 = K(\mathbf{x}_1,\mathbf{x}_1) -2K(\mathbf{x}_1,\mathbf{x}_2) + K(\mathbf{x}_2,\mathbf{x}_2). \label{distnovoespcarac} \end{eqnarray}\]
Thus, the distance between neighbors of the original data in the new feature space is calculated using a kernel function. The classification rule is equal to \(k\)-NN.
The fuzzy \(k\)-NN algorithm is based on the classification rules of the algorithm \(k\)-NN together with the fuzzy set theory.
The algorithm calculates the membership degrees of the characteristics for the set of training data through a membership function, the membership of \(\mathbf{x}_i\) to a class is given as a function of the number of nearest neighbors in the training sample.
To define the membership degree \(u_{h}\) as:
\[\begin{eqnarray} u_{h}(\mathbf{x}_i) = \left\lbrace \begin{array}{rc} \alpha + (k_h/k)(1-\alpha), & \text{if } h=c \\ (k_h/k)(1-\alpha), & \text{if } h \neq c, \\ \end{array} \right. \label{calculopertinenclass} \end{eqnarray}\]
where, \(c\) is the class \(\mathbf{x}_i\) belongs, \(k_h\) is the number of neighbors belonging to \(h\)-th class and \(k\) is the total of neighbors1.
\[\begin{eqnarray} \sum\limits_{h=1}^{c} u_{hi}= 1, \forall i \text{ and}\\ u_{hi} \in [0,1], \quad\quad 0 < \sum\limits_{i=1}^{n} u_{hi} < n. \end{eqnarray}\]
Let \(\mathbf{x}_r \in \mathbb{R}^p\) an observation to be classified on the test basis and \(\{\mathbf{x}_{i_1},\mathbf{x}_{i_2},\ldots,\mathbf{x}_{i_k} \}\) the set of \(k\) nearest neighbors, such that, \(k\leq n\) and \(\{{i_1},{i_2},\ldots,{i_k} \} \subset \{1,2,\ldots,n\}\).
The algorithm works similar to \(k\)-NN, and the membership obtained for the test sample is as follows:
\[\begin{equation} u_h(\mathbf{x}_r) = \dfrac{\sum\limits_{i=1}^{k} u_{hi} \left( sim\lbrace \mathbf{x}_r, \mathbf{x}_i \rbrace\right)}{\sum\limits_{i=1}^{k}\left( sim\lbrace \mathbf{x}_r, \mathbf{x}_i \rbrace\right)}, \label{pertinenciay} \end{equation}\]
where \(sim\lbrace \mathbf{x}_r, \mathbf{x}_i \rbrace\) is a similarity function of \(\mathbf{x}_r\) and \(\mathbf{x}_i\), such as, \[ sim\lbrace \mathbf{x}_r, \mathbf{x}_i \rbrace = \frac{1}{\parallel \mathbf{x}_r-\mathbf{x}_i \parallel ^{\frac{2}{(m-1)}}}. \]
\[\begin{equation} \widehat{c}_{_{FkNN}} = \underset{h \in 1,\ldots, c}{\mathrm{arg max}} \ u_h(\mathbf{x}_r), \end{equation}\]
Now we will use the idea of each of the algorithms to generate the algorithm Kernel Fuzzy \(k\)-NN, joining the idea of the kernel transformation with fuzzy membership degree.
It will be assumed \(m=2\), and considered the distance
\[\begin{equation} d^{2}(\psi(\mathbf{x}_1),\psi(\mathbf{x}_2)) = 2 - 2K(\mathbf{x}_1,\mathbf{x}_2), \label{euclidquadrado2} \end{equation}\]
then, by combining befores equation, we have:
\[\begin{equation} u_h(\phi(\mathbf{x}_r)) = \dfrac{\sum\limits_{i=1}^{k} u_{hi} \left(\dfrac{1}{d^{2}(\psi(\mathbf{x}_r),\psi(\mathbf{x}_i))} \right)}{\sum\limits_{i=1}^{k}\left(\dfrac{1}{d^{2}(\psi(\mathbf{x}_r),\psi(\mathbf{x}_i))}\right)}, \label{pertinenciaykernel} \end{equation}\]
where \(u_{hi}\) are the membership degrees.

For simplicity, assume that the input data points are \(\mathbf{x}_i \in \mathbb{R}^p\) and \(y_i \in \{ -1, 1 \}\) are the class labels, with \(i = 1, \ldots, n\).
Intuitively, based on the training set, the SVM estimates a hyperplane that serves as a decision boundary, it is
\[ \langle \mathbf{w},\mathbf{x}_1 \rangle + b = w_1 x_{11} + w_2 x_{12} + \cdots + w_p x_{1p} + b = 0, \]
where the weight vector \(\mathbf{w} \in \mathbb{R}^p\) and the bias \(b \in \mathbb{R}\).
\[ \hat{y}_r = \mathrm{sign}(\langle \mathbf{w},\mathbf{x}_r \rangle + b). \]
\[\begin{align} \label{EQ:SVM1} \min_{ \mathbf{w} \in \mathbb{R}^p, b \in \mathbb{R}, \geq 0 } f_{sv} = \dfrac{1}{2} \| \mathbf{w}\|_2^2 + C \mathbf{1}^\top \boldsymbol{\xi}, \end{align}\]
subject to the margin constraints \(y_i (\mathbf{w} \psi(\mathbf{x}_i) + b) \geq 1 − \xi_i\), \(\forall i\). Here \(\mathbf{1}\) is a vector of all 1’s. The Lagrange dual can be easily derived:
\[\begin{align} \max_{ \boldsymbol{\alpha} } \mathbf{1}^\top \boldsymbol{\alpha} - \dfrac{1}{2} \boldsymbol{\alpha}^\top \left( K \circ \mathbf{y}\mathbf{y}^\top \right) \boldsymbol{\alpha}, \, \text{s.t.}\,0 \leq \boldsymbol{\alpha} \leq C, \mathbf{y}^\top \boldsymbol{\alpha} = 0, \end{align}\]
here \(C\) is the trade-off parameter; \(K\) is the kernel matrix with \(K_{lh} = K(\mathbf{x}_l,\mathbf{x}_h)\); and \(\circ\) denotes element-wise matrix multiplication (Hadamard product).
\[\begin{equation} \widehat{y}_i = \sum^B_{b=1} f_b(\mathbf{x}_i), \ \ f_b\in \mathcal{F}, \end{equation}\]
where \(K\) is the total number of trees, \(f_b\) for \(b\)-th tree is a function in the functional space \(\mathcal{F}\), and \(\mathcal{F}\) is the the set of all possible regression trees (also known as CART).
The classifier estimates a conditional probability density function of feature vectors by learning samples. Formally, assume a dataset \(\mathcal{D} = \{(\mathbf{x}_i, y_i), i=1,\ldots,n\},\) with \(\mathbf{x}_i\in \mathbb{R}^p\) and target variable \(y_i = c \in C\).
The NB predict the class label of the instance \(\mathbf{x}_r\) as follows:
\[\begin{equation} \label{key1} \hat{c}_{NB}=\underset{c}{\mathrm{arg \, max}} P(c) \prod_{j=1}^p P(x_{rj}|c), \end{equation}\]
where \(P(c)\) is the prior probability of class \(c\), and \(P(x_{rj} | c)\) is the conditional probability of the attribute value \(x_{rj}\) given the class \(c\).
\[ \hat{c}_{NB}=\underset{c}{\mathrm{arg \, max}} P(c) \prod_{j=1}^p P(x_{rj}|c), \]
\[\begin{equation}\label{key2} P(c)=\frac{\sum_{i=1}^n {\mathbb{I}(c_i,c)}} {n} \end{equation}\]
\[\begin{equation}\label{key3} P(x_{ij}|c)=\frac{\sum_{i=1}^n {\mathbb{I}(c_i,c)\mathbb{I}(x_{ij},A_{j})}}{\sum_{i=1}^n {\mathbb{I}(c_i,c)}}, \end{equation}\]
where \(A_j\) represents all the values of the \(j\)-th attribute in training instances, \(c_i\) denotes the correct class label for the \(i\)-th instance, and \(\mathbb{I}(a,b)\) is a indicator function, which takes the value 1 if \(a\) and \(b\) are identical and 0 otherwise.
\[f_{\mathbf{z}}(\dot{\mathbf{z}};\mathbf{\Sigma})= \dfrac{1}{\pi^{p}|\mathbf{\Sigma}|}\exp \left(-\dot{\mathbf{z}}^{*} \mathbf{\Sigma}^{-1} \dot{\mathbf{z}}\right), \]
where \(|\cdot|\) is the determinant of a matrix or absolute value of a scalar, \((\cdot)^{*}\) is the conjugate transpose of a vector, and \(\mathbf{\Sigma}\) is the positive definite Hermitian covariance matrix that contains all necessary information to characterize the polarimetric return of the data under analysis.
\[ f_{\mathbf{Z}}(\dot{\mathbf{Z}};{\mathbf{\Sigma}},L) = \dfrac{L^{p^{L}}|\dot{\mathbf{Z}}|^{L-p}}{|\mathbf{\Sigma}|^{L}\Gamma_p(L)}\exp\left[-L\text{tr}(\mathbf{{\Sigma}}^{-1}\dot{\mathbf{Z}})\right], \]
where \(\Gamma_p(L) = \pi^{p(p-1)/2}\prod_{i=0}^{p-1}\Gamma(L-i)\), \(L \geq p\), representing the multivariate gamma function, \(\Gamma(\cdot)\) is the gamma function, and \(\text{tr}(\cdot)\) is the trace operator.
Let \(\mathbf{X}\) and \(\mathbf{Y}\) be two random matrices defined in the common support \(\mathcal{Z}\) with densities \(f_{\mathbf{X}}(\dot{\mathbf{Z}};\boldsymbol{\theta}_1)\) and \(f_{\mathbf{Y}}(\dot{\mathbf{Z}};\boldsymbol{\theta}_2)\), respectively, where \(\boldsymbol{\theta}_1\) and \(\boldsymbol{\theta}_2\) are parameter vectors.
Then, a function is considered a stochastic distance, say \(d(\cdot,\cdot)\), between two random matrices if the following properties are satisfied:
\(d(\dot{\mathbf{X}},\dot{\mathbf{Y}}) \geq 0\) (non-negativity);
\(d(\dot{\mathbf{X}},\dot{\mathbf{Y}}) = 0 \Leftrightarrow \dot{\mathbf{X}} = \dot{\mathbf{Y}}\) (identity);
\(d(\dot{\mathbf{X}},\dot{\mathbf{Y}}) = d(\dot{\mathbf{Y}},\dot{\mathbf{X}})\) (symmetry);
\[\begin{align}% d_\text{KL}(\mathbf{X},\mathbf{Y})=&\frac{L_1-L_2}{2}\bigg\{\log\frac{|\boldsymbol{\Sigma}_1|}{|\boldsymbol{\Sigma}_2|}-p\log\frac{L_1}{L_2}\nonumber +\psi_p^{(0)}(L_1)-\psi_p^{(0)}(L_2)\bigg\} \nonumber \\ &-\frac{p(L_1+L_2)}{2} %&+(L_2-L_1)\displaystyle \sum_{i=1}^{m-1}\frac{i}{(L_1-i)(L_2-i)}\bigg\} \nonumber \\ + \frac{\operatorname{tr}(L_2\boldsymbol{\Sigma}_2^{-1}\boldsymbol{\Sigma}_1+L_1\boldsymbol{\Sigma}_1^{-1}\boldsymbol{\Sigma}_2)}{2}. \label{expreKL} \end{align}\]
\[\begin{equation} d_{\text{KL}}(\boldsymbol{X},\boldsymbol{Y}) = L\bigg\{ \frac{\operatorname{tr}(\boldsymbol{\Sigma}_2^{-1}\boldsymbol{\Sigma}_1+\boldsymbol{\Sigma}_1^{-1}\boldsymbol{\Sigma}_2)}{2} - p\bigg\}. \end{equation}\]
\[ \hat{c}_{KL} = \underset{c}{\mathrm{arg min}} \ d_{\text{KL}}(\mathbf{X},\mathbf{Y}), \]
In this section, we describe the experiments carried out to validate our analysis of the selected classification methods.
PolSAR data (\(Looks = 4\)) of size \(1600 \times 1200\) pixels acquired by the C-band RADARSAT-2 PolSAR system over the San Francisco area
Pauli-basis image of the RADARSAT-2 San Francisco data (left) and ground truth (right).
Confusion Matrix:
\[ M = \begin{bmatrix} n_{11} & \dots & n_{1c}\\ \vdots & \ddots & \vdots\\ n_{c1} & \dots& n_{cc} \end{bmatrix} \]
\[\begin{equation} AC={\sum^{c}_{i=1}n_{ii}}/N, \end{equation}\]
\[\begin{equation} \kappa={(Ac-P_{c})}/{(1-P_{c})}, \end{equation}\] where \(P_c= \sum_{i=1}^{c} (n_{i.}n_{.j})/N^2\). The ideal values for both indices are 1 (\(100\%\) in percentages).












In this study we used eight classification algorithms, four based on -NN, one based on naive Bayes, one that uses Support Vector Machine (SVM), one produced by randomized decision tree and finally one based on stochastic distance. The latter exploits in a natural way the stochastic nature of the PolSAR data, that is, it is well-suited to this kind of remote sensing data.
The algorithms were used to classify the PolSAR data intensities of the regions of a well-known PolSAR dataset: the San Francisco Bay region.
All classifiers for San Francisco-USA image had a kappa coefficient above 80% and below 93%. It is clear the nowadays deep learning techniques show classification results with kappa \(\approx\) 1. However, at mentioned above, deep learning techniques require extensive training and results are difficult to reproduce.
ANFINSEN SN, DOULGERIS AP & ELTOFT T. 2009. Estimation of the equivalent number of looks in polarimetric synthetic aperture radar imagery. EEE Trans Geosci Remote Sens 47(11): 3795-3809.
BISHOP CM. 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). 1st ed. Springer-Verlag, 738 p.
CINTRA RJ, FRERY AC & NASCIMENTO AD. 2013. Parametric and nonparametric tests for speckled imagery. Pattern Anal Appl 16(2): 141-161
FERREIRA JA, COÊLHO H & NASCIMENTO AD. 2021. A family of divergence-based classifiers for Polarimetric Synthetic Aperture Radar (PolSAR) imagery vector and matrix features. Int J Remote Sens 42(4): 1201-1229.
FRERY AC, NASCIMENTO ADC & CINTRA RJ. 2011. Information Theory and Image Understanding: An Application to Polarimetric SAR Imagery. Chil J Stat 2(2): 81-100.
LEE JS & POTTIER E. 2009. Polarimetric Radar Imaging From Basics to Applications. CRC Press.
Link do paper: https://doi.org/10.1590/0001-3765202420230064
OBRIGADO!
Slide produzido com quarto
Lattes: http://lattes.cnpq.br/4617170601890026
LinkedIn: jodavidferreira
Site Pessoal: https://jodavid.github.io/
e-mail: jodavid.ferreira@ufpe.br

Machine learning classification based on \(k\)-Nearest Neighbors for PolSAR data - Jodavid Ferreira