Aprendizaje y clasificación basados en criterios de vecindad. Métodos alternativos y análisis comparativo.
Neighbourhood-based learning and classification: alternative methods and comparative analysis.

PhD Thesis

Author: J.S. Sánchez
Supervisor: Filiberto Pla and F.J. Ferri


Abstract

Classification is a problem consisting of implicitly or explicitly generating decision boundaries which can successfully distinguish the distinct classes usually in a d-dimensional feature space. From this general definition, classification in pattern recognition has traditionally been tackled through two alternative approaches; namely, parametric and non-parametric.
While the parametric classifiers assume a (known) functional distribution of given samples, the non-parametric approaches do not assume any particular distribution of the set of prototypes. Although the parametric approaches have theoretically been shown to be potentially capable of yielding optimal results, in practice, they often tend to actually fail because of inappropriate assumptions of a priori distributions.
Among non-parametric methods, those which are based on sample-to-sample distances are specially remarkable and in particular, the family of k-Nearest Neighbours (k-NN) techniques has been widely used in several application domains. When applied to classification, these schemes require the classes to be represented by appropriate sets of prototypes (namely, training set) and then, the decision rule is generally reduced to label each given sample with the class that contains most of its k nearest neighbours among all prototypes in the training set. It is the conceptual simplicity of such a classifier, along with its asymptotical (that is, when an infinite number of prototypes is available in the training set) trend towards the optimal Bayes rule in terms of minimum classification error, which makes the k-NN approach particularly appealing in many practical situations.
However, these rules also present some drawbacks. First, when the number of prototypes in the training set is not large enough, the k-NN rule is no longer optimal. This problem becomes more relevant when having few prototypes compared to the intrinsic dimensionality of the feature space, which is a very usual practical situation. On the other hand, an increase in the training set size gives rise to a large computational cost. Second, the set of prototypes may contain noisy or mislabelled prototypes which usually lead to a decrease in performance. Two different families of prototype selection methods have been proposed in the literature as a way of minimizing the problems just mentioned. First, condensing techniques aim at selecting a sufficiently small subset of prototypes that leads to approximately the same performance than the 1-NN rule using the whole set. And second, editing algorithms eliminate erroneous prototypes from the original training set and ``cleans'' possible overlapping among classes, what usually leads to significant improvements in performance.
This thesis contains a number of contributions within the general context of neighbourhood-based classification. Thus, some alternative neighbourhood definitions are firstly introduced. From these novel concepts, different non-parametric classifiers are defined, trying to partially overcome the practical drawbacks pointed out for the k-NN rule. At the same time, several editing as well as combined editing-condensing techniques are also proposed. Further, a method for binary decision tree induction, whose classification result is equivalent to the 1-NN rule, is presented. An important benefit of the resulting tree-like classifier refers to the fact that the computational cost is expected to be lower than using a traditional 1-NN approach since it does not require to compute distances from a sample to all prototypes in the training set. Finally, all of these new techniques are empirically compared with respect to the most important conventional methods.
Related to the definitions of neighbourhood proposed in the thesis, it is worth pointing out that they are mainly based on the general idea that the neighbours should be as close to a sample as possible, but also the neighbours should lie as homogeneously distributed around that sample as possible. Although the second condition is a consequence of the first in the asymptotic case, in some practical situations the geometrical location of prototypes can become much more important than the actual distances to appropriately characterize a sample by its neighbourhood. As the classical neighbourhood concept takes into account the first property only, these neighbours may not be placed symmetrically around the sample if the neighbourhood in the training set is not homogeneously distributed. Hence, in order to tackle the problem just described, we propose to make use of some kind of neighbourhood, here so-called Surrounding Neighbourhood (SN), in such a way that the neighbours of a sample will be considered not only in terms of proximity but also in terms of their spatial distribution with respect to that sample.
The first SN definition considered in the thesis comes from the Nearest Centroid Neighbourhood (NCN) concept. Let p be a sample whose k neighbours should be found in a training set. These k neighbours are such that (a) they are as near to p as possible, and (b) their centroid is also as close to p as possible. This definition gives rise to a neighbourhood in which both closeness and geometrical distribution of neighbours are taken into account because of the centroid criterion. On the other hand, proximity of the k nearest centroid neighbours to the sample is guaranteed because of the incremental nature of the way in which they are obtained from the first nearest neighbour.
The second SN definition utilized here is derived from two analogous geometrical structures widely used in pattern recognition: the Gabriel Graph (GG) and the Relative Neighbourhood Graph (RNG), which are two well-known examples of proximity graphs. In this case, taking into account that two points are graph neighbours if no other point lies inside a certain zone of influence between them (that is, a hypersphere in the GG and a lune in the RNG), it is possible to completely surround a given sample by means of its graph neighbourhood. After introducing a number of distinct realizations for the SN concept, the second part of the thesis concentrates on defining alternative non-parametric classification schemes, which will be particularly adequate for small sample size problems (few prototypes compared to the intrinsic dimensionality of the feature space). Concretely, the k-NCN approach along with the Gabriel Neighbours and the Relative Neighbours classifiers are proposed as a kind of majority rule. The results obtained from the experiments carried out demonstrate a much better behaviour in the solutions presented here than in the classical decision rules used in the comparative study. Hence, it seems clear enough that the SN concept is able to estimate the class membership of a given sample in a more accurate way than the simple nearest neighbourhood.
As a second contribution, the thesis presents a set of learning algorithms, that is, prototype selection techniques, in order to improve the performance of the traditional ones. In this context, it must be basically distinguished between editing and combined editing-condensing approaches. While the former comes from the application of the SN-based classification schemes in order to estimate and discard erroneously labelled prototypes, the editing-condensing algorithms try to obtain a conveniently reduced set of prototypes in which the classes are clustered. In fact, editing and condensing are two closely related and complementary techniques, that is, condensing makes sense only when there is no overlapping among different classes (which constitutes the major aim of editing algorithms). An exhaustive empirical analysis shows that the NCN- and RNG-based procedures give excellent results, while the performance of the GG-based ones does not become better than that of the conventional algorithms.
The last part of the thesis attends to increasing efficiency with NN classifiers. So, in order to overcome the large computational cost of the NN rules, it is introduced a method for oblique decision tree induction, whose classification result is equivalent to the application of a conventional 1-NN classifier. In a few words, the algorithm is based on a particular partition of the feature space from the Voronoi diagram associated with the training set. It is worth mentioning that only those hyperplanes corresponding to the Voronoi boundaries (that is, hyperplanes separating two Voronoi cells which belong to different classes) are used to define a set of convex regions with points from a same class. The design procedure makes use of linear programming techniques to expand the tree structure and finally, some pruning strategies are applied to reduce the size of the resulting classifier. The equivalence property mentioned above is guaranteed by the fact of using the hyperplanes of the Voronoi boundaries to build the decision tree structure, as proven in the thesis.