PhD Thesis
Author: J.S. Sánchez
Supervisor: Filiberto Pla and F.J. Ferri
Abstract
Classification is a problem consisting of implicitly or explicitly
generating decision boundaries which can successfully distinguish the
distinct classes usually in a d-dimensional feature space. From this
general definition, classification in pattern recognition has traditionally
been tackled through two alternative approaches; namely, parametric and
non-parametric.
While the parametric classifiers assume a (known) functional
distribution of given samples, the non-parametric approaches do not assume
any particular distribution of the set of prototypes. Although the
parametric approaches have theoretically been shown to be potentially
capable of yielding optimal results, in practice, they often tend
to actually fail because of inappropriate assumptions of a priori
distributions.
Among non-parametric methods, those which are based on sample-to-sample
distances are specially remarkable and in particular, the family of
k-Nearest Neighbours (k-NN) techniques has been widely used in several
application domains. When applied to classification, these schemes require the
classes to be represented by appropriate sets of prototypes (namely, training
set) and then, the decision rule is generally reduced to label each given
sample with the class that contains most of its k nearest neighbours among
all prototypes in the training set. It is the conceptual simplicity of such
a classifier, along with its asymptotical (that is, when an infinite number
of prototypes is available in the training set) trend towards the optimal
Bayes rule in terms of minimum classification error, which makes the
k-NN approach particularly appealing in many practical situations.
However, these rules also present some drawbacks. First, when the number
of prototypes in the training set is not large enough, the k-NN rule is no longer
optimal. This problem becomes more relevant when having few prototypes compared to
the intrinsic dimensionality of the feature space, which is a very usual practical
situation. On the other hand, an increase in the training set size gives rise to a
large computational cost. Second, the set of prototypes may contain noisy or
mislabelled prototypes which usually lead to a decrease in performance.
Two different families of prototype selection methods have been proposed in the
literature as a way of minimizing the problems just mentioned. First,
condensing techniques aim at selecting a sufficiently small subset of prototypes
that leads to approximately the same performance than the 1-NN rule using the
whole set. And second, editing algorithms eliminate erroneous prototypes from the
original training set and ``cleans'' possible overlapping among classes, what usually
leads to significant improvements in performance.
This thesis contains a number of contributions within the general context of
neighbourhood-based classification. Thus, some alternative neighbourhood definitions
are firstly introduced. From these novel concepts, different non-parametric
classifiers are defined, trying to partially overcome the practical drawbacks pointed
out for the k-NN rule. At the same time, several editing as well as combined
editing-condensing techniques are also proposed. Further, a method for binary decision
tree induction, whose classification result is equivalent to the 1-NN rule, is
presented. An important benefit of the resulting tree-like classifier
refers to the fact that the computational cost is expected to be lower than
using a traditional 1-NN approach since it does not require to compute
distances from a sample to all prototypes in the training set. Finally, all of these
new techniques are empirically compared with respect to the most important
conventional methods.
Related to the definitions of neighbourhood proposed in the thesis, it
is worth pointing out that they are mainly based on the general idea that the
neighbours should be as close to a sample as possible, but also the neighbours
should lie as homogeneously distributed around that sample as possible. Although
the second condition is a consequence of the first in the asymptotic case,
in some practical situations the geometrical location of prototypes can become
much more important than the actual distances to appropriately characterize a
sample by its neighbourhood. As the classical neighbourhood concept takes into
account the first property only, these neighbours may not be placed symmetrically
around the sample if the neighbourhood in the training set is not
homogeneously distributed. Hence, in order to tackle the problem just described,
we propose to make use of some kind of neighbourhood, here so-called Surrounding
Neighbourhood (SN), in such a way that the neighbours of a sample will be
considered not only in terms of proximity but also in terms of their spatial
distribution with respect to that sample.
The first SN definition considered in the thesis comes from the Nearest Centroid
Neighbourhood (NCN) concept. Let p be a sample whose k neighbours should be
found in a training set. These k neighbours are such that (a) they are as near
to p as possible, and (b) their centroid is also as close to p as possible.
This definition gives rise to a neighbourhood in which both closeness and geometrical
distribution of neighbours are taken into account because of the centroid criterion. On the
other hand, proximity of the k nearest centroid neighbours to the sample is
guaranteed because of the incremental nature of the way in which they are obtained from the
first nearest neighbour.
The second SN definition utilized here is derived from two analogous geometrical
structures widely used in pattern recognition: the Gabriel Graph (GG) and the
Relative Neighbourhood Graph (RNG), which are two well-known examples of
proximity graphs. In this case, taking into account that two points are graph
neighbours if no other point lies inside a certain zone of influence between them
(that is, a hypersphere in the GG and a lune in the RNG), it is possible to
completely surround a given sample by means of its graph neighbourhood.
After introducing a number of distinct realizations for the SN concept, the
second part of the thesis concentrates on defining alternative non-parametric
classification schemes, which will be particularly adequate for small sample size
problems (few prototypes compared to the intrinsic dimensionality of the feature
space). Concretely, the k-NCN approach along with the Gabriel Neighbours and
the Relative Neighbours classifiers are proposed as a kind of majority rule. The
results obtained from the experiments carried out demonstrate a much better
behaviour in the solutions presented here than in the classical decision rules
used in the comparative study. Hence, it seems clear enough that the SN concept
is able to estimate the class membership of a given sample in a more accurate
way than the simple nearest neighbourhood.
As a second contribution, the thesis presents a set of learning algorithms,
that is, prototype selection techniques, in order to improve the performance
of the traditional ones. In this context, it must be basically distinguished
between editing and combined editing-condensing approaches. While the former comes
from the application of the SN-based classification schemes in order to
estimate and discard erroneously labelled prototypes, the editing-condensing
algorithms try to obtain a conveniently reduced set of prototypes in which
the classes are clustered. In fact, editing and condensing are two closely
related and complementary techniques, that is, condensing makes sense only
when there is no overlapping among different classes (which constitutes the
major aim of editing algorithms). An exhaustive empirical analysis shows
that the NCN- and RNG-based procedures give excellent results, while
the performance of the GG-based ones does not become better than that of the
conventional algorithms.
The last part of the thesis attends to increasing efficiency with NN
classifiers. So, in order to overcome the large computational cost of the NN rules, it
is introduced a method for oblique decision tree induction, whose
classification result is equivalent to the application of a conventional 1-NN
classifier. In a few words, the algorithm is based on a particular partition
of the feature space from the Voronoi diagram associated with the training
set. It is worth mentioning that only those hyperplanes corresponding to the
Voronoi boundaries (that is, hyperplanes separating two Voronoi cells which belong
to different classes) are used to define a set of convex regions with points from a
same class. The design procedure makes use of linear programming techniques to
expand the tree structure and finally, some pruning strategies are applied
to reduce the size of the resulting classifier. The equivalence property
mentioned above is guaranteed by the fact of using the hyperplanes of the
Voronoi boundaries to build the decision tree structure, as proven in
the thesis.