[期刊论文][Full-length article]


A distributional framework for evaluation, comparison and uncertainty quantification in soft clustering

作   者:
Andrea Campagner;Davide Ciucci;Thierry Denœux;

出版年:2023

页    码:109008 - 109008
出版社:Elsevier BV


摘   要:

In this article, we propose and study a general framework for comparing and evaluating soft clusterings, viewed as a form of uncertainty quantification for clustering tasks , with the aim of studying the uncertainty that arises from such a comparison. Our proposal is based on the interpretation of soft clustering as describing uncertain information, represented in the form of a mass function, about an unknown, target hard clustering . We present a general construction to this purpose, called distributional measures , whereby any evaluation measure can be naturally extended to soft clustering. We study the theoretical properties of the proposed approach, both in terms of computational complexity and metric properties. Furthermore, we study its relationship with other existing proposals providing, in particular, necessary and sufficient conditions for the equivalence between distributional measures and the recently proposed framework of transport-theoretic measures . We also propose sampling-based approximation algorithms with convergence guarantees, making it possible to apply the proposed method to real-world datasets. Finally, we apply the distributional measures in a computational experiment in order to illustrate their usefulness. Introduction Clustering [36] refers to the act of (as well as to the algorithms for) partitioning a set of objects into groups, called clusters , supposed to encode some unknown classification defined according to a given property. Depending on the formalism that is adopted to represent the mentioned partitioning of the data, one can distinguish two main families of clusterings (and, consequently, of clustering algorithms), namely: hard clustering and soft clustering . In the case of hard clustering, the assignment of objects to clusters is one-to-one: each object is precisely assigned to one and only cluster. In the case of soft clustering [32], [22], by contrast, the assignment of objects to clusters is affected by uncertainty, which is explicitly represented in the partitioning through some uncertainty quantification formalism. Popular approaches in this category include: probabilistic and fuzzy clustering [35], in which the assignment is represented through a fuzzy partition; three-way and rough clustering [31], [45], [7], in which the assignment is represented through a rough partition; possibilistic clustering [26], in which the assignment is represented through a possibilistic partition; and evidential clustering [19], [17], [15], in which the assignment is represented through an evidential partition: thus, it is the most general formalism among the mentioned ones. When clustering is applied to real data, with the aim of discovering interesting relationships among a given set of instances, one of the most critical steps is clustering evaluation [44], that is, the evaluation of the obtained results. In this context, internal validation criteria refer to measures that evaluate the goodness of a given clustering based only on characteristics of the clustering itself, such as its intra-cluster homogeneity or inter-cluster separatedness, while external validation criteria refer to evaluation approaches and metrics that compare two different clusterings of the data at hand. While internal validation criteria can be important as goodness–of–fit measures, only external validation criteria can be used to objectively assess the quality of a given clustering [34], whenever the reference clustering to be compared with it is interpreted as the ground truth . For this reason, external validation is of fundamental importance in the evaluation of newly developed clustering algorithms1: we will focus solely on this family of measures, which we simply refer to as validation measures . While several external validation measures have been proposed in the context of hard clustering (including, among others, the Rand index [33], the mutual information [43], or the partition distance [11]), how to properly evaluate the results of a clustering analysis is much less clear in the case of soft clustering methods. Indeed, the interplay between the uncertainty represented in the object-cluster assignments and the errors in the given partitioning (in comparison with the given ground truth partitioning of the data) makes the evaluation of such soft clustering algorithms particularly difficult [9]. For this reason, the development of evaluation measures for soft clustering has largely focused on the extension of common measures [1], notably the Rand index [10], [18], [23], [24], while a general approach to extend other comparison measures to soft clustering has so far been largely lacking. As a way to bridge this gap, in a series of recent articles [8], [9] we proposed some principles for developing new evaluation metrics for soft clustering, based on the intuition that any given soft clustering can be represented as a distribution (in particular, a mass function) that encodes the belief about an underlying, but unknown, true hard clustering. Indeed, an evidential clustering (as the most general form of soft clustering among the ones considered in this article) associates with each instance a mass function over the clusters, which represents the uncertainty about the assignment of that specific instance. Obviously, such instance-wise uncertainty quantification can be transformed to a partition-wise uncertainty quantification by simply transferring the mass functions for the single instances to a global mass function over hard clusterings: this latter mass function is considered as a representation of the uncertainty (determined by the evidential clustering algorithm that generated the evidential clustering at hand) about a set of compatible hard clusterings of the data, one of which is assumed to be optimal (i.e., the hard clustering, among the ones compatible with the given soft clustering, which is as similar as possible to the true, unknown clustering of the data). In particular, we noted that one can distinguish two different purposes in such an evaluation. First, to objectively evaluate the quality of a soft clustering with respect to a known ground truth partitioning: since any given soft clustering, as mentioned above, can be represented as a distribution of hard clusterings, this amounts to finding bounds on the quality, with respect to the given ground truth, of the hard clusterings compatible with the given soft clustering. Second, to compare two different soft clusterings2 and to quantify the uncertainty arising from their comparison: that is, to compare all the hard clusterings compatible with the given soft clusterings, and to propagate the degree of uncertainty represented by the two soft clusterings to the quality comparison. Based on these two principles, we proposed a general mathematical framework [9] aimed at addressing the former need mentioned above (i.e. evaluating the result of a soft clustering algorithm in comparison with a given ground truth). This approach, grounded on optimal transport theory and a novel representation of soft clustering in distributional terms, allows us to extend any validation measure from the hard clustering setting to the soft clustering one using a principled approach. In this article, which is an extension of our previous conference paper [8], we continue our study of validation measures for soft clustering. After recalling our previous contributions in [9], we provide three main new contributions. First, we propose an alternative approach for the comparison of soft clusterings, which aims at comprehensively evaluating the uncertainty represented by a soft clustering, through the definition of so-called distributional measures . Second, we draw connections between the transport-theoretic measures introduced in [9] and the framework of distributional measures, showing their equivalence in the special but important case in which one of the two clusterings to be compared is a hard clustering. Finally, we propose algorithms (both exact and approximation ones) for computing distributional measures and we show their application to a sample collection of datasets, so as to evaluate their practical applicability and illustrate the kind of reasoning they enable. The rest of the paper is organized as follows. The necessary background on soft clustering and evaluation measures will first be recalled in Section 2. Our new framework will be then introduced in Section 3, and approximation methods will be presented in Section 4. Finally, illustrative experiment will be reported in Section 5, and Section 6 will conclude the paper. Section snippets Background and related work General background on clustering will first be exposed in Section 2.1. The distributional representation introduced in [9] will then be recalled in Section 2.2, and an overview of evaluation measures for soft clustering will be provided in Section 2.3. A general framework for soft clustering evaluation measures As shown in Section 2.3, most of the research on comparison measures for soft clustering has focused on the analysis of some specific indices [24], [18]. Furthermore, we have described a recent proposal for a general framework to extend validation and comparison measures to the setting of evidential clustering, aimed at addressing the need for measures that enable the objective comparison of two clusterings. In this section, we propose another framework aimed at the second goal described in the Approximation methods In the previous section we proposed distributional measures as a general approach to extend any hard clustering comparison measure to a soft clustering comparison measure. Nonetheless, the computation of these distributional measures is, in general, intractable. For this reason, in this section, we introduce some approximation methods and algorithms, based on a sampling approach, which can be applied to any base distance between hard clusterings. We start with the case of the summarized Illustrative experiments In this section we discuss two sets of experiments, with the aim of illustrating the application of the proposed sampling-based approximations for the distributional measures, as well as to evaluate their accuracy and convergence speed. In particular, in Section 5.1 we will analyze the accuracy and speed with which the sampling-based approximations converge to the exact values of the (interval representations of the) distributional measures. By contrast, in Section 5.2 we will illustrate the Conclusion In this article, which is an extension of [8], we proposed and studied a mathematical approach, as well as computational algorithms, making it possible to lift any clustering comparison measures from hard clustering to evidential clustering (hence, as special cases, also to rough, fuzzy and possibilistic clustering). We studied the metric and complexity-theoretic properties of the proposed distributional measures. We have shown that computing these measures, or their interval representation, is Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. References (47) V. Antoine et al. Fast semi-supervised evidential clustering Int. J. Approx. Reason. (2021) A. Campagner et al. Orthopartitions and soft clustering: soft mutual information measures for clustering validation Knowl.-Based Syst. (2019) A. Campagner et al. Belief functions and rough sets: survey and new insights Int. J. Approx. Reason. (2022) A. Campagner et al. A general framework for evaluating and comparing soft clusterings Inf. Sci. (2023) R.J. Campello A fuzzy extension of the Rand index and other related indexes for clustering and classification assessment Pattern Recognit. Lett. (2007) W.H. Day The complexity of computing metric distances between partitions Math. Soc. Sci. (1981) T. Denoeux Decision-making with belief functions: a review Int. J. Approx. Reason. (2019) T. Denœux NN-EVCLUS: neural network-based evidential clustering Inf. Sci. (2021) H. Frigui et al. Clustering and aggregation of relational data with applications to image database categorization Pattern Recognit. (2007) M.H. Masson et al. ECM: an evidential version of the fuzzy c-means algorithm Pattern Recognit. (2008) M. Naaman On the tight constant in the multivariate Dvoretzky–Kiefer–Wolfowitz inequality Stat. Probab. Lett. (2021) G. Peters Rough clustering utilizing the principle of indifference Inf. Sci. (2014) G. Peters et al. Soft clustering: fuzzy and rough approaches and their extensions and derivatives Int. J. Approx. Reason. (2013) A. Saxena et al. A review of clustering techniques and developments Neurocomputing (2017) K. Zhou et al. Evidential prototype-based clustering based on transfer learning Int. J. Approx. Reason. (2022) D.T. Anderson et al. Comparing fuzzy, probabilistic, and possibilistic partitions IEEE Trans. Fuzzy Syst. (2010) D.T. Anderson et al. Comparing fuzzy, probabilistic, and possibilistic partitions using the Earth mover's distance IEEE Trans. Fuzzy Syst. (2012) J.C. Bezdek Pattern Recognition with Fuzzy Objective Function Algorithms (1981) R.K. Brouwer Extending the Rand, adjusted Rand and Jaccard indices to fuzzy partitions J. Intell. Inf. Syst. (2009) A. Campagner et al. A distributional approach for soft clustering comparison and evaluation A. Dempster Upper and lower probabilities induced by a multivalued mapping Ann. Math. Stat. (1967) J. Demšar Statistical comparisons of classifiers over multiple data sets J. Mach. Learn. Res. (2006) T. Denœux et al. Representations of uncertainty in AI: beyond probability and possibility View more references Cited by (0) Recommended articles (6) Research article Discovery of link keys in resource description framework datasets based on pattern structures International Journal of Approximate Reasoning, Volume 161, 2023, Article 108978 Show abstract In this paper, we present a detailed and complete study on data interlinking and the discovery of identity links between two RDF –Resource Description Framework– datasets over the web of data. Data interlinking is the task of discovering identity links between individuals across datasets. Link keys are constructions based on pairs of properties and classes that can be considered as rules allowing to infer identity links between subjects in two RDF datasets. Here we investigate how FCA –Formal Concept Analysis– and its extensions are well adapted to investigate and to support the discovery of link keys. Indeed plain FCA allows to discover the so-called link key candidates, while a specific pattern structure allows to associate a pair of classes with every candidate. Different link key candidates can generate sets of identity links between individuals that can be considered as equal when they are regarded as partitions of the identity relation and thus involving a kind of redundancy. In this paper, such a redundancy is deeply studied thanks to partition pattern structures. In particular, experiments are proposed where it is shown that redundancy of link key candidates while not significant when based on identity of partitions appears to be much more significant when based on similarity. Research article Morpho-logic from a topos perspective – application to symbolic AI International Journal of Approximate Reasoning, Volume 161, 2023, Article 109011 Show abstract Modal logics have proved useful for many reasoning tasks in symbolic artificial intelligence (AI), such as belief revision, spatial reasoning, among others. On the other hand, mathematical morphology (MM) is a theory for non-linear analysis of structures, that was widely developed and applied in image analysis. Its mathematical bases rely on algebra, complete lattices, topology. Strong links have been established between MM and mathematical logics, mostly modal logics. In this paper, we propose to further develop and generalize this link between mathematical morphology and modal logic from a topos perspective, i.e. categorial structures generalizing space, and connecting logics, sets and topology. Furthermore, we rely on the internal language and logic of a topos. We define structuring elements, dilations and erosions as morphisms. Then we introduce the notion of structuring neighborhoods, and show that the dilations and erosions based on them lead to a constructive modal logic, for which a sound and complete proof system is proposed. We then show that the modal logic thus defined (called morpho-logic here), is well adapted to define concrete and efficient operators for revision, merging, and abduction of new knowledge, or even spatial reasoning. Research article Using scoring functions in a group decision-making procedure with heterogeneous experts and qualitative assessments International Journal of Approximate Reasoning, Volume 161, 2023, Article 109004 Show abstract In this paper, we propose a group decision-making procedure to rank alternatives in the context of ordered qualitative scales that not necessarily are uniform (the proximities between consecutive terms of the scales can be perceived as different). The procedure manages two ordered qualitative scales. One of them is used to determine the weights of the experts according to their expertise, taking into account the assessments given by a decision-maker to the experts. And another one, that is used by the experts to assess the alternatives. In order to assign numerical scores to the linguistic terms of the ordered qualitative scales, we have introduced and analyzed some scoring functions. They are based on the concept of ordinal proximity measure that properly represents the ordinal proximities between the linguistic terms of the ordered qualitative scales. Research article Optimization problems with evidential linear objective International Journal of Approximate Reasoning, Volume 161, 2023, Article 108987 Show abstract We investigate a general optimization problem with a linear objective in which the coefficients are uncertain and the uncertainty is represented by a belief function. We consider five common criteria to compare solutions in this setting: generalized Hurwicz, strong dominance, weak dominance, maximality and E-admissibility. We provide characterizations for the non-dominated solutions with respect to these criteria when the focal sets of the belief function are Cartesian products of compact sets. These characterizations correspond to established concepts in optimization. They make it possible to find non-dominated solutions by solving known variants of the deterministic version of the optimization problem or even, in some cases, simply by solving the deterministic version. Research article Addressing ambiguity in randomized reinsurance stop-loss treaties using belief functions International Journal of Approximate Reasoning, Volume 161, 2023, Article 108986 Show abstract The aim of the paper is to model ambiguity in a randomized reinsurance stop-loss treaty. For this, we consider the lower envelope of the set of bivariate joint probability distributions having a precise discrete marginal and an ambiguous Bernoulli marginal. Under an independence assumption, since the lower envelope fails 2-monotonicity, inner/outer Dempster-Shafer approximations are considered, so as to select the optimal retention level by maximizing the lower expected insurer's annual profit under reinsurance. We show that the inner approximation is not suitable in the reinsurance problem, while the outer approximation preserves the given marginal information, weakens the independence assumption, and does not introduce spurious information in the retention level selection problem. Finally, we provide a characterization of the optimal retention level. Research article Toward credible belief base revision International Journal of Approximate Reasoning, Volume 162, 2023, Article 109007 Show abstract This paper deals with belief base revision, a form of belief change which consists in restoring consistency with the intention of incorporating a new piece of information from the environment, while minimally modifying the agent's belief state represented by a finite set of propositional formulas. In an effort to guarantee more reliability and rationality for real applications while performing revision, we come up with the idea of credible belief base revision. We define two new formula-based revision operators using tools offered by evidence theory. These operators, uniformly presented in the same spirit as [92] , [13] , stem from consistent sub-bases maximal with respect to credibility instead of set inclusion or cardinality. Logical properties and productivity of inference from these operators are established, along with complexity results. View full text © 2023 Elsevier Inc. All rights reserved. About ScienceDirect Remote access Shopping cart Advertise Contact and support Terms and conditions Privacy policy We use cookies to help provide and enhance our service and tailor content and ads. By continuing you agree to the use of cookies . Copyright © 2023 Elsevier B.V. or its licensors or contributors. ScienceDirect® is a registered trademark of Elsevier B.V. ScienceDirect® is a registered trademark of Elsevier B.V.



关键字:

暂无


所属期刊
International Journal of Approximate Reasoning
ISSN: 0888-613X
来自:Elsevier BV