Subproject 2: Questioning similarities for expert-informed statistical learning (SIMILAR)
Distance- and similarity-based methods (nearest-neighbour classification, phylogenetic trees, multi-dimensional scaling, etc.) are at the heart of many predictive approaches in statistical data science and machine learning. Yet, distance/similarity functions are often chosen off the shelf, without necessarily questioning their adequacy to the considered task. For instance, the Euclidean and Gower distances typically appear as default choices when dealing with real-valued and mixed continuous-categorical covariates, respectively. While in recent years, metric learning has arisen as an automatic method to tune distances in in distance-based machine learning, many issues remain open in diagnosing the suitability of prescribed and tuned distance functions to tasks of interests. In this subproject, we will pioneer the field with novel diagnostic tools and algorithms extending approaches from spatial statistics and beyond (including variography, crossvalidation, and more) to efficiently choose from and combine expert-informed distance functions towards improved predictivity and uncertainty quantification.
Collaborators: Prof. David Ginsbourger and Tim Steinert