# DigiK Project "Perception in Statistics, Econometrics and Stochastics"

This project is a collaboration between the Institute of Mathematical Statistics and Actuarial Science and the Department of Economics at the University of Bern. It is co-funded by the Commission for Digitalisation at the University of Bern.

Digitalisation enables faster and more efficient processes, however it also leads to an enormous increase of data to be analysed. For instance in medical studies, numerous quantities and features are reported for each person or procedure, hoping that this leads to additional insights, e.g. by means of artificial intelligence.

One problem is that the human mind is able to grasp and imagine two- or three-dimensional objects. However, if the data consists of high-dimensional observations vectors, there exist surprosing effects which contradict human intuition. These effects render the detection of interesting structures rather difficult - a search for needles in a haystack.

Another problem is that when analysing massive data, there is a danger of detecting apparent associations and other interesting effects which would rutn out to be spurious in future experiments or studies.

In our project we shall work on both problems. A particular goal is a deeper understanding of when and how to apply certain methods of machine learning purposefully, instead of naive trial and error.

### Subproject 1: Misleading perceptions in high-dimensional statistics (HIGHDIM)

When visualising high-dimensional data, we often face the so-called Diaconis-Freedman effect, that is, most linear projections on two- or three-dimensional subspaces look rather similar and unstructured. Projection pursuit is a general paradigm to find (almost) automatically interesting projections which reveal relevant structures such as clusters. We are currently investigating local projection pursuit based on kernel mean embeddings.

A second line of research is about minimum distance estimation via kernel mean embeddings in special generative models.

Collaborators: Prof. Lutz Dümbgen and Oliver Warth

### Subproject 2: Questioning similarities for expert-informed statistical learning (SIMILAR)

Distance- and similarity-based methods (nearest-neighbour classification, phylogenetic trees, multi-dimensional scaling, etc.) are at the heart of many predictive approaches in statistical data science and machine learning. Yet, distance/similarity functions are often chosen off the shelf, without necessarily questioning their adequacy to the considered task. For instance, the Euclidean and Gower distances typically appear as default choices when dealing with real-valued and mixed continuous-categorical covariates, respectively. While in recent years, metric learning has arisen as an automatic method to tune distances in in distance-based machine learning, many issues remain open in diagnosing the suitability of prescribed and tuned distance functions to tasks of interests. In this thesis, we will pioneer the field with novel diagnostic tools and algorithms extending approaches from spatial statistics and beyond (including variography, crossvalidation, and more) to efficiently choose from and combine expert-informed distance functions towards improved predictivity and uncertainty quantification.

Collaborators: Prof. David Ginsbourger and Tim Steinert

### Subproject 3: Spuriously perceived as significant: p-hacking in the era of big data (P-HACK)

P-hacking (also known as data dredging) is the misuse of data analysis to find patterns in the data. For instance, researchers may test multiple hypotheses and report only the significant ones, or they may perform naive inference after data-driven model selection. In these cases, the perceived statistical significance will overstate its intended level. The increasing availability of data expands the pool of variables that can be used and, therefore, the risk that multiple testing and data-driven model specification contaminate the reported significance of the results. In this project, we will propose algorithms that provide correct significance levels despite the complications mentioned above. More specifically, we will focus on quantiles in setups where naive inference is biased. Quantiles are especially interesting because they allow for analyzing the heterogeneity of the impact of policy variables as well as their effect on inequality.

Collaborators: Prof. Costanza Naguib, Prof. Blaise Melly and Nina Dorta