Machine Learning & Knowledge Extraction (MAKE)
with application on Health Informatics (MAKE-HEALTH)

In a joint effort with international colleagues the Holzinger-Group HCI-KDD is interested in theoretical, algorithmic, and experimental studies in machine learning in order to contribute to solving the problem of knowledge extraction from complex data to discover unknown unknowns, to make predictions and to support decision making under uncertainty – the grand goal of health informatics – which is our application domain.

Fundamentally, we are excited within the international research community to help to answer a grand question: How can we perform a new task by exploiting knowledge extracted during problem solving of previous tasks. Contributions to this problem would have major impact on Artificial Intelligence (AI) generally, and Machine Learning (ML) specifically, as we could develop software which learns from previous experience – similarly as we humans do.

Ultimately, to reach a level of usable computational intelligence, we need

  1. to learn from prior data,
  2. to extract knowledge,
  3. to generalize – i.e. guessing where probability mass/density concentrates,
  4. to fight the curse of dimensionality, and
  5. to disentangle underlying explanatory factors of the data – i.e. sensemaking in the context of an application domain.

However, the application of automatic machine learning (aML) algorithms in complex domains such as health informatics seems elusive at present. A good example are Gaussian processes, where aML (e.g. Kernel machines) struggle on function extrapolation problems, which are trivial for human learners.

Consequently, interactive machine learning (iML) with a human-in-the-loop, thereby making use of human cognitive abilities, can be of particular interest to solve problems, where learning algorithms suffer due to insufficient training samples, dealing with complex data and/or rare events or computationally hard problems, e.g. subspace clustering, protein folding, or k-anonymization. Here human experience and knowledge can help to reduce an exponential search space through heuristic selection of samples. Therefore, what would otherwise be an NP-hard problem reduces greatly in complexity through the input and the assistance of a human agent involved in the learning phase.

We work consistently on a synergistic combination of methods, techiques and approaches which offer ideal conditions to support human intelligence with computational intelligence: Human–Computer Interaction (HCI) and Knowledge Discovery & Data Mining (KDD).

Successful Machine Learning & Knowledge Extraction (MAKE) pipelines require a concerted effort of integrative research across seven fields (see image below).

❶ Data preprocessing (fusion, mapping, knowledge representation);
❷ Machine learning algorithms (automatic ML/interactive ML with the human-in-the-loop);
❸ Graphical models/network science (i.e. graph-based data mining);
❹ Topological data analytics (i.e. topological data mining);
❺ Time/entropy (i.e. entropy-based data mining);
❻ Data visualization (i.e. visual analytics), and last but not least:
❼ Privacy (data protection, safety, security, privacy aware machine learning).

Visit the CD-MAKE conference.

1
2
3
4
5
6
7
1

Before we may apply machine learning algorithms to heterogenous data sets we have to work carefully on data integration, data fusion and data mapping in arbitrarily high dimensional spaces, and perform adequate data pre-processing to avoid the danger of modelling artifacts. We also want to understand the underlying physics of complex, high-dimensional and weakly-structured data sets.

2

Of course we are interested in fully automatic machine learning (aML), however, aML approaches fail in complex domains such as health informatics. Consequently, interactive machine learning (iML) with a human-in-the-loop, thereby making use of human cognitive abilities, can be of particular interest to solve problems, where learning algorithms suffer due to insufficient training samples, dealing with complex data and/or rare events or computationally hard problems, e.g. subspace clustering, protein folding, or k-anonymization

3

Graph theory provides powerful tools to map data structures and to discover novel connections among data sets. Graph structures can be analyzed by using statistical and machine learning techniques. Our goal is to develop, blending with graph-entropy and multi-touch interaction, promising approaches for interactive knowledge discovery, which needs investigations beyond small world and random networks.

4

Information entropy, originally a measure of uncertainty in data, has evolved into a vast research area. Our goal is to contribute towards advances in the combination of learning algorithms and entropy for the use in knowledge discovery and data mining to discover unknown unknowns in complex data sets, e.g. for biomarker discovery in biomedical data sets.

5

Often we are confronted with point cloud data sets sampled from an unknown high-dimensional space (e.g. in Proteomics). We use the shape of data to identify features in the data aiming at recovering the topology of the space. Our goal is in deep understanding of the underlying data and to contribute towards the the application of toplogical data analysis for advances in machine learning.

6

Humans are excellent at pattern recognition in dimensions of less than 3, however, most biomedical data sets are in dimensions much higher than 3 making manual analysis often impossible. Our goal is to reduce the dimensionality of results from this arbitrarly high dimensional spaces into the lower dimensions and to make our results accessible to the human end user. For machine learning the visualization part is maybe the most important one, because at the end of the day it is our customers (e.g. medical doctors) who must comprehend the results in R2 – results which we may have found in arbitrarily
high dimensional spaces!

7

As soon as you deal with biomedical data sets, issues of privacy, data protection, safety, and security are mandatory. Our goal is to contribute towards the generation of open data sets, in order to support the international research community and to make results openly available and replicable – a central goal in science.