Posts

Transparency & Trust in Machine Learning: Making AI interpretable and explainable

A huge motivation for us in continuing to study interactive Machine Learning (iML) [1] – with a human in the loop [2] (see our project page) is that modern deep learning models are often considered to be “black-boxes” [3]. A further drawback is that such models have no explicit declarative knowledge representation, hence have difficulty in generating the required explanatory structures – which considerably limits the achievement of their full potential [4].

Even if we understand the mathematical theories behind the machine model it is still complicated to get insight into the internal working of that model, hence black box models are lacking transparency, consequently we raise the question: “Can we trust our results?”

In fact: “Can we explain how and why a result was achieved?” A classic example is the question “Which objects are similar?”, but an even more interesting question would be to answer “Why are those objects similar?”

We believe that there is growing demand in machine learning approaches, which are not only well performing, but transparent, interpretable and trustworthy. We are currently working on methods and models to reenact the machine decision-making process, to reproduce and to comprehend the learning and knowledge extraction process. This is important, because for decision support it is necessary to understand the causality of learned representations [5], [6]. If human intelligence is complemented by machine learning and at least in some cases even overruled, humans must still be able to understand, and most of all to be able to interactively influence the machine decision process. This needs context awareness and sensemaking to close the gap between human thinking and machine “thinking”.

A recent, and very interesting discussion with Daniel S. WELD (Artificial Intelligence, Crowdsourcing, Information Extraction) on Explainable AI can be found here:

The interview in essence brings out that most machine learning models are very complicated: deep neural networks operate incredibly quickly, considering thousands of possibilities in seconds before making decisions and Dan Weld points out: “The human brain simply can’t keep up” – and pointed at the example when AlphaGo made an unexpected decision: It is not possible to understand why the algorithm made exactly that choice. Of course this may not be critical in a game – no one gets hurt; however, deploying intelligent machines that we can not understand could set a dangerous precedent in e.g. in our domain: health informatics. According to Dan Weld, understanding and trusting machines is “the key problem to solve” in AI safety, security, data protection and privacy, and it is urgently necessary. He further explains, “Since machine learning is nowadays at the core of pretty much every AI success story, it’s really important for us to be able to understand what is it that the machine learned.” In case a machine learning system is confronted with a “known unknown,” it may recognize its uncertainty with the situation in the given context. However, when it encounters an unknown unknown, it won’t even recognize that this is an uncertain situation: the system will have extremely high confidence that its result is correct – but it still will be wrong, and Dan pointed on the example of classifiers “trained on data that had some regularity in it that’s not reflected in the real world” – which is a problem of having little data or even no available training data (see [1]) – the problem of “unknown unknowns” is definitely underestimated in the traditional AI community. Governments and businesses can’t afford to deploy highly intelligent AI systems that make unexpected, harmful decisions, especially if these systems are in safety critical environments.

A huge motivation for this approach are rising legal and privacy aspects, e.g. with the new European General Data Protection Regulation (GDPR and ISO/IEC 27001) entering into force on May, 25, 2018, will make black-box approaches difficult to use in business, because they are not able to explain why a decision has been made.

This will stimulate research in this area with the goal of making decisions interpretable, comprehensible and reproducible. On the example of health informatics this is not only useful for machine learning research, and for clinical decision making, but at the same time a big asset for the training of medical students.

The General Data Protection Regulation (GDPR) (Regulation (EU) 2016/679) is a regulation by which the European Parliament, the Council of the European Union and the European Commission intend to strengthen and unify data protection for all individuals within the European Union (EU). It also addresses the export of personal data outside of the European Union (this will affect data-centric projects between the EU and e.g. the US). The GDPR aims primarily to give control back to citizens and residents over their personal data and to simplify the regulatory environment for international business by unifying the regulation within the EU. The GDPR replaces the data protection Directive 95/46/EC) of 1995. The regulation was adopted on 27 April 2016 and becomes enforceable from 25 May 2018 after now a two-year transition period and, unlike a directive, it does not require national governments to pass any enabling legislation, and is thus directly binding – which affects practically all data-driven businesses and particularly machine learning and AI technology Here to note is that the “right to be forgotten” [7] established by the European Court of Justice has been extended to become a “right of erasure”; it will no longer be sufficient to remove a person’s data from search results when requested to do so, data controllers must now erase that data. However, if the data is encrypted, it may be sufficient to destroy the encryption keys rather than go through the prolonged process of ensuring that the data has been fully erased [8].

References:

[1]          Holzinger, A. 2016. Interactive Machine Learning for Health Informatics: When do we need the human-in-the-loop? Brain Informatics, 3, (2), 119-131, doi:10.1007/s40708-016-0042-6.

[2]          Holzinger, A., Plass, M., Holzinger, K., Crisan, G. C., Pintea, C.-M. & Palade, V. 2017. A glass-box interactive machine learning approach for solving NP-hard problems with the human-in-the-loop. arXiv:1708.01104.

[3]          Lipton, Z. C. 2016. The mythos of model interpretability. arXiv preprint arXiv:1606.03490.

[4]          Bologna, G. & Hayashi, Y. 2017. Characterization of Symbolic Rules Embedded in Deep DIMLP Networks: A Challenge to Transparency of Deep Learning. Journal of Artificial Intelligence and Soft Computing Research, 7, (4), 265-286, doi:10.1515/jaiscr-2017-0019.

[5]          Pearl, J. 2009. Causality: Models, Reasoning, and Inference (2nd Edition), Cambridge, Cambridge University Press.

[6]          Gershman, S. J., Horvitz, E. J. & Tenenbaum, J. B. 2015. Computational rationality: A converging paradigm for intelligence in brains, minds, and machines. Science, 349, (6245), 273-278, doi:10.1126/science.aac6076.

[7]          Malle, B., Kieseberg, P., Schrittwieser, S. & Holzinger, A. 2016. Privacy Aware Machine Learning and the “Right to be Forgotten”. ERCIM News (special theme: machine learning), 107, (3), 22-23.

[8]          Kingston, J. 2017. Using artificial intelligence to support compliance with the general data protection regulation. Artificial Intelligence and Law, doi:10.1007/s10506-017-9206-9.

Links:

https://de.wikipedia.org/wiki/Datenschutz-Grundverordnung

https://en.wikipedia.org/wiki/General_Data_Protection_Regulation

http://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:31995L0046

2016 ICML Workshop on Human Interpretability in Machine Learning (WHI 2016), New York, NY

http://googleblog.blogspot.com/2015/07/neon-prescription-or-rather-new.html

https://sites.google.com/site/nips2016interpretml

 

Interpretable Machine Learning Workshop

Andrew G Wilson, Jason Yosinski, Patrice Simard, Rich Caruana, William Herlands

https://nips.cc/Conferences/2017/Schedule?showEvent=8744

 

Journal “Artificial Intelligence and Law”

https://link.springer.com/journal/volumesAndIssues/10506

ISSN: 0924-8463 (Print) 1572-8382 (Online)

Glossary:

AI = Artificial Intelligence (today interchangeably used together with Machine learning (ML) – those are highly interrelated but not the same:

Causality = extends from Greek philosophy to todays neuropsychology; assumptions about the nature of causality may be shown to be functions of a previous event preceding a later event.

Explainability = fundamental topic within AI

Etiology = in medicine (many) factors coming together to cause an illness (see causality)

 

 

 

 

 

 

 

 

CD-MAKE machine learning and knowledge extraction

Marta Milo and Neil Lawrence in Reggio di Calabria at CD-MAKE 2017

The CD-MAKE 2017 in the context of the ARES conference series was a full success in beautiful Reggio di Calabria.

In the middle Marta Milo and Neil Lawrence the keynote speakers of CD-MAKE 2017, flanked by Francesco Buccafurri (on the right) and Andreas Holzinger

Machine Learning & Knowledge Extraction (MAKE) Journal launched

Inaugural Editorial Paper published:

Holzinger, A. 2017. Introduction to Machine Learning & Knowledge Extraction (MAKE). Machine Learning and Knowledge Extraction, 1, (1), 1-20, doi:10.3390/make1010001.

http://www.mdpi.com/2504-4990/1/1/1

Machine Learning and Knowledge Extraction (MAKE) is an inter-disciplinary, cross-domain, peer-reviewed, scholarly open access journal to provide a platform to support the international machine learning community. It publishes original research articles, reviews, tutorials, research ideas, short notes and Special Issues that focus on machine learning and applications. Papers which deal with fundamental research questions to help reach a level of useable computational intelligence are very welcome.

Machine learning deals with understanding intelligence to design algorithms that can learn from data, gain knowledge from experience and improve their learning behaviour over time. The challenge is to extract relevant structural and/or temporal patterns (“knowledge”) from data, which is often hidden in high dimensional spaces,  thus not accessible to humans. Many application
domains, e.g., smart health, smart factory, etc. affect our daily life, e.g., recommender systems, speech recognition, autonomous driving, etc. The grand challenge is to understand the context in the real-world under uncertainty. Probabilistic inference can be of
great help here as the inverse probability allows to learn from data, to infer unknowns, and to make predictions to support decision making.

NOTE: To support the training of a new kind of machine learning graduates, the journal accepts peer-reviewed high-end tutorial papers, similar as the IEEE Signal Processing Magazine (SCI IF=9.654 !) is doing:
http://ieeexplore.ieee.org/xpl/aboutJournal.jsp?punumber=79#AimsScope

Call for Papers: Open Data for Discovery Science (due to July, 31, 2017)

The Journal BMC Medical Informatics and Decision Making (SCI IF (2015): 2,042)
invites to submit to a new thematic series on open data for discovery science

https://bmcmedinformdecismak.biomedcentral.com/articles/collections/odds

Note: Excellent submissions to the IFIP Cross Domain Conference on Machine Learning and Knowledge Discovery (CD-MAKE), (Submission due to May, 15, 2017) relevant to the topics described below, will be invited to expand their work into this thematic series:
The use of open data for discovery science has gained much attention recently as its full potential is unfolding and being explored in projects spanning all areas of healthcare research. A plethora of data sets are now available thanks to drives to make data universally accessible and usable for discovery science. However, with these advances come inherent challenges with the processing and management of ever expanding data sources. The computational and informatics tools and methods currently used in most investigational settings are often labor intensive and rely upon technologies that have not been designed to scale and support reasoning across multi-dimensional data resources. In addition, there are many challenges associated with the storage and responsible use of open data, particularly medical data, such as privacy, data protection, safety, information security and fair use of the data. There are therefore significant demands from the research community for the development of data management and analytic tools supporting heterogeneous analytic workflows and open data sources. Effective anonymisation tools are also of paramount importance to protect data security whilst preserving the usability of the data.

The purpose of this thematic series is to bring together articles reporting advances in the use of open data including the following:

  • The development of tools and methods targeting the reproducible and rigorous use of open data for discovery science, including but not limited to: syntactic and semantic standards, platforms for data sharing and discovery, and computational workflow orchestration technologies that enable the creation of data analytics, machine learning and knowledge extraction pipelines.
  • Practical approaches for the automated and/or semi-automated harmonization, integration, analysis, and presentation of data products to enable hypothesis discovery or testing.
  • Theoretical and practical approaches for solutions to make use of interactive machine learning to put a human-in-the-loop, answering questions including: could human intelligence lead to general heuristics that we can use to improve heuristics?
  • Frameworks for the application of open data in hypothesis generation and testing in projects spanning translational, clinical, and population health research.
  • Applied studies that demonstrate the value of using open data either as a primary or as an enriching source of information for the purposes of hypothesis generation/testing or for data-driven decision making in the research, clinical, and/or population health environments.
  • Privacy preserving machine learning and knowledge extraction algorithms that can enable the sharing of previously “privileged” data types as open data.
  • Evaluation and benchmarking methodologies, methods and tools that can be used to demonstrate the impact of results generated through the primary or secondary use of open data.
  • Socio-cultural, usability, acceptance, ethical and policy issues and frameworks relevant to the sharing, use, and dissemination of information and knowledge derived from the analysis of open data.

Submission is open to everyone, and all submitted manuscripts will be peer-reviewed through the standard BMC Medical Informatics and Decision Making review process. Manuscripts should be formatted according to the submission guidelines and submitted via the online submission system. Please indicate clearly in the covering letter that the manuscript is to be considered for the ‘Open data for discovery science’ collection. The deadline for submissions will be 31 July 2017.

For further information, please email the editors of the thematic series:
Andreas HOLZINGER a.holzinger@hci-kdd.org,
Philip PAYNE prpayne@wustl.edu ,or the BMC in-house editor
Emma COOKSON at emma.cookson@biomedcentral.com

Link to the IFIP Cross-Domain Conference on Machine Learning and Knowledge Extraction (CD-MAKE):
https://cd-make.net

Integrated interactomes and pathways in precision medicine by Igor Jurisica, Toronto

Machine learning is the fastest growing field in computer science, and Health Informatics is amongst the greatest application challenges, providing benefits in improved medical diagnoses, disease analyses, and pharmaceutical development – towards future precision medicine.

Talk announcement: Friday, 12th May, 2017, 10:00, Seminaraum 137, Parterre, Inffeldgasse 16c

Integrated interactomes and pathways in precision medicine

by Igor Jurisica, University of Toronto and Princess Margaret Cancer Center Toronto

Abstract: Fathoming cancer and other complex disease development processes requires systematically integrating diverse types of information, including multiple high-throughput datasets and diverse annotations. This comprehensive and integrative analysis will lead to data-driven precision medicine, and in turn will help us to develop new hypotheses, and answer complex questions such as what factors cause disease; which patients are at high risk; will patients respond to a given treatment; how to rationally select a combination therapy to individual patient, etc.
Thousands of potentially important proteins remain poorly characterized. Computational biology methods, including machine learning, knowledge extraction, data mining and visualization, can help to fill this gap with accurate predictions, making disease modeling more comprehensive. Intertwining computational prediction and modeling with biological experiments will lead to more useful findings faster and more economically.

Short Bio: Igor Jurisica is Tier I Canada Research Chair in Integrative Cancer Informatics, Senior Scientist at Princess Margaret Cancer Centre, Professor at University of Toronto and Visiting Scientist at IBM CAS. He is also an Adjunct Professor at the School of Computing, Pathology and Molecular Medicine at Queen’s University, Computer Science at York University, scientist at the Institute of Neuroimmunology, Slovak Academy of Sciences and an Honorary Professor at Shanghai Jiao Tong University in China. Since 2015, he has also served as Chief Scientist at the Creative Destruction Lab, Rotman School of Management. Igor has published extensively on data mining, visualization and cancer informatics, including multiple papers in Science, Nature, Nature Medicine, Nature Methods, Journal of Clinical Oncology, and received over 9,960 citations since 2012. He has been included in Thomson Reuters 2016, 2015 & 2014 list of Highly Cited Researchers, and The World’s Most Influential Scientific Minds: 2015 & 2014 Reports.

Jurisica Lab, IBM Life Sciences Discovery Center: http://www.cs.toronto.edu/~juris/

Canada Tier I Research Chair: http://www.chairs-chaires.gc.ca/chairholders-titulaires/profile-eng.aspx?profileId=2347

On Nutrigenomics [1]: http://www.uhn.ca/corporate/News/Pages/Igor_Jurisica_talks_nutrigenomics.aspx

[1] Nutrigenomics tries to define the causality or relationship between specific nutrients and specific nutrient regimes (diets) on human health. The underlying idea is in personalized nutrition based on the *omics background, which may help to foster personal dietrary recommendations. Ultimately, nutrigenomics will allow effective dietary-intervention strategies to recover normal homeostasis and to prevent diet-related diseases, see: Muller, M. & Kersten, S. 2003. Nutrigenomics: goals and strategies. Nature Reviews Genetics, 4, (4), 315-322.

What is machine learning?

Many services of our every day life rely meanwhile on machine learning – a field of science and a powerful technology that allows machines to learn from data; a very nice info graphic by the Royal Society – interactive with a quiz – can be found here:

Royal Society Infographic “What is machine learning?”

This is part of a info campaign about machine learning from the Royal Society:

https://royalsociety.org/topics-policy/projects/machine-learning/

The Royal Society was formed by a group of natural scientists influenced by Francis Bacon (1561-1626).  The first ‘learned society’ meeting on 28 November 1660 followed a lecture at Gresham College by Christopher Wren. Joined by Robert Boyle and John Wilkins and others, the group received royal approval by King Charles II (1630-1685) in 1663 and was known since as ‘The Royal Society of London for Improving Natural Knowledge’.

Machine Learning Guide

An excellent podcast which I can fully recommend to my students is the Machine Learning Guide by Tyler RENELLE (Tensor Flow). This series aims to teach the high level fundamentals of machine learning with a focus on algorithms and some underlying mathematics, which is really great.

http://ocdevel.com/podcasts/machine-learning

 

 

 

CD-MAKE machine learning and knowledge extraction

Cross Domain Conference for Machine Learning & Knowledge Extraction

cd-make.net

Call for Papers – due to May, 15, 2017

http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=61244&copyownerid=17803

Call for Papers due to May, 15, 2017

International IFIP Cross Domain Conference for Machine Learning & Knowledge Extraction CD-MAKE
in Reggio di Calabria (Italy) August 29 – September 1, 2017

https://cd-make.net

CD stands for Cross-Domain and means the integration and appraisal of different fields and application domains (e.g. Health, Industry 4.0, etc.) to provide an atmosphere to foster different perspectives and opinions. The conference is dedicated to offer an international platform for novel ideas and a fresh look on the methodologies to put crazy ideas into Business for the benefit of the human. Serendipity is a desired effect, and shall cross-fertilize methodologies and transfer of algorithmic developments.

MAKE stands for MAchine Learning & Knowledge Extraction.

CD-MAKE is a joint effort of IFIP TC 5, IFIP WG 8.4, IFIP WG 8.9 and IFIP WG 12.9 and is held in conjunction with the International Conference on Availability, Reliability and Security (ARES).
Keynote Speakers are Neil D. LAWRENCE (Amazon) and Marta MILO (University of Sheffield).

IFIP is the International Federation for Information Processing and the leading multi-national, non-governmental, apolitical organization in Information & Communications Technologies and Computer Sciences, is recognized by the United Nations and was established in the year 1960 under the auspices of the UNESCO as an outcome of the first World Computer Congress held in Paris in 1959.

Papers are sought from the following seven topical areas (see image below). Papers which deal with fundamental questions and theoretical aspects in machine learning are very welcome.

❶ Data science (data fusion, preprocessing, data mapping, knowledge representation),
❷ Machine learning (both automatic ML and interactive ML with the human-in-the-loop),
❸ Graphs/network science (i.e. graph-based data mining),
❹ Topological data analysis (i.e. topology data mining),
❺ Time/entropy (i.e. entropy-based data mining),
❻ Data visualization (i.e. visual analytics), and last but not least
❼ Privacy, data protection, safety and security (i.e. privacy aware machine learning).

Proposals for Workshops, Special Sessions, Tutorials: April, 19, 2017
Submission Deadline: May, 15, 2017
Author Notification: June, 14, 2017
Camera Ready Deadline: July, 07, 2017

 

 https://cd-make.net/call-for-papers

 

Machine Learning Podcast: Data Skeptic (recommendable)

Data Skeptic is a weekly podcast that is skeptical of and with data. They explain methods and algorithms that power our world in an accessible manner through short mini-episode discussions and longer interviews with experts in the field, see:

http://dataskeptic.com

 

Call for Papers – Privacy Aware Machine Learning PAML due to April, 1, 2017

Privacy Aware Machine Learning (PAML)
for Health Data Science

Special Session on September, 1, 2017, organized by Andreas HOLZINGER, Peter KIESEBERG, Edgar WEIPPL and A Min TJOA in the context of the 12th International Conference on Availability, Reliability and Security (ARES and CD-ARES), Reggio di Calabria, Italy, August 29 – September, 2, 2017

Session Homepage

supported by the International Federation of Information Processing IFIP >  TC5 and WG 8.4 and WG 8.9
http://cd-ares-conference.eu
http://www.ares-conference.eu

Keynote Talk by Neil D. LAWRENCE, University of Sheffield and Amazon

With the new European data protection and privacy regulations coming into effect with January, 1, 2018 issues having been nice to have so far are becoming a must have. Privacy aware machine learning will be one of the most important fields for the European research community and the IT business in particular. Most affected is the whole area of biology, medicine and health, partiuclarly driven by the fact that health sciences are becoming a more and more data intensive science.

This special session will bring together scientists with diverse background, interested in both the underlying theoretical principles as well as the application of such methods for practical use in the biomedical, life sciences and health care domain. The cross-domain integration and appraisal of different fields will provide an atmosphere to foster different perspectives and opinions; it will offer a platform for novel crazy ideas and a fresh look on the methodologies to put these ideas into business.

All paper will be peer-reviewed by three members of the international PAML-commitee. Paper acceptance rate of the last session was 35 %. Accepted papers will be published in a Springer Lecture Notes in Computer Science (LNCS) Volume and excellent contributions will be invited to be extented in a special issue of a journal (planned Springer MACH and/or BMC MIDM).

Research topics covered by this special session include but are not limited to the following topics:

– Production of Open Data Sets
– Synthetic data sets for learning algorithm testing
– Privacy preserving machine learning, data mining and knowledge discovery
– Data leak detection
– Data citation
– Differential privacy
– Anonymization and pseudonymization
– Securing expert-in-the-loop machine learning systems
– Evaluation and benchmarking

This picture was taken by our local host, Francesco Buccafurri on January, 3, 2017: from the conference venue you have a direct view to the Aetna volcano:

Picture taken by Francesco Buccafurri on January, 3, 2017

Picture taken by Francesco Buccafurri on January, 3, 2017

Portfolio Items