Apr, 14, 2015 Seminar Talks Deep Learning

Title:  Using Deep Learning for Discovering Knowledge from Images: Pitfalls and Best Practices

Lecturer: Marcus BLOICE <expertise>

Abstract: Neural networks have been shown to be adept at image analysis and image classification. Deep layered neural networks especially so. However, deep learning requires two things in order to work proficiently: large amounts of data and lots of processing power. In this talk both aspects are covered, allowing you to maximise the potential of deep learning. Firstly, we will learn how the computational power of GPUs can be used to speed up learning by orders of magnitude, making it possible to learn from very large datasets on commodity hardware. Thanks to software such as Theano, Caffe, and Pylearn2, the GPU can be leveraged without needing to be an expert in parallel programming. This talk will discuss how. Secondly, data preprocessing, data augmentation, and artificial data generation are discussed. These methods allow you to ensure you are making the most of the data you possess, by expanding your dataset and preparing your data properly before analysis. This means discussing best practices in data preparation, using methods such as histogram equalisation, contrast stretching or normalisation, and discussing artificial data generation in detail. The tools you require to do so are described, using multi-platform software that is freely available. Finally, the talk will touch on hyper-parameters and the best practices and pitfalls of hyper-parameter choice when training deep neural networks.

Title: Pitfalls for applying Machine Learning in HCI-KDD: Things to be aware of and how to avoid them

Lecturer: Christof STOCKER <expertise>

Abstract: When dealing with big and unstructured data sets, we often try to be creative and to experiment with a number of different approaches for the purpose of knowledge discovery. This can lead to new insights and even spark novel ideas. However, ignorant application of algorithms to unknown data is dangerous and can lead to false conclusions – with high statistical significance. In finite data sets, structure can emerge from sheer randomness. Furthermore, hidden variables can lead to significant correlations that in turn might result in wrong conclusions. Beyond this, data science as a discipline has developed into a complex area in which mistakes can occur with ease and even lead experienced scientists astray. In this talk we will investigate these pitfalls together on simple examples and discuss how we can address these concerns with manageable effort.

 

 

Open PhD machine learning

PhD position in “Biomedical data sciences and machine learning” + 2 open MSc positions
in the context of the new competence center for biomarker discovery cbmed.org located at the Medical University Graz.

You
… have a MSc related to Information & Computer Science (e.g. Informatics, Software Engineering, Telematics, Mathematics, …)
… are eligible to enroll in the Doctoral School Computer Science at Graz University of Technology
… are interested to work within the hci-kdd.org group embedded in the international research community
… have experiences and interest in scientific work in the international context
… have a high interest in the topics data science and machine learning
… like undertaking theoretical, algorithmical, and experimental machine learning studies
… want to understand the problem of knowledge discovery from complex high-dimensional data sets

We
… are offering a PhD position (30 hours per week, 2100 Euro gross per month, 14 x, FWF salary) available immediately
(no closing date, the position will be filled when the ideal candidate has been found)
… a contract for four years, with opportunities to further develop into a PostDoc position with another four years
… do research in information integration in the life sciences, particularly in the integration of multiple heterogeneous data sources (e.g., -omics data, text data, image data, etc.) constituting the foundation for further machine learning based data analytics for biomarker discovery. Selected topics you have to deal with at the beginning include the research of how to integrate and analyse available data sources in the biomedical domain, a common representation and information fusion model of heterogeneous data sets and to develop and test model-based infrastructures for information integration and fusion
… are offering a workplace within the vibrant, beautiful and student friendly city of Graz in charming Austria

If
you are interested and motivated, please prepare
… a) your scientific résumé,
… b) a sample paper, and
… c) a research statement about your targeted scientific work within the four years (a PhD proposal)
by using the templates which you find here

and send it in one single pdf file directly to a.holzinger@hci-kdd.org

We are looking forward to welcome you in our group!

Geometric, Topological and Harmonic Trends to Image Processing due to 1st June 2015

Special Issue on Geometric, Topological and Harmonic Trends to Image Processing

Pattern Recognition Letters

Submission deadline: June 1, 2015

Advanced topological measures from the numerical and algebraic perspective, combined with the geometric representations of physical objects and the sparse decomposition using harmonic transforms are generating novel methods for the study of n-dimensional digital or continuous images. The mutual interdependence between harmonic analysis, geometry and topology supports the thesis that these different sources of mathematical information are necessary to fully characterize the spatially structured clouds of points at any dimension. In this special issue, the focus will be on novel methods of multi-dimensional and multi-variate image analysis and image processing using computational harmonic or geometric-topological techniques and algorithms.

The applications envisaged are in multidisciplinary engineering, paying particular attention to recent trends in the industrial setting and in any image-related topic situated at the interplay between these computational areas.

Main Topics of Interest:

Use of of harmonic analysis, topological and/or geometric information in image applications.
Computational harmonic analysis, topology or geometry applied to image processing;
Interactions between computational harmonic analysis, geometry and topology in image context;
Geometric and/or harmonic modeling guided by topological constraints;
Algorithm optimization for image applications, transfer of mathematical tools, parallel computation in image context and hierarchical approaches;
Pattern recognition from a harmonic, topological and/or geometrical viewpoint.
Combinatorial, geometric, topological, fractal or multi-resolution models.
Algebraic-topological and/or geometric invariants and features for n-dimensional images and their computation.

Submission Information:

See detailed Guide for Authors here: http://www.elsevier.com/journals/pattern-recognition-letters/0167-8655/guide-for-authors Papers can have a maximum length of 10 pages in the journal template.

Submit your paper here: http://ees.elsevier.com/patrec. Make sure to select ” SI: GeToHa” as the Article type. Submission is possible starting from May 1 2015. Submission deadline is June 1th, 2015

Papers will be reviewed according to the normal journal standards. Papers will receive at most two rounds of reviews. We will strive to finish the first round of review four to six weeks after submission.

For more information, please contact the Managing Guest Editor.

Pedro Real, Managing Guest Editor
Institute of Mathematics of Seville University (IMUS)
ETS. Ingeniería Informática, University of Seville, Spain

real@us.es

Darian Onchis Moaca, Guest Editor
Eftimie-Murgu University, Romania
http://homepage.univie.ac.at/darian.onchis/

Helena Molina-Abril, Guest Editor
The Maimonides Institute for Biomedical Research of Cordoba (IMIBIC), Spain

Mihail Gaianu, Guest Editor
West University of Timisoara, Romania

The future is in Open Data Sets

The idea of “open data” is not new. Many researchers in the past had followed the notion that Science is a public enterprise and that certain data should be openly available [1] and it is recently also a big topic in the biomedical domain [2], [3]; e.g.. the British Medical Journal (BMJ) started a big open data campaign [4]. The goal of the movement is similar to approaches of open source, open content or open access. With the launch of open data government initiatives the open data movement gained momentum [5] and some speak already about an Open Knowledge Foundation [6]. Consequently, there are plenty of research challenges on this topic. Cancer research, for example, could dramatically benefit from science without any boundaries.

[1]   L. Rowen, G. K. S. Wong, R. P. Lane, and L. Hood, “Intellectual property – Publication rights in the era of open data release policies,” Science, vol. 289, pp. 1881-1881, Sep 2000.

[2]  G. Boulton, M. Rawlins, P. Vallance, and M. Walport, “Science as a public enterprise: the case for open data,” The Lancet, vol. 377, pp. 1633-1635, // 2011.

[3]   A. Hersey, S. Senger, and J. P. Overington, “Open data for drug discovery: learning from the biological community,” Future Medicinal Chemistry, vol. 4, pp. 1865-1867, Oct 2012.

[4]  M. Thompson and C. Heneghan, “BMJ OPEN DATA CAMPAIGN We need to move the debate on open clinical trial data forward,” British Medical Journal, vol. 345, Dec 2012.

[5]  N. Shadbolt, K. O’Hara, T. Berners-Lee, N. Gibbins, H. Glaser, W. Hall, et al., “Open Government Data and the Linked Data Web: Lessons from data. gov. uk,” IEEE Intelligent Systems, pp. 16-24, 2012.

[6]   J. C. Molloy, “The Open Knowledge Foundation: Open Data Means Better Science,” Plos Biology, vol. 9, Dec 2011.

Here are some sample data sets:

[7] 1000 Genomes: A deep catalog of human genetic variation. The projects sequenced the genomes of a large number of people in order to provide a comprehensive resource on human genetic variation. It contains about 2,500 samples from 2010 and 2011:
http://www.1000genomes.org/ftpsearch

1000 Genomes Project Consortium and others. 2010. A map of human genome variation from population-scale sequencing. Nature, 467 (7319), 1061-1073.

[8] Tiny Images dataset: The data set consists of over 79 million images in color. They are stored in a 227 Gb binary file. A Matlab toolbox to access the images is provided. Automatic annotation data is available for all images, but manual annotation data is only available for a smaller portion:

A. Torralba and R. Fergus and W. T. Freeman. 2008. 80 Million Tiny Images: a Large Database for Non-Parametric Object and Scene Recognition. IEEE PAMI, 30 (11), 1958-1970.

[9] Just to make ones familiar with the abundance of different skin diseases, a very informative collection of skin images, provided by Healthline.com: http://www.healthline.com/health/skin-disorders

[10]  Breast Tumor (gene expression) data of Van’t Veer (2002): The training data set consists of 78 primary breast cancers of which 34 patients developed metastasses within 5 years. The training set contains 19 breast cancer patients of which 12 developed metastases within 5 years. The data contains 24188 gene expression levels. The general goal is predicting metastases for improving the therapy strategy:
http://www.stats.uwo.ca/faculty/aim/2015/9850/microarrays/FitMArray/chm/Veer.html

Van’t Veer, L. J., Dai, H., Van De Vijver, M. J., He, Y. D., Hart, A. A., Mao, M., Peterse, H. L., Van Der Kooy, K., Marton, M. J. & Witteveen, A. T. 2002. Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415, (6871), 530-536. 

[11] Machine Learning Repository data sets from the Center for Machine Learning and Intelligent Systems, University of California, maintains 313 data sets as a service to the machine learning community:
http://archive.ics.uci.edu/ml/datasets.html

Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

[12] Data.Medicare.gov: Data sets from Medicare.gov for downloading, exploring, and visualizing. Direct access to data sets, including data sets from hospitals, nursing homes, physicians, homes, supplierers and other facilities is provided. The data gives general information about the quality of care in these facilities:
https://data.medicare.gov/

[13] re3data.org: a Registry of Research Data Repositories. Research data repositories from different academic disciplines are featured here. The projects promotes a culture of sharing between researchers. It started in 2012 and is funded by the German Research Foundation:
http://www.re3data.org/

Pampel H, Vierkant P, Scholze F, Bertelmann R, Kindling M, et al. 2013. Making Research Data Repositories Visible: The re3data.org Registry. PLoS ONE, 8 (11). 

[14] Time series data as a sequence of point sets collected over a time intervall are widely used, e.g. in biomedicine (heart rate, ECG, EEG, etc.), but also in many other fields e.g. in astronomy or eartyquake prediction. The University of California Riverside (UCR) Time Series Classification and Clustering Collection has been created as a public service to the data mining/machine learning community, to encourage reproducible research for time series classification and clustering:
http://www.cs.ucr.edu/~eamonn/time_series_data/

Keogh, E. & Kasetty, S. 2003. On the need for time series data mining benchmarks: a survey and empirical demonstration. Data Mining and knowledge discovery, 7, (4), 349-371.

[15] The MNIST database of handwritten digits includes a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image:
http://yann.lecun.com/exdb/mnist/

Liu, C. L., Nakashima, K., Sako, H. & Fujisawa, H. 2003. Handwritten digit recognition: benchmarking of state-of-the-art techniques. Pattern Recognition, 36, (10), 2271-2285.

[16] KONECT (the Koblenz Network Collection) gathers large network datasets of various types. The over 200 open datasets are collected by the Institute of Web Science and Technologies at the University of Koblenz-Landau.
http://konect.uni-koblenz.de/

Kunegis, Jérôme (2013). KONECT – The Koblenz Network Collection. Proc. Int. Conf. on World Wide Web Companion, pages 1343-1350.

[17] Kaggle offeres competitions and thus provides many different kinds of real-world open data for scientists.
https://www.kaggle.com/

[18] CKAN serves as a data management tool used by organizations, research institutions and governments since 2006. It has been developed by the Open Knowledge Foundation.
http://datahub.io/

[19] The goal of healtdata.gov is to make health data more accessible for research. It contains ovre 1800 datasets at the moment.
http://www.healthdata.gov/

[20] Socrata is a cloud software company which also provides open datasets of many different topics.
https://opendata.socrata.com/
 

 

Machine Learning in Nature

Apart from occassional news entries, comptuer science rarely makes it into Nature. A quick count in the Web of Science results in 33 articles, the last one – a year ago – by Ekert, A. & Renner, R. 2014. The ultimate physical limits of privacy. Nature, 507, (7493), 443-447, and the most prominent one surely the one with 3,200 citations by Strogatz, S. H. 2001. Exploring complex networks. Nature, 410, (6825), 268-276.

Now, machine learning has made it into Nature: The group of DeepMind Technologies founded by Demis Hassabis in 2011 as a start-up company, and purchased by Google for approx. 400 Million USD in 2014, has published a paper, which appeared today, 26.02.2015:

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S. & Hassabis, D. 2015. Human-level control through deep reinforcement learning. Nature, 518, (7540), 529-533.

Abstract: The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.

Subject terms: Computer Science

The editors summary: For an artificial agent to be considered truly intelligent it needs to excel at a variety of tasks considered challenging for humans. To date, it has only been possible to create individual algorithms able to master a single discipline — for example, IBM’s Deep Blue beat the human world champion at chess but was not able to do anything else. Now a team working at Google’s DeepMind subsidiary has developed an artificial agent — dubbed a deep Q-network — that learns to play 49 classic Atari 2600 ‘arcade’ games directly from sensory experience, achieving performance on a par with that of an expert human player. By combining reinforcement learning (selecting actions that maximize reward — in this case the game score) with deep learning (multilayered feature extraction from high-dimensional data — in this case the pixels), the game-playing agent takes artificial intelligence a step nearer the goal of systems capable of learning a diversity of challenging tasks from scratch.

More information: http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html#tables

 

Feb, 17, 2015 > Seminar Talk by Hubert Wagner

Title: Topological analysis of text data.

Lecturer: Hubert WAGNER <expertise>

Abstract: In this talk an ongoing effort will be described to apply persistent homology in the area of text data mining. Persistent homology is the main tool of topological data analysis. In essence, it allows to robustly describe the shape of a data set, and compare the shapes of different data sets.
First, persistent homology will be explained, emphasizing its intuitive side.
Then, it will be demonstrated how persistent homology can be applied in the context of analyzing sets of text documents. Using the vector space model interpretation, each document becomes a point in a high-dimensional space, and it is intuitive to ask about the shape of such a point cloud. It wil be discussed, how this information can be used for knowledge discovery. Finally, an algorithmic aspect is emphasized, which is crucial if industrial applications are to be tackled.

Biography: Hubert Wagner is a computer scientist, currently working as a Postdoc at the Institute of Science and Technology Austria (IST-Austria) at the Edelsbrunner Group. Having worked as a software engineer, he moved towards science and obtained a PhD degree in 2014 from the Jagiellonian University in Krakow, Poland. Hubert is interested in the application of computational geometry and topology and related algorithmic questions. He is convinced that tools such as persistent homology may offer novel and robust solutions to many problems he encountered as an engineer, including e.g. problems in text mining. This line of his research was supported by a Google Research Grant from 2011 to 2012 (with Prof. Marian Mrozek and Dr. Paweł Dłotko) and is now continued within the Topological Complex Systems (TOPOSYS) grant. Efficient algorithms and their implementations are an important part of his work.

More Information: https://publist.ist.ac.at/ist/people/180-Hubert_Wagner/works

Topological Analysis for Text Data

Topological Analysis for Text Data

January, 27, 2015, Seminar Talk by Barbara Di Fabio

Title: Geometric-topological tools for shape description

Lecturer: Barbara DI FABIO

Abstract: In shape comparison a widely used scheme is to measure the dissimilarity between signatures associated with each shape rather than matching shapes. In this context, computational topology plays an important role, offering a series of techniques and measures with an extremely high abstraction power. Persistent homology and Reeb graphs provide signatures able to describe shapes from topological and geometrical perspectives, being approaches grounding in the classical Morse Theory. The common idea underlying these methods, indeed, is to perform a topological exploration of the shape according to some quantitative geometric properties provided by a real-valued function defined on the shape and chosen to extract shape features. This seminar  provides an overview of these shape descriptors with related comparison methods, their main properties and drawbacks, some of the main theoretical and experimental results, recent developments, open issues and future perspectives.

Biography: Barbara DI FABIO is born in Lanciano (Italy) in 1977. In 2004, she graduated cum laude and, in 2009, received her Ph.D. degree in Mathematics at the University of Bologna with a work on the enhancement of geometrical tools for pattern recognition. Since then, she has been post-doctoral fellow at the excellence centre ARCES ”E. De Castro” (University of Bologna) and at the Department of Mathematics (Prof. Massimo FERRI, University of Bologna). Barbaras main research interests are focused on computational geometry and topology and include problems of shape analysis and understanding with related applications in computer vision, computer graphics and pattern recognition – highly relevant for machine learning and knowledge discovery. She attended several postgraduate schools and workshops, participated in and was author of several communications in national and international scientific conferences. She is author of 9 peer-reviewed papers, 5 proceedings and 1 preprint. She is a referee for several international journals. Since 2005 she has been teaching in undergraduate courses in Engineering and Economics, University of Bologna. At present, supported by an ESF exchange visit grant, she is working with Professor Neza Mramor Kosta at the Faculty of Computer and Information Science, University of Ljubljana.

More Information: http://www.dm.unibo.it/~difabio/

Geometric-topological tools for shape description

Geometric-topological tools for shape description

 

Merry Christmas and a Happy 2015 from the Holzinger Group

Merry Christmas and a Happy 2015 from the Holzinger Group

Merry Christmas and a Happy 2015 from the Holzinger Group

Open Postdoc Position in interactive Machine Learning with complex biomedical data

A postdoc position in “knowledge discovery and interactive machine learning with complex biomedical data sets” is available immediately at the Holzinger Group (hci-kdd.org) in Graz, Austria. The postdoc will be financed for four years, with an option to continue for another four years by the newly formed CBmed – Center for Biomarker Discovery and supported by the PhD school “Biomarker discovery”, which is starting with October, 1, 2015.

The challenge: Worldwide there is raising interest in biomarker discovery as an important step towards P4-medicine. The data results from various sources in different structural dimensions, and a systematic and comprehensive exploration of all these data provides a mechanism for data driven hypotheses generation. A grand challenge is to make sense of this complex data sets by applying machine learning algorithms based on the “human-in-the-loop” concept, which is of emerging interest for the international research community.

The applicant should:
1) hold a PhD in machine learning, data mining, knowledge discovery or related area of modern data science;
2) have a strong research record, documented by publications at first-tier related conferences and journals;
3) having interest in advanced methodological approaches and enjoy working in a young research group following the motto
“Science is to test crazy ideas, engineering is to bring these ideas into Business”

The successful candidate shall take an active role in the further development of our research group. Communication skills and fluency in English are required.
Conditions of employment: This post-doctoral position is provided for four years with an option for another four years. The starting date is flexible; there is no fixed deadline, so applications will be considered until the position is filled with the optimal candidate.

Application procedure: Formal applications should include:
1) A scientific curriculum vitae, including a full list of publications;
2) A statement of research interests with an outlook for the coming 4 (8) years;
3) Contact details of three reference persons.

Apply by sending your application as one single PDF document, indicating Postdoc HCI-KDD in the header directly to
Prof.Dr. Andreas HOLZINGER via e-Mail: a.holzinger@hci-kdd.org

About the group: The Holzinger Group works consistently on a synergistic combination of methodologies and approaches of two areas that offer ideal conditions towards unraveling these problems: Human-Computer Interaction (HCI) and Knowledge Discovery/Data Mining (KDD), with the goal of supporting human intelligence with machine learning – human-in-the-loop – to discover novel, previously unknown insights into the data.
For more details please refer to: http://hci-kdd.org/about-us

Note: The language both of the Holzinger group and the language of the PhD school is English.

Keywords: interactive machine learning, knowledge discovery, data mining, human-in-the-loop, biomedical informatics

Nature repository for openly accessible scientific data from all disciplines

Nature Scientific Data is a recently launched open-access, online-only journal for openly accessible scientific data from all disciplines. The articles are called Data Descriptors, and combine traditional narrative content with curated descriptions of research data to support reproducibility, as this may accelerate scientific discovery, see: http://www.nature.com/sdata/

Goudiaby, V., Zuidema, P. A. & Mohren, G. M. J. 2014. Data storage: Overcome hurdles to global databases. Nature, 511, (7510), 410-410.

Editorial to Nature Volume 515, Issue 7527 > Data-access practices strengthened