The Holzinger Group fully supports the “open” movement, i.e. open access, open source and open data. The idea of “open data” is noOpen-Medical-Datat new. Many researchers in the past had followed the notion that Science is a public enterprise and that certain data should be openly available [1] and it is recently also a big topic in the biomedical domain [2], [3]; e.g.. the British Medical Journal (BMJ) started a big open data campaign [4]. The goal of the movement is similar to approaches of open source, open content or open access. With the launch of open data government initiatives the open data movement gained momentum [5] and some speak already about an Open Knowledge Foundation [6]. Consequently, there are plenty of research challenges on this topic. Cancer research, for example, could dramatically benefit from science without any boundaries.

[1] L. Rowen, G. K. S. Wong, R. P. Lane, and L. Hood, “Intellectual property – Publication rights in the era of open data release policies,” Science, vol. 289, pp. 1881-1881, Sep 2000.

[2] G. Boulton, M. Rawlins, P. Vallance, and M. Walport, “Science as a public enterprise: the case for open data,” The Lancet, vol. 377, pp. 1633-1635, // 2011.

[3] A. Hersey, S. Senger, and J. P. Overington, “Open data for drug discovery: learning from the biological community,” Future Medicinal Chemistry, vol. 4, pp. 1865-1867, Oct 2012.

[4] M. Thompson and C. Heneghan, “BMJ OPEN DATA CAMPAIGN We need to move the debate on open clinical trial data forward,” British Medical Journal, vol. 345, Dec 2012.

[5] N. Shadbolt, K. O’Hara, T. Berners-Lee, N. Gibbins, H. Glaser, W. Hall, et al., “Open Government Data and the Linked Data Web: Lessons from data. gov. uk,” IEEE Intelligent Systems, pp. 16-24, 2012.

[6] J. C. Molloy, “The Open Knowledge Foundation: Open Data Means Better Science,” Plos Biology, vol. 9, Dec 2011.

Here are some sample data sets:

[7] 1000 Genomes: A deep catalog of human genetic variation. The projects sequenced the genomes of a large number of people in order to provide a comprehensive resource on human genetic variation. It contains about 2,500 samples from 2010 and 2011:
http://www.1000genomes.org/ftpsearch

1000 Genomes Project Consortium and others. 2010. A map of human genome variation from population-scale sequencing. Nature, 467 (7319), 1061-1073.

[8] Tiny Images dataset: The data set consists of over 79 million images in color. They are stored in a 227 Gb binary file. A Matlab toolbox to access the images is provided. Automatic annotation data is available for all images, but manual annotation data is only available for a smaller portion:

A. Torralba and R. Fergus and W. T. Freeman. 2008. 80 Million Tiny Images: a Large Database for Non-Parametric Object and Scene Recognition. IEEE PAMI, 30 (11), 1958-1970.

[9] Just to make ones familiar with the abundance of different skin diseases, a very informative collection of skin images, provided by Healthline.com: http://www.healthline.com/health/skin-disorders

[10] Breast Tumor (gene expression) data of Van’t Veer (2002): The training data set consists of 78 primary breast cancers of which 34 patients developed metastasses within 5 years. The training set contains 19 breast cancer patients of which 12 developed metastases within 5 years. The data contains 24188 gene expression levels. The general goal is predicting metastases for improving the therapy strategy:
http://www.stats.uwo.ca/faculty/aim/2015/9850/microarrays/FitMArray/chm/Veer.html

Van’t Veer, L. J., Dai, H., Van De Vijver, M. J., He, Y. D., Hart, A. A., Mao, M., Peterse, H. L., Van Der Kooy, K., Marton, M. J. & Witteveen, A. T. 2002. Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415, (6871), 530-536.

[11] Machine Learning Repository data sets from the Center for Machine Learning and Intelligent Systems, University of California, maintains 313 data sets as a service to the machine learning community:
http://archive.ics.uci.edu/ml/datasets.html

Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

[12] Data.Medicare.gov: Data sets from Medicare.gov for downloading, exploring, and visualizing. Direct access to data sets, including data sets from hospitals, nursing homes, physicians, homes, supplierers and other facilities is provided. The data gives general information about the quality of care in these facilities:
https://data.medicare.gov/

[13] re3data.org: a Registry of Research Data Repositories. Research data repositories from different academic disciplines are featured here. The projects promotes a culture of sharing between researchers. It started in 2012 and is funded by the German Research Foundation:
http://www.re3data.org/

Pampel H, Vierkant P, Scholze F, Bertelmann R, Kindling M, et al. 2013. Making Research Data Repositories Visible: The re3data.org Registry. PLoS ONE, 8 (11).

[14] Time series data as a sequence of point sets collected over a time intervall are widely used, e.g. in biomedicine (heart rate, ECG, EEG, etc.), but also in many other fields e.g. in astronomy or eartyquake prediction. The University of California Riverside (UCR) Time Series Classification and Clustering Collection has been created as a public service to the data mining/machine learning community, to encourage reproducible research for time series classification and clustering:
http://www.cs.ucr.edu/~eamonn/time_series_data/

Keogh, E. & Kasetty, S. 2003. On the need for time series data mining benchmarks: a survey and empirical demonstration. Data Mining and knowledge discovery, 7, (4), 349-371.

[15] The MNIST database of handwritten digits includes a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image:
http://yann.lecun.com/exdb/mnist/

Liu, C. L., Nakashima, K., Sako, H. & Fujisawa, H. 2003. Handwritten digit recognition: benchmarking of state-of-the-art techniques. Pattern Recognition, 36, (10), 2271-2285.

[16] KONECT (the Koblenz Network Collection) gathers large network datasets of various types. The over 200 open datasets are collected by the Institute of Web Science and Technologies at the University of Koblenz-Landau.
http://konect.uni-koblenz.de/

Kunegis, Jérôme (2013). KONECT – The Koblenz Network Collection. Proc. Int. Conf. on World Wide Web Companion, pages 1343-1350.

[17] Kaggle offeres competitions and thus provides many different kinds of real-world open data for scientists.
https://www.kaggle.com/

[18] CKAN serves as a data management tool used by organizations, research institutions and governments since 2006. It has been developed by the Open Knowledge Foundation.
http://datahub.io/

[19] The goal of healtdata.gov is to make health data more accessible for research. It contains ovre 1800 datasets at the moment.
http://www.healthdata.gov/

[20] Socrata is a cloud software company which also provides open datasets of many different topics.
https://opendata.socrata.com/

Data repositories of general interest:

  • GenBank: GenBank is a genetic sequence database from the National Insititute of Health, an annotated collection of all publicly available DNA sequences (Nucleic Acids Research, 2013 Jan;41(D1):D36-42).  GenBank is part of the International Nucleotide Sequence Database Collaboration , which comprises the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at NCBI. These three organizations exchange data on a daily basis.
    http://www.ncbi.nlm.nih.gov/genbank/
  • EMBL: The European Bioinformatics Insitute is part of the European Molecular Biology Laboratory and maintain the world’s most comprehensive range of freely available and up-to-date molecular databases. The services let share data, perform complex queries and analyse the results. Everybody can download data and software, or use web services. More about in the journal Nucleic Acids Research.
    http://www.ebi.ac.uk/services
  • HMCA: Health and Medical Care Archive is a data archive of the Robert Wood Johnson Foundation preserves and disseminates data collected by selected research projects and facilitates secondary analyses of the data. The data collections in HMCA include surveys of health care professionals and organizations, investigations of access to medical care, surveys on substance abuse, and evaluations of innovative programs for the delivery of health care. Their goal is to increase understanding of health and health care in the United States through secondary analysis.
    http://www.icpsr.umich.edu/icpsrweb/HMCA/index.jsp

Data repositories specialized by data types:

  • UniProtKB/Swiss-Prot: is a manually annotated and reviewed section of the UniProt Knowledgebase (UniProtKB).
    It is a high quality annotated and non-redundant protein sequence database, which brings together experimental results, computed features and scientific conclusions.
    http://www.uniprot.org/uniprot/
  • MMMP: is an open access interactive multidatabase for research on melanoma biology and treatment.
    http://www.mmmp.org/MMMP/
  • KEGG: is a database for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies.
    http://www.genome.jp/kegg/
  • PDB: Since 1971, the Protein Data Bank archive has served as a repository of information about the 3D structures of proteins, nucleic acids, and complex assemblies. The Worldwide PDB (wwPDB) organization manages the PDB archive and ensures that the PDB is freely and publicly available to the global community.
    http://www.wwpdb.org/

Data repositories specialized by organism:

  • WormBase: is an international consortium of biologists and computer scientists dedicated to providing the research community with accurate, current, accessible information concerning the genetics, genomics and biology of C. elegans and related nematodes. Founded in 2000, the WormBase Consortium is led by Paul Sternberg of CalTech, Paul Kersey of the EBI, Matt Berriman of the Wellcome Trust Sanger Institute, and Lincoln Stein of the Ontario Institute for Cancer Research.
    http://www.wormbase.org
  • FlyBase: is a data repository project carried out by a consortium of Drosophila researchers and computer scientists at: Harvard University, University of Cambridge (UK), Indiana University and the University of New Mexico.
    http://flybase.org/
  • Human Brain NeuroMorpho: is a centrally curated inventory of digitally reconstructed neurons associated with peer-reviewed publications. It contains contributions from over 100 laboratories worldwide and is continuously updated as new morphological reconstructions are collected, published, and shared. To date, NeuroMorpho.Org is the largest collection of publicly accessible 3D neuronal reconstructions and associated metadata.
    http://neuromorpho.org/