4th Workshop on Data Mining for Medicine and Healthcare

May 2, 2015, Vancouver, BC, Canada

To be held in conjunction with 15th SIAM International Conference on Data Mining (SDM 2015)


In virtually every country, the cost of healthcare is increasing more rapidly than the willingness and the ability to pay for it. At the same time, more and more data is being captured around healthcare processes in the form of Electronic Health Records (EHR), health insurance claims, medical imaging databases, disease registries, spontaneous reporting sites, and clinical trials. As a result, data mining has become critical to the healthcare world. On the one hand, EHR offers the data that gets data miners excited, however on the other hand, is accompanied with challenges such as 1) the unavailability of large sources of data to academic researchers, and 2) limited access to data-mining experts. Healthcare entities are reluctant to release their internal data to academic researchers and in most cases there is limited interaction between industry practitioners and academic researchers working on related problems.

The objectives of this workshop are:
1. Bring together researchers (from both academia and industry) as well as practitioners to present their latest problems and ideas.
2. Attract healthcare providers who have access to interesting sources of data and problems but lack the expertise in data mining to use the data effectively. 
3. Enhance interactions between data mining, text mining and visual analytics communities working on problems from medicine and healthcare.

Topics of Interest

Topic areas for the workshop include (but are not limited to) the following:

  • Statistical analysis and characterization of healthcare data
  • Text mining - mining free text in electronic medical records
  • Visual analysis and exploration of longitudinal clinical trial data
  • Meaningful use of healthcare data for improved patient care and cost-reduction
  • Data quality assessment and improvement: preprocessing, cleaning, missing data treatment etc.
  • Pattern detection and hypothesis generation from observational data
  • Visualization of prescriptions drugs and interactions
  • Privacy and security issues in healthcare
  • Information fusion and knowledge transfer in healthcare
  • Evolutionary and longitudinal patient and disease models
  • Medical fraud detection
  • Hospital readmission analytics
  • Help with ICD 9 to ICD 10 conversions
  • Health Information exchanges

Program Committee

Riccardo Bellazzi, University of Pavia
Andreas Holzinger, Medical University Graz
Zoran Obradovic, Temple University
Mykola Pechenizky, Technical University Eindhoven
Chandan Reddy, Wayne State University
Niels Peek, University of Manchester
Igor Pernek, Research Studios Austria
Ping Zhang, IBM T.J. Watson Research Center
Jiayu Zhou, Samsung Research America

Important Dates

Paper Submission: February 2, 2015 (extended)

Notification of Acceptance: February 5, 2015

Camera Ready Paper Due: February 9, 2015

Workshop: May 2, 2015

Submission Information

All submissions must be made electronically at https://easychair.org/conferences/?conf=sdmdmmh2015.

Papers submitted to this workshop must not have been accepted or be under review by another conference with a published proceedings or by a journal. The work may be either theoretical or applied.

The workshop accepts short (4-6 pages) and long papers (up to 9 pages) with US Letter (8.5" x 11") paper size (single-spaced, 2 column, 10 point font, and at least 1" margin on each side). Papers must have an abstract with a maximum of 300 words and a keyword list with no more than 6 keywords.

We would like to encourage you to prepare your paper in LaTeX2e. Papers should be formatted using the SIAM SODA macro, which is available through the SIAM website. You can access it at http://www.siam.org/proceedings/macros.php. The filename is soda2e.all. Make sure you use the macros for SODA and Data Mining Proceedings; papers prepared using other proceedings macros will not be accepted.

For Microsoft Word users, please convert your document to the PDF format.  Since there is no Microsoft Word Template, please visit http://www.siam.org/proceedings/ to view the format of previous papers.

All submissions should clearly present the author information including the names of the authors, the affiliations and the emails.

Workshop Schedule

May 2, Saturday

8:30 – 9:30

Invited talk I

Big Data for Personalized Medicine: a case study of Biomarker Discovery

Speaker: Raymond Ng

9:30 – 10:00

Improving confidence while predicting trends in temporal disease networks

Djordje Gligorijevic, Jelena Stojanovic and Zoran Obradovic


A simple and effective method for anomaly detection in healthcare

Luiz Fernando Carvalho, Carlos Teixeira, Ester Dias, Wagner Meira Jr. and Osvaldo Carvalho

10:00 – 10:30

Coffee Break

10:30 – 11:30

Invited talk II

Outlier Detection for Clinical Monitoring and Alerting

Speaker: Milos Hauskrecht

11:30 – 12:00

Mining Strongly Correlated Intervals with Hypergraphs on Health Social Network

Hao Wang, Dejing Dou, Yue Fang and Yongli Zhang


Data Analytics Framework for A Game-based Rehabilitation System

Jiongqian Liang, David Fuhry, David Maung, Roger Crawfis, Lynne Gauthier, Arnab Nandi and Srinivasan Parthasarathy

12:00 – 13:30

Lunch Break (on your own)

13:30 – 14:30

Invited talk III

Semantic Data Mining and Deep Learning in Health Datasets

Speaker: Dejing Dou

14:30 – 15:00

Hospital Corners and Wrapping Patients in Markov Blankets

Dusan Ramljak, Adam Davey, Alexey Uversky, Shoumik Roychoudhury and Zoran Obradovic


Modelling Trajectories for Diabetes Complications

Pranjul Yadav, Lisiane Pruinelli, Andrew Hangsleben, Sanjoy Dey, Katherine Hauwiller, Bonnie Westra, Connie Delaney, Vipin Kumar, Michael Steinbach and Gyorgy Simon

15:00 – 15:30

Short Break

15:30 – 16:30

Invited talk IV

Efficient and Effective Algorithms for Analyzing Large Scale Genetic Interactions

Speaker: Xiang Zhang

16:30 – 16:35


Invited Speakers

Raymond Ng
University of British Columbia

Title: Big Data for Personalized Medicine: a case study of Biomarker Discovery

Abstract: Personalized medicine has been hailed as one of the main frontiers for medical research in this century. In the first half of the talk, we will give an overview on our projects that use gene expression, proteomics, DNA and clinical features for biomarker discovery. In the second half of the talk, we will describe some of the challenges involved in biomarker discovery. One of the challenges is the lack of quality assessment tools for data generated by ever-evolving genomics platforms. We will conclude the talk by giving an overview of some of the techniques we have developed on data cleansing and pre-processing.

Bio: Dr. Raymond Ng is a professor in Computer Science at the University of British Columbia. His main research area for the past two decades is on data mining, with a specific focus on health informatics and text mining. He has published over 180 peer-reviewed publications on data clustering, outlier detection, OLAP processing, health informatics and text mining. He is the recipient of two best paper awards – from 2001 ACM SIGKDD conference, which is the premier data mining conference worldwide, and the 2005 ACM SIGMOD conference, which is one of the top database conferences worldwide. He was one of the program co-chairs of the 2009 International conference on Data Engineering, and one of the program co-chairs of the 2002 ACM SIGKDD conference. He was also one of the general co-chairs of the 2008 ACM SIGMOD conference.
For the past decade, Dr. Ng has co-led several large scale genomic projects, funded by Genome Canada, Genome BC and industrial collaborators. The total amount of funding of those projects well exceeded $40 million Canadian dollars. He now holds the Chief Informatics Officer position of the PROOF Centre of Excellence, which focuses on biomarker development for end-stage organ failures.

Dejing Dou
University of Oregon

Title: Semantic Data Mining and Deep Learning in Health Datasets 

Abstract: Ontologies have been well used in the biomedical and health domains. Semantic data mining refers to the data mining tasks that systematically incorporate domain knowledge, especially formal semantics, into the process. The use of ontologies in mining healthcare and medicine data is a natural fit. In this talk, we focus on the use of formal (Semantic Web) ontologies in two data mining and deep learning tasks in health datasets. We applied semantic association mining in electronic health records to discover indirect (hidden) associations among diseases and drugs.  We also applied ontology-based deep learning in a health social network to predict human physical activities. 

Bio: Dr. Dejing Dou is an Associate Professor in the Computer and Information Science Department at the University of Oregon and leads the Advanced Integration and Mining (AIM) Lab. He received his bachelor degree from Tsinghua University, China in 1996 and his Ph.D. degree from Yale University in 2004. His research areas include ontologies, data mining, data integration, biomedical and health informatics, and the Semantic Web. Dejing Dou has published more than 50 research papers, some of which appear in prestigious conferences and journals like KDD, ICDM, SDM, CIKM, ISWC, JIIS and JoDS. His KDD'07 paper was nominated for the best research paper award. He is on the Editorial Boards of Journal on Data Semantics and Journal of Intelligent Information Systems. Dejing Dou has received over $4 million PI research grants from the NSF and the NIH.

Xiang Zhang
Case Western Reserve University

Title: Efficient and Effective Algorithms for Analyzing Large Scale Genetic Interactions

Abstract: Advanced biotechnologies have rendered feasible high-throughput data collecting in human and other model organisms. The availability of such data holds promise for dissecting complex biological processes. Making sense of the flood of biological data poses great statistical and computational challenges.
In this talk, I will discuss the problem of finding gene-gene interactions in high-throughput genetic data. Finding genetic interactions is an important biological problem since many common diseases are caused by joint effects of genes. Previously, it was considered intractable to find genetic interactions in the whole-genome scale due to the enormous search space. The problem was commonly addressed using heuristics which do not guarantee the optimality of the solution. I will show that by utilizing the upper bound of the test statistic and effectively indexing the data, we can dramatically prune the search space and reduce computational burden. Moreover, our algorithms guarantee to find the optimal solution. In addition to handling specific statistical tests, our algorithms can be applied to a wide range of study types by utilizing convexity, a common property of many commonly used statistics.

Bio: Xiang Zhang is the T&D Schroeder Assistant Professor in the Electrical Engineering and Computer Science Department at Case Western Reserve University. He received his PhD in Computer Science from the University of North Carolina at Chapel Hill in 2011. His research interests include data mining, bioinformatics, and databases. His publications received several awards including the Best Research Paper Award at SIGKDD’08, the Best Student Paper Award at ICDE’08, one of the best papers at SDM'12. His PhD dissertation won an honorable mention for the SIGKDD Dissertation Award in 2012.

Zenglin Xu (Canceled)
University of Electronic Science and Technology of China

Title: Association Discovery and Diagnosis of Alzheimer’s Disease

Abstract: In biological and biomedical research, the analysis and diagnosis of many complex diseases, e.g., Alzheimer’s disease, can be based on a number of data sources or views, such as genetic variations and the phenotypic traits. This brings a new machine learning setting where the objectives are of two folds -- to make diagnosis and to study the association between the genetic variations and the phenotypic traits. 
In this talk, we discuss a new sparse Bayesian approach for joint association study and disease diagnosis. In this approach, common latent features are extracted from different data sources based on sparse projection matrices and used to predict multiple disease severity levels; in return, the disease status can guide the discovery of relationships between data sources. I will also discuss how to take advantage of the linkage disequilibrium (LD) measuring the non-random association of alleles to guide the selection of genes. Finally, I show analysis on imaging genetics datasets for the study of Alzheimer’s Disease.

Bio: Zenglin Xu is a Professor in School of Computer Science and Engineering at University of Electronic Science and Technology of China. He obtained his PhD in Computer Science and Engineering from the Chinese University of Hong Kong. His research interest includes machine learning and its applications on social network analysis, health informatics, and cyber security analytics. He has published over 30 papers in prestigious journals and conference such as NIPS, ICML, IJCAI, AAAI, UAI, CIKM, ICDM, IEEE PAMI, IEEE TNN, etc. He is also the recipient of the best student paper honorable mention of AAAI 2015. Dr. Xu has been a PC member or reviewer to the top conferences such as NIPS, ICML, AAAI, IJCAI, etc. He regularly servers as a reviewer to IEEE TPAMI, JMLR, PR, IEEE TNN, IEEE TKDD, ACM TKDD, etc.

Milos Hauskrecht
University of Pittsburgh

Title: Outlier detection for clinical monitoring and alerting

Abstract: Outlier detection has been successfully applied to identification of unusual data instances, unusual behaviors and outcomes. In this work, I present an outlier detection framework for clinical monitoring and alerting that aims to identify unusual patient management actions in electronic health record data.  Our hypothesis is that patient-management actions that are unusual with respect to past patients may be due to errors and that it is worthwhile to raise an alert if such a condition is encountered prospectively. Our methodology was tested on data derived from electronic health records, and the quality of alerts was evaluated using a panel of clinicians. Our results support that outlier-based alerting can lead to reasonably low false alert rates and that stronger outliers are correlated with higher alert rates.

Bio: Dr. Milos Hauskrecht is a professor of computer science at the University of Pittsburgh. He received his PhD from MIT in 1997. His research interest are in probabilistic modeling, machine learning, data mining and their applications in medicine. He has authored or co-authored over 100 publications in these areas. His research work is funded by grants from NIH and NSF. He serves regularly on program committees of top artificial intelligence and biomedical informatics conferences. He is the recipient of Homer R. Warner award for 2010 for his work on outlier-based clinical monitoring and alerting.

Workshop Chairs

 Honorary Chair

Zoran Obradovic
Temple University

 Workshop Chairs

Nitesh Chawla
University of Notre Dame
Gregor Stiglic
University of Maribor
Fei Wang
University of Connecticut

Note: for inquiries please send e-mail to gregor.stiglic@um.si and fei_wang@engr.uconn.edu

Previous DMMH Workshops

First Workshop on Data Mining for Medicine and Healthcare was organized at KDD 2011 conference in San Diego, CA. The workshop was implemented as a full-day workshop with 2 invited speakers, 6 full papers and 4 short papers.

Keynote lectures and the panel are available at http://videolectures.net/datamining2011_san_diego/.

Information on the 2nd and 3rd Workshop on Data Mining for Medicine and Healthcare can be found at DMMH-SDM 2013 and DMMH-SDM 2014 websites.