2nd Workshop on Data Mining for Medicine and Healthcare

May 4, 2013, Austin, Texas

To be held in conjunction with 13th SIAM International Conference on Data Mining (SDM 2013)



In virtually every country, the cost of healthcare is increasing more rapidly than the willingness and the ability to pay for it. At the same time, more and more data is being captured around healthcare processes in the form of Electronic Health Records (EHR), health insurance claims, medical imaging databases, disease registries, spontaneous reporting sites, and clinical trials. As a result, data mining has become critical to the healthcare world. On the one hand, EHR offers the data that gets data miners excited, however on the other hand, is accompanied with challenges such as 1) the unavailability of large sources of data to academic researchers, and 2) limited access to data-mining experts. Healthcare entities are reluctant to release their internal data to academic researchers and in most cases there is limited interaction between industry practitioners and academic researchers working on related problems.

The objectives of this workshop are:

  1. Bring together researchers (from both academia and industry) as well as practitioners to present their latest problems and ideas.
  2. Attract healthcare providers who have access to interesting sources of data and problems but lack the expertise in data mining to use the data effectively.
  3. Enhance interactions between data mining, text mining and visual analytics communities working on problems from medicine and healthcare.


Topics of Interest

In addition to the more classical data mining approaches, this workshop aims to include two new topic fields – i.e. visual analytics and text mining in medicine and healthcare. By this extension, we aim to foster interactions among multiple communities that work at the intersections of data mining, medicine and healthcare.

Topic areas for the workshop include (but are not limited to) the following:

  • Statistical analysis and characterization of healthcare data
  • Text mining - mining free text in electronic medical records
  • Visual analysis and exploration of longitudinal clinical trial data
  • Meaningful use of healthcare data for improved patient care and cost-reduction
  • Data quality assessment and improvement: preprocessing, cleaning, missing data treatment etc.
  • Pattern detection and hypothesis generation from observational data
  • Visualization of prescriptions drugs and interactions
  • Privacy and security issues in healthcare
  • Information fusion and knowledge transfer in healthcare
  • Evolutionary and longitudinal patient and disease models
  • Medical fraud detection
  • Help with ICD 9 to ICD 10 conversions
  • Health Information exchanges


Workshop Chairs

David Gotz, IBM T.J. Watson Research Center

Nigam Shah, Stanford University

Gregor Stiglic, University of Maribor

Fei Wang, IBM T.J. Watson Research Center

Note: for inquiries please send e-mail to gregor.stiglic@um.si and fwang@us.ibm.com

Program Committee

Sophia Ananiadou, University of Manchester

David Buckeridge, McGill University

Nitesh Chawla, University of Notre Dame 

Rave Harpaz, Stanford University

Andreas Holzinger, Medical University Graz

Jin-Dong Kim, Database Center for Life Science

Peter Kokol, University of Maribor

Nada Lavrac, Institute Jozef Stefan

Zoran Obradovic, Temple University

Mykola Pechenizky, Technical University Eindhoven

Niels Peek, University of Amsterdam

Igor Pernek, University of Maribor

Martijn Schuemie, Erasmus University Medical Center

David Sontag, New York University

Jimeng Sun, IBM T.J. Watson Research Center

Jieping Ye, Arizona State University

Ping Zhang, IBM T.J. Watson Research Center


Important Dates

Paper Submission: January 14, 2013 (extended)

Notification of Acceptance: January 25, 2013

Camera Ready Paper Due: February 6, 2013

Workshop: May 4, 2013 

Submission Information

All submissions must be made electronically at: https://www.easychair.org/conferences/?conf=sdmdmmh2013 

Papers submitted to this workshop must not have been accepted or be under review by another conference with a published proceedings or by a journal. The work may be either theoretical or applied.

The workshop accepts short (4-6 pages) and long papers (up to 9 pages) with US Letter (8.5" x 11") paper size (single-spaced, 2 column, 10 point font, and at least 1" margin on each side). Papers must have an abstract with a maximum of 300 words and a keyword list with no more than 6 keywords.

We would like to encourage you to prepare your paper in LaTeX2e. Papers should be formatted using the SIAM SODA macro, which is available through the SIAM website. You can access it at http://www.siam.org/proceedings/macros.php. The filename is soda2e.all. Make sure you use the macros for SODA and Data Mining Proceedings; papers prepared using other proceedings macros will not be accepted.

For Microsoft Word users, please convert your document to the PDF format.  Since there is no Microsoft Word Template, please visit http://www.siam.org/proceedings/ to view the format on previous papers.

All submissions should clearly present the author information including the names of the authors, the affiliations and the emails.

Workshop Schedule

May 4, Saturday

Morning Session

8:30 – 8:35

Workshop opening

8:35 – 9:25

Invited Talk I

Speaker: Joydeep Ghosh


9:25 – 10:20

Classification of Clinical Data using Sequential Patterns: A case study in Amyotrophic Lateral Sclerosis

André V. Carreiro, Susana Pinto, Mamede de Carvalho, Sara C. Madeira and Cláudia



Incremental SampleBoost for Efficient Learning from Multi-Class Data Sets

Mohamed Abouelenien and Xiaohui Yuan


Discover Temporal Dynamics of Biomarkers in  Predictive Modeling with Longitudinal Data

Jiayu Zhou, Jimeng Sun, Fei Wang, Jianying Hu, Shahram Ebadollahi and Jieping Ye 


10:20 – 10:30



10:30 – 11:20

Invited Talk II

Speaker: Marc Suchard


11:20 – 12:00

A Temporal Association Mining Method for Signaling Drug-Drug Interactions

Yanqing Ji, Hao Ying, John Tran, Peter Dews, Ayman Mansour, See Yan and R Michael Massanari


Temporal Mining of Integrated Healthcare Data: Methods, Revealings and Implications

Rui Henriques, Sílvia Moura Pina and Cláudia Antunes 


12:00 – 13:30

Lunch Break (on your own)

May 4, Saturday

Afternoon Session

13:30 – 14:20

Invited Talk III

Speaker: Jimeng Sun


14:20 – 15:00

Data Mining for Integrated Sensing of Multiple Wearable Body Sensors: A Healthcare Application

Gaurav Pradhan and B Prabhakaran


Multivariate Methods for Classifying Physiological Data

Patricia Ordóñez, Tom Armstrong, Tim Oates, Jim Fackler and Christoph U. Lehmann

15:00 – 15:15



15:15 – 16:05

Invited Talk IV

Speaker: Nitesh Chawla


16:05 – 16:55

Real-Time Digital Flu Surveillance using Twitter Data

Kathy Lee, Ankit Agrawal and Alok Choudhary


Classification and Diagnosis of Myopathy from EMG Signals

Brian Bue, Erzsébet Merényi and James Killian 


Mining hospital admission-discharge data to discover the chance of readmission

Arian Hosseinzadeh, Aman Verma, Masoumeh Izadi, Doina Precup and David Buckeridge 


16:55 – 17:00

Best Paper Award

Sponsored by IBM


Invited Speakers

Nitesh Chawla, University of Notre Dame


Nitesh Chawla is the Frank Freimann Collegiate Chair and Associate Professor of Computer Science and Engineering. He started his tenure-track position at Notre Dame in 2007, was promoted and tenured in 2011, and recognized with the Frank Freimann Collegiate Chair in 2012. He is the Director of the Notre Dame Interdisciplinary Center for Network Science and Applications (iCeNSA)  and  Data Inference Analytics and Learning Lab (DIAL).

Dr. Chawla's research interests lie primarily in data mining and network science, specifically his research is in the areas of imbalanced data, concept drift and dataset shift, adversarial learning, evaluation issues, heterogeneous networks, co-evolving networks, link prediction. He is also at the frontier of interdisciplinary applications with innovative work and key contributions in patient-centered healthcare, social networks, analytics, and climate/ environmental sciences. He is the recipient of multiple awards for research and teaching innovation including outstanding teacher awards (2008 and 2011), National Academy of Engineers New Faculty Fellowship, and number of best paper awards and nominations. He is the recipient of IBM Watson Faculty Award in 2012. He has over 150 publications,  and serves as PI/Co-PI on over $10 Million Dollars in research funding with research supported from federal agencies including NSF, DOD, DARPA, ARL, DOE, and a number of industry partners. He is the former chair of the IEEE CIS Data Mining Technical Committee. He also serves on a number of editorial boards and organizing/program committees of conferences.  Dr. Chawla is also the founder of Aunalytics, Inc. 


Big Data and Patient-centered Healthcare

Proactive personalized medicine is expected to bring fundamental changes, offering recommendations of lifestyle adjustments and treatments to avoid diseases a patient has high risk for developing in the future. No matter how unique our medical experiences, chances are that other patients among millions have experienced genetic and environmental risk factors that closely mirror ours. These medical experiences, risks, symptoms are tapped in the vast repositories of electronic medical records. Can we then take a data-driven approach to discover nuggets of knowledge and insight from the Big Data in healthcare for patient-centered outcomes and personalized healthcare? Can we answer the question: What are my disease risks? This talk will focus on our work that takes the data and networks driven thinking to personalized healthcare and patient-centered outcomes. 

Joydeep Ghosh, University of Texas at Austin


Joydeep Ghosh is currently the Schlumberger Centennial Chair Professor of Electrical and Computer Engineering at the University of Texas, Austin. He joined the UT-Austin faculty in 1988 after being educated at, (B. Tech '83) and The University of Southern California (Ph.D’88). He is the founder-director of IDEAL (Intelligent Data Exploration and Analysis Lab) and a Fellow of the IEEE. Dr. Ghosh has taught graduate courses on data mining and web analytics every year to both UT students and to industry, for over a decade. He was voted as "Best Professor" in the Software Engineering Executive Education Program at UT. 

Dr. Ghosh's research interests lie primarily in data mining and web mining, predictive modeling / predictive analytics, machine learning approaches such as adaptive multi-learner systems, and their applications to a wide variety of complex real-world problems. He has published more than 300 refereed papers and 50 book chapters, and co-edited over 20 books. His research has been supported by the NSF, Yahoo!, Google, ONR, ARO, AFOSR, Intel, IBM, and several others. He has received 14 Best Paper Awards over the years, including the 2005 Best Research Paper Award across UT and the 1992 Darlington Award given by the IEEE Circuits and Systems Society for the overall Best Paper in the areas of CAS/CAD. Dr. Ghosh has been a plenary/keynote speaker on several occasions such as MICAI'12, KDIR'10, ISIT'08, ANNIE’06 and MCS 2002, and has widely lectured on intelligent analysis of large-scale data. He served as the Conference Co-Chair or Program Co-Chair for several top data mining oriented conferences, including SDM'13, SDM''12, KDD 2011, CIDM’07, ICPR'08 (Pattern Recognition Track) and SDM'06. He was the Conf. Co-Chair for Artificial Neural Networks in Engineering (ANNIE)'93 to '96 and '99 to '03 and the founding chair of the Data Mining Tech. Committee of the IEEE Computational Intelligence Society. He has also co-organized workshops on high dimensional clustering, Web Analytics, Web Mining and Parallel/ Distributed Knowledge Discovery. 

Dr. Ghosh has served as a co-founder, consultant or advisor to successful startups (Stadia Marketing, Neonyoyo and Knowledge Discovery One) and as a consultant to large corporations such as IBM, Motorola and Vinson & Elkins. 


Predictive Modeling of Large Healthcare Data under Privacy Constraints 

As medical records move to the digital age and the bedside gets increasingly instrumented,  a wealth of information is being acquired, with the potential of providing unprecedented insights into the cause, prevention, treatment and management of illness. Analyses of such data also promises numerous opportunities for much more effective and efficient delivery of healthcare. However (valid) privacy concerns and restrictions pose a major impediment to realizing this potential. In this talk I will outline two approaches that we have recently and successfully taken that provide privacy-aware predictive modeling with little degradation in model quality despite restrictions on what can be shared or analyzed. The first approach focuses on extracting predictive value from data that has been aggregated at various levels due to privacy concerns, while the second introduces a novel, non-parametric Gibbs sampler that can generate "realistic but not real" data given a dataset that cannot be shared as is. 

[Joint work with Yubin Park and Shankar Mallikarjun]

Marc Suchard, University of California, Los Angeles


Marc A. Suchard is Professor in the Departments of Biostatistics, of Biomathematics and of Human Genetics in the UCLA School of Public Health and David Geffen School of Medicine at UCLA. He earned his Ph.D in biomathematics from UCLA in 2002 and continued for a MD degree which he received in 2004.  Dr. Suchard is a leading Bayesian statistician who focuses on inference of stochastic processes in genomics and for massive datasets in healthcare. His training in both Medicine and Applied Probability help bridge the gap of understanding between statistical theory and clinical practicality.  He has been awarded several prestigious statistical awards such as the 2003 Savage Award, the 2006 and 2011 Mitchell Prizes, as well as a 2007 Alfred P. Sloan Research Fellowship in computational and molecular evolutionary biology and a 2008 Guggenheim Fellowship to further computational statistics.  Recently, he received the 2011 Raymond J. Carroll Young Investigator Award for a leading statistician within 10 years post-Ph.D.


Following a series of high-profile drug safety disasters in recent years, many countries are redoubling their efforts to ensure the safety of licensed medical products. Large-scale observational databases such as claims databases or electronic health record systems are attracting particular attention in this regard, but present significant methodological and computational concerns.  In this talk, I discuss how high-performance statistical computation, including graphics processing units, can enable complex inference methods in these massive datasets. I focus on algorithm restructuring through techniques like block relaxation (Gibbs, cyclic coordinate descent, MM) to exploit increased data/parameter conditional independence within traditional serial structures. I find orders-of-magnitude improvement in overall run-time fitting models involving tens of millions of observations.

These approaches are ubiquitous in high-dimensional biological problems modeled through stochastic processes. To drive this point home, I conclude with a seemingly unrelated example developing nonparametric models to study the genomic evolution of infectious diseases. These infinite hidden Markov models (HMMs) generalize both Dirichlet process mixtures and the usual finite-state HMM to capture unknown heterogeneity in the evolutionary process.  Data squashing strategies, coupled with massive parallelization, yield novel algorithms that bring these flexible models finally within our grasp.  

[Joint work with Subha Guha, David Madigan, and Steve Scott]

Jimeng SunIBM T.J. Watson Research Center


Jimeng Sun is a research staff member at Healthcare Analytic Department of IBM TJ Watson Research Center. He leads research projects of medical informatics, especially in developing large-scale predictive and similarity analytics on healthcare applications. 

Sun has extensive research track records on core and applied data mining research: specialized in healthcare analytics, big data analytics, similarity metric learning, social network analysis, predictive modeling and visual analytics. He has published over 70 papers, filed over 20 patents (4 granted). He has received ICDM best research paper in 2007, SDM best research paper in 2007, and KDD Dissertation runner-up award in 2008. 

Sun received his B.S. and M.Phil. in Computer Science from Hong Kong University of Science and Technology in 2002 and 2003, and PhD in Computer Science in Carnegie Mellon University in 2007, specialized on data mining on streams, graphs and tensor data.


Heterogeneous and large volume of Electronic Health Records (EHR) data are becoming available in many healthcare institutes. Such EHR data from millions of patients serve as huge collective memory of doctors and patients over time. How to leverage that EHR data to help caregivers and patients to make better decisions in future? How to efficiently use these data to help clinical and pharmaceutical research? 

My research focuses on developing large-scale algorithms and systems for healthcare analytics. First, I will describe our healthcare analytic research framework, which provides  an intuitive collaboration mechanism across interdisciplinary teams and an efficient  computation framework for handling heterogeneous patient data. Second, I will present a core component of this framework, patient similarity learning that answers the following questions: 

- How to leverage physician feedback into the similarity computation? 

- How to integrate multiple patient similarity measures into a single consistent similarity measure? 

- How to incrementally update the existing patient similarity functions as new data or feedback arrive? 

- How to present the similarity results in an intuitive and interactive way to users? 

I will illustrate the effectiveness of our proposed algorithms for patient similarity learning in several different healthcare scenarios.  I will demonstrate an interactive visual analytic system that allows users to cluster patients and to refine the underlying patient similarity metric. Finally, I will highlight some current/future work that I am pursuing.  

Previous Edition of the Workshop

First Workshop on Data Mining in Medicine and Healthcare was organized at KDD 2011 conference in San Diego, CA. The workshop was implemented as a full-day workshop with 2 invited speakers, 6 full papers and 4 short papers.

Keynote lectures and the panel are available at http://videolectures.net/datamining2011_san_diego/.

Information about SDM-DMMH 2013 is also available at KDnuggets.