3rd Workshop on Data Mining for Medicine and Healthcare

April 26, 2014, Philadelphia, PA

To be held in conjunction with 14th SIAM International Conference on Data Mining (SDM 2014)


In virtually every country, the cost of healthcare is increasing more rapidly than the willingness and the ability to pay for it. At the same time, more and more data is being captured around healthcare processes in the form of Electronic Health Records (EHR), health insurance claims, medical imaging databases, disease registries, spontaneous reporting sites, and clinical trials. As a result, data mining has become critical to the healthcare world. On the one hand, EHR offers the data that gets data miners excited, however on the other hand, is accompanied with challenges such 1) the unavailability of large sources of data to academic researchers, and 2) limited access to data-mining experts. Healthcare entities are reluctant to release their internal data to academic researchers and in most cases there is limited interaction between industry practitioners and academic researchers working on related problems.

The objectives of this workshop are:

1. Bring together researchers (from both academia and industry) as well as practitioners to present their latest problems and ideas.

2. Attract healthcare providers who have access to interesting sources of data and problems but lack the expertise in data mining to use the data effectively.

3. Enhance interactions between data mining, text mining and visual analytics communities working on problems from medicine and healthcare.

SDM is a unique venue for this workshop as leading researchers and practitioners from academia and industry will be able to participate. A workshop where healthcare professionals can have an audience, present and discuss their problems, views and ideas on the field as well as pose research challenges will attract them to SDM. The organizers of this proposed workshop have continuous and in-depth contact with people working on healthcare applications of data mining and healthcare professionals in the US and Europe which will attract a broad and varied set of participants. We believe that this workshop will serve as a bridge between the traditional SDM community and healthcare professionals - two groups of participants that have a lot to learn from and share with each other.

Topics of Interest

Topic areas for the workshop include (but are not limited to) the following:

  • Statistical analysis and characterization of healthcare data
  • Text mining - mining free text in electronic medical records
  • Visual analysis and exploration of longitudinal clinical trial data
  • Meaningful use of healthcare data for improved patient care and cost-reduction
  • Data quality assessment and improvement: preprocessing, cleaning, missing data treatment etc.
  • Pattern detection and hypothesis generation from observational data
  • Visualization of prescriptions drugs and interactions
  • Privacy and security issues in healthcare
  • Information fusion and knowledge transfer in healthcare
  • Evolutionary and longitudinal patient and disease models
  • Medical fraud detection
  • Hospital readmission analytics
  • Help with ICD 9 to ICD 10 conversions
  • Health Information exchanges

Workshop Chairs

Nitesh Chawla
University of Notre Dame
Gregor Stiglic
University of Maribor
Fei Wang
IBM T.J. Watson Research Center

 Publicity chair:  Proceedings chair:
Ping Zhang
IBM T.J. Watson Research Center
Xiang Wang
IBM T.J. Watson Research Center

Note: for inquiries please send e-mail to gregor.stiglic@um.si and fwang@us.ibm.com

Program Committee

Riccardo Bellazzi, University of Pavia
David Buckeridge, McGill University
Andreas Holzinger, Medical University Graz
Siddhartha Jonnalagadda, Northwestern University
Jin-Dong Kim, Database Center for Life Science
Zoran Obradovic, Temple University
Mykola Pechenizky, Technical University Eindhoven
Chandan Reddy, Wayne State University
Niels Peek, University of Amsterdam
Igor Pernek, University of Maribor
Patrick Ryan, Jannsen Research and Development
Lucia Sacchi, University of Pavia
Martijn Schuemie, Jannsen Research and Development
Xiang Wang, IBM T.J. Watson Research Center
Ping Zhang, IBM T.J. Watson Research Center
Jiayu Zhou, Arizona State University

Important Dates

Paper Submission: January 10, 2014

Notification of Acceptance: January 31, 2014

Camera Ready Paper Due: February 10, 2014

Workshop: April 26, 2014

Submission Information

All submissions must be made electronically at https://www.easychair.org/conferences/?conf=sdmdmmh2014.

Papers submitted to this workshop must not have been accepted or be under review by another conference with a published proceedings or by a journal. The work may be either theoretical or applied.

The workshop accepts short (4-6 pages) and long papers (up to 9 pages) with US Letter (8.5" x 11") paper size (single-spaced, 2 column, 10 point font, and at least 1" margin on each side). Papers must have an abstract with a maximum of 300 words and a keyword list with no more than 6 keywords.

We would like to encourage you to prepare your paper in LaTeX2e. Papers should be formatted using the SIAM SODA macro, which is available through the SIAM website. You can access it at http://www.siam.org/proceedings/macros.php. The filename is soda2e.all. Make sure you use the macros for SODA and Data Mining Proceedings; papers prepared using other proceedings macros will not be accepted.

For Microsoft Word users, please convert your document to the PDF format.  Since there is no Microsoft Word Template, please visit http://www.siam.org/proceedings/ to view the format of previous papers.

All submissions should clearly present the author information including the names of the authors, the affiliations and the emails.

Workshop Schedule

April 26, Saturday

Morning Session

8:30 – 9:20

Workshop opening

Invited talk I

Speaker: Zoran Obradovic

9:20 – 10:10

Invited talk II

Speaker: Chandan K. Reddy

10:10 – 10:20


10:20 – 11:10

Invited talk III

Speaker: Christopher C. Yang

11:10 – 12:00

Invited talk IV

Speaker: Anastasia Christianson 

12:00 – 13:30

Lunch break (on your own)

April 26, Saturday

Afternoon Session

13:30 – 14:20

Invited talk V

Speaker: Patrick Ryan

14:20 – 15:10

Short presentations of accepted papers

15:10 – 15:35

Poster session

15:35 – 16:25

Invited talk VI

Speaker: Zhi Wei

16:25 – 17:00

Panel discussion with invited speakers

Moderator: Nitesh Chawla

Short presentations schedule (14:20 – 15:10) with max. 4 slides (4 minutes) per presentation:


Label Many with a Few: Semi-Automatic Medical Image Modality Discovery in a Large Image Collection

Szilárd Vajda, Daekeun You, Sameer Antani, and George R. Thoma


Robust Feature Selection Framework for High Dimensional Electronic Health Records Data

Chandrima Sarkar, Prasanna Desikan, and Jaideep Srivastava


Machine Learning for the Detection of MRI-Elusive Epileptogenic Lesions: A Surface Based MRI Morphometric Approach

Bilal Ahmed, Thomas Thesen, Karen Blackmon, Ruben Kuzniecky, Chad Carlson, Brian T. Quinn, Werner Doyle, Jacqueline French, Orrin Devinsky, and Carla E. Brodley


Temporal Modeling in Clinical Artificial Intelligence, Decision-Making, and Cognitive Computing: Empirical Exploration of Practical Challenges

Casey C. Bennett and Thomas W. Doub


Analysis of Surface Motion Patterns Changes for Detecting Baseline Shifts in Respiratory Tumor Motion Data

Arvind Balasubramanian, Duk-Jin Kim, B. Prabhakaran, Yam Cheung, and Amit Sawant


A Hierarchical Algorithm for Exercise Intensity Recognition

Igor Pernek, Gregorij Kurillo, Gregor Stiglic, and Ruzena Bajcsy 


On Recent Advances in Supervised Ranking for Metabolite Profiling

Charanpal Dhanjal and Stéphan Clémençon 


MedTrend: Interactive Visualization System for Trend Analysis in Medical Data

Naouel Baili, Basheer Hawwash, Jeffrey Spaeder, Joseph Goodgame, Winfred Shaw, and Joseph Bastante


A Data-Driven Model for Optimizing Therapy Duration for Septic Patients

Mohamed Ghalwash and Zoran Obradovic 


Ranking with Distance Metric Learning for Biomedical Severity Detection

Feiyu Xiong, Moshe Kam, Leonid Hrebien, and Yanjun Qi 


Learning Decision Lists with Lags for Physiological Time Series

Erik Hemberg, Kalyan Veeramachaneni, Prashan Wanigasekara, Hormoz Shahrzad,

Babak Hodjat, and Una-May O'Reilly


Workshop Proceedings

Workshop notes are available for download here.

Invited Speakers

Zoran Obradovic
, Temple University

Modeling Patient's Response in Acute Inflammation Treatment

Uncontrolled inflammation accompanied by an infection that results in septic shock is the most common cause of death in intensive care units and the 10th leading cause of death overall. In principle, spectacular mortality rate reduction can be achieved by early diagnosis and accurate prediction of response to therapy. This is a very difficult objective due to the fast progression and complex multi-stage nature of acute inflammation. Our ongoing DARPA DLT project is addressing this challenge by development and validation of effective predictive modeling technology for analysis of temporal dependencies in high dimensional multi-source sepsis related data. This lecture will provide an overview of the results of our project, which show potentials for significant mortality reduction in severe sepsis patients.

Zoran Obradovic is a L.H. Carnell Professor of Data Analytics at Temple University, Professor in the Department of Computer and Information Sciences with a secondary appointment in Statistics, and is the Director of the Center for Data Analytics and Biomedical Informatics. His research interests include data mining, machine learning and complex networks applications in climate modeling and health management. Zoran is the executive editor at the journal on Statistical Analysis and Data Mining, which is the official publication of the American Statistical Association and is an editorial board member at eleven journals. He was general co-chair for 2013 and 2014 SIAM International Conference on Data Mining and was the program or track chair at many data mining and biomedical informatics conference. In 2014-2015 he chairs the SIAM Activity Group on Data Mining and Analytics.

Christopher C. Yang
, Drexel University

Harnessing Social Media to Empower Health Consumers and Discover Healthcare Knowledge 

The healthcare system is currently in the process of transforming from reactive care to proactive and preventive care. Patients are taking an active role in seeking health information to acquire a better understanding of their health conditions and the treatments or medications they are receiving. Patients often go online to search for authoritative information or to communicate with peers who have similar health conditions. Social media plays an important role in empowering patients to acquire healthcare knowledge. The latest sensor and mobile technologies also enable patients to track their health conditions ubiquitously.  In this talk, we’ll look at how the big data contributed by health consumers in social media and the increased connectivity between patients, caregivers, health professionals, and life science researchers are changing the way we conduct patient-centered research. 

Christopher C. Yang is an associate professor in the College of Computing and Informatics at Drexel University. His recent research interests include healthcare informatics, social intelligence and technology, and social media analytics. He has published over 230 referred journal and conference papers in ACM Transactions, IEEE Transactions, JASIST, IEEE Computer, IEEE Intelligent Systems, Information Processing and Management, International Journal of Electronic Commerce, and more.   
In his recent work on healthcare informatics, he is closely collaborating with the Children's Hospital of Philadelphia, UPenn Medical School, USC Keck School of Medicine, UCSF School of Medicine, Marshfield Clinic Research Institute, Johnson & Johnson. Actonnect, one of his collaboration with Marshfield Clinic Research Institute, University of Wisconsin at Madison, and a couple other institutions, has won the first place of conceptual model in the PCORI Patient-Research Matching Challenge in 2013. Actonnect is a web-based search engine and interface that aims to enable patients, clinicians, researchers, and others to conduct searches of health information gleaned from dozens of patient forums and social media sites and share their results graphically. His work has been supported by National Science Foundation, National Institutes of Health, Pennsylvania Department of Health, Children's Hospital of Philadelphia, Hong Kong Research Grant Council, Hong Kong Innovation and Technology Fund, and Hong Kong SAR Government. 
He is currently serving as the co-editor of Electronic Commerce Research and Applications (Elsevier), the associate editor-in-chief of Security Informatics (Springer) and the associate editor-in-chief of Security Informatics (Springer). He has edited several special issues on social media, healthcare informatics, Web mining, multilingual information systems, and electronic commerce in IEEE Transactions, ACM Transactions, IEEE Intelligent Systems, JASIST, DSS, IPM, and others.  He is the founding general chair and steering committee chair of IEEE International Conference on Healthcare Informatics. He has also chaired in many international conferences and workshops such as ACM SIGHIT International Health Informatics Symposium, IEEE International Conference on Intelligence and Security Informatics, ACM International Conference on Information and Knowledge Management, International Conference on Electronic Commerce, International Conference on Social Intelligence and Technology, ACM SIGKDD Workshop on Intelligence and Security Informatics, ACM SIGKDD Workshop on Health Informatics, International Workshop on Smart Health and Wellbeing, and International Conference on Asia-Pacific Digital Libraries.   

Patrick Ryan
, Jannsen Research and Development

Learning from observational health data: Lessons from OMOP and OHDSI

Observational healthcare data, such as administrative claims and electronic health records, offer tremendous opportunities to explore the real-world effects of medical products in support of medical product safety surveillance and comparative effectiveness. However, statistical best practices have not yet been fully established for observational analyses. High-profile examples of conflicting results between observational studies and randomized trials or among epidemiologic studies, have caused concern and skepticism within the medical research community about the relative role of observational data in the hierarchy of evidence. Since 2009, the Observational Medical Outcomes Partnership (OMOP) has conducted methodological research experiments to empirically measure the performance of observational analysis approaches across a distributed network of administrative claims and electronic health records. OMOP demonstrated that traditional epidemiologic study designs, such as cohort, case-control, and self-controlled case series, can yield valuable insights about potential effects of exposure, but are all highly susceptible to biases of various sorts that can result in both false negative and false positive findings. OMOP proposed an approach through systematic evaluation and empirical calibration whereby observational study results could be interpreted in the context of their predictive accuracy and operating characteristics. Building on the lessons from OMOP, the Observational Health Data Sciences and Informatics (OHDSI, http://ohdsi.org) program was formed as a multi-stakeholder, interdisciplinary collaborative to create open-source solutions that bring out the value of observational health data through large-scale analytics. In this talk, we will discuss the opportunities and challenges for use of observational data in medical decision-making. We will highlight the progress toward establishing scientific best practices and standardized tools to enable the entire research community to improve the reliability of evidence from observational studies. We will also explore the emerging opportunities to use observational data to go beyond estimating average treatment effects to generating patient-level predictions that provide personalized evidence that’s tailored to the medical history of an individual.

Patrick Ryan, PhD is the Head of Epidemiology Analytics at Janssen Research and Development, where he has leading efforts to develop and apply analysis methods to better understand the real-world effects of medical products. He is currently a collaborator in Observational Health Data Sciences and Informatics (OHDSI),  a multi-stakeholder, interdisciplinary collaborative to create open-source solutions that bring out the value of observational health data through large-scale analytics.  He served as a principal investigator of the Observational Medical Outcomes Partnership (OMOP), a public-private partnership chaired by the Food and Drug Administration.  As part of OMOP, he led methodological research to assess the appropriate use of observational health care data to identify and evaluate drug safety issues. Patrick received his undergraduate degrees in Computer Science and Operations Research at Cornell University, his Master of Engineering in Operations Research and Industrial Engineering at Cornell, and his PhD in Pharmaceutical Outcomes and Policy from University of North Carolina at Chapel Hill. Patrick has worked in various positions within the pharmaceutical industry at Pfizer and GlaxoSmithKline, and also in academia at the University of Arizona Arthritis Center.

Anastasia Christianson
, AstraZeneca Pharmaceuticals

Data Mining and Big Data in Pharma R&D – From disease to therapies

Data mining is used at every stage of drug discovery and development.  From understanding stages of disease to understanding the mechanism of action of drugs and their effects on patients and disease, data mining is an integral part of developing drugs.  As technological advances are enabling is to generate more and more data, we are becoming more dependent on data mining tools, approaches, and skills to help us derive knowledge from our data. Given the ever increasing volume, varieties, and velocity of data being generated today, we will explore the impact of data mining on drug development in this presentation and show examples of how data mining is enabling informed decision making in Pharma R&D and how big data is impacting drug development.     

Anastasia Christianson is Head of Translational R&D IT at Bristol-Myers Squibb (BMS) where she is responsible for delivering all IT and information needs for Translational Medicine across Research and Development. Prior to BMS, Anastasia spent 20 year at AstraZeneca Pharmaceuticals in various roles across Discovery and Clinical Development ranging from leading drug projects to establishing Genomics in Discovery and Biomedical Informatics in Clinical Development. She has experience supporting the end to end information needs across R&D for various functions including Personalized Healthcare and Biomarkers, Safety Assessment, Strategy, Portfolio and Performance, Discovery Medicine, and multiple therapeutic areas. 
Anastasia received her Ph.D. in Biological Chemistry from the University of Pennsylvania followed by postdoctoral training at Harvard University in Cellular and Developmental Biology. She started her working career at a Biotech Company, DNX Biotherapeutics, and has over 20 years of experience in the biotechnology and pharmaceutical R&D. She is a member of the DIA and PhRMA Translational Medicine Advisory Committees, has held adjunct appointments at various area Universities (currently at the University of Delaware and Delaware Biotechnology Institute), and is an editor of the Journal of Data Mining and Bioinformatics.  She helped establish a degree program in Quantitative Biology at the University of Delaware and chaired the external advisory committee of this degree program during its establishment.  She was also a member of the External Advisory Panel for the collaboration between Drexel University, the University of Louisiana, and NSF in establishing the Center for Visual and Decision Informatics. 
Anastasia’s passion is “data exploitation”, including mining "Big Data", for evidence-based decision making.

Chandan K. Reddy
, Wayne State University

Privacy Preserving and Knowledge Transfer approaches to Information Exchange in Healthcare

There had been significant efforts in the last few years to improve the accessibility of patient records between disparate health care information systems. In addition to the critical challenges associated with data transformation and data integration, inferring useful knowledge using such data from different sites can introduce several challenges in terms of preserving patient privacy and transferring potentially useful knowledge. This talk consists of two parts. In the first part, we will describe new ways to transfer knowledge between different healthcare systems under patient privacy constraints. In essence, our privacy-preserving predictive models provide a novel mechanism to share knowledge acquired from the multiple data repositories without revealing the actual individual patient-specific information. These models can be almost as accurate as the models built upon centralized data repositories where the entire patient data is revealed. In the second part of this talk, we will describe a new transfer learning approach based on constrained Elastic Net method that can inherently select the appropriate features to effectively transfer knowledge across multiple sites. This approach can also quantify the difference between patient data distributions conditioned upon the corresponding predictive models. Such a difference measure is extremely critical for making the knowledge transfer decisions especially to avoid negative and unsuccessful knowledge transfer. We will demonstrate the performance of the proposed models in two important healthcare problems related to lung cancer and diabetes. The data consists of several electronic health records collected from different states and regions within the U.S. The performance of both the models using various evaluation metrics will be demonstrated and compared with existing state-of-the-art methods.

Chandan Reddy is an Associate Professor in the Department of Computer Science at Wayne State University. He received his Ph.D. from Cornell University and M.S. from Michigan State University. He is the Director of the Data Mining and Knowledge Discovery (DMKD) Laboratory and a scientific member of Karmanos Cancer Institute. His primary research interests are Data Mining and Machine Learning with applications to Healthcare Analytics, Bioinformatics and Social Network Analysis. His research is funded by the National Science Foundation, the National Institutes of Health, the Department of Transportation, and the Susan G. Komen for the Cure Foundation. He has published over 45 peer-reviewed articles in leading conferences and journals including TPAMI, TKDE, SIGKDD, ICDM, SDM, and CIKM. He received the Best Application Paper Award in ACM SIGKDD conference in 2010, and was a finalist of the INFORMS Franz Edelman Award Competition in 2011. He is a member of IEEE, ACM, and SIAM. 

Zhi Wei
, New Jersey Institute of Technology

Change-point models for detecting aberrant gene expression patterns in cancers

In cancer cells, the protein products of oncogenes can be overexpressed without alteration of the proto-oncogene but with shorter 3' untranslated regions (UTRs). During the past few years, RNA-seq has matured as a powerful tool for characterizing gene expressions because of its affordable cost and highly accurate digital resolution. For cancer study, the introduction of RNA-Seq technology, equipped with new analytic methods, makes it possible to capture aberrant shortening or lengthening expression patterns of oncogenes. 
Change-point models are a classical approach to determine whether a change has taken place and where the changes occur. This lecture will introduce change-point models for detecting aberrant gene expression patterns. We develop appropriate parametrical models for characterizing RNA-seq data. In the multiple-testing framework, we will introduce Type I error control and testing efficiency issues for pattern recognition. The numerical performances of the approaches will be illustrated using both simulation study and applications to real cancer data.

Dr Zhi Wei is an associate professor at the Department of Computer Science, New Jersey Institute of Technology.  He receives his Ph.D. from the University of Pennsylvania and M.S. from the Rutgers University-New Brunswick. His research interests include multiple testing, statistical modelling, machine learning and data mining with applications to Bioinformatics and genetics. His recent research focuses on developing statistical models and data mining algorithms for analysis of high dimensional data. His research is funded by the National Institutes of Health, the Pheo Para Alliance, the Henry M. Jackson Foundation, and the Robert Mapplethorpe Foundation. His methodology works have been published in prestigious journals and conferences including JASA, AJHG, AOAS, Bioinformatics, Biostatistics, PLoS Genetics, NAR and NIPS. He is an editorial board member of PLoS ONE, Frontiers in Bioinformatics and Computational Biology, and Frontiers in Applied Genetic Epidemiology.

Previous Edition of the Workshop

First Workshop on Data Mining for Medicine and Healthcare was organized at KDD 2011 conference in San Diego, CA. The workshop was implemented as a full-day workshop with 2 invited speakers, 6 full papers and 4 short papers.

Keynote lectures and the panel are available at http://videolectures.net/datamining2011_san_diego/.

Information on the Second Workshop on Data Mining for Medicine and Healthcare can be found at DMMH-SDM 2013 website.

SDM-DMMH 2014 is also featured at KDnuggets.