Automated feature extraction and selection for high-throughput phenotyping

Nov.11,2016

CBI seminar
Title:Automated feature extraction and selection for high-throughput phenotyping
Speaker: Dr. Yu Sheng,
Assistant Professor of statistics
in the Center for Statistical Science of Tsinghua University.
Time:14:00-15:00, Friday, November 18, 2016
Location: Room 311, Wang Ke-Zhen Building, Peking University
Abstract:
With the rapid adoption of electronic medical records (EMR), medicine and healthcare has become one of the most important field for big data applications. One of the important applications in medical research is the EMR-based phenotyping, which is to identify patients with certain phenotypes with machine learning algorithms. The conventional procedure for designing a phenotyping algorithm requires the participation of medical experts to discuss with statisticians and medical informaticians about the variables to use and the medical terms to search for, and the designing of one algorithm typically takes months to finalize. We propose a data-driven method to automate the algorithm designing process that can achieve higher accuracy even than expert designed algorithms. We utilize publicly available knowledge sources, such as the Wikipedia, to collect an initial set of candidate features. Billing codes and the natural language variable of the target phenotype are used to created surrogates of the gold-standard labels, and penalized logistic regression models are trained repeatedly with bootstrap to predict the surrogates in order to evaluate the informativeness of the candidate features. Only a succinct set of highly informative features will pass the data-driven screening and enter the final model to predict the true gold-standard labels. This method has been implemented in the development of large scale biobanks in top ranked hospitals in the U.S.
Speaker Bio:
Dr. Yu Sheng is Assistant Professor of statistics in the Center for Statistical Science of Tsinghua University. Dr. Yu received his BS and MA degrees in statistics from Nankai University and the University of Michigan, and he received his PhD degree in systems engineering (operations research) from the George Washington University. He started his research in medical informatics since his research work at Harvard University, and his current research interests include deep understanding of the medical language with machine learning methods, internet and data-driven knowledge extraction, and supervised and unsupervised EMR analysis.
Welcome!