- A Probit Tensor Factorization Model For Relational Learning , JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS (2022)
- Dynamic Causal Effects Evaluation in A/B Testing with a Reinforcement Learning Framework , JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION (2022)
- Rule mining over knowledge graphs via reinforcement learning , KNOWLEDGE-BASED SYSTEMS (2022)
- Concordance and Value Information Criteria for Optimal Treatment Decision , Annals of Statistics (2021)
- GEAR: On optimal decision making with auxiliary data , STAT (2021)
- Multi-Objective Model-based Reinforcement Learning for Infectious Disease Control , KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING (2021)
- On estimating optimal regime for treatment initiation time based on restricted mean residual lifetime , BIOMETRICS (2021)
- Online Testing of Subgroup Treatment Effects Based on Value Difference , 2021 21ST IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2021) (2021)
- Statistical inference of the value function for reinforcement learning in infinite-horizon settings , JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY (2021)
- A New Framework for Online Testing of Heterogeneous Treatment Effect , Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (2020)
The Special Supplemental Nutrition Program for Women, Infants, and Children (WIC) is the nation's third largest food and nutrition assistance program. In 2019, WIC served 6.4 million low-income pregnant and postpartum women, infants and children up to age five at a cost of $5.2 billion. The economic literature is sparse concerning the effects of WIC on food purchases and prices. Efforts to expand the literature are hampered by underreporting of WIC in household surveys. Our objective is to bridge these gaps with the following Specific Aims: 1. Apply machine learning (ML) to improve the precision of WIC participation status in scanner data. 2. Use the ML-based WIC participation variable from Aim 1 to quantify the causal effects of the 2009 WIC food package revision on participantsÃƒÂ¢Ã¢â€šÂ¬Ã¢â€žÂ¢ food purchases and nutrition. 3. Use ML to quantify the effects of the 2009 revision on retail prices, and, by extension, nonparticipants' food purchases and nutrition. We will train ML models using the restricted-use FoodAPS, where WIC participation is more accurately reported, and state-level variation in WIC-eligible food brands and timing of benefit issuance. The trained ML models will then predict WIC participation for households in the IRI Consumer Network household scanner panel. The 2009 WIC food package revision serves as a policy experiment to identify the effects of WIC on participants' food purchases and their overall nutritional quality. The policy-driven changes in WIC-eligible food types, brands and amounts will be leveraged to identify the effects of the revision on retail price and nonparticipants.
Despite the tremendous impacts that RL has achieved in areas such as games and robots, a direct deployment of RL algorithms in precision health can be costly, risky, unethical, or even infeasible, due to significant real-world challenges: (1) Existing RL algorithms typically require very large samples, while real-world data collection is often expensive so the sample size can be limited. (2) Different from those in typical clinical trials with nite-horizon settings, in the data from mobile health (mHealth) and electronic medical record (EMR) applications for precision health, the number of decision points for each subject is often much larger and not necessarily fixed (infinite-horizon). (3) EMR data are often created via aggregating many data sources corresponding to different sub-populations and thus heterogeneous. Naive applications of existing RL methods to large scale data sets such as EMR may generate misleading results. (4) The observational data are possibly confounded with unobserved variables that causally affect the agent and the environment simultaneously. This may cause misleading analysis for estimation and evaluation of a policy. The main objective of this proposal is to develop new statistical offline RL methods to handle these challenges by developing flexible and efficient off-policy learning and robust and efficient off-policy evaluation methods. The plan for analyzing real-world precision health data with the proposed methods will also be thoroughly discussed.
Overview Big Data, characterized by high dimensionality and large sample size, pose three unique challenges in statistics (Fan et al. 2014). They include (i) noise accumulation and spurious correlations brought on by high dimensionality, which can cause trouble for statistical inference; (ii) heavy computational cost and algorithmic instability due to the combination of high dimensionality and large sample size; (iii) issues of heterogeneity. For example, one source of heterogeneity can be created by the aggregation of the massive samples from multiple sources at diÃƒÂ¯Ã‚Â¬Ã¢â€šÂ¬erent time points using diÃƒÂ¯Ã‚Â¬Ã¢â€šÂ¬erent technologies. An interesting new application that becomes possible in the Big Data era is personalized medicine with a major goal to discover individualized treatment rules using health-related metrics such as individual molecular characteristics, human activities and environmental factors are now available. A salient feature and major source of heterogeneity in personalized medicine is that patients can show signiÃƒÂ¯Ã‚Â¬Ã‚Âcant heterogeneity in response to treatments. For example, in some cases, a drug that works for a majority of individuals may not work for a subset of patients with certain characteristics. Intellectual Merit Semiparametric and nonparametric modelling provide a ÃƒÂ¯Ã‚Â¬Ã¢â‚¬Å¡exible framework that is being applied to various ÃƒÂ¯Ã‚Â¬Ã‚Âelds including public health, economics, biostatistics and other scientiÃƒÂ¯Ã‚Â¬Ã‚Âc ÃƒÂ¯Ã‚Â¬Ã‚Âelds. We call the semiparametric and nonparametric models and methods semi-nonparametric when the parametric and nonparametric components are both of interest. Due to the rapid development of computing power, the popularity of semi-nonparametric methods is increasing in the era of Big Data. These methods are ÃƒÂ¯Ã‚Â¬Ã¢â‚¬Å¡exible and adaptive enough for Big Data, which often involve estimation and inference procedures that yield rates of convergence diÃƒÂ¯Ã‚Â¬Ã¢â€šÂ¬erent from the usual root-n rate and that have non-Gaussian limit distributions. Moreover, nonasymptotic analysis plays an important role in the analysis of Big Data. The PI is motivated to consider several diÃƒÂ¯Ã‚Â¬Ã¢â€šÂ¬erent but interrelated statistical problems in this proposal to address the aforementioned challenges. First, uniÃƒÂ¯Ã‚Â¬Ã‚Âed semi-nonparametric and machine learning methods are proposed to discover optimal individualized treatment rules for single-stage and dynamic treatment regimens. Second, asymptotic optimal inference will be developed for high-dimensional statistical models using semi-parametric methods. Third, the PI will explore novel bootstrap methods and non-asymptotic theorems for these methods in the Big Data setting to enhance the practical computational performance. The proposed research will develop new theory, methodologies and algorithms for statistical inference for Big Data. Broader Impacts The outlined research project not only tries to tackle fundamental problems for processing Big Data, but explores new directions in semi-nonparametrics. In addition, this project involves the development of new empirical processes tools. The proposed research will have signiÃƒÂ¯Ã‚Â¬Ã‚Âcant impact and have many applications in ÃƒÂ¯Ã‚Â¬Ã‚Âelds ranging from genomics and health sciences to economics and ÃƒÂ¯Ã‚Â¬Ã‚Ânance, where Big Data are often available. For example, the proposed method and theory can be directly applied to inference problems for decision making in personalized medicine, thereby contributing to improved disease treatment or prevention. Therefore, the PI is expecting to stimulate interests from a diverse group of scientists and researchers in numerous ÃƒÂ¯Ã‚Â¬Ã‚Âelds. Another key aspect of this project is the integration of research and education. New courses on Big Data statistical learning and seminonparametric inference will be developed. These courses will broaden the areas of specialized training in a department that has a strong history of attracting under-represented groups. The PI will also reach out to the K-12 education levels through training high school teachers.
This program project, entitled "Statistical Methods for Cancer Clinical Trials," is a joint venture of Duke University, North Carolina State University (NCSU), and the University of North Carolina at Chapel Hill (UNC). Biostatistician and clinician researchers from these three top research institutions collaborate on project research and share project-related resources. The scientific goal of this ambitious program project is to develop innovative statistical methods for cancer clinical trials that can hasten successful introduction of effective new therapies into practice, with focus on personalized cancer treatment. The method of approach is to leverage recent advances in statistical and computational science to create new clinical trial designs and data analysis tools that resolve many of the key scientific limitations of current clinical trial methodology. The program project involves five interrelated research projects focusing on developing these tools for personalized medicine. The proposed methods have the potential to alter the prevailing clinical trial paradigm and increase the discovery and translation of new treatments into clinical practice. The multi-institutional approach, which exploits the complementary strengths of each the three universities, includes an effective and energetic process for coordinated implementation, communication, and dissemination of the results, including development of new software for public dissemination to practitioners. The project will lead to significant improvements in cancer clinical trial practice that will result in improved health and outcomes for cancer patients The NCSU investigators will collaborate with each other and investigators from Duke and UNC to address the research problems jointly through synergistic interactions that exploit the complementary expertise from all three institutions.
Dynamic treatment regimens (DTRs) are sequential decision rules for individual patients that can adapt over time to an evolving illness. The ultimate goal is to accommodate heterogeneity among patients and find the DTR which will produce the best long term outcome. The patients receive treatments are often at multiple decision times. The effects of the covariates are often complex and the number of covariates are often very high. The broad, long-term objectives of this project are to adapt recent statistical learning methodologies and to develop new methods to develop optimal, personalized and single-stage or dynamic treatment regimens. In particular, this project aims 1) to develop flexible semiparametric modeling tailored in single-stage or dynamic treatment regimens; 2) to develop a penalized Q-learning and valid statistical inference for estimating optimal dynamic regimens with censored outcome; 3) to develop effective variable selection strategies which can simplify and improve implementation and reproducibility of personalized treatment regimens. For all the goals, we will establish rigorously the desired asymptotic properties and provide suitable numerical algorithms. The outlined research project will bring more insights into multi-stage, high-dimensional statistics learning, and will benefit future studies in this area. This study will develop flexible models and Q-learning methods in estimating personalized regimens. It will enrich the family of personalized medicine methodologies in general. A successful completion of this research will not only fill important gaps in statistical theory, but will also yield new tools for applied statisticians and other scientists. This project will foster more intensive collaborations among investigators from the Department of Statistics, the Department of Mathematics and in the Health Systems Engineering group in the Edward P. Fitts Department of Industrial and Systems Engineering (ISE) at North Carolina State University. The proposed study will promote teaching, training and learning at North Carolina State University. Research conducted in this study will help develop advanced graduate courses in statistical learning and semiparametric methodology. It will create challenging statistical projects for graduate students.
Influenza is a serious public health concern. Seasonal epidemics annually impact 5-15% of the world's population, resulting in 3-5 million cases of severe illness and up to 500,000 deaths. Influenza has also significant economic impacts, which include direct medical costs and working days lost because of illness. The first line of defense against seasonal influenza is getting the flu shot every year. However, flu viruses frequently mutate, and multiple strains co-circulate in one season. The World Health Organization (WHO) updates the flu shot annually based global surveillance. Empirical studies have documented that one to three mutations on the surface protein of circulating viruses reduce the antigenicity and efficacy of inactivated vaccines. The main goal of this proposal is to identify amino acid residues on the surface protein of the influenza virus that are mainly responsible for antigenic variety. Our results will allow the WHO to rapidly identify whether the current vaccine strains provide enough immunity protection against an emerging virus when updating the annual influenza vaccine composition.
This program project, entitled ?Statistical Methods for Cancer Clinical Trials,? will be joint venture of Duke University, North Carolina State University (NCSU), and the University of North Carolina at Chapel Hill (UNC). Biostatistician and clinician researchers from these three top research institutions will collaborate on project research and share project-related resources. The scientific goal of this ambitious program project is to develop innovative statistical methods for cancer clinical trials that can hasten successful introduction of effective new therapies into practice. The method of approach is to leverage recent advances in statistical and computational science to create new clinical trial designs and data analysis tools that resolve many of the key scientific limitations of current clinical trial methodology. The program project involves five interrelated research projects focusing on practical design and analysis problems in Phase II and III clinical trials, the problem of missing data and efficient use of prognostic information, post-marketing surveillance and comparative effectiveness research using clinical trial data, pharmacogenetics and individualized therapies, and the potential of dynamic treatment regimes to improve cancer treatment. The proposed methods have the potential to alter the prevailing clinical trial paradigm and increase the discovery and translation of new treatments into clinical practice. The multi-institutional approach, which exploits the complementary strengths of each the three universities, includes an effective and energetic process for coordinated implementation, communication, and dissemination of the results, including development of new software for public dissemination to practitioners. The project will lead to significant improvements in cancer clinical trial practice that will result in improved health and outcomes for cancer patients The NCSU component of the project will be administered by the NCSU Center for Quantitative Sciences in Biomedicine (CQSB) and involves seven faculty in the Department of Statistics, who will participate in all facets of the research. The investigators will collaborate with each other and investigators from Duke and UNC to address the research problems jointly through synergistic interactions that exploit the complementary expertise from all three institutions.
In contrast to the standard treatment discovery framework which is used for finding single treatments for a homogenous group of patients, personalized medicine involves finding therapies that are tailored to each individual in a heterogeneous group. This is currently of great interest to cancer clinicians since it holds the promise of better outcomes for more patients. The broad, long-term objectives of this project are to adapt recent statistical advances and to develop new methods to develop optimal, personalized and single-stage or dynamic treatment regimens for cancer and other chronic and life-threatening diseases. In particular, this project aims 1) to develop flexible semiparametric modeling tailored to personalized medicine in single-stage or dynamic treatment regimens; 2) to develop a penalized Q-learning and proper statistical inference for estimating optimal dynamic regimens with censored outcome; 3) to develop effective variable selection strategies which can simplify and improve implementation and reproducibility of personalized treatment regimens. For all the goals, we will establish rigorously the desired asymptotic properties and provide suitable numerical algorithms. We will assess the performance of the proposed methods through extensive simulation studies and provide applications to real studies.
With rapid advances of computing power and other modern technology, high-throughput data of unprecedented size and complexity are becoming a commonplace in diverse fields. Examples include data from genetic, microarrays, proteomics, fMRI, cancer clinical trials and high frequency financial data. These high dimensional data characterize many important contemporary problems in statistics and feature selection play pivotal roles in these problems. This research project aims to develop cutting-edge statistical theory and methods for high dimensional variable selections. In particular, the PI proposes the following interrelated research topics for investigation: (1) groupedvariables screening with sparse linear models; (2) nonparametric components screening with sparse additive models; (3) parametric components screening with sparse semiparametric models and (4) their further extensions. The proposed methods will be studied theoretically for their sure screening behavior and compared with some of the existing methods empirically in terms of computational expediency, statistical accuracy and algorithmic stability. The outlined research project on variable selection in high dimensions tries to tackle fundamental problems in statistical learning and will stimulate interests from a large group of scientists and researchers in diverse fields of sciences, engineering and humanities ranging from genomics and health sciences to economics and finance. Another key aspect of this project is the integration of research and education, which will be achieved by developing two new courses on statistical learning and non-, semi-parametric inference and proposing specific projects for students during the teaching of classes. It will enable the participation of all citizens from various disciplines, including underrepresented groups of students.