Monday, May 2

Room A: 1220 JSMBS
Room B: 2220A JSMBS
Room C: 2220B JSMBS

Session 1, 9:00-10:30 am, Room A

9:00-9:28 am Welcome and intro to the conference
- 9:00-9:10 am Dr. Allison Brashear, Vice President for Health Sciences and Dean, Jacobs School of Medicine & Biomedical Sciences, University at Buffalo
- 9:11-9:17 am Dr. Sanjay Sethi, Associate Director, CTSI, University at Buffalo
- 9:18-9:28 am Dr. Jean Wactawski-Wende, Dean of the SPHHP, University at Buffalo
9:30-10:30 am Keynote 1 (Professor Regina Liu)
- Chair: Marianthi Markatou, University at Buffalo

Break, 10:30-10:50 am

Session 2, 10:50 am-12:00 noon

Room A

Organizer/Chair: Brent Johnson, University of Rochester

Speakers:

Linbo Wang, University of Toronto
Co-authors: Meng, Xiang; Richardson, Thomas; Robins, James
Presentation Title: Coherent modeling of longitudinal causal effects on binary outcomes
Abstract: Analyses of biomedical studies often necessitate modeling longitudinal causal effects. The current focus on personalized medicine and effect heterogeneity makes this task even more challenging. Towards this end, structural nested mean models (SNMMs) are fundamental tools for studying heterogeneous treatment effects in longitudinal studies. However, when outcomes are binary, current methods for estimating multiplicative and additive SNMM parameters suffer from variation dependence between the causal parameters and the non-causal nuisance parameters. This leads to a series of difficulties in interpretation, estimation and computation. These difficulties have hindered the uptake of SNMMs in biomedical practice, where binary outcomes are very common. We solve the variation dependence problem for the binary multiplicative SNMM via a reparametrization of the non-causal nuisance parameters. Our novel nuisance parameters are variation independent of the causal parameters, and hence allow for coherent modeling of heterogeneous effects from longitudinal studies with binary outcomes. Our parametrization also provides a key building block for flexible doubly robust estimation of the causal parameters. Along the way, we prove that an additive SNMM with binary outcomes does not admit a variation independent parametrization, thereby justifying the restriction to multiplicative SNMMs.

Indrabati Bhattacharya, University of Rochester
Co-authors:
Presentation Title: Nonparametric Bayesian Q-learning for adjusting partial compliance in SMART trials
Abstract: Q-learning is a well-known reinforcement learning approach for estimation of optimal dynamic treatment regimes. Existing methods for estimation of dynamic treatment regimes are limited to intention-to-treat analyses--which estimate the effect of randomization to a particular treatment regime without considering the compliance behavior of patients. In this article, we propose a novel Bayesian nonparametric Q-learning approach based on stochastic decision rules for adjusting partial compliance. We consider the popular potential compliance framework, where some potential compliances are latent and need to be imputed. For each stage, we fit a locally weighted Dirichlet process mixture model for the conditional distribution of potential outcomes given the compliance values and baseline covariates. The key challenge is learning the joint distribution of the potential compliances, which we do using a Dirichlet process mixture model. Our approach provides two sets of decision rules: (1) conditional decision rules given the potential compliance values; and (2) marginal decision rules where the potential compliances are marginalized. Extensive simulation studies show the effectiveness of our method compared to intention-to-treat analyses. We apply our method on the Adaptive Treatment For Alcohol and Cocaine Dependence Study (ENGAGE), where the goal is to construct optimal treatment regimes to engage patients in therapy.

William Artman, University of Rochester
Co-authors: Ertefaie, Ashkan; Lynch, Kevin; McKay, James; Johnson, Brent
Presentation Title: A Marginal Structural Model for Partial Compliance in SMARTs
Abstract: Sequential, multiple assignment, randomized trials (SMART) are a clinical trial design which allows for the comparison of sequences of treatment decision rules tailored to the individual patient, i.e., dynamic treatment regime (DTR). The standard approach to analyzing a SMART is intention-to-treat (ITT) which may lead to biased estimates of DTR outcomes in the presence of partial compliance. A major causal inference challenge is that adjusting for observed compliance directly leads to the post-treatment adjustment bias. Principal stratification is a powerful tool which stratifies patients according to compliance classes allowing for a causal interpretation of the effect of compliance on DTR outcomes. Importantly, differential compliance behavior may lead to different optimal DTRs. We extend existing methods from the single-stage setting to the SMART setting by developing a principal stratification framework that leverages a flexible Bayesian non-parametric model for the compliance distribution and a parametric marginal structural model for the outcome. We conduct simulation studies to validate our method.

Room B Tutorial: Community Detection in Complex Networks

Room C Tutorial: An Introduction to Bayesian Thinking: Gaining Insights into a Theory that would not die!

Lunch, 12:00-1:00 pm, Room A

Session 3, 1:00-2:10 pm

Room A

Organizer: Sumanta Basu, Cornell University

Chair: Yuxin Ding, Eli Lilly and Company

Speakers:

Marie-Christine Duker, Cornell University
Co-authors: Betken, Annika
Presentation Title: Higher-order approximation to construct confidence intervals in time series
Abstract: For time series with high correlation, the empirical process converges extremely slowly to its limiting distribution. Many relevant statistics like the median and the Wilcoxon statistic can be written as functionals of the empirical process and inherit the slow convergence. Inference based on the limiting distribution of those quantities becomes highly impacted by relatively small sample sizes. This talk proposes a novel approach to calculate confidence intervals for the empirical process and the median based on a higher-order approximation of the empirical process. We establish the theoretical validity of our method for statistics that are functionals of the empirical process. In a simulation study, we compare coverage rate and interval length of our confidence intervals with classical results and highlight the improvements. The talk concludes with a discussion of further utilization of the proposed approximation in change-point analysis and high-dimensional time series analysis.

Pramita Bagchi, George Mason University
Co-authors: Bruce, Scott
Presentation Title: Adaptive Frequency Band Analysis for Functional Time Series
Abstract: The frequency-domain properties of nonstationary functional time series often contain valuable information. These properties are characterized through their time-varying power spectrum. Practitioners seeking low-dimensional summary measures of the power spectrum often partition frequencies into bands and create collapsed measures of power within bands. However, standard frequency bands have largely been developed through manual inspection of time series data and may not adequately summarize power spectra. In this article, we propose a framework for adaptive frequency band estimation of nonstationary functional time series that optimally summarizes the time-varying dynamics of the series. We develop a scan statistic and search algorithm to detect changes in the frequency domain. We establish the theoretical properties of this framework and develop a computationally-efficient implementation. The validity of our method is also justified through numerous simulation studies and an application to analyzing electroencephalogram data in participants alternating between eyes open and eyes closed conditions.

Sayar Karmakar, University of Florida
Co-authors: Chudy, Marek; Wu, Wei Biao
Presentation Title: Long-term predictions with many covariates
Abstract: Time-aggregated prediction intervals are constructed for a univariate response time series in a high-dimensional regression regime. A simple quantile-based approach on the LASSO residuals seems to provide reasonably good prediction intervals. We allow for a very general possibly heavy-tailed, possibly long-memory and possibly non-linear dependent error process and discuss both the situations where the predictors are assumed to form a fixed or stochastic design. Finally, we construct prediction intervals for hourly electricity prices over horizons spanning 17 weeks and compare them to selected Bayesian and bootstrap interval forecasts

Room B Tutorial: Community Detection in Complex Networks (cont.)

Room C Tutorial: An Introduction to Bayesian Thinking: Gaining Insights into a Theory that would not die! (cont.)

Session 4, 2:15-3:50 pm

Room A

Organizer: Albert Vexler, University at Buffalo

Chair: Jihnhee Yu, University at Buffalo

Speakers:

Paul Albert, National Cancer Institute
Co-authors:
Presentation Title: Latent Variable Modeling Approaches for Chemical Mixtures
Abstract: Understanding the relationships between biomarkers of exposure and disease incidence is an important problem in environmental epidemiology. Typically, a large number of these exposures are measured, and it is found either that a few exposures transmit risk or that each exposure transmits a small amount of risk, but, taken together, these may pose a substantial disease risk. Importantly, these effects can be highly non-linear and can be in different directions. We develop a latent functional approach, which assumes that the individual joint effects of each biomarker exposure can be characterized as one of a series of unobserved functions, where the number of latent functions is less than or equal to the number of exposures. We propose Bayesian methodology to fit models with a large number of exposures. An efficient Markov chain Monte Carlo sampling algorithm is developed for carrying out Bayesian inference. The deviance information criterion is used to choose an appropriate number of nonlinear latent functions. We demonstrate the good properties of the approach using simulation studies. Further, we show that complex exposure relationships can be represented with only a few latent functional curves. The proposed methodology is illustrated with an analysis of the effect of cumulative pesticide exposure on cancer risk in a large cohort of farmers.

Yaakov Malinovsky, University of Maryland, Baltimore County
Co-authors:
Presentation Title: Group Testing: Some Results and Open Challenges
Abstract: Group testing has its origins in the identification of syphilis in the U.S. Army during World War II. The aim of the method is to test groups of people instead of single individuals in such a way that infected individuals are detected while the testing costs are reduced. In the last few years of the COVID-19 pandemic, the mostly-forgotten practice of group testing has been raised again in many countries as an efficient method for addressing the epidemic while facing restrictions of time and resources. Consider a finite population of N items, where item i has a probability p to be defective, independent of the other units (in the generalized group testing allows different pi). A group test is a binary test on an arbitrary group of items with two possible outcomes: all items are good, or at least one item is defective. The goal is to identify all items through group testing with the minimum expected number of tests. The optimum procedure, with respect to the expected total number of tests, is unknown even in the case where all pi are equal. In this talk, I shall review established results in the group testing literature and present new results characterizing the optimality of group testing procedures. In addition, I will discuss some open problems and conjectures.

Aleksey Polunchenko, Binghamton University
Co-authors: Li, Kexuan
Presentation Title: On the performance of the Generalized Shiryaev-Roberts control chart in continuous time
Abstract: The topic of interest is the performance of the Generalized Shiryaev-Roberts (GSR) control chart in continuous time, where the goal is to detect a possible onset of a drift in a standard Brownian motion observed live. We derive analytically and in a closed-form all of the relevant performance characteristics of the chart. By virtue of the obtained performance formulae we show numerically that the GSR chart with a carefully designed headstart is far superior to such mainstream charts as CUSUM and EWMA. More importantly, the Fast Initial Response feature exhibited by the headstarted GSR chart makes the latter not only better than the mainstream charts, but nearly the best one can do overall for a fixed in-control Average Run Length level. This is a stronger conclusion than that previously reached about CUSUM in the seminal 1982 Technometrics paper by J.M. Lucas and R.B. Crosier.

Albert Vexler, UB, Biostatistics
Co-authors:
Presentation Title: Univariate Likelihood Projections and Characterizations of the Multivariate Normal Distribution Applicable to a Multivariate Change Point Detection
Abstract: The problem of characterizing a multivariate distribution of a random vector using examination of univariate combinations of vector components is an essential issue of multivariate analysis. The likelihood principle plays a prominent role in developing powerful statistical inference tools. In this context, we raise the question: can the univariate likelihood function based on a random vector be used to provide the uniqueness in reconstructing the vector distribution? In multivariate normal (MN) frameworks, this question links to a reverse of Cochran’s theorem that concerns the distribution of quadratic forms in normal variables. We characterize the MN distribution through the univariate likelihood type projections. The proposed principle is employed to illustrate simple techniques for testing the hypothesis: “observed vectors are from a MN distribution” versus that “firs data points are from a MN distribution, and then, starting from an unknown position, observations are non-MN distributed”. In this context, the proposed characterizations of MN distributions allow us to employ well-known mechanisms that use univariate observations. The displayed testing strategy can exhibit high and stable power characteristics, when observed vectors satisfy the alternative hypothesis, whereas their components are normally distributed random variables. In such cases, classical change point detections based on, e.g., Shapiro-Wilk, Henze-Zirklers and Mardia type engines, may break down completely.

Room B

Organizers: Jeffrey Miecznikowski and Rachael Hageman Blair, University at Buffalo

Chair: Alexander Foss, Sandia National Labs

Speakers:

Jingwen Yan, Indiana University Purdue University Indianapolis
Co-authors:
Presentation Title: Integrative -omics for discovery of system level markers of Alzheimer’s disease
Abstract: A large number of genetic variations have been identified to be associated with Alzheimer’s disease (AD) and related quantitative traits. However, majority of existing studies focused on single types of omics data, lacking the power of generating a community including multi-omic markers and their functional connections. Because of this, the immense value of multi-omics data on AD has attracted much attention. Leveraging genomic, transcriptomic and proteomic data, and their backbone network through functional relations, we proposed a modularity-constrained logistic regression model to mine the association between disease status and a group of functionally connected multi-omic features, i.e. single-nucleotide polymorphisms (SNPs), genes and proteins. This new model was applied to the real data collected from the frontal cortex tissue in the Religious Orders Study and Memory and Aging Project cohort. Compared with other state-of-art methods, it provided overall the best prediction performance during cross-validation. This new method helped identify a group of densely connected SNPs, genes and proteins predictive of AD status. These SNPs are mostly expression quantitative trait loci in the frontal region. Brain-wide gene expression profile of these genes and proteins were highly correlated with the brain activation map of ‘vision’, a brain function partly controlled by frontal cortex. These genes and proteins were also found to be associated with the amyloid deposition, cortical volume and average thickness of frontal regions. Taken together, these results suggested a potential pathway underlying the development of AD from SNPs to gene expression, protein expression and ultimately brain functional and structural changes.

Rachael Hageman Blair, University of Buffalo
Co-authors: Han Yu
Presentation Title: Integrative modeling with regulatory and metabolic networks
Abstract: Mathematical models of biological networks can provide important predictions and insights into complex disease. Constraint-based models of cellular metabolism and probabilistic models of gene regulatory networks are two distinct areas that have progressed rapidly in parallel over the past decade. In principle, gene regulatory networks and metabolic networks underlay the same complex phenotypes and diseases. However, systematic integration of these two model systems remains a fundamental challenge. In this work, we address this challenge by fusing probabilistic models of gene regulatory networks into constraint-based models of metabolism. The novel approach utilizes probabilistic reasoning in Bayesian Network models of regulatory networks to serve as the “glue” that enables a natural interface between the two systems. Probabilistic reasoning is used to predict and quantify system-wide effects of perturbation to the regulatory network in the form of constraints for flux estimation. In this setting, both regulatory and metabolic networks inherently account for uncertainty. This framework demonstrates that predictive modeling of enzymatic activity can be facilitated using probabilistic reasoning, thereby extending the predictive capacity of the network. Integrated models are developed for brain and used in applications for Alzheimer’s disease and Glioblastoma to assess the role of potential drug targets on downstream metabolism. Applications highlight the ability of integrated models to prioritize drug targets and drug target combinations and the importance of accounting for the complex structure of the regulatory network that inherently influences the metabolic model.

Jeffrey Miecznikowski, SUNY University at Buffalo
Co-authors: Tritchler, David; Zhang, Fan
Presentation Title: Identification of supervised and sparse functional genomic pathways
Abstract: Functional pathways involve a series of biological alterations that may result in the occurrence of many diseases including cancer. With the widespread availability of “omics” technologies it has become feasible to integrate information from a hierarchy of biological layers to provide a more comprehensive understanding of disease. This talk provides a brief overview of a particular method to discover these functional networks across biological layers of information that are correlated with the phenotype. Simulations and a real data analysis from the The Cancer Genome Atlas (TCGA) will be shown to demonstrate the performance of the method.

Ramana Davuluri, Stony Brook University
Co-authors: Ji, Yanrong; Dutta, Pratik;
Presentation Title: Deep Multi-Omics Integration by Learning Correlation-Maximizing Representation Identifies Prognostically Better Stratified Cancer Subtypes
Abstract: Molecular subtyping of cancer is among the most important topics in translational cancer research. Over the years, there have been numerous subtyping studies performed on different types of cancers using various genomic data. Majority of studies utilized microarray or RNA-Seq-based expression profiling data, although there have been also many studies using data from other -omics platforms, such as DNA methylation, copy number variation, and microRNA. Nevertheless, great inconsistency on the number and assignment of clusters was often observed between subtyping studies utilizing data from different -omics types, and sometimes even data from same platforms, lowering the reproducibility of such research. Subtypes that consistently appear across multiple levels are considered robust and proven to enhance the predictions on patient prognosis. From a deep multi-view learning perspective, different -omics levels can be seen as individual views, or modalities, collected on same data, while the objective is to learn a shared nonlinear representation that simultaneously encodes all modalities by preserving maximal information through deep neural networks. In this talk, we present DeepMOIS-MC (Deep Multi-Omics Integrative Subtyping by Maximizing Correlation), a novel deep learning-based method that achieves multi-omics integration and subtyping of cancer by finding a low-dimensional shared representation that maximizes the correlation between multiple views. DeepMOIS-MC extends DGCCA (Deep Generalized Canonical Correlation Analysis), a canonical correlation analysis-based algorithm that can simultaneously learn nonlinear relationships between more than two views. The hypothesis is that the shared embedded space that maximizes correlation between views should contain most useful information for robust subtyping, since this indicates that certain patterns are repeatedly seen across multiple -omics platforms. We show that DeepMOIS-MC is indeed capable of robustly and accurately identifying cancer subtypes with enhanced prognostic stratification that are translatable across platforms.

Room C

Chair: Elisavet Sofikitou

Speakers:

Anitha Sivasubramanian, John Wiley and Sons
Co-authors: N/A
Presentation Title: Smart defaults and updating estimates over time using conjugate prior Bayesian models
Abstract: Alta is Wiley’s fully integrated, adaptive learning courseware. Alta is designed to optimize the way students study and learn while completing assignments. It does this using a mastery learning-based approach, where students continue to work and learn until they reach “mastery” of their assigned topics. If a student struggles on an assignment, alta recognizes their knowledge gap immediately and provides just-in-time learning supports — even when it requires reaching back to prerequisite concepts. As a result, students can better retain, recall and apply what they’re learning in their course. One implication of mastery learning is that different students will require different amounts of time for their homework. When instructors are assigning homework, it’s important for them to have a view of how much work they’re assigning. We recently revamped how we compute the assignment time and question estimates surfaced in Alta using prior data. The challenge is that prior data is sparse and evolving — each learning objective (LO) is different and new LOs/domains will not have any historical data. So, we need to build a model that will start with smart defaults. As we receive more LO completion information, we must keep updating our best estimate based on new data. This presentation will be a closer look into how Wiley empowers instructors and students by providing data-based metric representing an estimate for assignment length/time. Our model answers this question by estimating the range of number of questions/time for assignment completion at learning objective level. We start with a prior belief about the estimate and use Bayesian updating to transform these prior beliefs into posterior every time new data arrives. By using Bayesian updating we can quickly scale from no data to little data to lots of data smoothly, explicitly predict uncertainties and guarantee transparency in the user/model behaviors.

Lyubov Doroshenko, L’Université Laval
Co-authors: Liseo, Brunero; Macaro, Christian
Presentation Title: A Mixture of Heterogeneous Models with Dirichlet Time-Varying Weights
Abstract: Understanding stock market volatility is a major task for analysts, policy makers, and investors. Due to the complexity of stock market data, development of efficient models for predicting is far from trivial. In this work, we provide a way to analyze this kind of data using regression mixture models. We introduce a novel mixture of heterogeneous models with mixing weights characterized by an autoregressive structure. In comparison to the static mixture, the proposed model is based on time-dependent weights which allows one to learn whether the data-generating mechanism changes over time. A Bayesian approach based on MCMC algorithm is adopted. Through extensive analysis in both real and simulated data settings, we show the potential usefulness of our mixture model defined in a dynamic fashion over its static counterpart. We illustrate and compare their performance in the context of the stock market expectation of a 30-day forward-looking volatility expressed by Chicago Board Options Exchange’s Volatility Index.

Break, 3:50-4:00 pm

Session 5, 4:00-5:15 pm

Room A

Organizer/Chair: Sumanta Basu, Cornell University

Speakers:

Sanjeena Dang, Carleton University
Co-authors: Silva, Anjali; Rothstein, Steven; Qin, Xiaoke; McNicholas, Paul
Presentation Title: Clustering matrix-variate count data
Abstract: Three-way data structures or matrix-variate data are frequent in biological studies. In RNA sequencing, three-way data structures are obtained when high-throughput transcriptome sequencing data are collected for n genes across p conditions at r occasions. Matrix variate distributions offer a natural way to model three-way data and mixtures of matrix variate distributions can be used to cluster three-way data. Clustering of gene expression data is carried out as means of discovering gene co-expression networks. In this work, a mixture of matrix variate Poisson-log normal distributions is proposed for clustering read counts from RNA sequencing. By considering the matrix variate structure, the number of covariance parameters to be estimated is reduced and the components of resulting covariance matrices provide a meaningful interpretation. We propose three different frameworks for parameter estimation - a Markov chain Monte Carlo based approach, a variational Gaussian approximation-based approach, and a hybrid approach. The models are applied to both real and simulated data, and we demonstrate that the proposed approaches can recover the underlying cluster structure. In simulation studies where the true model parameters are known, our proposed approach shows good parameter recovery.

Y. Samuel Wang, Cornell University
Co-authors: Zhao, Boxin; Wang, Y. Samuel; Kolar, Mladen
Presentation Title: Functional differential graph estimation
Abstract: The problem of estimating the difference between two functional undirected graphical models with shared structures is considered. In many applications, data are naturally regarded as a vector of random functions rather than a vector of scalars. For example, electroencephalography (EEG) data are more appropriately treated as functions of time. In these problems, not only can the number of functions measured per sample be large, but each function is itself an infinite-dimensional object, making estimation of model parameters challenging. This is further complicated by the fact that the curves are usually only observed at discrete time points. We first define a functional differential graph that captures differences between two functional graphical models and formally characterize when the functional differential graph is well defined. We then propose a method, FuDGE, that directly estimates the functional differential graph without first estimating each individual graph. This is particularly beneficial in settings where the individual graphs are dense, but the differential graph is sparse. We show that FuDGE consistently estimates the functional differential graph even in a high-dimensional setting for both discretely observed and fully observed function paths.

Dana Yang, Cornell University
Co-authors: Ding, Jian; Wu, Yihong; Xu, Jiaming; Yang, Dana
Presentation Title: Recovery of planted matching in bipartite graphs
Abstract: Motivated by the application of tracking moving particles from snapshots, we study the problem of recovering a planted perfect matching hidden in an Erdos–Rényi bipartite graph. We establish the information-theoretic threshold for almost exact recovery of the hidden matching. Our result extends to general weighted graphs across both dense and sparse regimes. Furthermore, in the special case of exponential weights, we prove that the optimal reconstruction error is infinitely smooth at the threshold, confirming the infinite-order phase transition conjectured in [Semerjian et al. 2020].

Room B

Organizer/Chair: Guan Yu, University at Buffalo

Speakers:

Xingye Qiao, Binghamton University
Co-authors: Wang, Zhou
Presentation Title: Learning Acceptance Regions for Many Classes with Anomaly Detection
Abstract: Set-valued classification, a new classification paradigm that aims to identify all the plausible classes that an observation belongs to, can be obtained by learning the acceptance regions for all classes. Many existing set-valued classification methods do not consider the possibility that a new class that never appeared in the training data appears in the test data. Moreover, they are computationally expensive when the number of classes is large. We propose a Generalized Prediction Set (GPS) approach to estimate the acceptance regions while considering the possibility of a new class in the test data. The proposed classifier minimizes the expected size of the prediction set while guaranteeing that the class-specific accuracy is at least a pre-specified value. Unlike previous methods, the proposed method achieves a good balance between accuracy, efficiency, and anomaly detection rate. Moreover, our method can be applied in parallel to all the classes to alleviate the computational burden. Both theoretical analysis and numerical experiments are conducted to illustrate the effectiveness of the proposed method.

Saptarshi Chakraborty, University at Buffalo
Co-authors: Guan, Zoe; Martin, Axel; Begg, Colin B.; Shen, Ronglai
Presentation Title: Mining mutation contexts across the cancer genome to map tumor site of origin
Abstract: The vast preponderance of somatic mutations in a typical cancer are either extremely rare or have never been previously recorded in available databases that track somatic mutations. These constitute a hidden genome that contrasts the relatively small number of mutations that occur frequently, the properties of which have been studied in depth. Here we demonstrate that this hidden genome contains much more accurate information than common mutations for the purpose of identifying the site of origin of primary cancers in settings where this is unknown. We accomplish this using a projection-based statistical method that achieves a highly effective signal condensation, by leveraging DNA sequence and epigenetic contexts using a set of meta-features that embody the mutation contexts of rare variants throughout the genome.

Guan Yu, University at Buffalo
Co-authors:
Presentation Title: High-dimensional Cost-constrained Regression via Non-convex Optimization
Abstract: In modern predictive modeling process, budget constraints become a very important consideration due to the high cost of collecting data using new techniques such as brain imaging and DNA sequencing. This motivates us to develop new and efficient high-dimensional costconstrained predictive modeling methods. In this paper, to address this challenge, we first study a new non-convex high-dimensional cost-constrained linear regression problem, that is, we aim to find the cost-constrained regression model with the smallest expected prediction error among all models satisfying a budget constraint. The non-convex budget constraint makes this problem NP-hard. In order to estimate the regression coefficient vector of the cost-constrained regression model, we propose a new discrete extension of recent first-order continuous optimization methods. In particular, our method delivers a series of estimates of the regression coefficient vector by solving a sequence of 0-1 knapsack problems that can be addressed by many existing algorithms such as dynamic programming efficiently. Next, we show some extensions of our proposed method for statistical learning problems using loss functions with Lipschitz continuous gradient. It can be also extended to problems with groups of variables or multiple constraints. Theoretically, we prove that the series of the estimates generated by our iterative algorithm converge to a first-order stationary point, which can be a globally optimal solution to the nonconvex high-dimensional cost-constrained regression problem. Computationally, our numerical studies show that the proposed method can solve problems of fairly high dimensions and has promising estimation, prediction, and model selection performance.

Room C

Chair: Rachael Hageman Blair

Speakers:

Li Yan, Roswell Park Comprehensive Cancer Center
Co-authors: Buas, Matthew; Drescher, Charles; Urban, Nicole; Li, Christopher; Bettcher, Lisa;, Hait, Nitai; Moysich, Kirsten; Odunsi, Kunle; Raftery, Daniel;
Presentation Title: Utilize global lipidomics and CA125 for diagnosis and triage ovarian cancer versus benign adnexal mass with Receiver operating characteristic (ROC) curve analysis
Abstract: Altered lipid metabolism has emerged as an important feature of ovarian cancer (OC), yet the translational potential of lipid metabolites to aid in diagnosis and triage remains unproven. We conducted a multi-level interrogation of lipid metabolic phenotypes in patients with adnexal masses, integrating quantitative lipidomics profiling of plasma and ascites with publicly-available tumor transcriptome data. We assessed concentrations of > 500 plasma lipids in two patient cohorts—(i) a pilot set of 100 women with OC (50) or benign tumor (50), and (ii) an independent set of 118 women with malignant (60) or benign (58) adnexal mass. 249 lipid species and several lipid classes were significantly reduced in cases versus controls in both cohorts (FDR < 0.05). 23 metabolites—triacylglycerols, phosphatidylcholines, cholesterol esters—were validated at Bonferroni significance (P < 9.16 × 10–5). Certain lipids exhibited greater alterations in early- (diacylglycerols) or late-stage (lysophospholipids) cases, and multiple lipids in plasma and ascites were positively correlated. Lipoprotein receptor gene expression differed markedly in OC versus benign tumors. Importantly, several plasma lipid species, such as DAG(16:1/18:1), improved the accuracy of CA125 in differentiating early-stage OC cases from benign controls, and conferred a 15–20% increase in specificity at 90% sensitivity in multivariate models adjusted for age and BMI. In addition, Monte Carlo cross validation method and LASSO were used to investigate the robustness of the ROC analyses and the effect of combination multiple lipids with CA125. This study provides novel insight into systemic and local lipid metabolic differences between OC and benign disease, further implicating altered lipid uptake in OC biology, and advancing plasma lipid metabolites as a complementary class of circulating biomarkers for OC diagnosis and triage.

Adam Cunningham, University at Buffalo
Co-authors:
Presentation Title: Modeling and Predicting Persistent Symptoms in Adolescent Patients with Sport-Related Concussion Injuries
Abstract: It is estimated that approximately 5% – 10% of children in the US will experience a concussion at some point. Although most recover within a few weeks, approximately 30% take longer than a month to recover, and are said to experience Persistent Post-Concussive Symptoms (PPCS). Since children with PPCS are far more likely to experience psychosocial adjustment issues and learning difficulties in school, evidence-based tests are needed to indicate when persistent symptoms are a possibility requiring additional monitoring and early intervention. There are, however, currently no objective blood or imaging biomarkers to diagnose concussion or to identify early on those patients who will take longer to recover.

Working with clinicians from the University at Buffalo Concussion Management Clinic, I have developed a simple scoring system, the Risk for Delayed Recovery (RDR)-Score, which predicts the risk of PPCS in adolescents with concussion injuries. The data used to develop this system was collected by the clinic on 270 adolescent concussion patients over three years using the Buffalo Concussion Physical Examination (BCPE). This is a brief physical examination which identifies dysfunction within physiological and neurological sub-systems known to be affected by concussion. Developing the RDR-Score involved fitting Cox proportional-hazards models, accelerated failure time models, and binomial generalized linear models to the BCPE data, evaluating each model using cross-validation, and choosing an optimal subset of predictors to include in the final model. I then developed a technique to convert the coefficients of the best model into a set of small integer weighting factors that optimally preserved the characteristics of the continuous model. The resulting weighted scoring system allows physicians in an outpatient setting to more accurately predict which children are at greater risk for PPCS early after their injury, and who would benefit most from targeted therapies.

Sreelekha Guggilam, ORNL
Co-authors: Chandola, Varun; Patra, Abani
Presentation Title: Classifying Anomalous Members in a Collection of Multivariate Time Series Data Using Large Deviations Principle: An Application to COVID-19 Data
Abstract: Anomaly detection for time series data is often aimed at identifying extreme behaviors within an individual time series. However, identifying extreme trends relative to a collection of other time series is of significant interest, like in the fields of public health policy, social justice and pandemic propagation. We propose an algorithm that can scale to large collections of time series data using the concepts from the theory of large deviations. Exploiting the ability of the algorithm to scale to high-dimensional data, we propose an online anomaly detection method to identify anomalies in a collection of multivariate time series. We demonstrate the applicability of the proposed Large Deviations Anomaly Detection (LAD) algorithm in identifying counties in the United States with anomalous trends in terms of COVID-19 related cases and deaths. Several of the identified anomalous counties correlate with counties with documented poor response to the COVID pandemic.

Tuesday, May 3

Room A: 1220 JSMBS
Room B: 1225A
Room C: 1225B

Session 1, 9:00-10:00 am, Room A

Panel on Data Science

Session 2, 10:05-11:40 am, Room A

10:05-11:00 am Keynote 2 (Dr. Jianying Hu)
- Chair: Gregory Wilding, University at Buffalo
11:05-11:30 am Discussant Presentation
- Chair: Gregory Wilding, University at Buffalo
11:30-11:40 am Floor Discussion

Lunch, 11:40 am-1:00 pm, Room A

Session 3, 1:00-2:00 pm, Room A

Panel on AI

Session 4, 2:05-3:15 pm

Room A

Organizer/Chair: Dongliang Wang, SUNY Upstate Medical University

Speakers:

Xueya Cai, University of Rochester Department of Biostatistics and Computational Biology
Co-authors: Gao, Shan; Li, Yue
Presentation Title: Super Learner Prediction and Variable Importance Analyses of Nursing Home Resident Suicide Ideation
Abstract: The super learner method combines the stacking algorithm and regression analysis to obtain weighted predictions from varied statistical strategies for model prediction. It is shown to perform no worse than any single prediction method as well as to provide consistent estimates. In addition to model predictions, Rose developed nonparametric double robust machine learning in variable importance analyses. The targeted maximum likelihood estimation (TMLE) method was introduced for variable importance analyses, in which super learner predictions were compared between the saturated model and reduced models when each variable was left out. Variable importance was profiled by corresponding p-values.

In the study of nursing home resident suicide ideation, we first performed individual modeling for each of the statistical strategies, including the logistic regression model with model selection, LASSO (least absolute shrinkage and selection operator), Ridge model, Polynomial model, neuro network, and random forest. Ten-fold cross-validation was implemented in each strategy, and the aggregated estimates from ten-fold validations for each algorithm were approached. We further estimated the composite parameter estimates by enameling all model specific estimates, in which mean squared error (MSE) was used to identify best weights for the assembling. The TMLE method was used to identify 10 most important risk factors associated with nursing home resident suicide ideation.

Jianxuan Liu, Syracuse University
Co-authors:
Presentation Title: causal inference with error-prone covariates
Abstract: The goal of most empirical studies in policy research and medical research is to determine whether an alteration in an intervention or a treatment will cause a change in the desired outcome response. Unlike randomized designs, establishing the causal relationship based on observational studies is a challenging problem because the ceteris paribus condition is violated. When the covariates of interest are measured with errors, evaluating the causal effects becomes a thorny issue. Additional challenge arises from confounding variables which are often of high dimensional or correlated with the error-prone covariates. Most of the existing methods for estimating the average causal effect heavily rely on parametric assumptions about the propensity score or the outcome regression model one way or the other. In reality, both models are prone to misspecification, which can have undue influence on the estimated average causal effect. To the best of our knowledge, all the existing methods cannot handle high-dimensional covariates in the presence of error-prone covariates. We propose a semiparametric method to establish the causal relationship, which yields a consistent estimator of the average causal effect. The method we proposed results in efficient estimators of the covariate effects. We investigate their theoretical properties and demonstrate their finite sample performance through extensive simulation studies.

Wei Li, Syracuse University
Co-authors: Subhashis Ghosal
Presentation Title: Bayesian Inference for Level Sets
Abstract: This talk will focus on some recent results from nonparametric Bayesian statistics on level set estimation. Some theoretical properties of the plug-in Bayesian estimator for the level sets, such as posterior contraction and frequentist coverage of the credible regions are discussed. The links between the natural metric for level sets and the supremum-norm distance of the functions, and the difference between the frequentist approach and the proposed Bayesian approach are to be highlighted.

Room B

Organizer/Chair: Gregory Babbitt, RIT

Speakers:

John Werren, University of Rochester
Co-authors: Varela, Austin; Chen, Sammy
Presentation Title: Novel ACE2 protein interactions relevant to COVID-19 predicted by evolutionary rate correlations
Abstract: As Angiotensin-converting enzyme 2 (ACE2) is the cell receptor that SARS-CoV-2 uses to enter cells, a better understanding of the proteins that ACE2 normally interact with could reveal information relevant to COVID-19 disease manifestations and possible avenues for treatment. We have undertaken an evolutionary approach to identify mammalian proteins that “coevolve” with ACE2 based on their evolutionary rate correlations (ERCs). The rationale is that proteins that coevolve with ACE2 are likely to have functional interactions. ERCs reveal both proteins that have previously been reported to be associated with severe COVID-19, but are not currently known to interact with ACE2, and others with novel associations relevant to the disease, such as coagulation and cytokine signally pathways. Using reciprocal rankings of protein ERCs, we have identified ACE2 connections to coagulation pathway proteins Coagulation Factor V and fibrinogen components FGA, FGB, and FGG, the latter possibly mediated through ACE2 connections to Clusterin (which clears misfolded extracellular proteins) and GPR141 (whose functions are relatively unknown). ACE2 also connects to proteins involved in cytokine signaling and immune response (e.g. XCR1, IFNAR2 and TLR8), and to Androgen Receptor (AR). We propose that ACE2 has novel protein interactions that are disrupted during SARS-CoV-2 infection, contributing to the spectrum of COVID-19 pathologies. More broadly, ERCs can be used to predict protein interactions in many different biological pathways. Already, ERCs indicate possible functions for some relatively uncharacterized proteins, as well as possible new functions for well-characterized ones.

Miranda Lynch, Hauptman-Woodward Medical Research Institute
Co-authors:
Presentation Title: Statistical analyses of network representations of key SARS-CoV-2 proteins
Abstract: Network and graph theoretic approaches have been invaluable in modeling and analyzing many aspects of the Covid-19 pandemic, providing a valuable framework for understanding events at scales from population-level (such as transmission dynamics) to host- and virus-levels (such as cellular dynamics, phylogenetic genomic patterns, and protein-protein networks). In this work, we demonstrate how statistical analyses of networks can also contribute basic science insights into subcellular events associated with SARS-CoV-2 infection and treatment, via graph representations of viral proteins and their molecular level associations with host proteins and with therapeutic ligands. Statistical analyses applied to these network models permit comparative assessment of different protein states, which inform on mechanistic aspects of the host-pathogen interface.

Greg Babbitt, Rochester Institute of Technology
Co-authors:
Presentation Title: The functional evolution of target receptor binding in coronaviruses in the past, present, and uncertain future
Abstract: Comparative functional analysis of the dynamic interactions between various Betacoronavirus mutant strains and broadly utilized target proteins such as ACE2, is crucial for a more complete understanding of zoonotic spillovers of viruses that cause severe respiratory diseases such as COVID-19. Here, we employ machine learning to replicate sets of nanosecond scale GPU accelerated molecular dynamics simulations to statistically compare and classify atom motions of these target proteins in both the presence and absence of different endemic and emergent strains of the viral receptor binding domain (RBD) of the S spike glycoprotein. With this method of comparative protein dynamics, we demonstrate some important recent trends in the functional evolution of viral binding to human ACE2 in both endemic and recent emergent human SARS-like strains (hCoV-OC43, hCoV-HKU1, SARS-CoV-1 and 2, MERS-CoV). We also examine how genetic differences between the endemic bat strain batCoV-HKU4, the SARS-CoV-2 progenitor bat strain RaTG13, and the SARS-CoV-2 Variants Of Concern (VOC) alpha to omicron have progressively enhanced the binding dynamics of the spike protein RBD as it has functionally evolved towards more effective human ACE2 interaction and potentially less effective interactions with a Rhinolophus bat ACE2 ortholog. We conclude that while some human adaptation in the virus has occurred across the VOC’s, it has not yet progressed to the point of preventing reverse spillovers into a broad range of mammals. This raises the ominous specter of future unpredictability when SARS-like strains may become more widely held in reservoir in intermediate hosts.

Room C

Organizer/Chair: Saptarshi Chakraborty, University at Buffalo

Speakers:

Qian Qin, University of Minnesota
Co-authors: Jones, Galin L
Presentation Title: Convergence rates and asymptotic variances of two-component Gibbs samplers
Abstract: In this work, we compare the deterministic- and random-scan Gibbs samplers in the two-component case. It is found that the deterministic-scan version converges faster, while the random-scan version can be situationally superior in terms of asymptotic variance. Results herein take computational cost into account.

Subhadip Pal, University of Louisville
Co-authors: Gaskins Jeremy
Presentation Title: Data Augmentation Algorithms for Bayesian Analysis of Directional Data
Abstract: We develop new data augmentation algorithms for Bayesian analysis of the directional data using the von Mises-Fisher distribution in arbitrary dimensions. The approach leads to a new class of distributions, called the Modi ed Polya-Gamma distribution, which we construct in detail. The proposed data augmentation strategies circumvent the need for analytic approximations to integration, numerical integration, or Metropolis-Hastings for the corresponding posterior inference. Simulations and real data examples are presented to demonstrate the applicability and to apprise the performance of the proposed procedures.

Suman Bhattacharya, TEVA PHARMACEUTICALS USA, INC.
Co-authors: Khare, Kshitij; Pal, Subhadip
Presentation Title: Geometric ergodicity of Gibbs samplers for the Horseshoe and its regularized variants
Abstract: The Horseshoe is a widely used and popular continuous shrinkage prior for high-dimensional Bayesian linear regression. Recently, regularized versions of the Horseshoe prior have also been introduced in the literature. Various Gibbs sampling Markov chains have been developed in the literature to generate approximate samples from the corresponding intractable posterior densities. Establishing geometric ergodicity of these Markov chains provides crucial technical justification for the accuracy of asymptotic standard errors for Markov chain based estimates of posterior quantities. In this paper, we establish geometric ergodicity for various Gibbs samplers corresponding to the Horseshoe prior and its regularized variants in the context of linear regression. First, we establish geometric ergodicity of a Gibbs sampler for the original Horseshoe posterior under strictly weaker conditions than existing analyses in the literature. Second, we consider the regularized Horseshoe prior introduced in, and prove geometric ergodicity for a Gibbs sampling Markov chain to sample from the corresponding posterior without any truncation constraint on the global and local shrinkage parameters. Finally, we consider a variant of this regularized Horseshoe prior introduced in, and again establish geometric ergodicity for a Gibbs sampling Markov chain to sample from the corresponding posterior.

Break, 3:20-3:30 pm

Session 5, 3:30-4:30 pm, Room A

Industrial Panel

Session 6, 4:35-5:45 pm

Room A

Organizer/Chair: Virginia Filiaci, Roswell Park Cancer Institute

Speakers:

Mark Brady, Roswell Park Cancer Institute
Co-authors:
Presentation Title: Design and Statistical Power for Prospective-Retrospective Predictive Biomarker Study with a Dichotomous Biomarker and a Time-to-Event Endpoint
Abstract: Modern clinical trials that evaluate targeted therapies collect biological specimens from each study subject to evaluate study objectives involving predictive biomarkers that have been integrated into the study design. The goal of these integrated biomarkers is to identify patients who are more (or less) likely to respond to the targeted treatment. Aliquots of the biologic specimens are often banked for evaluating future biomarkers that become available after the clinical trial has been completed. When a new biomarker becomes available after the clinical trial has been completed it is then necessary to design and develop the study which will assess the new biomarker. Even though the original clinical trial was conducted prospectively, the biomarker study is considered a retrospective study because each subject’s treatment group and outcome (time to death, progression, or an adverse event) have already been determined. The patient’s biomarker status is the unknown variable. Whereas the statistical designs of prospective studies typically assume the distribution of the unknown event times follow a continuous parametric function, in retrospective studies the event times are known and an estimate of the true distribution of these times can be determined nonparametrically.

To avoid confusing these biomarker studies with non-experimental observational studies some authors have suggested that when these biomarker studies arise from randomized clinical trials and are conducted under rigorous conditions, these study designs can be classified as prospective-retrospective (P-R). This presentation proposes a closed-form formula for calculating the statistical power for P-R studies which incorporates the nonparametric estimate of the survival functions. The calculated power is then compared to simulated results. Other important, but often neglected statistical considerations for the design of this type of study are also presented.

Mei-Yin Polley, The University of Chicago
Co-authors: Dai, Buyue
Presentation Title: A Group Sequential Design for Biomarker Signature Development and Validation
Abstract: Biomarkers, singly or combined into a multivariable signature, are increasingly used for disease diagnosis, individual risk prognostication, and treatment selection. Rigorous and efficient statistical approaches are needed to appropriately develop, evaluate, and validate biomarkers that will be used to inform clinical decision-making. In this talk, I will present a two-stage prognostic biomarker signature design with a futility analysis that affords the possibility to stop the biomarker study early if the prognostic signature is not sufficiently promising at an early stage, thereby allowing investigators to preserve the remaining specimens for future research. This design integrates important elements necessary to meet statistical rigor and practical demands for developing and validating prognostic biomarker signatures. Our simulation studies demonstrated desirable operating characteristics of the method in that when the biomarker signature has weak discriminant potential, the proposed design has high probabilities of terminating the study early. This work provides a practical tool for designing efficient and rigorous biomarker studies and represents one of the few papers in the literature that formalize statistical procedures for early stopping in prognostic biomarker studies.

Han Yu, Roswell Park Comprehensive Cancer Center
Co-authors: Hutson, Alan
Presentation Title: A Robust Spearman Correlation Coefficient Permutation Test
Abstract: In this work, we show that Spearman's correlation coefficient test about H0:ρs=0 found in most statistical software packages is theoretically incorrect and performs poorly when bivariate normality assumptions are not met or the sample size is small. The historical works about these tests make an unverifiable assumption that the approximate bivariate normality of original data justifies using classic approximations. In general, there is common misconception that the tests about ρs=0 are robust to deviations from bivariate normality. In fact, we found under certain scenarios violation of the bivariate normality assumption has severe effects on type I error control for the most commonly utilized tests. To address this issue, we developed a robust permutation test for testing the general hypothesis H0:ρs=0. The proposed test is based on an appropriately studentized statistic. We will show that the test is theoretically asymptotically valid in the general setting when two paired variables are uncorrelated but dependent. This desired property was demonstrated across a range of distributional assumptions and sample sizes in simulation studies, where the proposed test exhibits robust type I error control across a variety of settings, even when the sample size is small. We demonstrated the application of this test in real world examples of transcriptomic data of the TCGA breast cancer patients and a data set of PSA levels and age.

Kristopher Attwood, Roswell Park Comprehensive Cancer Center
Co-authors: Hutson, Alan
Presentation Title: A Generalized ROC Curve in 3-Dimensional Space
Abstract: In practice, there exist many disease processes with multiple states; for example, in Alzheimer’s disease a patient can be classified as healthy, mild cognitive impairment, or full disease. Identifying a patient’s disease state is important in selecting the appropriate intervention and its effectiveness. Therefore, it is important to develop and evaluate a biomarker’s ability to discriminate between multiple disease states. The current literature focuses on extensions of standard 2-state ROC methods to multi-state settings, such as the ROC surface and corresponding volume under the surface for the ordinal 3-state setting. However, the extension of these methodologies have some documented limitations. In this paper we propose, for the ordinal 3-state setting, a 3-dimensional ROC line (ROC3) with corresponding measures of global performance and cut-point selection. We demonstrate the simple interpretation of the model and how it can be extended to the general multi-state setting. A numerical study is provided to compare the existing methods with our proposed ROC3 model, which demonstrates some gains in efficiency and bias. These methods are then further contrasted using real data from a cohort study of Glycan biomarkers for early detection of hepatocellular carcinoma.

Room B

Organizer/Chair: Ernest Fokoue, RIT

Speakers:

Ernest Fokoue, Rochester Institute of Technology
Co-authors:
Presentation Title: On the Ubiquity of the Bayesian Paradigm is Statistical Machine Learning and Data Science
Abstract: This talk explores the myriad of ways in which the Bayesian paradigm permeates the entire landscape of Statistical Machine Learning and Data Science. Despite some of the major challenges underlying its practical use, the Bayesian paradigm has proven to be ubiquitous, often appearing directly and indirectly in virtually every single aspect of statistical machine learning and data science, and artificial intelligence. This presentation highlights some of the emerging ways in which the Bayesian paradigm is playing an impactful role in the Data Science Revolution.

Xingchen Yu, Climate Corporation
Co-authors:
Presentation Title: Spatial voting models in circular spaces
Abstract: The use of spatial models for inferring members’ preferences from voting data has become widespread in the study of deliberative bodies, such as legislatures. Most established spatial voting models assume that ideal points belong to a Euclidean policy space. However, the geometry of Euclidean spaces (even multidimensional ones) cannot fully accommodate situations in which members at the opposite ends of the ideological spectrum reveal similar preferences by voting together against the rest of the legislature. This kind of voting behavior can arise, for example, when extreme conservatives oppose a measure because they see it as being too costly, while extreme liberals oppose it for not going far enough for them. This paper introduces a new class of spatial voting models in which preferences live in a circular policy space. Such geometry for the latent space is motivated by both theoretical (the so-called “horseshoe theory” of political thinking) and empirical (goodness of fit) considerations. Furthermore, the circular model is flexible and can approximate the one-dimensional version of the Euclidean voting model when the data supports it. We apply our circular model to roll-call voting data from the U.S. Congress between 1988 and 2019 and demonstrate that, starting with the 112th House of Representatives, circular policy spaces consistently provide a better explanation of legislators’s behavior than Euclidean ones and that legislators’s rankings, generated through the use of the circular geometry, tend to be more consistent with those implied by their stated policy positions.

Zhiyuan Li, Rochester Institute of Technology
Co-authors:
Presentation Title: Auto-encoding Variational Bayes for Continual Learning and Disentangling Self-Organizing Representations
Abstract: Variational auto-encoder (VAE), a framework that can efficiently approximate intractable posterior, has achieved great success in learning representation from a stationary data environment. However, limited progress has been made in expanding VAE to learn representation from streaming data environments, where data arrive sequentially with changing distributions. Main challenges of continual representation learning lies in reusing, expanding, and continually disentangling learned semantic factors across data environments. We argue that this is because existing approaches treat continually-arrived data independently, without considering how they are related based on the underlying semantic factors. We address this by a new generative model describing a topologically-connected mixture of spike-and-slab distributions in the latent space, learned end-to-end in a continual fashion via principled variational inference. The learned mixture is able to automatically discover the active semantic factors underlying each data environment and to accumulate their relational structure based on that. This distilled knowledge of different data environments can further be used for generative replay and guiding continual disentangling of new semantic factors.

Room C

Organizer: Marianthi Markatou

Chair: Gregory Wilding

Speakers:

Alexander Foss, Sandia National Laboratories
Co-authors:
Presentation Title: Dynamic Model Updating for Streaming Classification and Clustering
Abstract: A common challenge in the cybersecurity realm is the proper handling of high-volume streaming data. Typically in this setting, analysts are restricted to techniques with computationally cheap model-fitting and prediction algorithms. In many situations, however, it would be beneficial to use more sophisticated techniques. In this talk, a general framework is proposed that adapts a broad family of statistical and machine learning techniques to the streaming setting. The techniques of interest are those that can generate computationally cheap predictions, but which require iterative model-fitting procedures. This broad family of techniques includes various clustering, classification, regression, and dimension reduction algorithms. We discuss applied and theoretical issues that arise when using these techniques for streaming data whose distribution is evolving over time.

Yuxin Ding, Eli Lilly and Company
Co-authors: Jonathan I. Silverberg¹ , Mark Boguniewicz² , Jill Waibel³ , Jamie Weisman⁴ , Lindsay Strowd⁵ , Luna Sun⁶ , Yuxin Ding⁶ , Orin Goldblum⁷, Eric L. Simpson⁸ 1 Department of Dermatology, George Washington University School of Medicine, Washington DC, USA
2 Division of Allergy-Immunology, Department of Pediatrics, National Jewish Health and University of Colorado School of Medicine, Denver, CO, USA
3 Miami Dermatology and Laser Institute, Miami, FL, USA
4 Medical Dermatology Specialists, Atlanta, GA, USA
5 Department of Dermatology, Wake Forest University School of Medicine, Winston-Salem, NC, USA
6 Eli Lilly and Company, Indianapolis, IN, USA
7 Formerly with Eli Lilly and Company, Indianapolis, IN, USA
8 Department of Dermatology, Oregon Health and Science University, Portland, OR, USA Presentation Title: Clinical tailoring of baricitinib 2-mg in atopic dermatitis: baseline body surface area and rapid onset of action identifies response at Week 16
Acknowledgement to Fabio P. Nunes.
Abstract: Baricitinib, an oral Janus kinase (JAK)1/JAK2 inhibitor, improved moderate-to-severe atopic dermatitis (AD) in 5 Phase 3 clinical trials. Understanding which patients are likely to benefit most from treatment with baricitinib 2-mg would significantly improve patient experience. This post-hoc analysis proposed a clinical tailoring approach based on baseline body surface area affected (BSA) and early clinical improvement from the Phase 3 monotherapy trial BREEZE-AD5 (NCT03435081).

Classification and regression trees were applied to baseline patient demographics and disease characteristics to identify a patient population most likely to benefit from baricitinib 2-mg. Two-by-two contingency tables evaluated the association between speed of onset on improvement in skin inflammation or itch (assessed at Week 4 or Week 8) and response at Week 16 for the proportion of patients achieving ≥75% improvement in Eczema Area and Severity Index (EASI75), validated Investigator Global Assessment for AD (vIGA-AD™) score of 0 or 1, or ≥4-point improvement in Itch (Itch≥4). Response rates were summarized over time for identified subgroups with non-responder imputation for missing data.

At Week 16, EASI75 and vIGA-AD (0,1) were achieved by 37.5% and 31.7% of baricitinib 2-mg-treated patients with baseline BSA 10-50% compared to 9.5% and 4.8% of patients with BSA>50%. Early response in skin inflammation or itch at Week 4 or 8 was associated with corresponding EASI75, vIGA-AD (0,1), and Itch≥4 of 55.4% or 66.7%, 48.2% or 56.1%, and 39.3% or 42.1% at Week 16, respectively.

This analysis suggests that patients with BSA 10-50% account for the majority of responders to baricitinib 2-mg. In addition, clinical assessment of patients after 4 or 8 weeks of baricitinib 2-mg treatment showed a meaningful clinical benefit to patients, providing positive feedback to patients who are likely to benefit from long-term therapy and allowing for rapid decision on treatment discontinuation in those who are not.

Elisavet Sofikitou, University at Buffalo
Co-authors: Markatou, Marianthi
Presentation Title: Control Charts: Methods, Computation and Application to HCV Data
Abstract: Biomedical datasets contain health-related information and are comprised of variables measured in both interval/ratio and categorical scale. The analysis of such data is challenging, due to the difference in measurement scale and volume of available data. We introduce a methodology that leverages the basic idea of clustering to the statistical process control framework to monitor data obtained over time. The methodology provides alerts when issues arise. The major contribution and novelty of our work is that it suggests four new monitoring techniques for mixed-type data. This is a valuable addition to the relevant literature which has not been studied satisfactorily yet. The existing techniques for analyzing and monitoring mixed-type data are very limited and there is no associated software. We construct several algorithms for the implementation of the suggested control charts, and we create four test statistics that also represent the plotting statistics. We provide algorithmic procedures for the evaluation of their control limits and compute the false alarm rate and the average run length. Moreover, we developed the associated software in the R language to facilitate usage of the proposed methods. The advantages of our schemes are a) computational ease of implementation, b) ability to harness multivariate mixed-type data, c) applicability in high dimensions and semiparametric nature of methods, d) robustness and e) fast algorithmic convergence. We illustrate the proposed methods using a real-world medical data set that contains information about Egyptian patients who underwent treatment for Hepatitis C virus (HCV). The Fibrosis-4 (FIB-4) score estimates the amount of scaring in the liver. Patients with FIB-4 = 3.25 represent those with early of mild-to-moderate fibrosis, while patients with FIB-4 > 3.25 have advanced/severe liver problems (fibrosis or cirrhosis). Based on the FIB-4 index, all four new charts are capable of quickly distinguishing patients with early or mild-to-moderate fibrosis from those with advanced fibrosis or cirrhosis and alert the patients when their condition deteriorates.

Wednesday, May 4

Room A: 1220 JSMBS
Room B: 2220A JSMBS
Room C: 2220B JSMBS

Session 1, 9:00-10:10 am, Room A

Specially Invited Session (Professor Sarah Muldoon)
- Chair: Tanzy Love, University of Rochester

Break, 10:10-10:30 am

Session 2, 10:35-11:55 am

Room A

Organizer/Chair: Saptarshi Chakraborty, University at Buffalo

Speakers:

Jyotishka Datta, Virginia Tech
Co-authors: Sagar, Ksheera; Banerjee, Sayantan; Bhadra, Anindya
Presentation Title: Precision Matrix Estimation under the Horseshoe-like Prior-Penalty Dual
Abstract: Precision matrix estimation in a multivariate Gaussian model is fundamental to network estimation. Although there exist both Bayesian and frequentist approaches to this, it is difficult to obtain good Bayesian and frequentist properties under the same prior–penalty dual. To bridge this gap, our contribution is a novel prior–penalty dual that closely approximates the graphical horseshoe prior and penalty, and performs well in both Bayesian and frequentist senses. A chief difficulty with the horseshoe prior is a lack of closed form expression of the density function, which we overcome in this article. In terms of theory, we establish posterior convergence rate of the precision matrix that matches the oracle rate, in addition to the frequentist consistency of the MAP estimator. In addition, our results also provide theoretical justifications for previously developed approaches that have been unexplored so far, e.g. for the graphical horseshoe prior. Computationally efficient EM and MCMC algorithms are developed respectively for the penalized likelihood and fully Bayesian estimation problems. In numerical experiments, the horseshoe-based approaches echo their superior theoretical properties by comprehensively outperforming the competing methods. A protein–protein interaction network estimation in B-cell lymphoma is considered to validate the proposed methodology.

Ray Bai, University of South Carolina
Co-authors:
Presentation Title: Spike-and-slab group lassos for grouped regression and sparse generalized additive models
Abstract: We introduce the spike-and-slab group lasso (SSGL) for Bayesian estimation and variable selection in linear regression with grouped variables. We further extend the SSGL to sparse generalized additive models (GAMs), thereby introducing the first nonparametric variant of the spike-and-slab lasso methodology. The model simultaneously performs group selection and estimation. Meanwhile, our fully Bayes treatment of the mixture proportion allows for model complexity control and automatic self-adaptivity to different levels of sparsity. We develop theory to uniquely characterize the global posterior mode under the SSGL and introduce a highly efficient block coordinate ascent algorithm for maximum a posteriori (MAP) estimation. We further employ de-biasing methods to provide uncertainty quantification of our estimates. Thus, implementation of our model avoids the use of Markov chain Monte Carlo (MCMC) in high dimensions. We derive posterior concentration rates for both grouped linear regression and sparse GAMs when the number of covariates grows at nearly exponential rate with sample size. Finally, we illustrate our methodology through extensive simulations and data analysis.

This is joint work with Gemma Moran, Joseph Antonelli, Yong Chen, and Mary Boland.

Yeonhee Park, University of Wisconsin-Madison
Co-authors: Liu, Suyu; Thall, Peter; Yuan, Ying
Presentation Title: Bayesian group sequential enrichment designs based on adaptive regression of response and survival time on baseline biomarkers
Abstract: Precision medicine relies on the idea that only a subpopulation of patients are sensitive to a targeted agent and thus may benefit from it. In practice, based on pre-clinical data, it often is assumed that the sensitive subpopulation is known and the agent is substantively efficacious in that subpopulation. Subsequent patient data, however, often show that one or both of these assumptions are false. This paper provides a Bayesian randomized group sequential enrichment design to compare an experimental treatment to a control based on survival time. Early response is used as an ancillary outcome to assist with adaptive variable selection, enrichment, and futility stopping. The design starts by enrolling patients under broad eligibility criteria. At each interim decision, submodels for regression of response and survival time on a possibly high dimensional covariate vector and treatment are fit, variable selection is used to identify a covariate subvector that characterizes treatment-sensitive patients and determines a personalized benefit index, and comparative superiority and futility decisions are made. Enrollment of each cohort is restricted to the most recent adaptively identified treatment-sensitive patients. Group sequential decision cutoffs are calibrated to control overall type I error and account for the adaptive enrollment restriction. The design provides an empirical basis for precision medicine by identifying a treatment-sensitive subpopulation, if it exists, and determining whether the experimental treatment is substantively superior to the control in that subpopulation. A simulation study shows that the proposed design accurately identifies a sensitive subpopulation if it exists, yields much higher power than a conventional group sequential design, and is robust.

Room B

Organizer/Chair: Zi-Jia Gong, RIT

Speakers:

Zichen Ma, Clemson University
Co-authors: Fokoue, Ernest
Presentation Title: Bayesian Variable Selection for Linear Regression with the k-G Priors
Abstract: In this presentation, we propose a method that balances between variable selection and variable shrinkage in linear regression. A diagonal matrix G is injected to the covariance matrix of prior distribution of the regression coefficient vector, with each of the diagonal elements, bounded between 0 and 1, serving as a stabilizer of the corresponding regression coefficient. Mathematically, a stabilizer with value close to 0 indicates that the regression coefficient is nonzero, and hence the corresponding variable should be selected, whereas the value of the stabilizer close to 1 indicates otherwise. We prove this property under orthogonality. Computationally, the proposed method is easy to fit using automated programs such as JAGS. We provide three examples to verify the capability of this methodology in variable selection and shrinkage.

Ernest Fokoue, Rochester Institute of Technology
Co-authors:
Presentation Title: On the Emerging Platonic View of Statistical Learning Theory
Abstract: Learning Using Statistical Invariants (LUSI) is a relatively recent incarnation in the world of statistical learning theory paradigms. In their effort to propose what they hope to be a complete statistical theory of learning, Vapnik and Izmailov (2019) develop the LUSI framework, partly using their early tool known as the V-matrix but crucially borrowing heavily on Plato's philosophical teachings on ideas and things (forms) to extend the classical statistical learning theory from its purely empirical nature (known seen as brute force learning) to a learning theory based on predicates that minimize the true error. This talk will review the merits and the promises of LUSI and explore the ways in which Plato's philosophical teachings contain the potential of helping usher in a new era in Statistical Learning Theory.

Zi-Jia Gong, Rochester Institute of Technology
Co-authors:
Presentation Title: A Complete Statistical Theory of Learning based on Predicates: Computational Implementation and Demonstrations with Python and scikit-learn
Abstract: LUSI (Learning Using Statistical Invariants) is a new machine-learning paradigm proposed by Vapnik and Izmailov. In LUSI, a classification function is searched in the reproducing kernel Hilbert space (RKHS) by minimizing the loss function, while a set of ‘predicates’, functionals on the training data that incorporate the specific knowledge of the machine-learning problem, are preserved invariant. In this project, we implemented several versions of LUSI algorithms in Python in order to evaluate their performance in classification. We use the MNIST and CIFAR10 datasets to fit various LUSI models and compare the test accuracy of each model. Different predicates are designed and their impact on the classification performance is investigated. The LUSI code package is designed to be compatible with scikit-learn, and is open-source on GitHub.

Room C

Chair: Jeffrey Miecznikowski

Speakers:

Elle Schultz, Niagara University
Co-authors: Renko, Janelle; Mason, Susan
Presentation Title: Perspectives on Peer Mentoring in Psychology Statistics Courses
Abstract: This presentation is based on a literature review and personal experience. It outlines the benefits of upper-class students serving as peer mentors for a psychology statistics course. The two main presenters are two students who successfully completed the course and now serve as student assistants, mentoring and tutoring current students. Benefits to students in the course include greater involvement in their education, better awareness of their strengths and weaknesses, and having an upper-level student who can be an advocate, a leadership role model, and a trusted friend. The peer mentors also benefit from the experience in several ways. Peer mentors find satisfaction in helping other students; they enjoy working with others; and they value the opportunity to review the material, which can help them in other undergraduate courses, in graduate school, and in their careers.

Elizabeth Reid, Marist College
Co-authors:
Presentation Title: Statistically Significant
Abstract: Far too often students question why they need to know statistics. Because of this, it has been challenging to motivate students, and the pandemic has not helped. During the height of the pandemic, I taught several statistics classes online. Through this experience I learned more effective ways to make statistics important and meaningful to students. In this talk, we will discuss projects that were designed to engage students and ultimately teach them why statistics is significant.

Lunch, 11:55 am-1:00 pm, Room A

Student Sessions Abstracts 1:00-2:30 pm

Time: 1-2:30 pm

Room: B

Chair: Changxing Ma, University at Buffalo

Speakers:

Anran Liu, University at Buffalo

Co-authors: Markatou, Marianthi; Liu, Anran

Presentation Title: Statistical Distances in Goodness-of-Fit

Abstract: One of the conventional approaches to the problem of model selection is to view it as a hypothesis testing problem. When the hypothesis testing framework for model selection is adopted, one usually thinks about likely alternatives to the model, or alternatives that seem to be most dangerous to the inference, such as “heavy tails”. In this context, goodness of fit problems consist of a fundamental component of model selection viewed via the lens of hypothesis testing. Statistical distances or divergences have a long history in the scientific literature, where they are used for a variety of purposes, including that of testing for goodness of fit. We develop a goodness of fit test that is locally quadratic. Our proposed test statistic for testing a simple null hypothesis is based on measures of statistical distance. The asymptotic distribution of the statistic is obtained and a test of normality is presented as an example of the derived distributional results. Our simulation study shows the test statistic is powerful and able to detect alternatives close to the null hypothesis.

Shuyi Liang, University at Buffalo

Co-authors: Ma, Changxing

Presentation Title: Homogeneity Test of Prevalence and Sample Size Calculation Under Dallal’s Model

Abstract: The homogeneity test of prevalences among multiple groups is of general interest under paired Bernoulli settings. Dallal (1988) proposed a model by parameterizing the probability of an occurrence at one site given an occurrence at the other site and derived the maximum likelihood-ratio test. In this paper, we propose two alternative test statistics and evaluate their performances regarding the type I error controls and powers. Our simulation results show that the score test is the most robust. An algorithm for sample size calculation is developed based on the score test. Data from ophthalmologic studies are used to illustrate our proposed test procedures.

Xinwei Huang, University at Buffalo

Co-authors: Emura, Takeshi

Presentation Title: Likelihood-based inference for copula-based Markov chain models and its applications

Abstract: Copula modeling for serial dependence has been extensively discussed in the literature. However, model diagnostic methods in copula-based Markov chain models are rarely discussed in the literature. Also, copula-based Markov modeling for serially dependent survival data is challenging due to the complex censoring mechanisms. We propose the likelihood-based model fitting methods under copula-based Markov chain models on three types of data structures, continuous, discrete and, survival data. For continuous and discrete data, we propose model diagnostic procedures, including a goodness-of-fit test and a likelihood-based model selection method. For survival data, we propose a novel copula-based Markov chain model for modeling serial dependence in recurrent event times. We also use a copula for modeling dependent censoring. Due to the complex likelihood function with the two copulas, we adopt a two-stage estimation method for fitting the survival data, whose asymptotic variance is derived by the theory of estimating functions. We propose a jackknife method for interval estimates, which is shown to be consistent for the asymptotic variance. We develop user-friendly R functions for simulating the data and fitting the models for continuous, discrete, and survival data. We conduct simulation studies to see the performance of all the proposed methods. For illustration, we analyze five datasets (chemical data, financial data, baseball data, stock market data, and survival data).

Abhishek Pughazhendhi, University at Buffalo, The State University of New York

Co-authors: Muruganandam, Sri Balaji; Karuppayammal Chinnasamy, Akshayah

Presentation Title: Analyzing & Modelling Social Media Data To Predict An Individual’s Popularity Index

Abstract: The proposed model/application could potentially be a social media platform that indexes people based on how popular they are. In simple terms, it answers the question of how popular a person really is or lets an individual keep track of the count of people that he/she actually knows on a global scale. The application would collect, scrap, and process data that are predominantly from social media platforms of selected individuals and remove redundant records. To expand, one person could be following another individual on multiple platforms (assume 3) but this doesn’t mean the “popularity index” of the given individual is 3. The model/application would also be built with the capability to add new records over time to give users control over their popularity index or ranking inside the application. As the “users” get to know more people or vice versa, they consent to update their records which each others’ credentials (which could potentially be a QR code) their “popularity index” or ranking increments. This would gamify the entire user experience. Eventually, as the dataset multiplies, it can be analyzed and visualized to show what really makes individual popular, time-growth visualization, etc.,

Venkata Sai Rohit Ayyagari, University at Buffalo

Co-authors:

Presentation Title: Disease and Treatment Prediction using Principles of Machine Learning and Artificial Intelligence

Abstract: The concepts of Artificial Intelligence and Machine Learning have had a tremendous impact on various industries such as IT, research, retail and e-commerce, marketing and business analytics. A key domain where artificial intelligence and machine learning may be applied with a surfeit of benefits is the health care and medicine industry. Good health of people plays an important role in contributing to the economic growth of a country. The health care and medicine industry generates enormous amounts of health care records on a daily basis. Such a large volume of patient data can be utilized in a more effective and efficient manner in the diagnosis and treatment of patients. The proposed system aims at utilizing this vast patient data and providing accurate and efficient disease and treatment prediction using the concepts and principles of artificial intelligence and machine learning. The system aims at using datasets for disease and symptoms and corresponding treatments and applying machine learning algorithms to obtain efficient and accurate disease-treatment prediction based on the patient input. Such a system would ultimately simplify numerous processes in the health care industry and also speed-up diagnosis of life threatening diseases.

William Consagra, University of Rochester

Co-authors: Zhang, Zhengwu; Venkataraman, Arun

Presentation Title: Optimized Diffusion Imaging for Brain Structural Connectome Analysis

Abstract: High angular resolution diffusion imaging (HARDI) is a type of diffusion magnetic resonance imaging (dMRI) that measures diffusion signals on a sphere in q-space. It has been widely used in data acquisition for human brain structural connectome analysis. For accurate structural connectome estimation, dense samples in q-space are often acquired, resulting in long scanning times and logistical challenges. To overcome these issues, we develop a statistical framework that incorporates relevant dMRI data from prior large-scale imaging studies in order to improve the efficiency of human brain structural connectome estimation under sparse sampling. Our approach leverages the historical dMRI data to calculate a prior distribution characterizing local diffusion variability in each voxel in a template space. The priors are used to parameterize a sparse sample estimator and corresponding approximate optimal design algorithm to select the most informative q-space samples. Through both simulation studies and real data analysis using Human Connectome Project data, we demonstrate significant advantages of our method over existing HARDI sampling and estimation frameworks.

Time: 1-2:30 pm

Room: A

Chair: Ernest Fokoue, RIT

Co-chairs: Marianthi Markatou, University at Buffalo
Ernest Fokoue, RIT

Speakers:

Haiyang Sheng, University at Buffalo

Co-authors:

Presentation Title: TNN: a transfer learning classifier based on weighted nearest neighbors

Abstract: Weighted nearest neighbors (WNN) classifiers are popular non-parametric classifiers. Despite the significant progress in WNN, most existing WNN classifiers are designed for classic supervised learning problems where both training samples and test samples are assumed to be independent and identically distributed. However, in many real applications, it could be difficult or expensive to obtain training samples from the distribution of interest. Therefore, data collected from some related distributions are often used as supplementary training data for the classification task under the distribution of interest. It is essential to develop effective classification methods that could incorporate both training samples from the distribution of interest (if they exist) and the supplementary training samples from a different but related distribution. To address this challenge, we propose a novel Transfer learning weighted Nearest Neighbors (TNN) classifier. As a WNN classifier, TNN determines the weights on the class labels of training samples for different test samples adaptively by minimizing the worst-case upper bound on the conditional expectation of the estimation error of the regression function. It puts decreasing weights on the class labels of the successively more distant neighbors. To accommodate the difference between training samples from the distribution of interest and the supplementary training samples, TNN adds a non-negative offset distance to the training samples not from the distribution of interest, which tends to downgrade them to some extent. Our theoretical studies show that, under certain conditions, TNN is consistent and minimax optimal (up to a logarithmic term) in the covariate shift setting. Under the posterior drift or the more general setting where both covariate shift and posterior drift exist, the excess risk of TNN depends on the maximum discrepancy between the distribution of the supplementary training samples and the distribution of interest. We also demonstrate the finite sample performance of TNN via extensive simulation studies and an application to the land use/land cover mapping problem in geography.

Tianmou Liu, University at Buffalo

Co-authors: Blair, Rachael Hageman

Presentation Title: Out-of-bag stability estimation for k-means clustering

Abstract: Clustering data is a challenging problem in unsupervised learning where there is no gold standard. The selection of a clustering method, measures of dissimilarity, parameters and the determination of the number of reliable groupings, are often viewed as subjective processes. Stability has become a valuable surrogate to performance and robustness that can guide an investigator in the selection of a clustering and as a means to prioritize clusters. In this work, we develop a framework for stability measurements that are based on resampling and out-of-bag estimation. Bootstrapping methods for cluster stability can be prone to overfitting in a setting that is analogous to poor delineation of test and training sets in supervised learning. This work develops out-of-bag stability, which overcomes this issue, is observed to be consistently lower than traditional measures and is uniquely not conditional on a reference clustering. Furthermore, out-of-bag stability estimates can be estimated at different levels: item level, cluster level and as an overall summary, which has good interpretive value for the investigator. This framework is extended to develop stability estimates for determining the number of clusters (model selection) through contrasts between stability estimates on clustered data, and stability estimates of clustered reference data with no signal. These contrasts form stability profiles that can be used to identify the largest differences in stability and do not require a direct threshold on stability values, which tend to be data specific. These approaches can be implemented using the R package bootcluster that is available on the Comprehensive R Archive Network (CRAN).

Sara Venkatraman, Cornell University

Co-authors:

Presentation Title: Sparse reconstruction of dynamical systems with inference

Abstract: In many scientific disciplines, time-evolving phenomena are frequently modeled by nonlinear ordinary differential equations (ODEs). We present an approach to learning ODEs with rigorous statistical inference from time series data. Our methodology builds on a popular technique for this task in which the ODEs to be estimated are assumed to be sparse linear combinations of several candidate functions, such as polynomials. In addition to producing point estimates of the nonzero terms in the estimated equations, we propose leveraging recent advances in high-dimensional inference to quantify the uncertainty in the estimate of each term. We use both frequentist and Bayesian versions of regularized regression to estimate ODE systems as sparse combinations of terms that are statistically significant or have high posterior probabilities, respectively. We demonstrate through simulations that this approach allows us to recover the correct terms in the dynamics more often than existing methods that do not account for uncertainty.

Xiajun Jiang, Rochester Institute of Technology

Co-authors:

Presentation Title: Few-shot Generation of Personalized Neural Surrogates for Cardiac Simulation via Byesian Meta-Learning

Abstract: Clinical adoption of personalized virtual heart simulations faces challenges in model personalization and expensive computation. While an ideal solution is an efficient neural surrogate that at the same time is personalized to an individual subject, the state-of-the-art is either concerned with personalizing an expensive simulation model, or learning an efficient yet generic surrogate. This paper presents a completely new concept to achieve personalized neural surrogates in a single coherent framework of meta-learning (metaPNS). Instead of learning a single neural surrogate, we learn the process of learning a personalized neural surrogate using a small number of context data from a subject, in a novel formulation of few-shot generative modeling underpinned by: 1) a set-conditioned neural surrogate for cardiac simulation that, conditioned on subject-specific context data, learns to generate query simulations not included in the context set, and 2) a meta-model of amortized variational inference that learns to condition the neural surrogate via simple feed-forward embedding of context data. As test time, metaPNS delivers a personalized neural surrogate by fast feed-forward embedding of a small and flexible number of data available from an individual, achieving – for the first time – personalization and surrogate construction for expensive simulations in one end-to-end learning framework. Synthetic and real-data experiments demonstrated that metaPNS was able to improve personalization and predictive accuracy in comparison to conventionally-optimized cardiac simulation models, at a fraction of computation.

Tiange Shi, University at Buffalo

Co-authors: Yu, Han; Hageman Blair, Rachael

Presentation Title: Integrated regulatory and metabolic networks to prioritize therapeutic targets in the tumor microenvironment

Abstract: Recent advances in single-cell sequencing technologies have accelerated discoveries provided insights into the heterogenous tumor microenvironment. Despite this progress, the translation to clinical endpoints and drug discovery has not kept pace. Mathematical models of cellular metabolism and regulatory networks have emerged as powerful tools in systems biology that have progressed methodologically in parallel. Although cellular metabolism and regulatory networks are intricately linked, differences in their mathematical representations has made integration challenging. This work presents a framework for integration of Bayesian Network representations of regulatory networks into constraint-based metabolism model. Fully integrated models of this type can be used to perform computational experiments to predict the effects of perturbations to the signaling pathway on the downstream metabolism. This framework was applied single-cell sequencing data to develop cell-specific computational models of glioblastoma. Models were used to predict the pharmaceutical effects of 177 curated drugs published in drug repurposing hub library, and their pairwise combinations, on metabolism in the tumor microenvironment. The integrated model is used to predict the effects of pharmaceutical interventions on the system, providing insights on therapeutic targets prioritization, formulation of combination therapies and future drug discovery. Results show that predicted drug combinations inhibiting STAT3 (e.g. Niclosamide) with other transcription factors (e.g. AR inhibitor Enzalutamide) will strongly suppress anaerobic metabolism in malignant cells, without major interference to other cell types metabolism, suggesting a potential combination therapy for anticancer treatment. This framework of model integration is generalizable to other applications, such as different cell-types, organisms and diseases.

Alejandro Nieto Ramos, Rochester Institute of Technology

Co-authors: Cherry, Elizabeth; Krapu, Christopher; Fenton, Flavio

Presentation Title: Employing Gaussian process priors for studying spatial variation in parameter space for a cardiac action potential model

Abstract: Cardiac cells exhibit variability in the shape and duration of their action potentials in space within a single individual. To create a mathematical model of cardiac action potentials (AP) which captures this spatial variability and also allows for rigorous uncertainty quantification regarding within-tissue spatial correlation structure, we developed a novel hierarchical probabilistic model making use of a latent Gaussian process prior on the parameters of a simplified cardiac AP model which is used to map forcing behavior to observed voltage signals. This model allows for prediction of cardiac electrophysiological dynamics at new points in space and also allows for reconstruction of surface electrical dynamics with a relatively small number of spatial observation points. Furthermore, we make use of Markov chain Monte Carlo methods via the Stan modeling framework for parameter estimation. We employ a synthetic data case study oriented around the reconstruction of a sparsely-observed spatial parameter surface to highlight how this approach can be used for spatial or spatiotemporal analyses of cardiac electrophysiology.

Jiefei Wang, University at Buffalo

Co-authors: Jeffrey C. Miecznikowski; Jiefei Wang

Presentation Title: Multiple Testing for Exploratory Research

Abstract: Multiple testing methods to control the number of false discoveries play a central role in data analyses involving multiple hypothesis tests. Common examples include clinical and omics datasets. The traditional methods to control type I errors such as the Bonferroni adjustment for family-wise error control and the Benjamini-Hochberg procedure for false discovery rate control yield rejections based on the observed data. However, these methods generally do not allow the researcher to incorporate information obtained after viewing the data which may be considered wasteful as the information contained in the collection of p-values cannot be used. In this seminar, we present a simple but flexible method to give the upper bound of the false discovery proportion for a rejection set. The bound holds simultaneously for any possible rejection sets, which in turn gives the user possibility to explore any reasonable rejections even after observing the data. We demonstrate our method using the clinical data as well as a genome study to show its generality.

Virtual Student Poster Session, 1:00-2:00 pm, Location TBD

Time: 2:30-3:15 pm

Room: A

Chair: Changxing Ma, University at Buffalo

Matthew Jehrio (Competing), University at Buffalo
Co-authors: Adithya Narayanan, Rachael Hageman Blair
Title: Controlling the Spread of Disease With Network-Based Models of Influence
Abstract: The ability of public health professionals and decision makers to minimize the impact of disease spread, depends on their ability to quickly and efficiently deploy limited public health resources. This work examines the network-based prioritization algorithm PRINCE, and its ability to stem the spread of disease through simulated outbreaks over a network constructed with real world social network interaction data collected from a small town in England. This work examines a range of mitigation strategies that aims to isolate those infected along select susceptible individuals. Isolation of the susceptible includes different scenarios such as: those identified through primary and secondary contact tracing, those that are susceptible with the potential to be super-spreaders as identified through PRINCE, and a mixture thereof. Results from these simulations indicate that prophylactically isolating nodes based on PRINCE influence scores can significantly minimize transmission and overall disease burden above and beyond even a random isolation strategy in all of the scenarios tested, including tests with relatively high initial rates of infection. Additionally, there was a particularly strong effect in cases where resources were focused on isolating susceptible individuals. This lays the groundwork for a scalable and adaptable tool to maximize the public health response in the face of emerging epidemics.

Sam Weisenthal (Competing), University of Rochester
Co-authors: Thurston, Sally; Ertefaie, Ashkan
Title: Relative Sparsity
Abstract: Methods developed in dynamic treatment regimes and reinforcement learning can be used to estimate a policy, or a mapping from covariates to decisions, which can then instruct decision makers. There is great interest in using such data-driven policies to help health care providers and their patients make optimal decisions. In health care, however, if one is advocating for the adoption of a new policy, it is often important to explain to the provider and patient how this new policy differs from the current standard of care, or the behavioral policy. More generally, identifying the covariates that figure prominently in the shift from behavior to optimality might be of independent clinical or scientific interest. These ends are facilitated if one can pinpoint the parameters that change most when moving from the behavioral policy to the optimal policy. To do so, we adapt ideas from policy search, specifically trust region policy optimization, but, unlike current methods of this type, we focus on interpretability and statistical inference. In particular, we consider a class of policies parameterized by a finite-dimensional vector and jointly maximize value while employing an adaptive L1 norm penalty on divergence from the behavioral policy. This yields adaptive “relative sparsity,” where, as a function of a tuning parameter, we can approximately control the number of parameters in our suggested policy that are allowed to differ from their behavioral counterparts. We develop our method for the off-policy, observational data setting. We perform extensive simulations, prove asymptotic normality for an adaptive Lasso formulation of our objective, and show prelim any analyses of an observational health care dataset. This work is a step toward helping us better explain, in the context of the current standard of care, the policies that have been estimated using techniques from dynamic treatment regimes and reinforcement learning, which promotes the safe adoption of data-driven decision tools in high-stakes settings.

Nan Nan (Competing), University at Buffalo
Co-authors:
Title: New accuracy metric for biomarker evaluation in multiple classification when sub-classes are involved
Abstract: The development of biomarkers into diagnostic and prognostic tests can be categorized into three broad phases: discovery, performance evaluation, and impact determination when added to existing clinical measures. For the performance evaluation phase, the importance of proper evaluation metrics could not be emphasized enough. Researchers have proposed a variety of metrics for assessments of biomarkers under binary and multiple-class classification, especially in ROC framework.

This research project focuses on the classification setting which involves multiple main classes and at least one main class involves several subclasses. We are concerned about the problem of evaluating the accuracy of a biomarker measured on a continuous scale correctly identifying main classes without requiring specification of an ordering for marker values for subclasses within each main class. Such settings are very common in practice. For example, subjects enrolled in an Alzheimer’s disease (AD) study can be separated into five cohorts: CN (cognitive normal), SMC (significant memory concerns), EMCI (early mild cognitive impairment), LMCI (late mild cognitive impairment) and AD (Alzheimer’s dementia), where SMC, EMCI, and LMCI usually are grouped as “early stage” of AD. Biomarkers are evaluated for their accuracy of distinguishing among CN, “early stage”, and AD. Traditionally, data from SMC, EMCI, LMCI are pooled together and consequently this problem becomes a simple 3-class classification. Such practice of “naive pooling” is common in biomarker evaluation and accuracy metrics estimated using pooled data are reported routinely in scientific journals. The consequences of “naïve pooling” in biomarker evaluation and the inappropriateness of accuracy measures based on “naïve pooling” in multiple classification has not been investigated thoroughly.

This research project aims to addresses this common pitfall caused by “naïve pooling” in biomarker evaluation by proposing a new measure for the setting under consideration. The proposed new metric is a bona fide measure for evaluating the performance of biomarkers in distinguishing main classes when subclasses are involved. One of the advantages of the new metric over the existing measures based on “naive pooling” is that the new metric does not depend on the relative frequencies among subclasses, and hence it is a universal measure which allows for comparisons between biomarker evaluation studies with different sample sizes. Furthermore, it can accommodate general cases with arbitrary number of main classes K ≥ 2 and arbitrary numbers of subclasses involved in each main class. Parametric and non-parametric inference methods for estimating confidence intervals of the proposed measure are investigated. Finally, a subset from ADNI (Alzheimer’s Disease Neuroimaging Initiative) is analyzed.

Yihao Tan (Competing), University at Buffalo
Co-authors: Marianthi Markatou
Title: The Performance of Clustering Algorithms under Measurement Error
Abstract: Measurement error is often referred to “noise” in the data science literature. It occurs when the measured value of a variable differs from its true value. Measurement error can be either occurring by chance, without a specific pattern, or systematically. Measurement error manifests itself differently in the case of interval/ratio scale data than data measured on a categorical (nominal or ordinal) scale. Few studies have investigated the relationship between measurement error and clustering. We study the impact of measurement error on clustering algorithms. Our aim is to understand its impact on the number and quality of clusters. The quality of clusters is measured via ARI, Silhouette index, Dunn Index and Calinski-Harabasz index. Further, we provide alternative conceptualizations of data with measurement error and study the performance of algorithms based on deconvolution of measurement error data in clustering. Finally, we illustrate the methods on a Fibroscan data set.

Soyun Park (Competing), University at Buffalo
Co-authors:
Title: A Novel Network Architecture Combining Central-Peripheral Deviation with Image-Based Convolutional Neural Networks for Diffusion Tensor Imaging Studies
Abstract: Brain imaging research is a very challenging topic due to complex structure and lack of explicitly identifiable features in the image. With the advancement of magnetic resonance imaging (MRI) technologies, such as diffusion tensor imaging (DTI), developing classification methods to improve clinical diagnosis is crucial. This paper proposes a classification method for DTI data based on a novel neural network strategy that combines a convolutional neural network (CNN) with a multilayer neural network using central-peripheral deviation (CPD), which reflects diffusion dynamics in the white matter by spatially evaluating the deviation of diffusion coefficients between the inner and outer parts of the brain. In our method, a multilayer perceptron (MLP) using CPD is combined with the final layers for classification after reducing the dimensions of images in the convolutional layers of the neural network architecture. In terms of training loss and the classification error, the proposed classification method improves the existing image classification with CNN. For real data analysis, we demonstrate how to process raw DTI image data sets obtained from a traumatic brain injury study (MagNeTS) and a brain atlas construction study (ICBM), and apply the proposed approach to the data, successfully improving classification performance with two age groups.

Tiana Hose (Competing), Rochester Institute of Technology

Co-authors: Tedeschi, Mason; Mehlman, Emily; Franklin, Scott; Wong, Tony E.

Presentation Title: Measuring the downstream impact of Learning Assistants with Markov chains

Abstract: Evaluations of localized academic interventions often focus on the course performance, primarily attrition (DFW rate). We use a regularly updating Markov chain model to analyze the downstream impact of Learning Assistants (LAs), undergraduates that receive pedagogical instruction in order to help faculty implement research-based pedagogical strategies that focus on small-group interactions. LA programs have been shown to improve success in individual courses but, for a variety of reasons, little research has connected the program with downstream success. In this study, we compare yearly retention and graduation rates of 3500+ students that took courses supported by LAs with a matched sample that took the same courses that were not supported by LAs (but were often supported by untrained undergraduate Teaching Assistants). Our results show that exposure to LA support in courses designated as “high-DFW” is associated with an 11% increase in both first year retention and six-year graduation rates, compared with students that took the same course that was not LA-supported. This is larger than the reduction in DFW rate, implying that LA support not only results in more students passing a class, but better prepares all students for the rest of their academic career.

Betsy McCall (Non-Competing), University at Buffalo

Co-authors: Nowicki, Sophie Crooks; Csatho, Beata; Pitman, E. Bruce

Presentation Title: Insight into ice sheet dynamics from statistical models applied to surface elevation observations

Abstract: Satellite and airborne observations of the surface elevations of the Greenland Ice Sheet have been collected in recent decades to better understand the impact that climate change is having on the cryosphere. After processing, these observations produce approximately 100,000 irregular time series of the behavior of the ice. Separating out known seasonal variation leaves data about the dynamic changes in the ice over these decades. We examine these time series and explore several modeling approaches to this data such as polynomial regression, LOESS models, spline regression models and Gaussian process regression models for interpolating the data. We compare the flexibility of the models to capture local features in dynamic changes, and the ability of each type of model to capture sudden changes in behavior. Finally, we consider the ability of each type of model to accurately quantify uncertainty associated with the interpolated results, and the impact that uncertainty quantification will have on applications for the use of these interpolations in other applications, such as in ice sheet model validation.

Lan Zhang (Non-Competing), University of Rochester

Co-authors: Zhang Lan; McDavid Andrew

Presentation Title: Making out-of-sample predictions from unsupervised clustering of single cell RNASeq

Abstract: Background: A common goal in single cell RNA sequencing is to categorize subtypes of cells (observations) using unsupervised clustering on thousands of gene expression features. Each input cell is assigned a discrete label, interpreted as a cellular subpopulation. However, it has been challenging to characterize the robustness of the clustering, because most of the steps do not directly provide out-of-sample predictions.

Methods: We introduce extensions to the steps in a common clustering workflow (i.e feature selection of highly variable genes, dimension reduction using principal component analysis, Louvain community detection) that allow out-of-sample prediction. These are implemented as wrappers around the R packages SingleCellExperiment and scran. The data is partitioned into a training set, where the workflow parameters are learned, and a test set where parameters are fixed and predictions are made. We compare the clustering of a set of observations in training vs test using the Adjusted Rand Index (ARI), which is a measure of the similarity between two data clusterings that ranges from 0 and 1.

Result: We illustrate the approach using cells from the mouse brain originally published in Zeisel et al. 2015. We compare the impact on clustering concordance when splitting the cells into test/train subsets either a) uniformly at random or b) stratified by biological replicates (mice). Although we found agreement of clustering (approx. 0.80 ARI), the number of identified subpopulations was less stable. The ARI was further reduced (approx. 0.68) when our held out-data consisted of independent biological replicates.

Conclusion: Typical clustering workflows contain steps that only implicitly learn various parameters. Formalizing the estimation of these implicit parameters allows quantification of the sensitivity of the clustering to changes in the input data, and can interrogate the generalizability of cell population discoveries made using single cell RNA-seq data.

Junyu Nie (Non-Competing), University at Buffalo

Co-author: Jihnhee Yu

Presentation Title: An inferential approach for coefficient omega in application to complex survey data

Abstract: A class of coefficient omega indicates popular statistics to estimate the internal consistency reliability or general factor saturation of various psychological, sociological questionnaire instruments and health surveys, and has been recommended to be used in place of the Cronbach alpha. Coefficient omega has a few definitions but is generally explained by the factor models with one or multiple latent factors. While many surveys include various research instruments, the inference of the general class of coefficient omega has not been well addressed, particularly in the context of complex survey data analysis. In this article, we discuss a generally applicable scheme for a relevant inference of the class of coefficient omega based on the influence function approach in application to complex survey data, which allows incorporating unequal selection probabilities. Through the Monte Carlo study, we show adequate coverage rates for the confidence intervals of coefficient omega based on scenarios of stratified multi-stage cluster sampling. Using the data from the Medical Expenditure Panel Survey (MEPS), we provide the confidence intervals for the two types of coefficient omega (i.e., omega-hierarchical and omega-total) to assess the Short Form-12 version 2 (SF-12v2) which is a widely-used health survey instrument for assessing quality of life, and we evaluate reliabilities of the instrument by different demographics.

Monday, May 2

Session 1, 9:00-10:30 am, Room A

Break, 10:30-10:50 am

Session 2, 10:50 am-12:00 noon

Room A

Invited Session: Some Recent Developments in Dynamic Treatment Regimes

Lunch, 12:00-1:00 pm, Room A

Session 3, 1:00-2:10 pm

Room A

Invited Session: Recent Advances in Large-Scale Time Series

Session 4, 2:15-3:50 pm

Room A

Invited Session: Statistical Decision Mechanisms and their Application

Room B

Invited Session: Integrative Statistical Methods for High Dimensional-omics Data

Room C

Contributed Session 2: Bayesian Modeling of Complex Datasets

Break, 3:50-4:00 pm

Session 5, 4:00-5:15 pm

Room A

Invited Session: Statistical Learning Methods for Large and Complex Datasets

Room B

Invited Session: Statistical Machine Learning in High Dimensional Data

Room C

Contributed Session 1: Modern Statistical Methods for Complex Medical Datasets

Tuesday, May 3

Session 1, 9:00-10:00 am, Room A

Session 2, 10:05-11:40 am, Room A

Lunch, 11:40 am-1:00 pm, Room A

Session 3, 1:00-2:00 pm, Room A

Session 4, 2:05-3:15 pm

Room A

Invited Session: Recent Application and Methodology Development on Causal Inference and Level Set Estimation

Room B

Invited Session: Analysis of Coronavirus Pathogen-Host Interactions via Sequence/Structure/Dynamics-Based Methods

Room C

Invited Session: Recent Advances in Computational Bayesian Methods II

Break, 3:20-3:30 pm

Session 5, 3:30-4:30 pm, Room A

Session 6, 4:35-5:45 pm

Room A

Invited Session: Biomarker Trial Design and Analysis Methods

Room B

Invited Session: On the Impact of the Bayesian Paradigm in Machine Learning and Knowledge Discovery

Room C

Invited Session: Alumni Invited Session

Wednesday, May 4

Session 1, 9:00-10:10 am, Room A

Break, 10:10-10:30 am

Session 2, 10:35-11:55 am

Room A

Invited Session: Recent Advances in Computational Bayesian Methods I

Room B

Invited Session: New Frontiers in Statistical Learning Theory

Room C

Contributed Session 3: Training the Next Generation of Statisticians: Some Notes on Effective Teaching and Mentoring

Lunch, 11:55 am-1:00 pm, Room A

Student Sessions Abstracts 1:00-2:30 pm

Student Paper Session 1 (Non-competing)

Student Paper Session 2 (Competing)

Virtual Student Poster Session, 1:00-2:00 pm, Location TBD

Controlling the Spread of Disease With Network-Based Models of Influence

Relative Sparsity

New accuracy metric for biomarker evaluation in multiple classification when sub-classes are involved

The Performance of Clustering Algorithms under Measurement Error

A Novel Network Architecture Combining Central-Peripheral Deviation with Image-Based Convolutional Neural Networks for Diffusion Tensor Imaging Studies

Measuring the downstream impact of Learning Assistants with Markov chains

Insight into ice sheet dynamics from statistical models applied to surface elevation observations

Making out-of-sample predictions from unsupervised clustering of single cell RNASeq

An inferential approach for coefficient omega in application to complex survey data

Break, 3:15-3:30 pm

Student Award Ceremony, 3:30-4:00 pm

Research

Visit and Explore

Academics

Departments

Learning Resources

Resources

Faculty and Staff

Support Us