Qiao, Xingye

qiao@math.binghamton.edu

Binghamton University

Set-valued classification, a new classification paradigm that aims to identify all the plausible classes that an observation belongs to, can be obtained by learning the acceptance regions for all classes. Many existing set-valued classification methods do not consider the possibility that a new class that never appeared in the training data appears in the test data. Moreover, they are computationally expensive when the number of classes is large. We propose a Generalized Prediction Set (GPS) approach to estimate the acceptance regions while considering the possibility of a new class in the test data. The proposed classifier minimizes the expected size of the prediction set while guaranteeing that the class-specific accuracy is at least a pre-specified value. Unlike previous methods, the proposed method achieves a good balance between accuracy, efficiency, and anomaly detection rate. Moreover, our method can be applied in parallel to all the classes to alleviate the computational burden. Both theoretical analysis and numerical experiments are conducted to illustrate the effectiveness of the proposed method.

Yu, Han

han.yu@roswellpark.org

Roswell Park Comprehensive Cancer Center

In this work, we show that Spearman's correlation coefficient test about H0:ρs=0 found in most statistical software packages is theoretically incorrect and performs poorly when bivariate normality assumptions are not met or the sample size is small. The historical works about these tests make an unverifiable assumption that the approximate bivariate normality of original data justifies using classic approximations. In general, there is common misconception that the tests about ρs=0 are robust to deviations from bivariate normality. In fact, we found under certain scenarios violation of the bivariate normality assumption has severe effects on type I error control for the most commonly utilized tests. To address this issue, we developed a robust permutation test for testing the general hypothesis H0:ρs=0. The proposed test is based on an appropriately studentized statistic. We will show that the test is theoretically asymptotically valid in the general setting when two paired variables are uncorrelated but dependent. This desired property was demonstrated across a range of distributional assumptions and sample sizes in simulation studies, where the proposed test exhibits robust type I error control across a variety of settings, even when the sample size is small. We demonstrated the application of this test in real world examples of transcriptomic data of the TCGA breast cancer patients and a data set of PSA levels and age.

Park, Yeonhee

ypark56@wisc.edu

University of Wisconsin-Madison

Precision medicine relies on the idea that only a subpopulation of patients are sensitive to a targeted agent and thus may benefit from it. In practice, based on pre-clinical data, it often is assumed that the sensitive subpopulation is known and the agent is substantively efficacious in that subpopulation. Subsequent patient data, however, often show that one or both of these assumptions are false. This paper provides a Bayesian randomized group sequential enrichment design to compare an experimental treatment to a control based on survival time. Early response is used as an ancillary outcome to assist with adaptive variable selection, enrichment, and futility stopping. The design starts by enrolling patients under broad eligibility criteria. At each interim decision, submodels for regression of response and survival time on a possibly high dimensional covariate vector and treatment are fit, variable selection is used to identify a covariate subvector that characterizes treatment-sensitive patients and determines a personalized benefit index, and comparative superiority and futility decisions are made. Enrollment of each cohort is restricted to the most recent adaptively identified treatment-sensitive patients. Group sequential decision cutoffs are calibrated to control overall type I error and account for the adaptive enrollment restriction. The design provides an empirical basis for precision medicine by identifying a treatment-sensitive subpopulation, if it exists, and determining whether the experimental treatment is substantively superior to the control in that subpopulation. A simulation study shows that the proposed design accurately identifies a sensitive subpopulation if it exists, yields much higher power than a conventional group sequential design, and is robust.

Liu, Tianmou

tianmoul@buffalo.edu

University at Buffalo

Clustering data is a challenging problem in unsupervised learning where there is no gold standard.

The selection of a clustering method, measures of dissimilarity, parameters and the determination of the number of reliable groupings, are often viewed as subjective processes. Stability has become a valuable surrogate to performance and robustness that can guide an investigator in the selection of a clustering and as a means to prioritize clusters. In this work, we develop a framework for stability measurements that are based on resampling and out-of-bag estimation. Bootstrapping methods for cluster stability can be prone to overfitting in a setting that is analogous to poor delineation of test and training sets in supervised learning. This work develops out-of-bag stability, which overcomes this issue, is observed to be consistently lower than traditional measures and is uniquely not conditional on a reference clustering. Furthermore, out-of-bag stability estimates can be estimated at different levels: item level, cluster level and as an overall summary, which has good interpretive value for the investigator. This framework is extended to develop stability estimates for determining the number of clusters (model selection) through contrasts between stability estimates on clustered data, and stability estimates of clustered reference data with no signal. These contrasts form stability profiles that can be used to identify the largest differences in stability and do not require a direct threshold on stability values, which tend to be data specific. These approaches can be implemented using the R package bootcluster that is available on the Comprehensive R Archive Network (CRAN).

Yu, Guan

guanyu@buffalo.edu

University at Buffalo

In modern predictive modeling process, budget constraints become a very important consideration due to the high cost of collecting data using new techniques such as brain imaging and DNA sequencing. This motivates us to develop new and efficient high-dimensional cost constrained predictive modeling methods. In this paper, to address this challenge, we first study a new non-convex high-dimensional cost-constrained linear regression problem, that is, we aim to find the cost-constrained regression model with the smallest expected prediction error among all models satisfying a budget constraint. The non-convex budget constraint makes this problem NP-hard. In order to estimate the regression coefficient vector of the cost-constrained regression model, we propose a new discrete extension of recent first-order continuous optimization methods. In particular, our method delivers a series of estimates of the regression coefficient vector by solving a sequence of 0-1 knapsack problems that can be addressed by many existing algorithms such as dynamic programming efficiently. Next, we show some extensions of our proposed method for statistical learning problems using loss functions with Lipschitz continuous gradient. It can be also extended to problems with groups of variables or multiple constraints. Theoretically, we prove that the series of the estimates generated by our iterative algorithm converge to a first-order stationary point, which can be a globally optimal solution to the nonconvex high-dimensional cost-constrained regression problem. Computationally, our numerical studies show that the proposed method can solve problems of fairly high dimensions and has promising estimation, prediction, and model selection performance.

Markatou, Marianthi

anranliu@buffalo.edu

Department of Biostatistics, University at Buffalo

One of the conventional approaches to the problem of model selection is to view it as a hypothesis testing problem. When the hypothesis testing framework for model selection is adopted, one usually thinks about likely alternatives to the model, or alternatives that seem to be most dangerous to the inference, such as “heavy tails”. In this context, goodness of fit problems consist of a fundamental component of model selection viewed via the lens of hypothesis testing.

Statistical distances or divergences have a long history in the scientific literature, where they are used for a variety of purposes, including that of testing for goodness of fit. We develop a goodness of fit test that is locally quadratic. Our proposed test statistic for testing a simple null hypothesis is based on measures of statistical distance. The asymptotic distribution of the statistic is obtained and a test of normality is presented as an example of the derived distributional results. Our simulation study shows the test statistic is powerful and able to detect alternatives close to the null hypothesis.

Yaakov Malinovsky

yaakovm@umbc.edu

University of Maryland, Baltimore County

Group testing has its origins in the identification of syphilis in the U.S. Army during World War II. The aim of the method is to test groups of people instead of single individuals in such a way that infected individuals are detected while the testing costs are reduced. In the last few years of the COVID-19 pandemic, the mostly-forgotten practice of group testing has been raised again in many countries as an efficient method for addressing the epidemic while facing restrictions of time and resources.

Consider a finite population of N items, where item i has a probability p to be defective, independent of the other units (in the generalized group testing allows different pi). A group test is a binary test on an arbitrary group of items with two possible outcomes: all items are good, or at least one item is defective. The goal is to identify all items through group testing with the minimum expected number of tests. The optimum procedure, with respect to the expected total number of tests, is unknown even in the case where all pi are equal. In this talk, I shall review established results in the group testing literature and present new results characterizing the optimality of group testing procedures. In addition, I will discuss some open problems and conjectures.

Albert, Paul

albertp@mail.nih.gov

National Cancer Institute

Understanding the relationships between biomarkers of exposure and disease incidence is an important problem in environmental epidemiology. Typically, a large number of these exposures are measured, and it is found either that a few exposures transmit risk or that each exposure transmits a small amount of risk, but, taken together, these may pose a substantial disease risk. Importantly, these effects can be highly non-linear and can be in different directions. We develop a latent functional approach, which assumes that the individual joint effects of each biomarker exposure can be characterized as one of a series of unobserved functions, where the number of latent functions is less than or equal to the number of exposures. We propose Bayesian methodology to fit models with a large number of exposures. An efficient Markov chain Monte Carlo sampling algorithm is developed for carrying out Bayesian inference. The deviance information criterion is used to choose an appropriate number of nonlinear latent functions. We demonstrate the good properties of the approach using simulation studies. Further, we show that complex exposure relationships can be represented with only a few latent functional curves. The proposed methodology is illustrated with an analysis of the effect of cumulative pesticide exposure on cancer risk in a large cohort of farmers.

Liang, Shuyi

sliang34@buffalo.edu

University at Buffalo

The homogeneity test of prevalences among multiple groups is of general interest under paired Bernoulli settings. Dallal (1988) proposed a model by parameterizing the probability of an occurrence at one site given an occurrence at the other site and derived the maximum likelihood-ratio test. In this paper, we propose two alternative test statistics and evaluate their performances regarding the type I error controls and powers. Our simulation results show that the score test is the most robust. An algorithm for sample size calculation is developed based on the score test. Data from ophthalmologic studies are used to illustrate our proposed test procedures.

Artman, William

william_artman@urmc.rochester.edu

University of Rochester

Sequential, multiple assignment, randomized trials (SMART) are a clinical trial design which allows for the comparison of sequences of treatment decision rules tailored to the individual patient, i.e., dynamic treatment regime (DTR). The standard approach to analyzing a SMART is intention-to-treat (ITT) which may lead to biased estimates of DTR outcomes in the presence of partial compliance. A major causal inference challenge is that adjusting for observed compliance directly leads to the post-treatment adjustment bias. Principal stratification is a powerful tool which stratifies patients according to compliance classes allowing for a causal interpretation of the effect of compliance on DTR outcomes. Importantly, differential compliance behavior may lead to different optimal DTRs. We extend existing methods from the single-stage setting to the SMART setting by developing a principal stratification framework that leverages a flexible Bayesian non-parametric model for the compliance distribution and a parametric marginal structural model for the outcome. We conduct simulation studies to validate our method.

Brady, Mark F

Mark.Brady@RoswellPark.org

Roswell Park Cancer Institute

Modern clinical trials that evaluate targeted therapies collect biological specimens from each study subject to evaluate study objectives involving predictive biomarkers that have been integrated into the study design. The goal of these integrated biomarkers is to identify patients who are more (or less) likely to respond to the targeted treatment. Aliquots of the biologic specimens are often banked for evaluating future biomarkers that become available after the clinical trial has been completed. When a new biomarker becomes available after the clinical trial has been completed it is then necessary to design and develop the study which will assess the new biomarker. Even though the original clinical trial was conducted prospectively, the biomarker study is considered a retrospective study because each subject’s treatment group and outcome (time to death, progression, or an adverse event) have already been determined. The patient’s biomarker status is the unknown variable. Whereas the statistical designs of prospective studies typically assume the distribution of the unknown event times follow a continuous parametric function, in retrospective studies the event times are known and an estimate of the true distribution of these times can be determined nonparametrically.

To avoid confusing these biomarker studies with non-experimental observational studies some authors have suggested that when these biomarker studies arise from randomized clinical trials and are conducted under rigorous conditions, these study designs can be classified as prospective-retrospective (P-R). This presentation proposes a closed-form formula for calculating the statistical power for P-R studies which incorporates the nonparametric estimate of the survival functions. The calculated power is then compared to simulated results. Other important, but often neglected statistical considerations for the design of this type of study are also presented.

Huang, Xinwei

xinweihu@buffalo.edu

University at Buffalo

Copula modeling for serial dependence has been extensively discussed in the literature. However, model diagnostic methods in copula-based Markov chain models are rarely discussed in the literature. Also, copula-based Markov modeling for serially dependent survival data is challenging due to the complex censoring mechanisms. We propose the likelihood-based model fitting methods under copula-based Markov chain models on three types of data structures, continuous, discrete and, survival data. For continuous and discrete data, we propose model diagnostic procedures, including a goodness-of-fit test and a likelihood-based model selection method. For survival data, we propose a novel copula-based Markov chain model for modeling serial dependence in recurrent event times. We also use a copula for modeling dependent censoring. Due to the complex likelihood function with the two copulas, we adopt a two-stage estimation method for fitting the survival data, whose asymptotic variance is derived by the theory of estimating functions. We propose a jackknife method for interval estimates, which is shown to be consistent for the asymptotic variance. We develop user-friendly R functions for simulating the data and fitting the models for continuous, discrete, and survival data. We conduct simulation studies to see the performance of all the proposed methods. For illustration, we analyze five datasets (chemical data, financial data, baseball data, stock market data, and survival data).

Sofikitou, Elisavet

esofikit@buffalo.edu

University at Buffalo

Biomedical datasets contain health-related information and are comprised of variables measured in both interval/ratio and categorical scale. The analysis of such data is challenging, due to the difference in measurement scale and volume of available data. We introduce a methodology that leverages the basic idea of clustering to the statistical process control framework to monitor data obtained over time. The methodology provides alerts when issues arise.

The major contribution and novelty of our work is that it suggests four new monitoring techniques for mixed-type data. This is a valuable addition to the relevant literature which has not been studied satisfactorily yet. The existing techniques for analyzing and monitoring mixed-type data are very limited and there is no associated software. We construct several algorithms for the implementation of the suggested control charts, and we create four test statistics that also represent the plotting statistics. We provide algorithmic procedures for the evaluation of their control limits and compute the false alarm rate and the average run length. Moreover, we developed the associated software in the R language to facilitate usage of the proposed methods. The advantages of our schemes are a) computational ease of implementation, b) ability to harness multivariate mixed-type data, c) applicability in high dimensions and semiparametric nature of methods, d) robustness and e) fast algorithmic convergence.

We illustrate the proposed methods using a real-world medical data set that contains information about Egyptian patients who underwent treatment for Hepatitis C virus (HCV). The Fibrosis-4 (FIB-4) score estimates the amount of scaring in the liver. Patients with FIB-4 ≤ 3.25 represent those with early of mild-to-moderate fibrosis, while patients with FIB-4 > 3.25 have advanced/severe liver problems (fibrosis or cirrhosis). Based on the FIB-4 index, all four new charts are capable of quickly distinguishing patients with early or mild-to-moderate fibrosis from those with advanced fibrosis or cirrhosis and alert the patients when their condition deteriorates.

Shi, Tiange

tiangesh@buffalo.edu

University at Buffalo

Recent advances in single-cell sequencing technologies have accelerated discoveries provided insights into the heterogenous tumor microenvironment. Despite this progress, the translation to clinical endpoints and drug discovery has not kept pace. Mathematical models of cellular metabolism and regulatory networks have emerged as powerful tools in systems biology that have progressed methodologically in parallel. Although cellular metabolism and regulatory networks are intricately linked, differences in their mathematical representations has made integration challenging. This work presents a framework for integration of Bayesian Network representations of regulatory networks into constraint-based metabolism model. Fully integrated models of this type can be used to perform computational experiments to predict the effects of perturbations to the signaling pathway on the downstream metabolism. This framework was applied single-cell sequencing data to develop cell-specific computational models of glioblastoma. Models were used to predict the pharmaceutical effects of 177 curated drugs published in drug repurposing hub library, and their pairwise combinations, on metabolism in the tumor microenvironment. The integrated model is used to predict the effects of pharmaceutical interventions on the system, providing insights on therapeutic targets prioritization, formulation of combination therapies and future drug discovery. Results show that predicted drug combinations inhibiting STAT3 (e.g. Niclosamide) with other transcription factors (e.g. AR inhibitor Enzalutamide) will strongly suppress anaerobic metabolism in malignant cells, without major interference to other cell types metabolism, suggesting a potential combination therapy for anticancer treatment. This framework of model integration is generalizable to other applications, such as different cell-types, organisms and diseases.

Foss, Alexander

alexanderhfoss@gmail.com

Sandia National Laboratories

A common challenge in the cybersecurity realm is the proper handling of high-volume streaming data. Typically in this setting, analysts are restricted to techniques with computationally cheap model-fitting and prediction algorithms. In many situations, however, it would be beneficial to use more sophisticated techniques. In this talk, a general framework is proposed that adapts a broad family of statistical and machine learning techniques to the streaming setting. The techniques of interest are those that can generate computationally cheap predictions, but which require iterative model-fitting procedures. This broad family of techniques includes various clustering, classification, regression, and dimension reduction algorithms. We discuss applied and theoretical issues that arise when using these techniques for streaming data whose distribution is evolving over time.

Vexler, Albert

avexler@buffalo.edu

UB, Biostatistics

The problem of characterizing a multivariate distribution of a random vector using examination of univariate combinations of vector components is an essential issue of multivariate analysis. The likelihood principle plays a prominent role in developing powerful statistical inference tools. In this context, we raise the question: can the univariate likelihood function based on a random vector be used to provide the uniqueness in reconstructing the vector distribution? In multivariate normal (MN) frameworks, this question links to a reverse of Cochran's theorem that concerns the distribution of quadratic forms in normal variables. We characterize the MN distribution through the univariate likelihood type projections. The proposed principle is employed to illustrate simple techniques for testing the hypothesis: “observed vectors are from a MN distribution” versus that “firs data points are from a MN distribution, and then, starting from an unknown position, observations are non-MN distributed”. In this context, the proposed characterizations of MN distributions allow us to employ well-known mechanisms that use univariate observations. The displayed testing strategy can exhibit high and stable power characteristics, when observed vectors satisfy the alternative hypothesis, whereas their components are normally distributed random variables. In such cases, classical change point detections based on, e.g., Shapiro-Wilk, Henze-Zirklers and Mardia type engines, may break down completely.

Rachael, Hageman Blair

hageman@buffalo.edu

University of Buffalo

Mathematical models of biological networks can provide important predictions and insights into complex disease. Constraint-based models of cellular metabolism and probabilistic models of gene regulatory networks are two distinct areas that have progressed rapidly in parallel over the past decade. In principle, gene regulatory networks and metabolic networks underlay the same complex phenotypes and diseases. However, systematic integration of these two model systems remains a fundamental challenge. In this work, we address this challenge by fusing probabilistic models of gene regulatory networks into constraint-based models of metabolism. The novel approach utilizes probabilistic reasoning in Bayesian Network models of regulatory networks to serve as the “glue" that enables a natural interface between the two systems. Probabilistic reasoning is used to predict and quantify system-wide effects of perturbation to the regulatory network in the form of constraints for flux estimation. In this setting, both regulatory and metabolic networks inherently account for uncertainty. This framework demonstrates that predictive modeling of enzymatic activity can be facilitated using probabilistic reasoning, thereby extending the predictive capacity of the network. Integrated models are developed for brain and used in applications for Alzheimer’s disease and Glioblastoma to assess the role of potential drug targets on downstream metabolism. Applications highlight the ability of integrated models to prioritize drug targets and drug target combinations and the importance of accounting for the complex structure of the regulatory network that inherently influences the metabolic model.

Sivasubramanian, Anitha

anitha.sivasubramanian@gmail.com

John Wiley and Sons

Alta is Wiley’s fully integrated, adaptive learning courseware. Alta is designed to optimize the way students study and learn while completing assignments. It does this using a mastery learning-based approach, where students continue to work and learn until they reach “mastery” of their assigned topics. If a student struggles on an assignment, alta recognizes their knowledge gap immediately and provides just-in-time learning supports — even when it requires reaching back to prerequisite concepts. As a result, students can better retain, recall and apply what they're learning in their course.

One implication of mastery learning is that different students will require different amounts of time for their homework. When instructors are assigning homework, it’s important for them to have a view of how much work they’re assigning. We recently revamped how we compute the assignment time and question estimates surfaced in Alta using prior data. The challenge is that prior data is sparse and evolving — each learning objective (LO) is different and new LOs/domains will not have any historical data. So, we need to build a model that will start with smart defaults. As we receive more LO completion information, we must keep updating our best estimate based on new data. This presentation will be a closer look into how Wiley empowers instructors and students by providing data-based metric representing an estimate for assignment length/time. Our model answers this question by estimating the range of number of questions/time for assignment completion at learning objective level. We start with a prior belief about the estimate and use Bayesian updating to transform these prior beliefs into posterior every time new data arrives. By using Bayesian updating we can quickly scale from no data to little data to lots of data smoothly, explicitly predict uncertainties and guarantee transparency in the user/model behaviors.

Attwood, Kristopher

kristopher.attwood@roswellpark.org

Roswell Park Comprehensive Cancer Center

In practice, there exist many disease processes with multiple states; for example, in Alzheimer’s disease a patient can be classified as healthy, mild cognitive impairment, or full disease. Identifying a patient’s disease state is important in selecting the appropriate intervention and its effectiveness. Therefore, it is important to develop and evaluate a biomarker’s ability to discriminate between multiple disease states. The current literature focuses on extensions of standard 2-state ROC methods to multi-state settings, such as the ROC surface and corresponding volume under the surface for the ordinal 3-state setting. However, the extension of these methodologies have some documented limitations. In this paper we propose, for the ordinal 3-state setting, a 3-dimensional ROC line (ROC3) with corresponding measures of global performance and cut-point selection. We demonstrate the simple interpretation of the model and how it can be extended to the general multi-state setting. A numerical study is provided to compare the existing methods with our proposed ROC3 model, which demonstrates some gains in efficiency and bias. These methods are then further contrasted using real data from a cohort study of Glycan biomarkers for early detection of hepatocellular carcinoma.

Bhattacharya, Indrabati

indrabati_bhattacharya@urmc.rochester.edu

University of Rochester

Q-learning is a well-known reinforcement learning approach for estimation of optimal dynamic treatment regimes. Existing methods for estimation of dynamic treatment regimes are limited to intention-to-treat analyses--which estimate the effect of randomization to a particular treatment regime without considering the compliance behavior of patients. In this article, we propose a novel Bayesian nonparametric Q-learning approach based on stochastic decision rules for adjusting partial compliance. We consider the popular potential compliance framework, where some potential compliances are latent and need to be imputed. For each stage, we fit a locally weighted Dirichlet process mixture model for the conditional distribution of potential outcomes given the compliance values and baseline covariates. The key challenge is learning the joint distribution of the potential compliances, which we do using a Dirichlet process mixture model. Our approach provides two sets of decision rules: (1) conditional decision rules given the potential compliance values; and (2) marginal decision rules where the potential compliances are marginalized. Extensive simulation studies show the effectiveness of our method compared to intention-to-treat analyses. We apply our method on the Adaptive Treatment For Alcohol and Cocaine Dependence Study (ENGAGE), where the goal is to construct optimal treatment regimes to engage patients in therapy.

Bai, Ray

rbai@mailbox.sc.edu

University of South Carolina

We introduce the spike-and-slab group lasso (SSGL) for Bayesian estimation and variable selection in linear regression with grouped variables. We further extend the SSGL to sparse generalized additive models (GAMs), thereby introducing the first nonparametric variant of the spike-and-slab lasso methodology. The model simultaneously performs group selection and estimation. Meanwhile, our fully Bayes treatment of the mixture proportion allows for model complexity control and automatic self-adaptivity to different levels of sparsity. We develop theory to uniquely characterize the global posterior mode under the SSGL and introduce a highly efficient block coordinate ascent algorithm for maximum a posteriori (MAP) estimation. We further employ de-biasing methods to provide uncertainty quantification of our estimates. Thus, implementation of our model avoids the use of Markov chain Monte Carlo (MCMC) in high dimensions. We derive posterior concentration rates for both grouped linear regression and sparse GAMs when the number of covariates grows at nearly exponential rate with sample size. Finally, we illustrate our methodology through extensive simulations and data analysis.

This is joint work with Gemma Moran, Joseph Antonelli, Yong Chen, and Mary Boland.

Schultz, Elle

sem@niagara.edu

Niagara University

This presentation is based on a literature review and personal experience. It outlines the benefits of upper-class students serving as peer mentors for a psychology statistics course. The two main presenters are two students who successfully completed the course and now serve as student assistants, mentoring and tutoring current students.

Benefits to students in the course include greater involvement in their education, better awareness of their strengths and weaknesses, and having an upper-level student who can be an advocate, a leadership role model, and a trusted friend.

The peer mentors also benefit from the experience in several ways. Peer mentors find satisfaction in helping other students; they enjoy working with others; and they value the opportunity to review the material, which can help them in other undergraduate courses, in graduate school, and in their careers.

Junyu Nie

junyunie@buffalo.edu

University at buffalo

A class of coefficient omega indicates popular statistics to estimate the internal consistency reliability or general factor saturation of various psychological, sociological questionnaire instruments and health surveys, and has been recommended to be used in place of the Cronbach alpha. Coefficient omega has a few definitions but is generally explained by the factor models with one or multiple latent factors. While many surveys include various research instruments, the inference of the general class of coefficient omega has not been well addressed, particularly in the context of complex survey data analysis. In this article, we discuss a generally applicable scheme for a relevant inference of the class of coefficient omega based on the influence function approach in application to complex survey data, which allows incorporating unequal selection probabilities. Through the Monte Carlo study, we show adequate coverage rates for the confidence intervals of coefficient omega based on scenarios of stratified multi-stage cluster sampling. Using the data from the Medical Expenditure Panel Survey (MEPS), we provide the confidence intervals for the two types of coefficient omega (i.e., omega-hierarchical and omega-total) to assess the Short Form-12 version 2 (SF-12v2) which is a widely-used health survey instrument for assessing quality of life, and we evaluate reliabilities of the instrument by different demographics.

D'Andrea, Joy

jdandrea@mail.usf.edu

USF

In this talk we will briefly discuss and distinguish between the Ordinary Bayesian Analysis and the Empirical Bayesian Analysis. Ordinary Bayesian analysis is a statistical procedure which endeavors to estimate parameters of an underlying distribution based on the observed distribution. Empirical Bayes methods are procedures for statistical inference in which the prior distribution is estimated from the data. This approach stands in contrast to standard Bayesian methods, for which the prior distribution is fixed before any data are observed.

Datta, Jyotishka

jyotishka@vt.edu

Virginia Tech

Precision matrix estimation in a multivariate Gaussian model is fundamental to network estimation. Although there exist both Bayesian and frequentist approaches to this, it is difficult to obtain good Bayesian and frequentist properties under the same prior--penalty dual. To bridge this gap, our contribution is a novel prior--penalty dual that closely approximates the graphical horseshoe prior and penalty, and performs well in both Bayesian and frequentist senses. A chief difficulty with the horseshoe prior is a lack of closed form expression of the density function, which we overcome in this article. In terms of theory, we establish posterior convergence rate of the precision matrix that matches the oracle rate, in addition to the frequentist consistency of the MAP estimator. In addition, our results also provide theoretical justifications for previously developed approaches that have been unexplored so far, e.g. for the graphical horseshoe prior. Computationally efficient EM and MCMC algorithms are developed respectively for the penalized likelihood and fully Bayesian estimation problems. In numerical experiments, the horseshoe-based approaches echo their superior theoretical properties by comprehensively outperforming the competing methods. A protein--protein interaction network estimation in B-cell lymphoma is considered to validate the proposed methodology.

Zhang Lan

lzhang95@u.rochester.edu

University of Rochester

Background:

A common goal in single cell RNA sequencing is to categorize subtypes of cells (observations) using unsupervised clustering on thousands of gene expression features. Each input cell is assigned a discrete label, interpreted as a cellular subpopulation. However, it has been challenging to characterize the robustness of the clustering, because most of the steps do not directly provide out-of-sample predictions.

Methods:

We introduce extensions to the steps in a common clustering workflow (i.e feature selection of highly variable genes, dimension reduction using principal component analysis, Louvain community detection) that allow out-of-sample prediction. These are implemented as wrappers around the R packages SingleCellExperiment and scran. The data is partitioned into a training set, where the workflow parameters are learned, and a test set where parameters are fixed and predictions are made. We compare the clustering of a set of observations in training vs test using the Adjusted Rand Index (ARI), which is a measure of the similarity between two data clusterings that ranges from 0 and 1.

Result:

We illustrate the approach using cells from the mouse brain originally published in Zeisel et al. 2015. We compare the impact on clustering concordance when splitting the cells into test/train subsets either a) uniformly at random or b) stratified by biological replicates (mice). Although we found agreement of clustering (approx. 0.80 ARI), the number of identified subpopulations was less stable. The ARI was further reduced (approx. 0.68) when our held out-data consisted of independent biological replicates.

Conclusion:

Typical clustering workflows contain steps that only implicitly learn various parameters. Formalizing the estimation of these implicit parameters allows quantification of the sensitivity of the clustering to changes in the input data, and can interrogate the generalizability of cell population discoveries made using single cell RNA-seq data.

Ma, Zichen

zichenm@clemson.edu

Clemson University

In this paper, we propose a method that balances between variable selection and variable shrinkage in linear regression. A diagonal matrix G is injected to the covariance matrix of prior distribution of the regression coefficient vector, with each diagonal element g, bounded between 0 and 1, serving as a stabilizer of the corresponding regression coefficient. Mathematically, the value of g close to 0 indicates that the corresponding regression coefficient is nonzero, and hence the corresponding variable should be selected, whereas the value of g close to 1 indicates otherwise. We prove this property under orthogonality. Computationally, the proposed method is easy to fit using automated programs such as JAGS. We provide three examples to verify the capability of this methodology in variable selection and shrinkage.

Wang, Linbo

linbo.wang@utoronto.ca

University of Toronto

Analyses of biomedical studies often necessitate modeling longitudinal causal effects. The current focus on personalized medicine and effect heterogeneity makes this task even more challenging. Towards this end, structural nested mean models (SNMMs) are fundamental tools for studying heterogeneous treatment effects in longitudinal studies. However, when outcomes are binary, current methods for estimating multiplicative and additive SNMM parameters suffer from variation dependence between the causal parameters and the non-causal nuisance parameters. This leads to a series of difficulties in interpretation, estimation and computation. These difficulties have hindered the uptake of SNMMs in biomedical practice, where binary outcomes are very common. We solve the variation dependence problem for the binary multiplicative SNMM via a reparametrization of the non-causal nuisance parameters. Our novel nuisance parameters are variation independent of the causal parameters, and hence allow for coherent modeling of heterogeneous effects from longitudinal studies with binary outcomes. Our parametrization also provides a key building block for flexible doubly robust estimation of the causal parameters. Along the way, we prove that an additive SNMM with binary outcomes does not admit a variation independent parametrization, thereby justifying the restriction to multiplicative SNMMs.

Qin, Qian

qqin@umn.edu

University of Minnesota

In this work, we compare the deterministic- and random-scan Gibbs samplers in the two-component case. It is found that the deterministic-scan version converges faster, while the random-scan version can be situationally superior in terms of asymptotic variance. Results herein take computational cost into account.

Cunningham, Adam

adamcunn@buffalo.edu

University at Buffalo

It is estimated that approximately 5% – 10% of children in the US will experience a concussion at some point. Although most recover within a few weeks, approximately 30% take longer than a month to recover, and are said to experience Persistent Post-Concussive Symptoms (PPCS). Since children with PPCS are far more likely to experience psychosocial adjustment issues and learning difficulties in school, evidence-based tests are needed to indicate when persistent symptoms are a possibility requiring additional monitoring and early intervention. There are, however, currently no objective blood or imaging biomarkers to diagnose concussion or to identify early on those patients who will take longer to recover.

Working with clinicians from the University at Buffalo Concussion Management Clinic, I have developed a simple scoring system, the Risk for Delayed Recovery (RDR)-Score, which predicts the risk of PPCS in adolescents with concussion injuries. The data used to develop this system was collected by the clinic on 270 adolescent concussion patients over three years using the Buffalo Concussion Physical Examination (BCPE). This is a brief physical examination which identifies dysfunction within physiological and neurological sub-systems known to be affected by concussion. Developing the RDR-Score involved fitting Cox proportional-hazards models, accelerated failure time models, and binomial generalized linear models to the BCPE data, evaluating each model using cross-validation, and choosing an optimal subset of predictors to include in the final model. I then developed a technique to convert the coefficients of the best model into a set of small integer weighting factors that optimally preserved the characteristics of the continuous model. The resulting weighted scoring system allows physicians in an outpatient setting to more accurately predict which children are at greater risk for PPCS early after their injury, and who would benefit most from targeted therapies.

Jeffrey C. Miecznikowski, Jiefei Wang

jwang96@buffalo.edu

University at Buffalo

Multiple testing methods to control the number of false discoveries play a central role in data analyses involving multiple hypothesis tests. Common examples include clinical and omics datasets. The traditional methods to control type I errors such as the Bonferroni adjustment for family-wise error control and the Benjamini-Hochberg procedure for false discovery rate control yield rejections based on the observed data. However, these methods generally do not allow the researcher to incorporate information obtained after viewing the data which may be considered wasteful as the information contained in the collection of p-values cannot be used. In this seminar, we present a simple but flexible method to give the upper bound of the false discovery proportion for a rejection set. The bound holds simultaneously for any possible rejection sets, which in turn gives the user possibility to explore any reasonable rejections even after observing the data. We demonstrate our method using the clinical data as well as a genome study to show its generality.

Saptarshi Chakraborty

chakrab2@buffalo.edu

University at Buffalo

The vast preponderance of somatic mutations in a typical cancer are either extremely rare or have never been previously recorded in available databases that track somatic mutations. These constitute a hidden genome that contrasts the relatively small number of mutations that occur frequently, the properties of which have been studied in depth. Here we demonstrate that this hidden genome contains much more accurate information than common mutations for the purpose of identifying the site of origin of primary cancers in settings where this is unknown. We accomplish this using a projection-based statistical method that achieves a highly effective signal condensation, by leveraging DNA sequence and epigenetic contexts using a set of meta-features that embody the mutation contexts of rare variants throughout the genome.

Venkatraman, Sara

skv24@cornell.edu

Cornell University

In many scientific disciplines, time-evolving phenomena are frequently modeled by nonlinear ordinary differential equations (ODEs). We present an approach to learning ODEs with rigorous statistical inference from time series data. Our methodology builds on a popular technique for this task in which the ODEs to be estimated are assumed to be sparse linear combinations of several candidate functions, such as polynomials. In addition to producing point estimates of the nonzero terms in the estimated equations, we propose leveraging recent advances in high-dimensional inference to quantify the uncertainty in the estimate of each term. We use both frequentist and Bayesian versions of regularized regression to estimate ODE systems as sparse combinations of terms that are statistically significant or have high posterior probabilities, respectively. We demonstrate through simulations that this approach allows us to recover the correct terms in the dynamics more often than existing methods that do not account for uncertainty.

Consagra, William

william_consagra@urmc.rochester.edu

University of Rochester

High angular resolution diffusion imaging (HARDI) is a type of diffusion magnetic resonance imaging (dMRI) that measures diffusion signals on a sphere in q-space. It has been widely used in data acquisition for human brain structural connectome analysis. For accurate structural connectome estimation, dense samples in q-space are often acquired, resulting in long scanning times and logistical challenges. To overcome these issues, we develop a statistical framework that incorporates relevant dMRI data from prior large-scale imaging studies in order to improve the efficiency of human brain structural connectome estimation under sparse sampling. Our approach leverages the historical dMRI data to calculate a prior distribution characterizing local diffusion variability in each voxel in a template space. The priors are used to parameterize a sparse sample estimator and corresponding approximate optimal design algorithm to select the most informative q-space samples. Through both simulation studies and real data analysis using Human Connectome Project data, we demonstrate significant advantages of our method over existing HARDI sampling and estimation frameworks.

Hose, Tiana

tiamarie99999@gmail.com

Rochester Institute of Technology

Evaluations of localized academic interventions often focus on the course performance, primarily attrition (DFW rate). We use a regularly updating Markov chain model to analyze the downstream impact of Learning Assistants (LAs), undergraduates that receive pedagogical instruction in order to help faculty implement research-based pedagogical strategies that focus on small-group interactions. LA programs have been shown to improve success in individual courses but, for a variety of reasons, little research has connected the program with downstream success. In this study, we compare yearly retention and graduation rates of 3500+ students that took courses supported by LAs with a matched sample that took the same courses that were not supported by LAs (but were often supported by untrained undergraduate Teaching Assistants). Our results show that exposure to LA support in courses designated as “high-DFW” is associated with an 11% increase in both first year retention and six-year graduation rates, compared with students that took the same course that was not LA-supported. This is larger than the reduction in DFW rate, implying that LA support not only results in more students passing a class, but better prepares all students for the rest of their academic career.

Guggilam, Sreelekha

sreelekh@buffalo.edu

ORNL

Anomaly detection for time series data is often aimed at identifying extreme behaviors within an individual time series. However, identifying extreme trends relative to a collection of other time series is of significant interest, like in the fields of public health policy, social justice and pandemic propagation. We propose an algorithm that can scale to large collections of time series data using the concepts from the theory of large deviations. Exploiting the ability of the algorithm to scale to high-dimensional data, we propose an online anomaly detection method to identify anomalies in a collection of multivariate time series. We demonstrate the applicability of the proposed Large Deviations Anomaly Detection (LAD) algorithm in identifying counties in the United States with anomalous trends in terms of COVID-19 related cases and deaths. Several of the identified anomalous counties correlate with counties with documented poor response to the COVID pandemic.

Yan, Li

li.yan@roswellpark.org

Roswell Park Comprehensive Cancer Center

Altered lipid metabolism has emerged as an important feature of ovarian cancer (OC), yet the translational potential of lipid metabolites to aid in diagnosis and triage remains unproven. We conducted a multi-level interrogation of lipid metabolic phenotypes in patients with adnexal masses, integrating quantitative lipidomics profiling of plasma and ascites with publicly-available tumor transcriptome data. We assessed concentrations of > 500 plasma lipids in two patient cohorts—(i) a pilot set of 100 women with OC (50) or benign tumor (50), and (ii) an independent set of 118 women with malignant (60) or benign (58) adnexal mass. 249 lipid species and several lipid classes were significantly reduced in cases versus controls in both cohorts (FDR < 0.05). 23 metabolites—triacylglycerols, phosphatidylcholines, cholesterol esters—were validated at Bonferroni significance (P < 9.16 × 10–5). Certain lipids exhibited greater alterations in early- (diacylglycerols) or late-stage (lysophospholipids) cases, and multiple lipids in plasma and ascites were positively correlated. Lipoprotein receptor gene expression differed markedly in OC versus benign tumors. Importantly, several plasma lipid species, such as DAG(16:1/18:1), improved the accuracy of CA125 in differentiating early-stage OC cases from benign controls, and conferred a 15–20% increase in specificity at 90% sensitivity in multivariate models adjusted for age and BMI. In addition, Monte Carlo cross validation method and LASSO were used to investigate the robustness of the ROC analyses and the effect of combination multiple lipids with CA125. This study provides novel insight into systemic and local lipid metabolic differences between OC and benign disease, further implicating altered lipid uptake in OC biology, and advancing plasma lipid metabolites as a complementary class of circulating biomarkers for OC diagnosis and triage.

Jingwen Yan

jingyan@iupui.edu

Indiana University Purdue University Indianapolis

A large number of genetic variations have been identified to be associated with Alzheimer’s disease (AD) and related quantitative traits. However, majority of existing studies focused on single types of omics data, lacking the power of generating a community including multi-omic markers and their functional connections. Because of this, the immense value of multi-omics data on AD has attracted much attention. Leveraging genomic, transcriptomic and proteomic data, and their backbone network through functional relations, we proposed a modularity-constrained logistic regression model to mine the association between disease status and a group of functionally connected multi-omic features, i.e. single-nucleotide polymorphisms (SNPs), genes and proteins. This new model was applied to the real data collected from the frontal cortex tissue in the Religious Orders Study and Memory and Aging Project cohort. Compared with other state-of-art methods, it provided overall the best prediction performance during cross-validation. This new method helped identify a group of densely connected SNPs, genes and proteins predictive of AD status. These SNPs are mostly expression quantitative trait loci in the frontal region. Brain-wide gene expression profile of these genes and proteins were highly correlated with the brain activation map of ‘vision’, a brain function partly controlled by frontal cortex. These genes and proteins were also found to be associated with the amyloid deposition, cortical volume and average thickness of frontal regions. Taken together, these results suggested a potential pathway underlying the development of AD from SNPs to gene expression, protein expression and ultimately brain functional and structural changes.

Davuluri, Ramana

Ramana.Davuluri@stonybrookmedicine.edu

Stony Brook University

Molecular subtyping of cancer is among the most important topics in translational cancer research. Over the years, there have been numerous subtyping studies performed on different types of cancers using various genomic data. Majority of studies utilized microarray or RNA-Seq-based expression profiling data, although there have been also many studies using data from other -omics platforms, such as DNA methylation, copy number variation, and microRNA. Nevertheless, great inconsistency on the number and assignment of clusters was often observed between subtyping studies utilizing data from different -omics types, and sometimes even data from same platforms, lowering the reproducibility of such research. Subtypes that consistently appear across multiple levels are considered robust and proven to enhance the predictions on patient prognosis.

From a deep multi-view learning perspective, different -omics levels can be seen as individual views, or modalities, collected on same data, while the objective is to learn a shared nonlinear representation that simultaneously encodes all modalities by preserving maximal information through deep neural networks. In this talk, we present DeepMOIS-MC (Deep Multi-Omics Integrative Subtyping by Maximizing Correlation), a novel deep learning-based method that achieves multi-omics integration and subtyping of cancer by finding a low-dimensional shared representation that maximizes the correlation between multiple views. DeepMOIS-MC extends DGCCA (Deep Generalized Canonical Correlation Analysis), a canonical correlation analysis-based algorithm that can simultaneously learn nonlinear relationships between more than two views. The hypothesis is that the shared embedded space that maximizes correlation between views should contain most useful information for robust subtyping, since this indicates that certain patterns are repeatedly seen across multiple -omics platforms. We show that DeepMOIS-MC is indeed capable of robustly and accurately identifying cancer subtypes with enhanced prognostic stratification that are translatable across platforms.

McCall, Betsy

betsymcc@buffalo.edu

University at Buffalo

Satellite and airborne observations of the surface elevations of the Greenland Ice Sheet have been collected in recent decades to better understand the impact that climate change is having on the cryosphere. After processing, these observations produce approximately 100,000 irregular time series of the behavior of the ice. Separating out known seasonal variation leaves data about the dynamic changes in the ice over these decades. We examine these time series and explore several modeling approaches to this data such as polynomial regression, LOESS models, spline regression models and Gaussian process regression models for interpolating the data. We compare the flexibility of the models to capture local features in dynamic changes, and the ability of each type of model to capture sudden changes in behavior. Finally, we consider the ability of each type of model to accurately quantify uncertainty associated with the interpolated results, and the impact that uncertainty quantification will have on applications for the use of these interpolations in other applications, such as in ice sheet model validation.

Vir, Mayank

mayankvi@buffalo.edu

University at Buffalo

No Abstract

Venkata Sai Rohit Ayyagari

vayyagar@buffalo.edu

University at Buffalo

The concepts of Artificial Intelligence and Machine Learning have had a tremendous impact on various industries such as IT, research, retail and e-commerce, marketing and business analytics. A key domain where artificial intelligence and machine learning may be applied with a surfeit of benefits is the health care and medicine industry. Good health of people plays an important role

in contributing to the economic growth of a country. The health care and medicine industry generates enormous amounts of health care records on a daily basis. Such a large volume of patient data can be utilized in a more effective and efficient manner in the diagnosis and treatment of patients. The proposed system aims at utilizing this vast patient data and providing accurate and efficient disease and treatment prediction using the concepts and principles of artificial intelligence

and machine learning. The system aims at using datasets for disease and symptoms and corresponding treatments and applying machine learning algorithms to obtain efficient and accurate disease-treatment prediction based on the patient input. Such a system would ultimately simplify numerous processes in the health care industry and also speed-up diagnosis of life-threatening diseases.

Pughazhendhi, Abhishek

apughazh@buffalo.edu

University at Buffalo, The State University of New York

The proposed model/application could potentially be a social media platform that indexes people based on how popular they are. In simple terms, it answers the question of how popular a person really is or lets an individual keep track of the count of people that he/she actually knows on a global scale. The application would collect, scrap, and process data that are predominantly from social media platforms of selected individuals and remove redundant records. To expand, one person could be following another individual on multiple platforms (assume 3) but this doesn't mean the "popularity index" of the given individual is 3. The model/application would also be built with the capability to add new records over time to give users control over their popularity index or ranking inside the application. As the “users” get to know more people or vice versa, they consent to update their records which each others’ credentials (which could potentially be a QR code) their “popularity index” or ranking increments. This would gamify the entire user experience. Eventually, as the dataset multiplies, it can be analyzed and visualized to show what really makes individual popular, time-growth visualization, etc.

Navyaka Kandula

navyakak@buffalo.edu

University at Buffalo, The State University of New York

The established molecular heterogeneity of human cancers and the subsequent stratification of conventional diagnostic categories require the development of new paradigms for the development of a reliable basis for predictive medicine. We review clinical trial designs for the development of new therapeutics and predictive biomarkers to inform their use. We cover designs for a wide range of settings. At one extreme is the development of a new drug with a single biomarker and strong biological evidence that marker negative patients are unlikely to benefit from the new drug. At the other extreme are phase III clinical trials involving both genome-wide discovery and internal validation of a predictive classifier that identifies the patients most likely and unlikely to benefit from the new drug.

Pal, Subhadip

subhadippal@gmail.com

University of Louisville

We develop new data augmentation algorithms for Bayesian analysis of the directional data using the von Mises-Fisher distribution in arbitrary dimensions. The approach leads to a new class of distributions, called the Modi ed Polya-Gamma distribution, which we construct in detail. The proposed data augmentation strategies circumvent the need for analytic approximations to integration, numerical integration, or Metropolis-Hastings for the corresponding posterior inference. Simulations and real data examples are presented to demonstrate the applicability and to apprise the performance of the proposed procedures.

Miecznikowski, Jeffrey

jcm38@buffalo.edu

SUNY University at Buffalo

Functional pathways involve a series of biological alterations that may result in the occurrence of many diseases including cancer. With the widespread availability of “omics” technologies it has become feasible to integrate information from a hierarchy of biological layers to provide a more comprehensive understanding of disease. This talk provides a brief overview of a particular method to discover these functional networks across biological layers of information that are correlated with the phenotype. Simulations and a real data analysis from the The Cancer Genome Atlas (TCGA) will be shown to demonstrate the performance of the method.

Karmakar, Sayar

sayarkarmakar@ufl.edu

University of Florida

Time-aggregated prediction intervals are constructed for a univariate response time series in a high-dimensional regression regime. A simple quantile-based approach on the LASSO residuals seems to provide reasonably good prediction intervals. We allow for a very general possibly heavy-tailed, possibly long-memory and possibly non-linear dependent error process and discuss both the situations where the predictors are assumed to form a fixed or stochastic design. Finally, we construct prediction intervals for hourly electricity prices over horizons spanning 17 weeks and compare them to selected Bayesian and bootstrap interval forecasts

Bagchi, Pramita

pbagchi@gmu.edu

George Mason University

The frequency-domain properties of nonstationary functional time series often contain valuable information. These properties are characterized through their time-varying power spectrum. Practitioners seeking low-dimensional summary measures of the power spectrum often partition frequencies into bands and create collapsed measures of power within bands. However, standard frequency bands have largely been developed through manual inspection of time series data and may not adequately summarize power spectra. In this article, we propose a framework for adaptive frequency band estimation of nonstationary functional time series that optimally summarizes the time-varying dynamics of the series. We develop a scan statistic and search algorithm to detect changes in the frequency domain. We establish the theoretical properties of this framework and develop a computationally-efficient implementation. The validity of our method is also justified through numerous simulation studies and an application to analyzing electroencephalogram data in participants alternating between eyes open and eyes closed conditions.

Polley, Mei-Yin

mcpolley@uchicago.edu

The University of Chicago

Biomarkers, singly or combined into a multivariable signature, are increasingly used for disease diagnosis, individual risk prognostication, and treatment selection. Rigorous and efficient statistical approaches are needed to appropriately develop, evaluate, and validate biomarkers that will be used to inform clinical decision-making. In this talk, I will present a two-stage prognostic biomarker signature design with a futility analysis that affords the possibility to stop the biomarker study early if the prognostic signature is not sufficiently promising at an early stage, thereby allowing investigators to preserve the remaining specimens for future research. This design integrates important elements necessary to meet statistical rigor and practical demands for developing and validating prognostic biomarker signatures. Our simulation studies demonstrated desirable operating characteristics of the method in that when the biomarker signature has weak discriminant potential, the proposed design has high probabilities of terminating the study early. This work provides a practical tool for designing efficient and rigorous biomarker studies and represents one of the few papers in the literature that formalize statistical procedures for early stopping in prognostic biomarker studies.

Liu, Jianxuan

jianxuanliu7@gmail.com

Syracuse University

The goal of most empirical studies in policy research and medical research is to determine whether an alteration in an intervention or a treatment will cause a change in the desired outcome response. Unlike randomized designs, establishing the causal relationship based on observational studies is a challenging problem because the ceteris paribus condition is violated. When the covariates of interest are measured with errors, evaluating the causal effects becomes a thorny issue. Additional challenge arises from confounding variables which are often of high dimensional or correlated with the error-prone covariates. Most of the existing methods for estimating the average causal effect heavily rely on parametric assumptions about the propensity score or the outcome regression model one way or the other. In reality, both models are prone to misspecification, which can have undue influence on the estimated average causal effect. To the best of our knowledge, all the existing methods cannot handle high-dimensional covariates in the presence of error-prone covariates. We propose a semiparametric method to establish the causal relationship, which yields a consistent estimator of the average causal effect. The method we proposed results in efficient estimators of the covariate effects. We investigate their theoretical properties and demonstrate their finite sample performance through extensive simulation studies.

Polunchenko, Aleksey

aleksey@binghamton.edu

Binghamton University

The topic of interest is the performance of the Generalized Shiryaev-Roberts (GSR) control chart in continuous time, where the goal is to detect a possible onset of a drift in a standard Brownian motion observed live. We derive analytically and in a closed-form all of the relevant performance characteristics of the chart. By virtue of the obtained performance formulae we show numerically that the GSR chart with a carefully designed headstart is far superior to such mainstream charts as CUSUM and EWMA. More importantly, the Fast Initial Response feature exhibited by the headstarted GSR chart makes the latter not only better than the mainstream charts, but nearly the best one can do overall for a fixed in-control Average Run Length level. This is a stronger conclusion than that previously reached about CUSUM in the seminal 1982 Technometrics paper by J.M. Lucas and R.B. Crosier.

Li, Wei

wli169@syr.edu

Syracuse University

This talk will focus on some recent results from nonparametric Bayesian statistics on level set estimation. Some theoretical properties of the plug-in Bayesian estimator for the level sets, such as posterior contraction and frequentist coverage of the credible regions are discussed. The links between the natural metric for level sets and the supremum-norm distance of the functions, and the difference between the frequentist approach and the proposed Bayesian approach are to be highlighted.

BHATTACHARYA, SUMAN

suman.bhattacharya@tevapharm.com

TEVA PHARMACEUTICALS USA, INC.

The Horseshoe is a widely used and popular continuous shrinkage prior for high-dimensional Bayesian linear regression. Recently, regularized versions of the Horseshoe prior have also been introduced in the literature. Various Gibbs sampling Markov chains have been developed in the literature to generate approximate samples from the corresponding intractable posterior densities. Establishing geometric ergodicity of these Markov chains provides crucial technical justification for the accuracy of asymptotic standard errors for Markov chain based estimates of posterior quantities. In this paper, we establish geometric ergodicity for various Gibbs samplers corresponding to the Horseshoe prior and its regularized variants in the context of linear regression. First, we establish geometric ergodicity of a Gibbs sampler for the original Horseshoe posterior under strictly weaker

conditions than existing analyses in the literature. Second, we consider the regularized Horseshoe prior introduced in \cite{piironen2017}, and prove geometric ergodicity for a Gibbs sampling Markov chain to sample from the corresponding posterior without any truncation constraint on the global and local shrinkage parameters. Finally, we consider a variant of this regularized Horseshoe prior introduced in \cite{nishimura2019shrinkage}, and again establish geometric ergodicity for a Gibbs sampling Markov chain to sample from the corresponding posterior.

Babbitt, Gregory

gabsbi@rit.edu

Rochester Institute of Technology

Comparative functional analysis of the dynamic interactions between various Betacoronavirus mutant strains and broadly utilized target proteins such as ACE2, is crucial for a more complete understanding of zoonotic spillovers of viruses that cause severe respiratory diseases such as COVID-19. Here, we employ machine learning to replicate sets of nanosecond scale GPU accelerated molecular dynamics simulations to statistically compare and classify atom motions of these target proteins in both the presence and absence of different endemic and emergent strains of the viral receptor binding domain (RBD) of the S spike glycoprotein. With this method of comparative protein dynamics, we demonstrate some important recent trends in the functional evolution of viral binding to human ACE2 in both endemic and recent emergent human SARS-like strains (hCoV-OC43, hCoV-HKU1, SARS-CoV-1 and 2, MERS-CoV). We also examine how genetic differences between the endemic bat strain batCoV-HKU4, the SARS-CoV-2 progenitor bat strain RaTG13, and the SARS-CoV-2 Variants Of Concern (VOC) alpha to omicron have progressively enhanced the binding dynamics of the spike protein RBD as it has functionally evolved towards more effective human ACE2 interaction and potentially less effective interactions with a Rhinolophus bat ACE2 ortholog. We conclude that while some human adaptation in the virus has occurred across the VOC’s, it has not yet progressed to the point of preventing reverse spillovers into a broad range of mammals. This raises the ominous specter of future unpredictability when SARS-like strains may become more widely held in reservoir in intermediate hosts.

Miranda Lynch

mlynch@hwi.buffalo.edu

Hauptman-Woodward Medical Research Institute

Network and graph theoretic approaches have been invaluable in modeling and analyzing many aspects of the Covid-19 pandemic, providing a valuable framework for understanding events at scales from population-level (such as transmission dynamics) to host- and virus-levels (such as cellular dynamics, phylogenetic genomic patterns, and protein-protein networks). In this work, we demonstrate how statistical analyses of networks can also contribute basic science insights into subcellular events associated with SARS-CoV-2 infection and treatment, via graph representations of viral proteins and their molecular level associations with host proteins and with therapeutic ligands. Statistical analyses applied to these network models permit comparative assessment of different protein states, which inform on mechanistic aspects of the host-pathogen interface.

Yu, Xingchen

xvy5021@gmail.com

Climate Corporation

The use of spatial models for inferring members’ preferences from voting data has become widespread in the study of deliberative bodies, such as legislatures. Most established spatial voting models assume that ideal points belong to a Euclidean policy space. However, the geometry of Euclidean spaces (even multidimensional ones) cannot fully accommodate situations in which members at the opposite ends of the ideological spectrum reveal similar preferences by voting together against the rest of the legislature. This kind of voting behavior can arise, for example, when extreme conservatives oppose a measure because they see it as being too costly, while extreme liberals oppose it for not going far enough for them. This paper introduces a new class of spatial voting models in which preferences live in a circular policy space. Such geometry for the latent space is motivated by both theoretical (the so-called “horseshoe theory” of political thinking) and empirical (goodness of fit) considerations. Furthermore, the circular model is flexible and can approximate the one-dimensional version of the Euclidean voting model when the data supports it. We apply our circular model to roll-call voting data from the U.S. Congress between 1988 and 2019 and demonstrate that, starting with the 112th House of Representatives, circular policy spaces consistently provide a better explanation of legislators’s behavior than Euclidean ones and that legislators’s rankings, generated through the use of the circular geometry, tend to be more consistent with those implied by their stated policy positions.

Dang, Sanjeena

sanjeena.dang@carleton.ca

Carleton University

Three-way data structures or matrix-variate data are frequent in biological studies. In RNA sequencing, three-way data structures are obtained when high-throughput transcriptome sequencing data are collected for n genes across p conditions at r occasions. Matrix variate distributions offer a natural way to model three-way data and mixtures of matrix variate distributions can be used to cluster three-way data. Clustering of gene expression data is carried out as means of discovering gene co-expression networks. In this work, a mixture of matrix variate Poisson-log normal distributions is proposed for clustering read counts from RNA sequencing. By considering the matrix variate structure, the number of covariance parameters to be estimated is reduced and the components of resulting covariance matrices provide a meaningful interpretation. We propose three different frameworks for parameter estimation - a Markov chain Monte Carlo based approach, a variational Gaussian approximation-based approach, and a hybrid approach. The models are applied to both real and simulated data, and we demonstrate that the proposed approaches can recover the underlying cluster structure. In simulation studies where the true model parameters are known, our proposed approach shows good parameter recovery.

Werren, John

jack.werren@rochester.edu

University of Rochester

As Angiotensin-converting enzyme 2 (ACE2) is the cell receptor that SARS-CoV-2 uses to enter cells, a better understanding of the proteins that ACE2 normally interact with could reveal information relevant to COVID-19 disease manifestations and possible avenues for treatment. We have undertaken an evolutionary approach to identify mammalian proteins that “coevolve” with ACE2 based on their evolutionary rate correlations (ERCs). The rationale is that proteins that coevolve with ACE2 are likely to have functional interactions. ERCs reveal both proteins that have previously been reported to be associated with severe COVID-19, but are not currently known to interact with ACE2, and others with novel associations relevant to the disease, such as coagulation and cytokine signally pathways. Using reciprocal rankings of protein ERCs, we have identified ACE2 connections to coagulation pathway proteins Coagulation Factor V and fibrinogen components FGA, FGB, and FGG, the latter possibly mediated through ACE2 connections to Clusterin (which clears misfolded extracellular proteins) and GPR141 (whose functions are relatively unknown). ACE2 also connects to proteins involved in cytokine signaling and immune response (e.g. XCR1, IFNAR2 and TLR8), and to Androgen Receptor (AR). We propose that ACE2 has novel protein interactions that are disrupted during SARS-CoV-2 infection, contributing to the spectrum of COVID-19 pathologies. More broadly, ERCs can be used to predict protein interactions in many different biological pathways. Already, ERCs indicate possible functions for some relatively uncharacterized proteins, as well as possible new functions for well-characterized ones.

Gong, Zi-Jia

zg3988@rit.edu

Rochester Institute of Technology

LUSI (Learning Using Statistical Invariants) is a new machine-learning paradigm proposed by Vapnik and Izmailov. In LUSI, a classification function is searched in the reproducing kernel Hilbert space (RKHS) by minimizing the loss function, while a set of ‘predicates’, functionals on the training data that incorporate the specific knowledge of the machine-learning problem, are preserved invariant. In this project, we implemented several versions of LUSI algorithms in Python in order to evaluate their performance in classification. We use the MNIST and CIFAR10 datasets to fit various LUSI models and compare the test accuracy of each model. Different predicates are designed and their impact on the classification performance is investigated. The LUSI code package is designed to be compatible with scikit-learn, and is open-source on GitHub.

Zhao, Boxin; Wang, Y. Samuel

Ysw7@cornell.edu

Cornell

The problem of estimating the difference between two functional undirected graphical models with shared structures is considered. In many applications, data are naturally regarded as a vector of random functions rather than a vector of scalars. For example, electroencephalography (EEG) data are more appropriately treated as functions of time. In these problems, not only can the number of functions measured per sample be large, but each function is itself an infinite-dimensional object, making estimation of model parameters challenging. This is further complicated by the fact that the curves are usually only observed at discrete time points. We first define a functional differential graph that captures differences between two functional graphical models and formally characterize when the functional differential graph is well defined. We then propose a method, FuDGE, that directly estimates the functional differential graph without first estimating each individual graph. This is particularly beneficial in settings where the individual graphs are dense, but the differential graph is sparse. We show that FuDGE consistently estimates the functional differential graph even in a high-dimensional setting for both discretely observed and fully observed function paths.

Duker, Marie

duker@cornell.edu

Cornell University

For time series with high correlation, the empirical process converges extremely slowly to its limiting distribution. Many relevant statistics like the median and the Wilcoxon statistic can be written as functionals of the empirical process and inherit the slow convergence. Inference based on the limiting distribution of those quantities becomes highly impacted by relatively small sample sizes. This talk proposes a novel approach to calculate confidence intervals for the empirical process and the median based on a higher-order approximation of the empirical process. We establish the theoretical validity of our method for statistics that are functionals of the empirical process. In a simulation study, we compare coverage rate and interval length of our confidence intervals with classical results and highlight the improvements. The talk concludes with a discussion of further utilization of the proposed approximation in change-point analysis and high-dimensional time series analysis.

Reid, Elizabeth

elizabeth.reid@marist.edu

Marist College

Far too often students question why they need to know statistics. Because of this, it has been challenging to motivate students, and the pandemic has not helped. During the height of the pandemic, I taught several statistics classes online. Through this experience I learned more effective ways to make statistics important and meaningful to students. In this talk, we will discuss projects that were designed to engage students and ultimately teach them why statistics is significant.

Fokoue, Ernest

epfeqa@rit.edu

Rochester Institute of Technology

Learning Using Statistical Invariants (LUSI) is a relatively recent incarnation in the world of statistical learning theory paradigms. In their effort to propose what they hope to be a complete statistical theory of learning, Vapnik and Izmailov (2019) develop the LUSI framework, partly using their early tool known as the V-matrix but crucially borrowing heavily on Plato's philosophical teachings on ideas and things (forms) to extend the classical statistical learning theory from its purely empirical nature (known seen as brute force learning) to a learning theory based on predicates that minimize the true error. This talk will review the merits and the promises of LUSI and explore the ways in which Plato's philosophical teachings contain the potential of helping usher in a new era in Statistical Learning Theory.

Fokoue, Ernest

epfeqa@rit.edu

Rochester Institute of Technology

This talk explores the myriad of ways in which the Bayesian paradigm permeates the entire landscape of Statistical Machine Learning and Data Science. Despite some of the major challenges underlying its practical use, the Bayesian paradigm has proven to be ubiquitous, often appearing directly and indirectly in virtually every single aspect of statistical machine learning and data science, and artificial intelligence. This presentation highlights some of the emerging ways in which the Bayesian paradigm is playing an impactful role in the Data Science Revolution.

Li, Zhiyuan

zl7904@rit.edu

Rochester Institute of Technology

Variational auto-encoder (VAE), a framework that can efficiently approximate intractable posterior, has achieved great success in learning representation from a stationary data environment. However, limited progress has been made in expanding VAE to learn representation from streaming data environments, where data arrive sequentially with changing distributions. Main challenges of continual representation learning lies in reusing, expanding, and continually disentangling learned semantic factors across data environments. We argue that this is because existing approaches treat continually-arrived data independently, without considering how they are related based on the underlying semantic factors. We address this by a new generative model describing a topologically-connected mixture of spike-and-slab distributions in the latent space, learned end-to-end in a continual fashion via principled variational inference. The learned mixture is able to automatically discover the active semantic factors underlying each data environment and to accumulate their relational structure based on that. This distilled knowledge of different data environments can further be used for generative replay and guiding continual disentangling of new semantic factors.

Ding, Jian; Wu, Yihong; Xu, Jiaming; Yang, Dana

dana.yang@cornell.edu

Cornell University

Motivated by the application of tracking moving particles from snapshots, we study the problem of recovering a planted perfect matching hidden in an Erdős–Rényi bipartite graph. We establish the information-theoretic threshold for almost exact recovery of the hidden matching. Our result extends to general weighted graphs across both dense and sparse regimes. Furthermore, in the special case of exponential weights, we prove that the optimal reconstruction error is infinitely smooth at the threshold, confirming the infinite-order phase transition conjectured in [Semerjian et al. 2020].

Jiang, Xiajun

xj7056@rit.edu

Rochester Institute of Technology

Clinical adoption of personalized virtual heart simulations faces challenges in model personalization and expensive computation. While an ideal solution is an efficient neural surrogate that at the same time is personalized to an individual subject, the state-of-the-art is either concerned with personalizing an expensive simulation model, or learning an efficient yet generic surrogate. This paper presents a completely new concept to achieve personalized neural surrogates in a single coherent framework of meta-learning (metaPNS). Instead of learning a single neural surrogate, we learn the process of learning a personalized neural surrogate using a small number of context data from a subject, in a novel formulation of few-shot generative modeling underpinned by: 1) a set-conditioned neural surrogate for cardiac simulation that, conditioned on subject-specific context data, learns to generate query simulations not included in the context set, and 2) a meta-model of amortized variational inference that learns to condition the neural surrogate via simple feed-forward embedding of context data. As test time, metaPNS delivers a personalized neural surrogate by fast feed-forward embedding of a small and flexible number of data available from an individual, achieving – for the first time – personalization and surrogate construction for expensive simulations in one end-to-end learning framework. Synthetic and real-data experiments demonstrated that metaPNS was able to improve personalization and predictive accuracy in comparison to conventionally-optimized cardiac simulation models, at a fraction of computation.

Nieto Ramos, Alejandro

axn2780@rit.edu

Rochester Institute of Technology

Cardiac cells exhibit variability in the shape and duration of their action potentials in space within a single individual. To create a mathematical model of cardiac action potentials (AP) which captures this spatial variability and also allows for rigorous uncertainty quantification regarding within-tissue spatial correlation structure, we developed a novel hierarchical probabilistic model making use of a latent Gaussian process prior on the parameters of a simplified cardiac AP model which is used to map forcing behavior to observed voltage signals. This model allows for prediction of cardiac electrophysiological dynamics at new points in space and also allows for reconstruction of surface electrical dynamics with a relatively small number of spatial observation points. Furthermore, we make use of Markov chain Monte Carlo methods via the Stan modeling framework for parameter estimation. We employ a synthetic data case study oriented around the reconstruction of a sparsely-observed spatial parameter surface to highlight how this approach can be used for spatial or spatiotemporal analyses of cardiac electrophysiology.

Cai, Xueya

xueya_cai@urmc.rochester.edu

University of Rochester Department of Biostatistics and Computational Biology

The super learner method combines the stacking algorithm and regression analysis to obtain weighted predictions from varied statistical strategies for model prediction. It is shown to perform no worse than any single prediction method as well as to provide consistent estimates. In addition to model predictions, Rose developed nonparametric double robust machine learning in variable importance analyses. The targeted maximum likelihood estimation (TMLE) method was introduced for variable importance analyses, in which super learner predictions were compared between the saturated model and reduced models when each variable was left out. Variable importance was profiled by corresponding p-values.

In the study of nursing home resident suicide ideation, we first performed individual modeling for each of the statistical strategies, including the logistic regression model with model selection, LASSO (least absolute shrinkage and selection operator), Ridge model, Polynomial model, neuro network, and random forest. Ten-fold cross-validation was implemented in each strategy, and the aggregated estimates from ten-fold validations for each algorithm were approached. We further estimated the composite parameter estimates by enameling all model specific estimates, in which mean squared error (MSE) was used to identify best weights for the assembling. The TMLE method was used to identify 10 most important risk factors associated with nursing home resident suicide ideation.

Weisenthal, Samuel

Samuel_weisenthal@urmc.Rochester.edu

University of Rochester Medical Center

Methods developed in dynamic treatment regimes and reinforcement learning can be used to estimate a policy, or a mapping from covariates to decisions, which can then instruct decision makers. There is great interest in using such data-driven policies to help health care providers and their patients make optimal decisions. In health care, however, if one is advocating for the adoption of a new policy, it is often important to explain to the provider and patient how this new policy differs from the current standard of care, or the behavioral policy. More generally, identifying the covariates that figure prominently in the shift from behavior to optimality might be of independent clinical or scientific interest. These ends are facilitated if one can pinpoint the parameters that change most when moving from the behavioral policy to the optimal policy. To do so, we adapt ideas from policy search, specifically trust region policy optimization, but, unlike current methods of this type, we focus on interpretability and statistical inference. In particular, we consider a class of policies parameterized by a finite-dimensional vector and jointly maximize value while employing an adaptive L1 norm penalty on divergence from the behavioral policy. This yields adaptive ”relative sparsity,” where, as a function of a tuning parameter, we can approximately control the number of parameters in our suggested policy that are allowed to differ from their behavioral counterparts. We develop our method for the off-policy, observational data setting. We perform extensive simulations, prove asymptotic normality for an adaptive Lasso formulation of our objective, and show prelim any analyses of an observational health care dataset. This work is a step toward helping us better explain, in the context of the current standard of care, the policies that have been estimated using techniques from dynamic treatment regimes and reinforcement learning, which promotes the safe adoption of data-driven decision tools in high-stakes settings.