Abstract: Expected shortfall, measuring the average outcome (e.g., portfolio loss) above a given quantile of its probability distribution, is a common financial risk measure. The same measure can be used to characterize treatment effects in the tail of an outcome distribution, with applications ranging from policy evaluation in economics and public health to biomedical investigations. Expected shortfall regression is a natural approach of modeling covariate-adjusted expected shortfalls. Because the expected shortfall cannot be written as a solution of an expected loss function at the population level, computational as well as statistical challenges around expected shortfall regression have led to stimulating research. We discuss some recent developments in this area, with a focus on a new optimization-based semiparametric approach to estimation of conditional expected shortfall that adapts well to data heterogeneity with minimal model assumptions. The talk is based on joint work with Yuanzhi Li and Shushu Zhang.
Abstract: Given a dataset comprising a single network, we consider inference on a parameter selected from the data. We focus on the setting where the selected parameter is a linear combination of the mean connectivities within and between estimated communities. Inference in this setting poses a challenge, since the communities are themselves estimated from the data. Furthermore, since only a single realization of the network is available, sample splitting is not possible. We show that it is possible to split a single realization of a network with $n$ nodes into two (or more) networks involving the same $n$ nodes; the first network can be used to select a data-driven parameter, and the second to conduct inference on that parameter. In the case of weighted networks with Poisson or Gaussian edges, we obtain two independent realizations of the network; by contrast, in the case of Bernoulli edges, the two realizations are dependent, and so extra care is required. We establish the theoretical properties of our estimators, in the sense of confidence intervals that attain the nominal (selective) coverage, and demonstrate their utility in numerical simulations and in application to a dataset representing the relationships among dolphins in Doubtful Sound, New Zealand. This is joint work with Ethan Ancell and Daniela Witten of the University of Washington.
Abstract: Evaluating and validating the performance of prediction models is a crucial task in statistics, machine learning, and their diverse applications, including precision medicine. However, developing robust prediction performance measures, particularly for time-to- event data, poses unique challenges. In this talk, I will highlight how conventional performance metrics for time-to-event data, such as the C Index, Brier Score, and time- dependent AUC, may yield unexpected results when comparing prediction models/algorithms. I will then introduce a novel time-dependent pseudo R-squared measure and demonstrate its utility as a prediction performance measure for both uncensored and right-censored time-to-event data. Additionally, I will discuss its extension to time-dependent prediction performance measures and to competing risks scenarios. Its effectiveness will be showcased through simulations and real-world examples.
Abstract: We present a scalable manifold learning framework motivated by the challenge of estimating activation manifolds from functional magnetic resonance imaging data in the Human Connectome Project. Our key contribution is an efficient estimation strategy for heat kernel Gaussian processes within exponential family models. The method is designed to handle very large sample sizes while preserving the intrinsic geometry of the data. By introducing a reduced-rank approximation of the graph Laplacian transition matrix and employing a truncated singular value decomposition for eigenpair computation, we reduce computational complexity from O(n^3) to nearly O(n). Numerical experiments demonstrate that the proposed approach achieves both scalability and improved accuracy, making it well-suited for large-scale manifold learning tasks in complex biomedical and other data domains. This is joint work with Junhui He, Guoxuan Ma and Ying Yang.
Abstract: Unsupervised node clustering (or community detection) is a classical graph learning task. In this work, we study algorithms that exploit the local geometry of the graph to identify densely connected substructures, which form clusters or communities. Our method implements discrete Ricci curvatures and their associated geometric flows, under which the edge weights of the graph evolve to reveal its community structure. We consider several discrete curvature notions and analyze the utility of the resulting algorithms. In contrast to prior literature, we study not only single-membership community detection, where each node belongs to exactly one community, but also mixed-membership community detection, where communities may overlap. For the latter, we argue that it is beneficial to perform community detection on the line graph. We provide both theoretical and empirical evidence for the utility of our curvature-based clustering algorithms. In addition, we give several results on the relationship between the curvature of a graph and its line graph, which enable the efficient implementation of our proposed mixed-membership community detection approach and which may be of independent interest for curvature-based network analysis.
Abstract: This talk explores the deep connections between two foundational pillars of twentieth-century statistics—survey sampling and experimental design. Though these connections became somewhat esoteric in the late twentieth century, they have experienced a revival through the modern framework of finite-population causal inference. Central to this connection is the concept of potential outcomes (or counterfactuals), first introduced by Neyman in 1923 and later expanded and formalized by Rubin in the 1970s. Through illustrative examples, we will show how the classical results developed in the early twentieth century can be reinterpreted and extended to address contemporary challenges, particularly as randomized experiments gain renewed prominence across the social, behavioral, and biomedical sciences.
Abstract: We describe a model for a network time series whose evolution is governed by an underlying stochastic process, known as the latent position process, in which network evolution can be represented in Euclidean space by a curve, called the Euclidean mirror. We define the notion of a first-order changepoint for a time series of networks, and construct a family of latent position process networks with first-order changepoints. We show how a spectral estimate of the associated Euclidean mirror can localize these changepoints and provide simulated and real data examples of such localization.
Abstract: Image-on-scalar regression has been a popular approach to modeling the association between brain activities and scalar characteristics in neuroimaging research. The associations could be heterogeneous across individuals in the population, as indicated by recent large-scale neuroimaging studies, e.g., the Adolescent Brain Cognitive Development (ABCD) study. The ABCD data can inform our understanding of heterogeneous associations and how to leverage the heterogeneity and tailor interventions to increase the number of youths who benefit. It is of great interest to identify subgroups of individuals from the population such that: 1) within each subgroup the brain activities have homogeneous associations with the clinical measures; 2) across subgroups the associations are heterogeneous; and 3) the group allocation depends on individual characteristics. We propose a latent subgroup image-on-scalar regression model (LASIR) to analyze large-scale, multi-site neuroimaging data with diverse sociodemographics. LASIR introduces the latent subgroup for each individual and group-specific, spatially varying effects, with an efficient stochastic expectation maximization algorithm for inferences. We demonstrate that LASIR outperforms existing alternatives for subgroup identification of brain activation patterns with functional magnetic resonance imaging data via comprehensive simulations and applications to the ABCD study. We further discuss how insights into subgroup-specific heterogeneity can inform the generalizability of findings in population neuroscience.
Abstract: In this article, we discuss a Bayesian Empirical Likelihood (BayesEL) based method for complex survey data, leading to applications in non-probability sampling. Bayesian formulation of complex survey data presents several computational, methodological, as well as philosophical problems. Since the observations are sampled with unequal probability, the distributions of the observations in the population and the sample are different. Thus, when it comes to constructing the posterior, a practitioner has a choice of using one of the above distributions. Associated to this, we also need to consider if the posterior is constructed based on the sample or on the whole of the finite population. This question is also related to the method used to incorporate the information in the sampling weights in the procedure. We will show that the BayesEL provides an orderly way to construct a posterior for complex survey data by addressing all the above questions. Some properties of the posterior, e.g., asymptotic validity, objective prior construction, etc, will be discussed. Finally, an application to the non-probability sampling problem will be presented.
Abstract: Many modern learning tasks require models that can take inputs of varying sizes. Consequently, dimension-independent architectures have been proposed for domains where the inputs are graphs, sets, and point clouds. Recent work on graph neural networks has explored whether a model trained on low-dimensional data can transfer its performance to higher-dimensional inputs. We extend this body of work by introducing a general framework for transferability across dimensions. We show that transferability corresponds precisely to continuity in a limit space formed by identifying small problem instances with equivalent large ones. This identification is driven by the data and the learning task. We instantiate our framework on existing architectures, and implement the necessary changes to ensure their transferability. Finally, we provide design principles for designing new transferable models. Numerical experiments support our findings.
4176 Campus Drive - William E. Kirwan Hall
College Park, MD 20742-4015
P: 301.405.5047 | F: 301.314.0827