Abstract: We consider the problem of graph matchability in non-identically distributed networks. In a general class of edge-independent networks, we demonstrate that graph matchability can be lost with high probability when matching the networks directly. We further demonstrate that under mild model assumptions, matchability is almost perfectly recovered by centering the networks using Universal Singular Value Thresholding before matching. These theoretical results are then demonstrated in both real and synthetic simulation settings. We also recover analogous core-matchability results in a very general core-junk network model, wherein some vertices do not correspond between the graph pair.
Abstract: Abstract: Data analysis should be adaptive, and researchers should be able to modify their analyses based on data exploration and previous analysis. Holdout methods allow for this, however multiple reuse of the holdout set can lead to incorrect conclusions. Researchers have previously shown that holdout sets can be reused for adaptive analysis using differential privacy techniques. In this talk, I present an extension of the research from binomial response variable to continuous response for potential applications in my research at the Institute for Defense Analyses (IDA).
IDA is a not-for-profit company that runs three Federally Funded Research and Development Centers (FFRDCs). FFRDCs are centers that are sponsored by, and conduct research for, various government agencies. Graduate students in STEM fields have likely heard of some of the more well-known FFRDCs without ever learning the term âFFRDCâ. For example: the âNational Labsâ, such as Los Alamos National Laboratory and Oak Ridge National Laboratory, are FFRDCs sponsored by the Department of Energy. The public-private partnerships offered by FFRDCs offer unique opportunities to meet the research needs of government organizations in challenging, cooperative environments.
Abstract: An interplay between coherence and logistic regression is discussed. Inter- action terms expressed as products of covariates may prove useful in logistic regression for binary time series, even when their factors are not significant. To identify potentially useful interaction terms, a graphical spectral tool, a function of lag or delay referred to as residual coherence, is introduced. Potentially useful interaction terms are identified by the size or prominence of their residual coherence. Instead of direct significance testing in terms of the residual coherence, the identified covariates are tested for their significance within logistic regression.
Abstract: Graph embeddings, a class of dimensionality reduction techniques designed for relational data, have proven useful in exploring and modeling network structure. Most dimensionality reduction methods allow out-of-sample extensions, by which an embedding can be applied to observations not present in the training set. Applied to graphs, the out-of-sample extension problem concerns how to compute the embedding of a vertex that is added to the graph after an embedding has already been computed. In this talk, we will consider the out-of-sample extension problem for two graph embedding procedures: the adjacency spectral embedding and the Laplacian spectral embedding. In both cases, we prove that when the underlying graph is generated according to a latent space model called the random dot product graph, which includes the popular stochastic block model as a special case, an out-of-sample extension based on a least-squares objective obeys a central limit theorem. Our results also yield a convenient framework in which to analyze trade-offs between estimation accuracy and computational expenses, which we will explore briefly.
Abstract: The development of models for multiple heterogeneous network data is of critical importance both in statistical network theory and across multiple application domains. Although single-graph inference is well-studied, multiple graph inference is largely unexplored, in part because of the challenges inherent in appropriately modeling graph differences and yet retaining sufficient model simplicity to render estimation feasible. The common subspace independent-edge (COSIE) multiple random graph model addresses this gap, by describing a heterogeneous collection of networks with a shared latent structure on the vertices but potentially different connectivity patterns for each graph. The COSIE model is both flexible to account for important graph differences and tractable to allow for accurate spectral inference. In both simulated and real data, the model can be deployed for a number of subsequent network inference tasks, including dimensionality reduction, classification, hypothesis testing, and community detection.
Abstract: Determining how certain properties are related to other properties is fundamental to scientific discovery; further investigations into the geometry of the relationship and future predictions are warranted only if two properties are significantly related. To better discover any type of relationship underlying paired sample data, we introduce the multiscale graph correlation (MGC), which combines distance correlation, the locality principle, and smoothed maximum to yield a new and powerful dependency measure. We prove that MGC is consistent for testing independence, enjoys a number of desirable theoretical properties, exhibits empirical power advantages against a wide range of nonlinear and high-dimensional dependencies, and can be efficiently implemented and utilized for real data exploration.
Abstract: We consider sparse Bayesian estimation in the classical multivariate linear regression model with p regressors and q response variables. In univariate Bayesian linear regression with a single response y, shrinkage priors which can be expressed as scale-mixtures of normal densities are a popular approach for obtaining sparse estimates of the coefficients. In this paper, we extend the use of these priors to the multivariate case to estimate a p times q coefficients matrix B. Our method can be used for any sample size n and any dimension p, and moreover, we show that the posterior distribution can consistently estimate B even when p grows at nearly exponential rate with the sample size. Our method's finite sample performance is demonstrated through simulations and data analysis.
Abstract: When searching for gene pathways leading to speciï¬c disease outcomes, additional information on gene characteristics is often available that may facilitate to diï¬erentiate genes related to the disease from irrelevant background when connections involving both types of genes are observed and their relationships to the disease are unknown. We propose method to single out irrelevant background genes with the help of auxiliary information through a logistic regression, and cluster relevant genes into cohesive groups using the adjacency matrix. Expectationâmaximization algorithm is modiï¬ed to maximize a joint pseudo-likelihood assuming latent indicators for relevance to the disease and latent group memberships as well as Poisson or multinomial distributed link numbers within and between groups. A robust version allowing arbitrary linkage patterns within the background is further derived. Asymptotic consistency of label assignments under the stochastic blockmodel is proven. Superior performance and robustness in ï¬nite samples are observed in simulation studies. The proposed robust method identiï¬es previously missed gene sets underlying autism related neurological diseases using diverse data sources including de novo mutations, gene expressions, and proteinâprotein interactions. Besides, we further proposed integrative network analysis framework by combining likelihood or pseudo-likelihood of heterogeneous network data. For example, in studying gene expression and protein-protein interaction data, when the cluster structure is illustrated in the mean values of gene expression, empirical Bayesian hierarchical model is combined with stochastic block model to identify functional groups. In analyzing protein-protein interaction and gene ontology data, correlation coefficient matrix with blocked structure is combined with stochastic block model to identify protein complex. Asymptotic consistency of the group membership estimates is proven. Superior performances of the integrative methods compared to methods using single data source are observed in simulation studies and empirical guidelines in the choice of integrative analysis vs separate analysis are provided.
Abstract: Wind tunnel tests are crucial to the design of tall structures. Scale models are outfitted with pressure taps at many locations of interest, such as the center of the roof. Each tap measures pressure at one location for the duration of the test. Since the tap measurement is recorded at regular time intervals, the data produced form a regular time series. Wind engineers are typically concerned with very high and very low (suction) pressures. Peaks-over-threshold (POT) extreme value models are one of two main approaches used by wind engineers for these data. However, POT models require the choice of a threshold, which can influence the final results, sometimes substantially, because the threshold choice controls the data that enter into the analysis. In this talk a method for combining results from multiple thresholds is considered, thereby eliminating the need to choose only one. The focus is on estimating the distribution of the maximum or minimum value in wind tunnel tests. The new method is compared to several techniques for choosing a single threshold using a large collection of pressure series from wind tunnel tests. The comparison shows that choosing a single threshold underestimates the uncertainty associated with predicting a future peak value.
Abstract: In this talk, I present computational methodologies for extracting dynamic neural functional networks that underlie behavior. These methods aim at capturing the sparsity, dynamicity and stochasticity of these networks, by integrating techniques from high-dimensional statistics, point processes, state-space modeling, and adaptive filtering. I demonstrate their utility using several case studies involving auditory processing, including 1) functional auditory-prefrontal interactions during attentive behavior in the ferret brain, 2) network-level signatures of decision-making in the mouse primary auditory cortex, and 3) cortical dynamics of speech processing in the human brain.
Abstract: In modern psychological and biomedical research with diagnostic purposes, scientists often formulate the key task as inferring the fine-grained latent information under structural constraints. These structural constraints usually come from the domain expertsâ prior knowledge or insight. The emerging family of Structured Latent Attribute Models (SLAMs) accommodate these modeling needs and have received substantial attention in psychology, education, and epidemiology. SLAMs bring exciting opportunities and unique challenges. In particular, with high-dimensional discrete latent attributes and structural constraints encoded by a design matrix, one needs to balance the gain in the modelâs explanatory power and interpretability, against the difficulty of understanding and handling the complex model structure.
In the first part of this talk, I present identifiability results that advance the theoretical knowledge of how the design matrix influences the estimability of SLAMs. The new identifiability conditions guide real-world practices of designing cognitive diagnostic tests and also lay the foundation for drawing valid statistical conclusions. In the second part, I introduce a statistically consistent penalized likelihood approach to selecting significant latent patterns in the population. I also propose a scalable computational method. These developments explore an exponentially large model space involving many discrete latent variables, and they address the estimation and computation challenges of high-dimensional SLAMs arising in large-scale scientific measurements. The application of the proposed methodology to the data from an international educational assessment reveals meaningful knowledge structures and latent subgroups of the student populations.
Abstract: This talk is concerned with noisy matrix completion: given partial and corrupted entries of a large low-rank matrix, how to estimate and infer the underlying matrix? Arguably one of the most popular paradigms to tackle this problem is convex relaxation, which achieves remarkable efficacy in practice. However, the statistical stability guarantees of this approach are still far from optimal in the noisy setting, falling short of explaining empirical success. Moreover, it is generally very challenging to pin down the distributions of the convex estimator, which presents a major roadblock in assessing the uncertainty, or âconfidenceâ, of the obtained estimates --- a crucial task at the core of statistical inference.
Our recent work makes progress towards understanding stability and uncertainty quantification for noisy matrix completion. When the rank of the unknown matrix is a constant: (1) we demonstrate that the convex estimator achieves near-optimal estimation errors vis-Ã -vis random noise; (2) we develop a de-biased estimator that admits accurate distributional characterizations, thus enabling asymptotically optimal inference of the low-rank factors and the entries of the matrix. All of this is enabled by bridging convex relaxation with the nonconvex Burer-Monteiro approach, a seemingly distinct algorithmic paradigm that is provably robust against noise.
Abstract: In many large-scale surveys, estimates are produced for numerous small domains defined by cross-classifications of demographic, geographic and other variables. Even though the overall sample size of such surveys might be very large, samples sizes for domains are sometimes too small for reliable estimation. We propose an improved estimation approach that is applicable when ``natural'' or qualitative relationships (such as orderings or other inequality constraints) can be formulated for the domains means at the population level. We stay within a design-based inferential framework but impose constraints representing these relationships on the sample-based estimates. The resulting constrained domain estimator is shown to be design consistent and asymptotically normally distributed as long as the constraints are asymptotically satisfied at the population level. The estimator and its associated variance estimator are readily implemented in practice. The applicability of the method is illustrated on data from the 2015 U.S. National Survey of College Graduates.
Abstract: In this talk, several aspects of complex data in event history analysis with applications to process data from educational measurement and electronic health records will be discussed. In an exploratory analysis on process data, a large number of possibly time-varying covariates maybe included. These covariates along with the high-dimensional counting processes often exhibit a low-dimensional structure that has meaningful interpretation. We explore such structure through specifying random coefficients in a low dimensional space. Furthermore, to obtain a parsimonious model and to improve interpretation of parameters therein, variable selection and estimation for both fixed and random effects are developed by penalized likelihood. We establish a set of sufficient conditions for the identifiability of the model and show that the proposed penalized estimators perform as well as the oracle procedure in variable selection. In electronic health records, we illustrate that a joint modeling of disease occurrence and drug prescription is preferred and a multivariate proportional intensity model with generalized random coefficients is proposed, where a flexible time-varying effect of the random coefficients could be included. Furthermore, in the presence of rare events and sparse covariates, standard asymptotic theory is no longer applicable. We establish consistency and asymptotic normality under such setting.
Abstract: Performing statistical analyses on collections of graphs is of import to many disciplines, but principled, scalable methods for multisample graph inference are few. In this talk, we describe a joint, or "omnibus," spectral embedding in which multiple graphs on the same vertex set are jointly embedded into a single space with a distinct representation for each graph. We prove a central limit theorem for this omnibus embedding, and we show that this simultaneous embedding into a single common space allows for the comparison of graphs without further pairwise subspace alignments. The existence of multiple embedded points for each vertex renders possible the resolution of important multiscale graph inference goals, such as the identification of specific subgraphs or vertices as drivers of similarity or difference across large networks. We conclude with two analyses of connectomic graphs generated from MRI scans of the brain in human subjects, and we show how the omnibus embedding can be used to detect statistically significant differences, at multiple scales, across these networks.
Abstract: A two-stage normal hierarchical model called the Fay-Herriot model and the empirical Bayes estimator are widely used to provide indirect and model-based estimates of means in small areas. However, the performance of the empirical Bayes estimator might be poor when the assumed normal distribution is misspecified. In this article, we propose a simple modification by using density power divergence and suggest a new robust empirical Bayes small area estimator. The mean squared error and estimated mean squared error of the proposed estimator are derived based on the asymptotic properties of the robust estimator of the model parameters. We investigate the numerical performance of the proposed method through simulations.
Abstract: This talk will first address the way in which design-based sample survey research questions seem to differ from questions in mainstream Statistics. These questions will be shown to fit naturally into a semiparametric view of biased-sampling inference. In the context where population data has random effects shared within clusters, the problem of design- and model-based inference from informative probability-survey cluster samples has been partially but not completely resolved.
Known and new results on this problem will be discussed in the setting of finite `super'-population data assumed to satisfy a two-way random-effects ANOVA model. A new EM algorithm based on a census pseudolikelihood (augmented by unobservable random cluster effects) will be described and shown to provide consistent parameter estimates, using only single-inclusion weights, whenever sampling within clusters is noninformative. A simulation study is used to show that this kind of consistency does not hold for some previously proposed methods based on survey-weighted loglikelihoods. General reasoning will be given to show that consistency of estimation of variance components under all kinds of informative sampling, based only on single-inclusion weights, is an impossible goal.
Abstract: The computation of parametric estimates often involves iterative numerical approximations, which introduce numerical error. But when these estimates depend on random observations, they necessarily involve statistical error as well. Thus the common approach of minimizing numerical error without accounting for inherent statistical error can be both costly and wasteful, since it results in no improvement to the estimator's accuracy. We quantify this tradeoff between numerical and statistical error in a problem of estimating the eigendecomposition for the mean of a random matrix from its observed value, and show that one can save significant computation by terminating the iterative procedure early, with no loss of accuracy. We demonstrate this in a setting of estimating the latent positions of a random network from the observed adjacency matrix, on real and simulated data.
Abstract: The problem of variable selection in finite mixture of regression models has been the focus of some research over the last decade. The goal is to uncover latent classes and identify component-specific relevant predictors in a unified manner. This is achieved by combining ideas of mixture models, regression models and variable selection. I will present some of the methods we have proposed in this context, including a stochastic partitioning method to relate two high-dimensional datasets, a penalized mixture of multivariate generalized linear regression models, and a mixture of regression trees approach. I will illustrate the methods with various applications.
Abstract: Although combination antiretroviral therapy (ART) is highly effective in suppressing viral load for people with HIV (PWH), many ART agents may exacerbate the central nervous system (CNS)-related adverse effects including depression. Therefore, understanding the effects of ART drugs on the CNS function, especially mental health, can help clinicians personalize medicine with less adverse effects for PWH and prevent them from discontinuing their ART to avoid undesirable health outcomes and increased likelihood of HIV transmission. The emergence of electronic health records offers researchers unprecedented access to HIV data including individualsâ mental health records, drug prescriptions, and clinical information over time. However, modeling such data is very challenging due to the high-dimensionality of the drug combination space, the individual heterogeneity, and the sparseness of the observed drug combinations. We develop a Bayesian nonparametric approach to learn drug combination effect on mental health in PWH adjusting for socio-demographic, behavioral, and clinical factors. The proposed method is built upon the subset-tree kernel method that represents drug combinations in a way that synthesizes known regimen structure into a single mathematical representation. It also utilizes a distance-dependent Chinese restaurant process to cluster heterogeneous population while taking into account individualsâ treatment histories. Our method has clinical utility in guiding clinicians to prescribe more informed and effective personalized treatment based on the individualsâ treatment histories and clinical characteristics.
Abstract: The singular value matrix decomposition plays a ubiquitous role throughout statistics and related fields. Myriad applications including clustering, classification, and dimensionality reduction involve studying and exploiting the geometric structure of singular values and singular vectors.This paper provides a novel collection of technical and theoretical tools for studying the geometry of singular subspaces using the two-to-infinity norm. Motivated by preliminary deterministic Procrustes analysis, we consider a general matrix perturbation setting in which we derive a new Procrustean matrix decomposition. Together with flexible machinery developed for the two-to-infinity norm, this allows us to conduct a refined analysis of the induced perturbation geometry with respect to the underlying singular vectors even in the presence of singular value multiplicity. Our analysis yields singular vector entry-wise perturbation bounds for a range of popular matrix noise models, each of which has a meaningful associated statistical inference task. In addition, we demonstrate how the two-to-infinity norm is the preferred norm in certain statistical settings. Specific applications discussed in this paper include covariance estimation, singular subspace recovery, and multiple graph inference.
4176 Campus Drive - William E. Kirwan Hall
College Park, MD 20742-4015
P: 301.405.5047 | F: 301.314.0827