Abstract: We consider the problem of graph matchability in non-identically distributed networks. In a general class of edge-independent networks, we demonstrate that graph matchability can be lost with high probability when matching the networks directly. We further demonstrate that under mild model assumptions, matchability is almost perfectly recovered by centering the networks using Universal Singular Value Thresholding before matching. These theoretical results are then demonstrated in both real and synthetic simulation settings. We also recover analogous core-matchability results in a very general core-junk network model, wherein some vertices do not correspond between the graph pair.
Abstract: Abstract: Data analysis should be adaptive, and researchers should be able to modify their analyses based on data exploration and previous analysis. Holdout methods allow for this, however multiple reuse of the holdout set can lead to incorrect conclusions. Researchers have previously shown that holdout sets can be reused for adaptive analysis using differential privacy techniques. In this talk, I present an extension of the research from binomial response variable to continuous response for potential applications in my research at the Institute for Defense Analyses (IDA).
IDA is a not-for-profit company that runs three Federally Funded Research and Development Centers (FFRDCs). FFRDCs are centers that are sponsored by, and conduct research for, various government agencies. Graduate students in STEM fields have likely heard of some of the more well-known FFRDCs without ever learning the term âFFRDCâ. For example: the âNational Labsâ, such as Los Alamos National Laboratory and Oak Ridge National Laboratory, are FFRDCs sponsored by the Department of Energy. The public-private partnerships offered by FFRDCs offer unique opportunities to meet the research needs of government organizations in challenging, cooperative environments.
Abstract: An interplay between coherence and logistic regression is discussed. Inter- action terms expressed as products of covariates may prove useful in logistic regression for binary time series, even when their factors are not significant. To identify potentially useful interaction terms, a graphical spectral tool, a function of lag or delay referred to as residual coherence, is introduced. Potentially useful interaction terms are identified by the size or prominence of their residual coherence. Instead of direct significance testing in terms of the residual coherence, the identified covariates are tested for their significance within logistic regression.
Abstract: Graph embeddings, a class of dimensionality reduction techniques designed for relational data, have proven useful in exploring and modeling network structure. Most dimensionality reduction methods allow out-of-sample extensions, by which an embedding can be applied to observations not present in the training set. Applied to graphs, the out-of-sample extension problem concerns how to compute the embedding of a vertex that is added to the graph after an embedding has already been computed. In this talk, we will consider the out-of-sample extension problem for two graph embedding procedures: the adjacency spectral embedding and the Laplacian spectral embedding. In both cases, we prove that when the underlying graph is generated according to a latent space model called the random dot product graph, which includes the popular stochastic block model as a special case, an out-of-sample extension based on a least-squares objective obeys a central limit theorem. Our results also yield a convenient framework in which to analyze trade-offs between estimation accuracy and computational expenses, which we will explore briefly.
Abstract: The development of models for multiple heterogeneous network data is of critical importance both in statistical network theory and across multiple application domains. Although single-graph inference is well-studied, multiple graph inference is largely unexplored, in part because of the challenges inherent in appropriately modeling graph differences and yet retaining sufficient model simplicity to render estimation feasible. The common subspace independent-edge (COSIE) multiple random graph model addresses this gap, by describing a heterogeneous collection of networks with a shared latent structure on the vertices but potentially different connectivity patterns for each graph. The COSIE model is both flexible to account for important graph differences and tractable to allow for accurate spectral inference. In both simulated and real data, the model can be deployed for a number of subsequent network inference tasks, including dimensionality reduction, classification, hypothesis testing, and community detection.
Abstract: Determining how certain properties are related to other properties is fundamental to scientific discovery; further investigations into the geometry of the relationship and future predictions are warranted only if two properties are significantly related. To better discover any type of relationship underlying paired sample data, we introduce the multiscale graph correlation (MGC), which combines distance correlation, the locality principle, and smoothed maximum to yield a new and powerful dependency measure. We prove that MGC is consistent for testing independence, enjoys a number of desirable theoretical properties, exhibits empirical power advantages against a wide range of nonlinear and high-dimensional dependencies, and can be efficiently implemented and utilized for real data exploration.
Abstract: We consider sparse Bayesian estimation in the classical multivariate linear regression model with p regressors and q response variables. In univariate Bayesian linear regression with a single response y, shrinkage priors which can be expressed as scale-mixtures of normal densities are a popular approach for obtaining sparse estimates of the coefficients. In this paper, we extend the use of these priors to the multivariate case to estimate a p times q coefficients matrix B. Our method can be used for any sample size n and any dimension p, and moreover, we show that the posterior distribution can consistently estimate B even when p grows at nearly exponential rate with the sample size. Our method's finite sample performance is demonstrated through simulations and data analysis.
Abstract: When searching for gene pathways leading to speciï¬c disease outcomes, additional information on gene characteristics is often available that may facilitate to diï¬erentiate genes related to the disease from irrelevant background when connections involving both types of genes are observed and their relationships to the disease are unknown. We propose method to single out irrelevant background genes with the help of auxiliary information through a logistic regression, and cluster relevant genes into cohesive groups using the adjacency matrix. Expectationâmaximization algorithm is modiï¬ed to maximize a joint pseudo-likelihood assuming latent indicators for relevance to the disease and latent group memberships as well as Poisson or multinomial distributed link numbers within and between groups. A robust version allowing arbitrary linkage patterns within the background is further derived. Asymptotic consistency of label assignments under the stochastic blockmodel is proven. Superior performance and robustness in ï¬nite samples are observed in simulation studies. The proposed robust method identiï¬es previously missed gene sets underlying autism related neurological diseases using diverse data sources including de novo mutations, gene expressions, and proteinâprotein interactions. Besides, we further proposed integrative network analysis framework by combining likelihood or pseudo-likelihood of heterogeneous network data. For example, in studying gene expression and protein-protein interaction data, when the cluster structure is illustrated in the mean values of gene expression, empirical Bayesian hierarchical model is combined with stochastic block model to identify functional groups. In analyzing protein-protein interaction and gene ontology data, correlation coefficient matrix with blocked structure is combined with stochastic block model to identify protein complex. Asymptotic consistency of the group membership estimates is proven. Superior performances of the integrative methods compared to methods using single data source are observed in simulation studies and empirical guidelines in the choice of integrative analysis vs separate analysis are provided.
Abstract: Wind tunnel tests are crucial to the design of tall structures. Scale models are outfitted with pressure taps at many locations of interest, such as the center of the roof. Each tap measures pressure at one location for the duration of the test. Since the tap measurement is recorded at regular time intervals, the data produced form a regular time series. Wind engineers are typically concerned with very high and very low (suction) pressures. Peaks-over-threshold (POT) extreme value models are one of two main approaches used by wind engineers for these data. However, POT models require the choice of a threshold, which can influence the final results, sometimes substantially, because the threshold choice controls the data that enter into the analysis. In this talk a method for combining results from multiple thresholds is considered, thereby eliminating the need to choose only one. The focus is on estimating the distribution of the maximum or minimum value in wind tunnel tests. The new method is compared to several techniques for choosing a single threshold using a large collection of pressure series from wind tunnel tests. The comparison shows that choosing a single threshold underestimates the uncertainty associated with predicting a future peak value.
Abstract: In this talk, I present computational methodologies for extracting dynamic neural functional networks that underlie behavior. These methods aim at capturing the sparsity, dynamicity and stochasticity of these networks, by integrating techniques from high-dimensional statistics, point processes, state-space modeling, and adaptive filtering. I demonstrate their utility using several case studies involving auditory processing, including 1) functional auditory-prefrontal interactions during attentive behavior in the ferret brain, 2) network-level signatures of decision-making in the mouse primary auditory cortex, and 3) cortical dynamics of speech processing in the human brain.
Abstract: In modern psychological and biomedical research with diagnostic purposes, scientists often formulate the key task as inferring the fine-grained latent information under structural constraints. These structural constraints usually come from the domain expertsâ prior knowledge or insight. The emerging family of Structured Latent Attribute Models (SLAMs) accommodate these modeling needs and have received substantial attention in psychology, education, and epidemiology. SLAMs bring exciting opportunities and unique challenges. In particular, with high-dimensional discrete latent attributes and structural constraints encoded by a design matrix, one needs to balance the gain in the modelâs explanatory power and interpretability, against the difficulty of understanding and handling the complex model structure.
In the first part of this talk, I present identifiability results that advance the theoretical knowledge of how the design matrix influences the estimability of SLAMs. The new identifiability conditions guide real-world practices of designing cognitive diagnostic tests and also lay the foundation for drawing valid statistical conclusions. In the second part, I introduce a statistically consistent penalized likelihood approach to selecting significant latent patterns in the population. I also propose a scalable computational method. These developments explore an exponentially large model space involving many discrete latent variables, and they address the estimation and computation challenges of high-dimensional SLAMs arising in large-scale scientific measurements. The application of the proposed methodology to the data from an international educational assessment reveals meaningful knowledge structures and latent subgroups of the student populations.
Abstract: This talk is concerned with noisy matrix completion: given partial and corrupted entries of a large low-rank matrix, how to estimate and infer the underlying matrix? Arguably one of the most popular paradigms to tackle this problem is convex relaxation, which achieves remarkable efficacy in practice. However, the statistical stability guarantees of this approach are still far from optimal in the noisy setting, falling short of explaining empirical success. Moreover, it is generally very challenging to pin down the distributions of the convex estimator, which presents a major roadblock in assessing the uncertainty, or âconfidenceâ, of the obtained estimates --- a crucial task at the core of statistical inference.
Our recent work makes progress towards understanding stability and uncertainty quantification for noisy matrix completion. When the rank of the unknown matrix is a constant: (1) we demonstrate that the convex estimator achieves near-optimal estimation errors vis-Ã -vis random noise; (2) we develop a de-biased estimator that admits accurate distributional characterizations, thus enabling asymptotically optimal inference of the low-rank factors and the entries of the matrix. All of this is enabled by bridging convex relaxation with the nonconvex Burer-Monteiro approach, a seemingly distinct algorithmic paradigm that is provably robust against noise.
Abstract: In many large-scale surveys, estimates are produced for numerous small domains defined by cross-classifications of demographic, geographic and other variables. Even though the overall sample size of such surveys might be very large, samples sizes for domains are sometimes too small for reliable estimation. We propose an improved estimation approach that is applicable when ``natural'' or qualitative relationships (such as orderings or other inequality constraints) can be formulated for the domains means at the population level. We stay within a design-based inferential framework but impose constraints representing these relationships on the sample-based estimates. The resulting constrained domain estimator is shown to be design consistent and asymptotically normally distributed as long as the constraints are asymptotically satisfied at the population level. The estimator and its associated variance estimator are readily implemented in practice. The applicability of the method is illustrated on data from the 2015 U.S. National Survey of College Graduates.
Abstract: In this talk, several aspects of complex data in event history analysis with applications to process data from educational measurement and electronic health records will be discussed. In an exploratory analysis on process data, a large number of possibly time-varying covariates maybe included. These covariates along with the high-dimensional counting processes often exhibit a low-dimensional structure that has meaningful interpretation. We explore such structure through specifying random coefficients in a low dimensional space. Furthermore, to obtain a parsimonious model and to improve interpretation of parameters therein, variable selection and estimation for both fixed and random effects are developed by penalized likelihood. We establish a set of sufficient conditions for the identifiability of the model and show that the proposed penalized estimators perform as well as the oracle procedure in variable selection. In electronic health records, we illustrate that a joint modeling of disease occurrence and drug prescription is preferred and a multivariate proportional intensity model with generalized random coefficients is proposed, where a flexible time-varying effect of the random coefficients could be included. Furthermore, in the presence of rare events and sparse covariates, standard asymptotic theory is no longer applicable. We establish consistency and asymptotic normality under such setting.
4176 Campus Drive - William E. Kirwan Hall
College Park, MD 20742-4015
P: 301.405.5047 | F: 301.314.0827