Statistics Archives for Fall 2021 to Spring 2022

Statistical methods for merged data from multiple sources

When: Thu, September 10, 2020 - 3:30pm
Where: Zoom link:
Speaker: Takumi Saegusa (UMD) -
Abstract: Various data sets collected from numerous sources have a
great potential to enhance the quality of inference and accelerate
scientific discovery. Inference for merged data is, however, quite
challenging because such data may contain unidentified duplication
from overlapping inhomogeneous sources and each data set often
opportunistically collected induces complex dependence. In public
health research, for example, epidemiological studies have different
inclusion and exclusion criteria in contrast to hospital records
without a well-defined target population, and when combined with a
disease registry, patients appear in multiple data sets. In this talk,
we present several examples in public health research which
potentially enjoy merits of data integration. We overview existing
research such as random effects model approach and multiple frame
surveys and discuss their limitations in view of inferential goals,
privacy protection, and large sample theory. We then propose our
estimation method in the context of the Cox proportional hazards
model. We illustrate our theory in simulation and read data examples.
If time permitted we discuss extensions of our proposed method in
several directions.

Zoom link:

High-Dimensional Low-Rank Tensor Autoregressive Time Series Modelling

When: Thu, September 17, 2020 - 3:30pm
Where: Zoom:
Speaker: Yao Zheng (University of Connecticut) -
Abstract: Modern technological advances have enabled an unprecedented amount of structured data with complex temporal dependence, urging the need for new methods to efficiently model and forecast high-dimensional tensor-valued time series. This work serves as the first thorough attempt in this direction via autoregression. By considering a low-rank Tucker decomposition for the transition tensor, the proposed tensor autoregression can flexibly capture the underlying low-dimensional tensor dynamics, providing both substantial dimension reduction and meaningful dynamic factor interpretation. For this model, we introduce both low-dimensional rank-constrained estimator and high-dimensional regularized estimators, and derive their asymptotic and non-asymptotic properties. In particular, a novel convex regularization approach, based on the sum of nuclear norms of square matricizations, is proposed to efficiently encourage low-rankness of the coefficient tensor. A truncation method is further introduced to consistently select the Tucker ranks. Simulation experiments and real data analysis demonstrate the advantages of the proposed approach over various competing ones.

Properties and numerical evaluation of the Rosenblatt distribution

When: Thu, September 24, 2020 - 3:30pm
Where: Zoom:
Speaker: Murad Taqqu (Boston University) -
Abstract: The Rosenblatt distribution is named after Murray Rosenblatt who
passed away very recently, in October 2019. It is the simplest non-Gaussian
distribution which arises in Non-Central limit theorems involving long-range dependent random variables. However, no closed form is known, which is a problem if one wants to compute confidence intervals for example. In this talk, I shall describe a numerical method which provides insight in characteristic features of that distribution and seems to produce very good results. This method also gives rise to an interesting conjecture.

This is joint work with Mark Veillette.

Bias-Variance Tradeoffs in Joint Spectral Embeddings

When: Thu, October 1, 2020 - 3:30pm
Where: Zoom:
Speaker: Daniel L. Sussman (Boston University) -
Abstract: We consider the ramifications of utilizing biased latent position estimates in subsequent statistical analysis in exchange for sizable variance reductions in finite networks. We establish an explicit bias-variance tradeoff for latent position estimates produced by the omnibus embedding in the presence of heterogeneous network data. We reveal an analytic bias expression, derive a uniform concentration bound on the residual term, and prove a central limit theorem characterizing the distributional properties of these estimates.

Deep neural assisted integro-difference equation statistical models for spatio-temporal forecasting

When: Thu, October 8, 2020 - 3:30pm
Where: Zoom link:
Speaker: Christopher K. Wikle (University of Missouri) -
Abstract: Spatio-temporal data are ubiquitous in the sciences and engineering, and their study is important for understanding and predicting a wide variety of processes. One of the difficulties with statistical modeling of spatial processes that change in time is the complexity of the dependence structures that must describe how such a process varies, and the presence of high-dimensional complex datasets and large prediction domains. It is particularly challenging to specify parameterizations for nonlinear dynamic spatio-temporal models (DSTMs) that are simultaneously useful scientifically, efficient computationally, and allow for proper uncertainty quantification. Here we describe a recent approach that utilizes a deep convolutional neural network to learn the kernel mapping function in a state-dependent integro-difference equation (IDE) DSTM. We implement the approach using an ensemble Kalman filter for efficient computation. The model is trained on daily sea surface temperatures in the Atlantic Ocean and is used to generate forecasts. Importantly, the model has the remarkable “transfer learning” ability to predict a process (weather radar storm cell movement) completely different from the sea surface temperature data on which it was trained.

This is joint work with Andrew Zammit-Mangion, University of Wollongong, Australia.

Paper link:

Estimation of Possibly Non-Stationary Long Memory Processes via Adaptive Overdifferencing

When: Thu, October 15, 2020 - 3:30pm
Where: Zoom:
Speaker: Maryclare Griffin (University of Massachusetts, Amherst) -
Abstract: Time series data is prevalent, however common time series models can fail to capture the complex dynamics of time series data in practice.In this paper, we focus on a specific popular model - the ARFIMA(p, d, q) model - and the assumption of stationarity. An ARFIMA(p, d, q) model is stationary when the d-th differences of the time series data are distributed according to a stationary ARMA(p, q) model, and the differencing parameter d is constrained to be less than or equal to 0.5. Assuming that the differencing parameter d is less than or equal to 0.5 is technically convenient, but may not be appropriate in practice. In this paper, we make a simple observation that facilitates exact and approximate likelihood-based inference for the parameters of the ARFIMA(p, d, q) model given an upper bound for the differencing parameter d that can exceed 0.5. We explore how estimation of the differencing parameter d depends on the upper bound, and introduce adaptive procedures for choosing the upper bound. Via simulations we demonstrate that our adaptive exact likelihood procedures estimate the differencing parameter d well even when the true differencing parameter d is as large as 2.5, can be used to obtain confidence intervals for the differencing parameter that achieve nominal coverage rates, perform favorably relative to existing alternatives, and can be made approximate. We conclude by applying our adaptive procedures to several real data sets.

Subsampling and Jackknife for networks

When: Thu, October 22, 2020 - 3:30pm
Where: Zoom:
Speaker: Purnamrita Sarkar (University of Texas, Austin) -
Abstract: Networks show up in a variety of applications, starting from social networks to brain networks in neuroscience, from recommender systems to the internet, from who-eats-whom networks in ecosystems to protein protein interaction networks. An important question that often arises is centered around estimating the underlying distribution of network statistics, like eigenvalues, subgraph densities, etc. In this talk I will discuss some recent work on subsampling and the jackknife for these inferential tasks. Despite the dependence of network data induced by edges between pairs of nodes, under the graphon model, we see that jackknife and subsampling behave similarly to their IID counterparts. This is joint work with Robert Lunde and Qiaohui Lin.

A Prior for Record Linkage Based on Allelic Partitions

When: Thu, November 5, 2020 - 3:30pm
Where: Zoom:
Speaker: Brenda Betancourt (University of Florida, Department of Statistics) -
Abstract: In database management, record linkage aims to identify multiple records that
correspond to the same individual. This task can be treated as a clustering problem, in which a latent entity is associated with one or more noisy database records. However, in contrast to traditional clustering applications, a large number of clusters with a few observations per cluster is expected in this context. In this work, we introduce a new class of prior distributions based on allelic partitions that is specially suited for the small cluster setting of record linkage. Our approach makes it straightforward to introduce prior information about the cluster size distribution at different scales, and naturally enforces sublinear growth of the maximum cluster size – known as the microclustering property. We evaluate the performance of our proposed class of priors using official statistics data sets and show that our models provide competitive results compared to state-of-the-art microclustering models in the record linkage literature.

Variable selection in mixture of regression models: Uncovering cluster structure and relevant features

When: Thu, November 12, 2020 - 3:30pm
Where: Zoom:
Speaker: Mahlet Tadesse (Georgetown University) -
Abstract: The problem of variable selection in finite mixture of regression models has been the focus of some research over the last decade. The goal is to uncover latent classes and identify component-specific relevant predictors in a unified manner. This is achieved by combining ideas of mixture models, regression models and variable selection. I will present some of the methods we have proposed in this context, including (1) a stochastic partitioning method to relate two high-dimensional datasets, (2) a penalized mixture of multivariate generalized linear regression models, and (3) a mixture of regression trees approach. I will illustrate the methods with various applications.

Three-dimensional cosmography of the high redshift Universe using intergalactic absorption

When: Thu, November 19, 2020 - 3:30pm
Where: Zoom:
Speaker: Collin Politsch (Carnegie Mellon University) -
Abstract: The Lyman-α forest – a dense series of hydrogen absorptions seen in the spectra of distant quasars – provides a unique observational probe of the early Universe. The density of spectroscopically measured quasars across the sky has recently risen to a level that has enabled secure measurements of large-scale structure in the three-dimensional distribution of intergalactic gas using the inhomogeneous hydrogen absorption patterns imprinted in the densely sampled quasar sightlines. In principle, these modern Lyman-α forest observations can be used to statistically reconstruct three-dimensional density maps of the intergalactic medium over the massive cosmological volumes illuminated by current spectroscopic quasar surveys. However, until now, such maps have been impossible to produce without the development of scalable and statistically rigorous spatial modeling techniques. Using a sample of approximately 160,000 quasar sightlines measured across 25 percent of the sky by the SDSS-III Baryon Oscillation Spectroscopic Survey, here we present a 154 Gpc^3 large-scale structure map of the redshift 1.98≤z≤3.15 intergalactic medium — the largest volume large-scale structure map of the Universe to date — accompanied by rigorous quantification of the statistical uncertainty in the reconstruction.

A unified approach for solving sequential selection problems

When: Thu, December 10, 2020 - 3:30pm
Where: Zoom:
Speaker: Yaakov Malinovsky (UMBC) -
Abstract: In this work we develop a unified approach for solving a wide class of sequential selection problems. This class includes, but is not limited to, selection problems with no–information, rank–dependent rewards, and considers both fixed as well as random problem horizons. The proposed framework is based on a reduction of the original selection problem to one of optimal stopping for a sequence of judiciously constructed independent random variables. We demonstrate that our approach allows exact and efficient computation of optimal policies and various performance metrics thereof for a variety of sequential selection problems, several of which have not been solved to date.

Disentangling confounding and nonsense associations due to dependence

When: Thu, February 11, 2021 - 3:30pm
Where: Zoom:
Speaker: Elizabeth (Betsy) Ogburn (Johns Hopkins University) -
Abstract: Nonsense associations can arise when an exposure and an outcome of interest exhibit similar patterns of dependence. Confounding is present when potential outcomes are not independent of treatment. This talk will describe how confusion about these two phenomena results in shortcomings in popular methods in three areas: causal inference with multiple treatments and unmeasured confounding; causal and statistical inference with social network data; and causal inference with spatial data. For each of these three areas I will demonstrate the flaws in existing methods and describe new methods that were inspired by careful consideration of dependence and confounding.

Scalable estimation of random graph models with dependent edges and increasing numbers of parameters

When: Thu, February 18, 2021 - 3:30pm
Where: Zoom link:
Speaker: Jonathan Stewart (Florida State University) -
Abstract: An important question in statistical network analysis is how to estimate models of dependent network data with intractable likelihood functions, without sacrificing computational scalability and statistical guarantees. In this talk, we demonstrate that scalable estimation of random graph models with dependent edges is possible, by establishing consistency results and convergence rates for pseudo-likelihood-based M-estimators for parameter vectors of increasing dimension based on a single observation of dependent random variables with finite sample spaces. To showcase consistency results and convergence rates, we introduce a novel class of generalized beta-models with dependent edges and parameter vectors of increasing dimension. We establish consistency results and convergence rates for pseudo-likelihood-based M-estimators of generalized beta-models with dependent edges, in dense- and sparse-graph settings. These results demonstrate that all assumptions of our main results can be verified in applications to models capturing dependencies encountered in real-world networks.

Statistical Learning for High-dimensional Tensor Data

When: Thu, March 4, 2021 - 3:30pm
Where: Zoom:
Speaker: Anru Zhang (University of Wisconsin, Madison) -
Abstract: The analysis of tensor data has become an active research topic in this area of big data. Datasets in the form of tensors, or high-order matrices, arise from a wide range of applications, such as financial econometrics, genomics, and material science. In addition, tensor methods provide unique perspectives and solutions to many high-dimensional problems, such as topic modeling and high-order interaction pursuit, where the observations are not necessarily tensors. High-dimensional tensor problems generally possess distinct characteristics that pose unprecedented challenges to the data science community. There is a clear need to develop new methods, efficient algorithms, and fundamental theory to analyze the high-dimensional tensor data.

In this talk, we discuss some recent advances in high-dimensional tensor data analysis through the consideration of several fundamental and interrelated problems, including tensor SVD and tensor regression. We illustrate how we develop new statistically optimal methods and computationally efficient algorithms that exploit useful information from high-dimensional tensor data based on the modern theories of computation, high-dimensional statistics, and non-convex optimization. Through tensor SVD, we are able to achieve good performance in the denoising of 4D scanning transmission electron microscopy images. Using tensor regression, we are able to use MRI images for the prediction of attention-deficit/hyperactivity disorder.

Manifold structure in graph embeddings

When: Thu, March 11, 2021 - 3:30pm
Where: Zoom:
Speaker: Patrick Rubin-Delanchy (University of Bristol) -
Abstract: Statistical analysis of a graph often starts with embedding, the process of representing its nodes as points in space. How to choose the embedding dimension is a nuanced decision in practice, but in theory a notion of true dimension is often available. In spectral embedding, this dimension may be very high. However, in this talk I will show that many existing random graph models predict the data should live near a much lower-dimensional (curved) subset. One may therefore circumvent the curse of dimensionality by employing methods which exploit hidden manifold structure. Results are illustrated on simple, weighted, multilayer and multipartite graphs originating from various (cyber-)security applications, as we strive towards more robust anomaly detection in such problems.

PCA, Double Descent, and Gaussian Processes

When: Thu, March 25, 2021 - 3:30pm
Where: Zoom:
Speaker: Soledad Villar (Johns Hopkins University) -
Abstract: Overparameterization in deep learning has shown to be powerful: very large models can fit the training data perfectly and yet generalize well. Investigation of overparameterization brought back the study of linear models, which, like more complex models, show a “double descent” behavior. This involves two features: (1) The risk (out-of-sample prediction error) can grow arbitrarily when the number of samples n approaches the number of parameters p (from either side), and (2) the risk decreases with p at p > n, sometimes achieving a lower value than the lowest risk at p < n. The divergence of the risk at p = n is related to the condition number of the empirical covariance in the feature set. For this reason, it can be avoided with regularization. In this work we show that performing a PCA-based dimensionality reduction also avoids the divergence at p = n; we provide a finite upper bound for the variance of the estimator that decreases with p. This result contrasts with recent work that shows that a different form of dimensionality reduction—one based on the population covariance instead of the empirical covariance—does not avoid the divergence. We connect these results to an analysis of adversarial attacks, which become more effective as they raise the condition number of the empirical covariance of the features. We show that ordinary least squares is arbitrarily susceptible to data-poisoning attacks in the overparameterized regime—unlike the underparameterized regime—and how regularization and dimensionality reduction improve its robustness. We also translate the results on the highly overparameterized linear regression regime to Gaussian Processes.

Machine learning methods for causal inference from complex observational data

When: Thu, April 8, 2021 - 3:30pm
Where: Zoom:
Speaker: Alexander Volfovsky (Duke University) -
Abstract: A classical problem in causal inference is that of matching treatment units to control units in an observational dataset. This problem is distinct from simple estimation of treatment effects as it provides additional practical interpretability of the underlying causal mechanisms that is not available without matching. Some of the main challenges in developing matching methods arise from the tension among (i) inclusion of as many relevant covariates as possible in defining the matched groups, (ii) having matched groups with enough treated and control units for a valid estimate of average treatment effect in each group, (iii) computing the matched groups efficiently for large datasets, and (iv) dealing with complicating factors such as non-independence among units. Many matching methods require expert input into the choice of distance metric that guides which covariates to match on and how to match on them. This task becomes impractical for modern electronic health record and large online social network data simply because humans are not naturally adept at constructing high dimensional functions manually. We propose the Almost Matching Exactly (AME) framework to tackle these problems for categorical covariates. At its core this framework proposes an optimization objective for match quality that captures covariates that are integral for making causal statements while encouraging as many matches as possible. We demonstrate that this framework is able to construct good matched groups on relevant covariates and leverage these high quality matches to estimate conditional average treatment effects (CATEs) in the study of the effects of a mother’s smoking status on pregnancy outcomes. We further extend the methodology to incorporate continuous and other complex covariates.

Collapsible Cox-regression and non-collapsible Aalen regression

When: Thu, April 15, 2021 - 3:30pm
Where: Zoom:
Speaker: Sven Ove Samuelsen (University of Oslo, Norway) -
Abstract: It is known that the additive hazards model is collapsible, in the sense that when omitting one covariate from a model with two independent covariates, the marginal model is still an additive hazards model with the same regression coefficient. In contrast, for the proportional hazards model under the same covariate assumption, it is well-known that the marginal model is no longer a proportional hazards model and hence not collapsible. These results, however, relate to the model specification and not necessarily to the regression estimates.

I point out that if covariates in risk sets at all event times are independent then both Cox and Aalen regression estimates are collapsible, in the sense that there is no systematic change in the parameter estimates. Vice-versa, if this assumption fails, then the estimates will change systematically both for Cox and Aalen regression. In particular, if the data are generated by an Aalen model with censoring independent of covariates both Cox and Aalen regression are collapsible, but if generated by a proportional hazards model neither estimators are. We will also discuss settings where survival times are generated by proportional hazards models with censoring patterns providing uncorrelated covariates and hence collapsible Cox and Aalen regression estimates. Furthermore, possible consequences for instrumental variable analyses with survival data are discussed.

Graph matching between bipartite and unipartite networks: to collapse, or not to collapse, that is the question

When: Thu, April 29, 2021 - 3:30pm
Where: Zoom:
Speaker: Jesus Arroyo (UMD) -
Abstract: Graph matching consists of aligning the vertices of two unlabeled graphs in order to maximize the shared structure across networks; when the graphs are unipartite, this is commonly formulated as minimizing their edge disagreements. In this paper, we address the common setting in which one of the graphs to match is a bipartite network and one is unipartite. Commonly, the bipartite networks are collapsed or projected into a unipartite graph, and graph matching proceeds as in the classical setting. This potentially leads to noisy edge estimates and loss of information. We formulate the graph matching problem between a bipartite and a unipartite graph using an undirected graphical model, and introduce methods to find the alignment with this model without collapsing. We theoretically demonstrate that our methodology is consistent, and provide non-asymptotic conditions that ensure exact recovery of the matching solution. In simulations and real data examples, we show how our methods can result in a more accurate matching than the naive approach of transforming the bipartite networks into unipartite, and we demonstrate the performance gains achieved by our method in simulated and real data networks, including a co-authorship-citation network pair, and brain structural and functional data.

Analysis of multivariate longitudinal data using dynamic lasso-regularized copula models with application to large pediatric cardiovascular studies

When: Thu, May 13, 2021 - 3:30pm
Where: Zoom:
Speaker: Colin Wu (NIH/NHLBI) -
Abstract: The National Heart, Lung and Blood Institute Growth and Health Study (NGHS) is a large longitudinal study of childhood health. A main objective of the study is to estimate the joint distributions of cardiovascular risk outcomes at any two time points conditioning on a large number of covariates. Existing multivariate longitudinal methods are not suitable for outcomes at multiple time points. We present a dynamic copula approach for estimating an outcome's joint distributions at two time points given a large number of time-varying covariates. Our models depend on the outcome's time-varying distributions at one time point, the bivariate copula densities and the functional copula parameters. We develop a three-step procedure for variable selection and estimation, which selects the influential covariates using a machine learning procedure based on spline Lasso-regularized least squares, computes the outcome's single-time distribution using splines, and estimates the functional copula parameter of the dynamic copula models. Pointwise confidence intervals are constructed through the resampling-subject bootstrap. We apply our procedure to the NGHS cardiovascular risk data and illustrate the clinical interpretations of the conditional distributions of a set of risk outcomes. We demonstrate the statistical properties of the dynamic models and estimation procedure through a simulation study.

This is joint work with Wei Zhang, Xiaoyang Ma, Xin Tian and Qizhai Li