Abstract: Various data sets collected from numerous sources have a great potential to enhance the quality of inference and accelerate scientific discovery. Inference for merged data is, however, quite challenging because such data may contain unidentified duplication from overlapping inhomogeneous sources and each data set often opportunistically collected induces complex dependence. In public health research, for example, epidemiological studies have different inclusion and exclusion criteria in contrast to hospital records without a well-defined target population, and when combined with a disease registry, patients appear in multiple data sets. In this talk, we present several examples in public health research which potentially enjoy merits of data integration. We overview existing research such as random effects model approach and multiple frame surveys and discuss their limitations in view of inferential goals, privacy protection, and large sample theory. We then propose our estimation method in the context of the Cox proportional hazards model. We illustrate our theory in simulation and read data examples. If time permitted we discuss extensions of our proposed method in several directions.
Abstract: Modern technological advances have enabled an unprecedented amount of structured data with complex temporal dependence, urging the need for new methods to efficiently model and forecast high-dimensional tensor-valued time series. This work serves as the first thorough attempt in this direction via autoregression. By considering a low-rank Tucker decomposition for the transition tensor, the proposed tensor autoregression can flexibly capture the underlying low-dimensional tensor dynamics, providing both substantial dimension reduction and meaningful dynamic factor interpretation. For this model, we introduce both low-dimensional rank-constrained estimator and high-dimensional regularized estimators, and derive their asymptotic and non-asymptotic properties. In particular, a novel convex regularization approach, based on the sum of nuclear norms of square matricizations, is proposed to efficiently encourage low-rankness of the coefficient tensor. A truncation method is further introduced to consistently select the Tucker ranks. Simulation experiments and real data analysis demonstrate the advantages of the proposed approach over various competing ones.
Abstract: The Rosenblatt distribution is named after Murray Rosenblatt who passed away very recently, in October 2019. It is the simplest non-Gaussian distribution which arises in Non-Central limit theorems involving long-range dependent random variables. However, no closed form is known, which is a problem if one wants to compute confidence intervals for example. In this talk, I shall describe a numerical method which provides insight in characteristic features of that distribution and seems to produce very good results. This method also gives rise to an interesting conjecture.
Abstract: We consider the ramifications of utilizing biased latent position estimates in subsequent statistical analysis in exchange for sizable variance reductions in finite networks. We establish an explicit bias-variance tradeoff for latent position estimates produced by the omnibus embedding in the presence of heterogeneous network data. We reveal an analytic bias expression, derive a uniform concentration bound on the residual term, and prove a central limit theorem characterizing the distributional properties of these estimates.
Abstract: Spatio-temporal data are ubiquitous in the sciences and engineering, and their study is important for understanding and predicting a wide variety of processes. One of the difficulties with statistical modeling of spatial processes that change in time is the complexity of the dependence structures that must describe how such a process varies, and the presence of high-dimensional complex datasets and large prediction domains. It is particularly challenging to specify parameterizations for nonlinear dynamic spatio-temporal models (DSTMs) that are simultaneously useful scientifically, efficient computationally, and allow for proper uncertainty quantification. Here we describe a recent approach that utilizes a deep convolutional neural network to learn the kernel mapping function in a state-dependent integro-difference equation (IDE) DSTM. We implement the approach using an ensemble Kalman filter for efficient computation. The model is trained on daily sea surface temperatures in the Atlantic Ocean and is used to generate forecasts. Importantly, the model has the remarkable âtransfer learningâ ability to predict a process (weather radar storm cell movement) completely different from the sea surface temperature data on which it was trained.
This is joint work with Andrew Zammit-Mangion, University of Wollongong, Australia.
Paper link: https://www.sciencedirect.com/science/article/pii/S2211675320300026
Abstract: Time series data is prevalent, however common time series models can fail to capture the complex dynamics of time series data in practice.In this paper, we focus on a specific popular model - the ARFIMA(p, d, q) model - and the assumption of stationarity. An ARFIMA(p, d, q) model is stationary when the d-th differences of the time series data are distributed according to a stationary ARMA(p, q) model, and the differencing parameter d is constrained to be less than or equal to 0.5. Assuming that the differencing parameter d is less than or equal to 0.5 is technically convenient, but may not be appropriate in practice. In this paper, we make a simple observation that facilitates exact and approximate likelihood-based inference for the parameters of the ARFIMA(p, d, q) model given an upper bound for the differencing parameter d that can exceed 0.5. We explore how estimation of the differencing parameter d depends on the upper bound, and introduce adaptive procedures for choosing the upper bound. Via simulations we demonstrate that our adaptive exact likelihood procedures estimate the differencing parameter d well even when the true differencing parameter d is as large as 2.5, can be used to obtain confidence intervals for the differencing parameter that achieve nominal coverage rates, perform favorably relative to existing alternatives, and can be made approximate. We conclude by applying our adaptive procedures to several real data sets.
Abstract: Networks show up in a variety of applications, starting from social networks to brain networks in neuroscience, from recommender systems to the internet, from who-eats-whom networks in ecosystems to protein protein interaction networks. An important question that often arises is centered around estimating the underlying distribution of network statistics, like eigenvalues, subgraph densities, etc. In this talk I will discuss some recent work on subsampling and the jackknife for these inferential tasks. Despite the dependence of network data induced by edges between pairs of nodes, under the graphon model, we see that jackknife and subsampling behave similarly to their IID counterparts. This is joint work with Robert Lunde and Qiaohui Lin.
Abstract: In database management, record linkage aims to identify multiple records that correspond to the same individual. This task can be treated as a clustering problem, in which a latent entity is associated with one or more noisy database records. However, in contrast to traditional clustering applications, a large number of clusters with a few observations per cluster is expected in this context. In this work, we introduce a new class of prior distributions based on allelic partitions that is specially suited for the small cluster setting of record linkage. Our approach makes it straightforward to introduce prior information about the cluster size distribution at different scales, and naturally enforces sublinear growth of the maximum cluster size â known as the microclustering property. We evaluate the performance of our proposed class of priors using official statistics data sets and show that our models provide competitive results compared to state-of-the-art microclustering models in the record linkage literature.
Abstract: The problem of variable selection in finite mixture of regression models has been the focus of some research over the last decade. The goal is to uncover latent classes and identify component-specific relevant predictors in a unified manner. This is achieved by combining ideas of mixture models, regression models and variable selection. I will present some of the methods we have proposed in this context, including (1) a stochastic partitioning method to relate two high-dimensional datasets, (2) a penalized mixture of multivariate generalized linear regression models, and (3) a mixture of regression trees approach. I will illustrate the methods with various applications.
Abstract: The Lyman-Î± forest â a dense series of hydrogen absorptions seen in the spectra of distant quasars â provides a unique observational probe of the early Universe. The density of spectroscopically measured quasars across the sky has recently risen to a level that has enabled secure measurements of large-scale structure in the three-dimensional distribution of intergalactic gas using the inhomogeneous hydrogen absorption patterns imprinted in the densely sampled quasar sightlines. In principle, these modern Lyman-Î± forest observations can be used to statistically reconstruct three-dimensional density maps of the intergalactic medium over the massive cosmological volumes illuminated by current spectroscopic quasar surveys. However, until now, such maps have been impossible to produce without the development of scalable and statistically rigorous spatial modeling techniques. Using a sample of approximately 160,000 quasar sightlines measured across 25 percent of the sky by the SDSS-III Baryon Oscillation Spectroscopic Survey, here we present a 154 Gpc^3 large-scale structure map of the redshift 1.98â¤zâ¤3.15 intergalactic medium â the largest volume large-scale structure map of the Universe to date â accompanied by rigorous quantification of the statistical uncertainty in the reconstruction.
Abstract: In this work we develop a unified approach for solving a wide class of sequential selection problems. This class includes, but is not limited to, selection problems with noâinformation, rankâdependent rewards, and considers both fixed as well as random problem horizons. The proposed framework is based on a reduction of the original selection problem to one of optimal stopping for a sequence of judiciously constructed independent random variables. We demonstrate that our approach allows exact and efficient computation of optimal policies and various performance metrics thereof for a variety of sequential selection problems, several of which have not been solved to date.
Abstract: Nonsense associations can arise when an exposure and an outcome of interest exhibit similar patterns of dependence. Confounding is present when potential outcomes are not independent of treatment. This talk will describe how confusion about these two phenomena results in shortcomings in popular methods in three areas: causal inference with multiple treatments and unmeasured confounding; causal and statistical inference with social network data; and causal inference with spatial data. For each of these three areas I will demonstrate the flaws in existing methods and describe new methods that were inspired by careful consideration of dependence and confounding.
Abstract: An important question in statistical network analysis is how to estimate models of dependent network data with intractable likelihood functions, without sacrificing computational scalability and statistical guarantees. In this talk, we demonstrate that scalable estimation of random graph models with dependent edges is possible, by establishing consistency results and convergence rates for pseudo-likelihood-based M-estimators for parameter vectors of increasing dimension based on a single observation of dependent random variables with finite sample spaces. To showcase consistency results and convergence rates, we introduce a novel class of generalized beta-models with dependent edges and parameter vectors of increasing dimension. We establish consistency results and convergence rates for pseudo-likelihood-based M-estimators of generalized beta-models with dependent edges, in dense- and sparse-graph settings. These results demonstrate that all assumptions of our main results can be verified in applications to models capturing dependencies encountered in real-world networks.
Abstract: The analysis of tensor data has become an active research topic in this area of big data. Datasets in the form of tensors, or high-order matrices, arise from a wide range of applications, such as financial econometrics, genomics, and material science. In addition, tensor methods provide unique perspectives and solutions to many high-dimensional problems, such as topic modeling and high-order interaction pursuit, where the observations are not necessarily tensors. High-dimensional tensor problems generally possess distinct characteristics that pose unprecedented challenges to the data science community. There is a clear need to develop new methods, efficient algorithms, and fundamental theory to analyze the high-dimensional tensor data.
In this talk, we discuss some recent advances in high-dimensional tensor data analysis through the consideration of several fundamental and interrelated problems, including tensor SVD and tensor regression. We illustrate how we develop new statistically optimal methods and computationally efficient algorithms that exploit useful information from high-dimensional tensor data based on the modern theories of computation, high-dimensional statistics, and non-convex optimization. Through tensor SVD, we are able to achieve good performance in the denoising of 4D scanning transmission electron microscopy images. Using tensor regression, we are able to use MRI images for the prediction of attention-deficit/hyperactivity disorder.
Abstract: Statistical analysis of a graph often starts with embedding, the process of representing its nodes as points in space. How to choose the embedding dimension is a nuanced decision in practice, but in theory a notion of true dimension is often available. In spectral embedding, this dimension may be very high. However, in this talk I will show that many existing random graph models predict the data should live near a much lower-dimensional (curved) subset. One may therefore circumvent the curse of dimensionality by employing methods which exploit hidden manifold structure. Results are illustrated on simple, weighted, multilayer and multipartite graphs originating from various (cyber-)security applications, as we strive towards more robust anomaly detection in such problems.
Abstract: Overparameterization in deep learning has shown to be powerful: very large models can fit the training data perfectly and yet generalize well. Investigation of overparameterization brought back the study of linear models, which, like more complex models, show a âdouble descentâ behavior. This involves two features: (1) The risk (out-of-sample prediction error) can grow arbitrarily when the number of samples n approaches the number of parameters p (from either side), and (2) the risk decreases with p at p > n, sometimes achieving a lower value than the lowest risk at p < n. The divergence of the risk at p = n is related to the condition number of the empirical covariance in the feature set. For this reason, it can be avoided with regularization. In this work we show that performing a PCA-based dimensionality reduction also avoids the divergence at p = n; we provide a finite upper bound for the variance of the estimator that decreases with p. This result contrasts with recent work that shows that a different form of dimensionality reductionâone based on the population covariance instead of the empirical covarianceâdoes not avoid the divergence. We connect these results to an analysis of adversarial attacks, which become more effective as they raise the condition number of the empirical covariance of the features. We show that ordinary least squares is arbitrarily susceptible to data-poisoning attacks in the overparameterized regimeâunlike the underparameterized regimeâand how regularization and dimensionality reduction improve its robustness. We also translate the results on the highly overparameterized linear regression regime to Gaussian Processes.
Abstract: A classical problem in causal inference is that of matching treatment units to control units in an observational dataset. This problem is distinct from simple estimation of treatment effects as it provides additional practical interpretability of the underlying causal mechanisms that is not available without matching. Some of the main challenges in developing matching methods arise from the tension among (i) inclusion of as many relevant covariates as possible in defining the matched groups, (ii) having matched groups with enough treated and control units for a valid estimate of average treatment effect in each group, (iii) computing the matched groups efficiently for large datasets, and (iv) dealing with complicating factors such as non-independence among units. Many matching methods require expert input into the choice of distance metric that guides which covariates to match on and how to match on them. This task becomes impractical for modern electronic health record and large online social network data simply because humans are not naturally adept at constructing high dimensional functions manually. We propose the Almost Matching Exactly (AME) framework to tackle these problems for categorical covariates. At its core this framework proposes an optimization objective for match quality that captures covariates that are integral for making causal statements while encouraging as many matches as possible. We demonstrate that this framework is able to construct good matched groups on relevant covariates and leverage these high quality matches to estimate conditional average treatment effects (CATEs) in the study of the effects of a motherâs smoking status on pregnancy outcomes. We further extend the methodology to incorporate continuous and other complex covariates.
Abstract: It is known that the additive hazards model is collapsible, in the sense that when omitting one covariate from a model with two independent covariates, the marginal model is still an additive hazards model with the same regression coefficient. In contrast, for the proportional hazards model under the same covariate assumption, it is well-known that the marginal model is no longer a proportional hazards model and hence not collapsible. These results, however, relate to the model specification and not necessarily to the regression estimates.
I point out that if covariates in risk sets at all event times are independent then both Cox and Aalen regression estimates are collapsible, in the sense that there is no systematic change in the parameter estimates. Vice-versa, if this assumption fails, then the estimates will change systematically both for Cox and Aalen regression. In particular, if the data are generated by an Aalen model with censoring independent of covariates both Cox and Aalen regression are collapsible, but if generated by a proportional hazards model neither estimators are. We will also discuss settings where survival times are generated by proportional hazards models with censoring patterns providing uncorrelated covariates and hence collapsible Cox and Aalen regression estimates. Furthermore, possible consequences for instrumental variable analyses with survival data are discussed.
4176 Campus Drive - William E. Kirwan Hall
College Park, MD 20742-4015
P: 301.405.5047 | F: 301.314.0827