Abstract: Various data sets collected from numerous sources have a great potential to enhance the quality of inference and accelerate scientific discovery. Inference for merged data is, however, quite challenging because such data may contain unidentified duplication from overlapping inhomogeneous sources and each data set often opportunistically collected induces complex dependence. In public health research, for example, epidemiological studies have different inclusion and exclusion criteria in contrast to hospital records without a well-defined target population, and when combined with a disease registry, patients appear in multiple data sets. In this talk, we present several examples in public health research which potentially enjoy merits of data integration. We overview existing research such as random effects model approach and multiple frame surveys and discuss their limitations in view of inferential goals, privacy protection, and large sample theory. We then propose our estimation method in the context of the Cox proportional hazards model. We illustrate our theory in simulation and read data examples. If time permitted we discuss extensions of our proposed method in several directions.
Abstract: Modern technological advances have enabled an unprecedented amount of structured data with complex temporal dependence, urging the need for new methods to efficiently model and forecast high-dimensional tensor-valued time series. This work serves as the first thorough attempt in this direction via autoregression. By considering a low-rank Tucker decomposition for the transition tensor, the proposed tensor autoregression can flexibly capture the underlying low-dimensional tensor dynamics, providing both substantial dimension reduction and meaningful dynamic factor interpretation. For this model, we introduce both low-dimensional rank-constrained estimator and high-dimensional regularized estimators, and derive their asymptotic and non-asymptotic properties. In particular, a novel convex regularization approach, based on the sum of nuclear norms of square matricizations, is proposed to efficiently encourage low-rankness of the coefficient tensor. A truncation method is further introduced to consistently select the Tucker ranks. Simulation experiments and real data analysis demonstrate the advantages of the proposed approach over various competing ones.
Abstract: The Rosenblatt distribution is named after Murray Rosenblatt who passed away very recently, in October 2019. It is the simplest non-Gaussian distribution which arises in Non-Central limit theorems involving long-range dependent random variables. However, no closed form is known, which is a problem if one wants to compute confidence intervals for example. In this talk, I shall describe a numerical method which provides insight in characteristic features of that distribution and seems to produce very good results. This method also gives rise to an interesting conjecture.
Abstract: We consider the ramifications of utilizing biased latent position estimates in subsequent statistical analysis in exchange for sizable variance reductions in finite networks. We establish an explicit bias-variance tradeoff for latent position estimates produced by the omnibus embedding in the presence of heterogeneous network data. We reveal an analytic bias expression, derive a uniform concentration bound on the residual term, and prove a central limit theorem characterizing the distributional properties of these estimates.
Abstract: Spatio-temporal data are ubiquitous in the sciences and engineering, and their study is important for understanding and predicting a wide variety of processes. One of the difficulties with statistical modeling of spatial processes that change in time is the complexity of the dependence structures that must describe how such a process varies, and the presence of high-dimensional complex datasets and large prediction domains. It is particularly challenging to specify parameterizations for nonlinear dynamic spatio-temporal models (DSTMs) that are simultaneously useful scientifically, efficient computationally, and allow for proper uncertainty quantification. Here we describe a recent approach that utilizes a deep convolutional neural network to learn the kernel mapping function in a state-dependent integro-difference equation (IDE) DSTM. We implement the approach using an ensemble Kalman filter for efficient computation. The model is trained on daily sea surface temperatures in the Atlantic Ocean and is used to generate forecasts. Importantly, the model has the remarkable âtransfer learningâ ability to predict a process (weather radar storm cell movement) completely different from the sea surface temperature data on which it was trained.
This is joint work with Andrew Zammit-Mangion, University of Wollongong, Australia.
Paper link: https://www.sciencedirect.com/science/article/pii/S2211675320300026
Abstract: Time series data is prevalent, however common time series models can fail to capture the complex dynamics of time series data in practice.In this paper, we focus on a specific popular model - the ARFIMA(p, d, q) model - and the assumption of stationarity. An ARFIMA(p, d, q) model is stationary when the d-th differences of the time series data are distributed according to a stationary ARMA(p, q) model, and the differencing parameter d is constrained to be less than or equal to 0.5. Assuming that the differencing parameter d is less than or equal to 0.5 is technically convenient, but may not be appropriate in practice. In this paper, we make a simple observation that facilitates exact and approximate likelihood-based inference for the parameters of the ARFIMA(p, d, q) model given an upper bound for the differencing parameter d that can exceed 0.5. We explore how estimation of the differencing parameter d depends on the upper bound, and introduce adaptive procedures for choosing the upper bound. Via simulations we demonstrate that our adaptive exact likelihood procedures estimate the differencing parameter d well even when the true differencing parameter d is as large as 2.5, can be used to obtain confidence intervals for the differencing parameter that achieve nominal coverage rates, perform favorably relative to existing alternatives, and can be made approximate. We conclude by applying our adaptive procedures to several real data sets.
Abstract: Networks show up in a variety of applications, starting from social networks to brain networks in neuroscience, from recommender systems to the internet, from who-eats-whom networks in ecosystems to protein protein interaction networks. An important question that often arises is centered around estimating the underlying distribution of network statistics, like eigenvalues, subgraph densities, etc. In this talk I will discuss some recent work on subsampling and the jackknife for these inferential tasks. Despite the dependence of network data induced by edges between pairs of nodes, under the graphon model, we see that jackknife and subsampling behave similarly to their IID counterparts. This is joint work with Robert Lunde and Qiaohui Lin.
Abstract: In database management, record linkage aims to identify multiple records that correspond to the same individual. This task can be treated as a clustering problem, in which a latent entity is associated with one or more noisy database records. However, in contrast to traditional clustering applications, a large number of clusters with a few observations per cluster is expected in this context. In this work, we introduce a new class of prior distributions based on allelic partitions that is specially suited for the small cluster setting of record linkage. Our approach makes it straightforward to introduce prior information about the cluster size distribution at different scales, and naturally enforces sublinear growth of the maximum cluster size â known as the microclustering property. We evaluate the performance of our proposed class of priors using official statistics data sets and show that our models provide competitive results compared to state-of-the-art microclustering models in the record linkage literature.
Abstract: The problem of variable selection in finite mixture of regression models has been the focus of some research over the last decade. The goal is to uncover latent classes and identify component-specific relevant predictors in a unified manner. This is achieved by combining ideas of mixture models, regression models and variable selection. I will present some of the methods we have proposed in this context, including (1) a stochastic partitioning method to relate two high-dimensional datasets, (2) a penalized mixture of multivariate generalized linear regression models, and (3) a mixture of regression trees approach. I will illustrate the methods with various applications.
Abstract: The Lyman-Î± forest â a dense series of hydrogen absorptions seen in the spectra of distant quasars â provides a unique observational probe of the early Universe. The density of spectroscopically measured quasars across the sky has recently risen to a level that has enabled secure measurements of large-scale structure in the three-dimensional distribution of intergalactic gas using the inhomogeneous hydrogen absorption patterns imprinted in the densely sampled quasar sightlines. In principle, these modern Lyman-Î± forest observations can be used to statistically reconstruct three-dimensional density maps of the intergalactic medium over the massive cosmological volumes illuminated by current spectroscopic quasar surveys. However, until now, such maps have been impossible to produce without the development of scalable and statistically rigorous spatial modeling techniques. Using a sample of approximately 160,000 quasar sightlines measured across 25 percent of the sky by the SDSS-III Baryon Oscillation Spectroscopic Survey, here we present a 154 Gpc^3 large-scale structure map of the redshift 1.98â¤zâ¤3.15 intergalactic medium â the largest volume large-scale structure map of the Universe to date â accompanied by rigorous quantification of the statistical uncertainty in the reconstruction.
Abstract: In this work we develop a unified approach for solving a wide class of sequential selection problems. This class includes, but is not limited to, selection problems with noâinformation, rankâdependent rewards, and considers both fixed as well as random problem horizons. The proposed framework is based on a reduction of the original selection problem to one of optimal stopping for a sequence of judiciously constructed independent random variables. We demonstrate that our approach allows exact and efficient computation of optimal policies and various performance metrics thereof for a variety of sequential selection problems, several of which have not been solved to date.
4176 Campus Drive - William E. Kirwan Hall
College Park, MD 20742-4015
P: 301.405.5047 | F: 301.314.0827