Abstract: Spectral inference on multiple networks is a rapidly-developing subfield of graph statistics. Recent work has demonstrated that joint, or simultaneous, spectral embedding of multiple independent networks can deliver more accurate estimation than individual spectral decompositions of those same networks. Such inference procedures typically rely heavily on independence assumptions across the multiple network realizations, and even in this case, little attention has been paid to the induced network correlation in such joint embeddings. Here, we present a generalized omnibus embedding methodology and provide a detailed analysis of this embedding across both independent and correlated networks, the latter of which significantly extends the reach of such procedures. We describe how this omnibus embedding can itself induce correlation, leading us to distinguish between inherent correlation -- the correlation that arises naturally in multisample network data -- and induced correlation, which is an artifice of the joint embedding methodology. We show that the generalized omnibus embedding procedure is flexible and robust, and prove both consistency and a central limit theorem for the embedded points. We examine how induced and inherent correlation can impact inference for network time series data, and we provide network analogues of classical questions such as the effective sample size for more generally correlated data. Further, we show how an appropriately calibrated generalized omnibus embedding can detect changes in real biological networks that previous embedding procedures could not discern, confirming that the effect of inherent and induced correlation can be subtle and transformative, with import in theory and practice.
Abstract: There is a growing interest in using multiple-frame surveys in recent years in order to save survey costs and reduce different types of nonsampling errors. Following the pioneering work by Hartley, methods and theories have been developed. A key underlying assumption of current papers on multiple-frame surveys is known domain membership of each unit of the finite population. But this assumption is hardly met in practice. The effect of violation of this critical assumption on finite population inference is not fully understood. We first investigate the effect of misspecification of the domain membership on estimation and variance estimation. We then exploit the recent development of probabilistic record linkage techniques in adjusting for biases due to domain membership misspecification in the finite population inference. We study the properties of the proposed estimators and the associated variance estimators analytically and through Monte Carlo simulations.
Abstract: Nonlinear phenomena in random processes can be modeled by a class of nonlinear polynomial functionals relating input and output. Residual coherence, a variation of the well-known measure of linear coherence, is a graphical tool to detect and select potential second-order interactions as functions of a single time series and its lags. An extension of residual coherence is made to account for interaction terms of multiple time series. The method is applied to analyzing the relationship between the implied market volatility of stock market and commodity market.
Abstract: The frequency-domain properties of nonstationary functional time series often contain valuable information. These properties are characterized through its time-varying power spectrum, which describes the contribution to the variability of a functional time series from waveforms oscillating at different frequencies over time. Practitioners seeking low-dimensional summary measures of the power spectrum often partition frequencies into bands and create collapsed measures of power within these bands. However, standard frequency bands have largely been developed through subjective inspection of time series data and may not provide adequate summary measures of the power spectrum. In this work we provide an adaptive frequency band estimation for nonstationary functional time series that adequately summarizes the time-varying dynamics of the series and simultaneously accounts for the complex interaction between the functional and temporal dependence structures. We develop scan statistics that takes a high value around any change in the frequency domain. We establish the theoretical properties of this statistic and develop a computationally efficient scalable algorithm to implement it. The validity of our method is also justified through numerous simulation studies and application to EEG data.
Abstract: The dramatic growth of big datasets presents a new challenge to data storage and analysis. Data reduction, or subsampling, that extracts useful information from datasets is a crucial step in big data analysis. We propose an orthogonal subsampling (OSS) approach for big data with a focus on linear regression models. The approach is inspired by the fact that an orthogonal array of two levels provides the best experimental design for linear regression models in the sense that it minimizes the average variance of the estimated parameters and provides the best predictions. The merits of OSS are three-fold: (i) it is easy to implement and fast; (ii) it is suitable for distributed parallel computing and ensures the subsamples selected in different batches have no common data points; and (iii) it outperforms existing methods in minimizing the mean squared errors of the estimated parameters and maximizing the efficiencies of the selected subsamples. Theoretical results and extensive numerical results show that the OSS approach is superior to existing subsampling approaches. It is also more robust to the presence of interactions among covariates and, when they do exist, OSS provides more precise estimates of the interaction effects than existing methods. The advantages of OSS are also illustrated through analysis of real data.
Abstract: Nowadays, we are living in the era of “Big Data.” A significant portion of big data is big spatial data captured through advanced technologies or large-scale simulations. Explosive growth in spatial and spatiotemporal data emphasizes the need for developing new and computationally efficient methods and credible theoretical support tailored for analyzing such large-scale data. Parallel statistical computing has proved to be a handy tool when dealing with big data. In general, it uses multiple processing elements simultaneously to solve a problem. However, it is hard to execute the conventional spatial regressions in parallel. This talk will introduce a novel parallel smoothing technique for generalized partially linear spatially varying coefficient models, which can be used under different hardware parallelism levels. Moreover, conflated with concurrent computing, the proposed method can be easily extended to the distributed system. Regarding the theoretical support of estimators from the proposed parallel algorithm, we first establish the asymptotical normality of linear estimators. Secondly, we show that the spline estimators reach the same convergence rate as the global spline estimators. The proposed method is evaluated through extensive simulation studies and an analysis of the US loan application data.
4176 Campus Drive - William E. Kirwan Hall
College Park, MD 20742-4015
P: 301.405.5047 | F: 301.314.0827