Statistics

Statistics Archives for Fall 2021 to Spring 2022

The Importance of Being Correlated: Implications of Dependence in Joint Spectral Inference across Multiple Networks

When: Thu, September 9, 2021 - 3:30pm
Where: Kirwan Hall 1308
Speaker: Vincent Lyzinski (UMD) -
Abstract: Spectral inference on multiple networks is a rapidly-developing subfield of graph statistics. Recent work has demonstrated that joint, or simultaneous, spectral embedding of multiple independent networks can deliver more accurate estimation than individual spectral decompositions of those same networks. Such inference procedures typically rely heavily on independence assumptions across the multiple network realizations, and even in this case, little attention has been paid to the induced network correlation in such joint embeddings. Here, we present a generalized omnibus embedding methodology and provide a detailed analysis of this embedding across both independent and correlated networks, the latter of which significantly extends the reach of such procedures. We describe how this omnibus embedding can itself induce correlation, leading us to distinguish between inherent correlation -- the correlation that arises naturally in multisample network data -- and induced correlation, which is an artifice of the joint embedding methodology. We show that the generalized omnibus embedding procedure is flexible and robust, and prove both consistency and a central limit theorem for the embedded points. We examine how induced and inherent correlation can impact inference for network time series data, and we provide network analogues of classical questions such as the effective sample size for more generally correlated data. Further, we show how an appropriately calibrated generalized omnibus embedding can detect changes in real biological networks that previous embedding procedures could not discern, confirming that the effect of inherent and induced correlation can be subtle and transformative, with import in theory and practice.

Probabilistic Record Linkage in Data Integration

When: Thu, September 23, 2021 - 3:30pm
Where: Kirwan Hall 1308
Speaker: Takumi Saegusa (UMD) -
Abstract: There is a growing interest in using multiple-frame surveys in recent years in order to save survey costs and reduce different types of nonsampling errors. Following the pioneering work by Hartley, methods and theories have been developed. A key underlying assumption of current papers on multiple-frame surveys is known domain membership of each unit of the finite population. But this assumption is hardly met in practice. The effect of violation of this critical assumption on finite population inference is not fully understood. We first investigate the effect of misspecification of the domain membership on estimation and variance estimation. We then exploit the recent development of probabilistic record linkage techniques in adjusting for biases due to domain membership misspecification in the finite population inference. We study the properties of the proposed estimators and the associated variance estimators analytically and through Monte Carlo simulations.

Extended Residual Coherence with a Financial Application

When: Thu, September 30, 2021 - 3:30pm
Where: Kirwan Hall 1308
Speaker: Xuze Zhang (UMD) -
Abstract: Nonlinear phenomena in random processes can be modeled by a class of nonlinear polynomial functionals relating input and output. Residual coherence, a variation of the well-known measure of linear coherence, is a graphical tool to detect and select potential second-order interactions as functions of a single time series and its lags. An extension of residual coherence is made to account for interaction terms of multiple time series. The method is applied to analyzing the relationship between the implied market volatility of stock market and commodity market.

Dynamic frequency band analysis for nonstationary functional time series

When: Thu, October 7, 2021 - 3:30pm
Where: Room 4122, CSIC Building
Speaker: Pramita Bagchi (George Mason University) -
Abstract: The frequency-domain properties of nonstationary functional time series often contain valuable information. These properties are characterized through its time-varying power spectrum, which describes the contribution to the variability of a functional time series from waveforms oscillating at different frequencies over time. Practitioners seeking low-dimensional summary measures of the power spectrum often partition frequencies into bands and create collapsed measures of power within these bands. However, standard frequency bands have largely been developed through subjective inspection of time series data and may not provide adequate summary measures of the power spectrum. In this work we provide an adaptive frequency band estimation for nonstationary functional time series that adequately summarizes the time-varying dynamics of the series and simultaneously accounts for the complex interaction between the functional and temporal dependence structures. We develop scan statistics that takes a high value around any change in the frequency domain. We establish the theoretical properties of this statistic and develop a computationally efficient scalable algorithm to implement it. The validity of our method is also justified through numerous simulation studies and application to EEG data.

Orthogonal Subsampling for Big Data Linear Regression

When: Thu, October 14, 2021 - 3:30pm
Where: CSIC 4122
Speaker: Lin Wang (George Washington University) -
Abstract: The dramatic growth of big datasets presents a new challenge to data storage and analysis. Data reduction, or subsampling, that extracts useful information from datasets is a crucial step in big data analysis. We propose an orthogonal subsampling (OSS) approach for big data with a focus on linear regression models. The approach is inspired by the fact that an orthogonal array of two levels provides the best experimental design for linear regression models in the sense that it minimizes the average variance of the estimated parameters and provides the best predictions. The merits of OSS are three-fold: (i) it is easy to implement and fast; (ii) it is suitable for distributed parallel computing and ensures the subsamples selected in different batches have no common data points; and (iii) it outperforms existing methods in minimizing the mean squared errors of the estimated parameters and maximizing the efficiencies of the selected subsamples. Theoretical results and extensive numerical results show that the OSS approach is superior to existing subsampling approaches. It is also more robust to the presence of interactions among covariates and, when they do exist, OSS provides more precise estimates of the interaction effects than existing methods. The advantages of OSS are also illustrated through analysis of real data.

Big Spatial Data Learning: a Parallel Solution

When: Thu, October 21, 2021 - 3:30pm
Where: CSIC 4122
Speaker: Lily Wang (George Mason University) -
Abstract: Nowadays, we are living in the era of “Big Data.” A significant portion of big data is big spatial data captured through advanced technologies or large-scale simulations. Explosive growth in spatial and spatiotemporal data emphasizes the need for developing new and computationally efficient methods and credible theoretical support tailored for analyzing such large-scale data. Parallel statistical computing has proved to be a handy tool when dealing with big data. In general, it uses multiple processing elements simultaneously to solve a problem. However, it is hard to execute the conventional spatial regressions in parallel. This talk will introduce a novel parallel smoothing technique for generalized partially linear spatially varying coefficient models, which can be used under different hardware parallelism levels. Moreover, conflated with concurrent computing, the proposed method can be easily extended to the distributed system. Regarding the theoretical support of estimators from the proposed parallel algorithm, we first establish the asymptotical normality of linear estimators. Secondly, we show that the spline estimators reach the same convergence rate as the global spline estimators. The proposed method is evaluated through extensive simulation studies and an analysis of the US loan application data.

Novel variable screening methods for omics data integration

When: Thu, October 28, 2021 - 3:30pm
Where: CSIC 4122
Speaker: Tianzhou Ma (UMD (EPIB)) -
Sure screening are a series of simple and effective dimension reduction methods to reduce
noise accumulation for variable selection in high-dimensional regression and classification
problems. Since the first method proposed by Fan and Lv (2008), numerous sure screening
methods have been developed for various model settings and showed their advantage for big
data analysis with desired scalability and theoretical guarantees. However, none of the
methods are directly applicable to reduce dimension and select variables in omics data
integration problems. In this talk, I will introduce two novel variable screening methods
recently developed in our group for both horizontal and vertical omics data integration. In the
first project, we proposed a general framework and a two-step procedure to perform variable
screening when combining the same type of omics data from multiple related studies and
showed the inclusion of multiple studies provided more evidence to reduce dimension. In the
second project, we developed a fast and robust variable screening method to detect epigenetic
regulators of gene expression over the whole genome by combining epigenomic and
transcriptomic data, where both predictor and response spaces are of high-dimension. We used
extensive simulations and real data to demonstrate the strengths of our methods as compared
to existing screening methods.

Dual Principal Component Pursuit

When: Thu, November 4, 2021 - 3:30pm
Where: CSIC 4122
Speaker: Rene Vidal (Johns Hopkins University) -
Abstract: We consider the problem of learning a union of subspaces from data corrupted by outliers. State-of-the-art methods based on convex l1 and nuclear norm minimization require the subspace dimensions and the number of outliers to be sufficiently small. In this talk I will present a non-convex approach called Dual Principal Component Pursuit (DPCP), which can provably learn subspaces of high relative dimension and tolerate a large number of outliers by solving a non-convex l1 minimization problem on the sphere. Specifically, I will present both geometric and probabilistic conditions under which every global solution to the DPCP problem is a vector in the orthogonal complement to one of the subspaces. Such conditions show that DPCP can tolerate as many outliers as the square of the number of inliers. I will also present various optimization algorithms for solving the DPCP problem and show that a Projected Sub-Gradient Method admits linear convergence to the global minimum of the underlying non-convex and non-smooth optimization problem. Experiments show that the proposed method is able to handle more outliers and higher relative dimensions than state-of-the-art methods. Joint work with Tianjiao Ding, Daniel Robinson, Manolis Tsakiris and Zhihui Zhu.

Jump Q-Learning for Optimal Interval-Values Treatment Decision Rule

When: Thu, November 11, 2021 - 3:00pm
Where: CSIC 4122
Speaker: Wenbin Lu (North Carolina State University) - https://www4.stat.ncsu.edu/~lu/
Abstract: An individualized decision rule (IDR) is a decision function that assigns each individual a given treatment based on his/her observed characteristics. Most of the existing works in the literature consider settings with binary or finitely many treatment options. In this work, we focus on the continuous treatment setting and propose a jump Q-learning to develop an individualized interval-valued decision rule (I2DR) that maximizes the expected outcome. Unlike IDRs that recommend a single treatment, the proposed I2DR yields an interval of treatment options for each individual, making it more flexible to implement in practice. To derive an optimal I2DR, our jump Q-learning method estimates the conditional mean of the response given the treatment and the covariates (the Q-function) via jump penalized regression, and derives the corresponding optimal I2DR based on the estimated Q-function. The regressor is allowed to be either linear for clear interpretation or deep neural network to model complex treatment-covariate interactions. To implement jump Q-learning, we develop a searching algorithm based on dynamic programming that efficiently computes the Q-function. Statistical properties of the resulting I2DR are established when the Q-function is either a piecewise or continuous function over the treatment space. We further develop a procedure to infer the mean outcome under the estimated optimal policy. Extensive simulations and a real data application to a warfarin study are conducted to demonstrate the empirical validity of the proposed I2DR.

Qualitative Insights from Social Media Data

When: Thu, November 18, 2021 - 3:30pm
Where: CSIC 4122
Speaker: Robyn Ferg (Westat) -
Abstract: Social media has been used, with varying degrees of success, to quantify public opinion on various topics. An implicit analogy underlying this work is that social media data may be regarded, at least in some ways, as similar to data gathered from traditional designed public opinion surveys. At the same time there has been enthusiasm about using the content of social media posts as a source of qualitative data, to enrich the information produced in traditional sample surveys. In the work presented here, we adopt an analogy based on the latter view, namely that social media data may be regarded as similar to data gathered from a focus group -- but a very large one.

In this framework, the primary challenge is the sheer volume of data; it is simply not possible for a human to read and digest all of the relevant material posted to social media. Yet the volume of these data is in many ways their principal benefit. A spectrum of approaches to this problem is possible, with varying levels of automation: bottom-up topic modeling algorithms, semi-supervised topic modeling, and manual human coding of different samples of social media content. To facilitate qualitative exploration of social media corpora that allows one to see both "the forest and the trees," we have been developing an interactive tool--a Tweet Browser--that allows users to identify tweets related in content based on new clustering approaches. The tool is intended to support analysts' guided exploration of social media, at various levels of detail, so as to generate rich insights based on a large quantity of data. We have tested the tool in a preliminary way using a collection of tweets that mention the Census Bureau or Census data. Potential benefits include not only better understanding the public's view of the Census Bureau and data but potentially discovering new phenomena that the Bureau may not yet measure systematically, improving and empirically informing questionnaire development, and informing data collection with real-time information.

Bayesian wavelet-packet historical functional linear models

When: Thu, February 10, 2022 - 3:30pm
Where: Kirwan Hall 1313
Speaker: Mark Meyer (Georgetown University) -
Abstract: Historical functional linear models (HFLMs) quantify associations between a functional predictor and functional outcome where the predictor is an exposure variable that occurs before, or at least concurrently with, the outcome. Prior work on the HFLM has largely focused on estimation of a surface that represents a time-varying association between the functional outcome and the functional exposure. This existing work has employed frequentist and spline-based estimation methods, with little attention paid to formal inference or adjustment for multiple testing and no approaches that implement wavelet bases. In this work, we propose a new functional regression model that estimates the time-varying, lagged association between a functional outcome and a functional exposure. Building off of recently developed function-on-function regression methods, the model employs a novel use of the wavelet-packet decomposition of the exposure and outcome functions that allows us to strictly enforce the temporal ordering of exposure and outcome, which is not possible with existing wavelet-based functional models. Using a fully Bayesian approach, we conduct formal inference on the time-varying lagged association, while adjusting for multiple testing. We investigate the operating characteristics of our wavelet-packet HFLM and compare them to those of two existing estimation procedures in simulation. We also assess several inference techniques and use the model to analyze data on the impact of lagged exposure to particulate matter finer than 2.5μg, or PM2.5, on heart rate variability in a cohort of journeyman boilermakers during the morning of a typical day’s shift.

Understanding bias in microbiome sequencing studies and ways to address it

When: Thu, February 24, 2022 - 3:30pm
Where: Kirwan Hall 1313
Speaker: Ni Zhao (Johns Hopkins University ) - https://www.biostat.jhsph.edu/~nzhao/
Abstract: Bias is ubiquitous in microbiome sequencing studies. The observed relative abundances are only a distorted version of their values due to the differential efficiency in PCR and sequencing process. Bias leads to invalid statistical inference even in a well-designed study in which all samples were processed in the same conditions (i.e., no batch effect). In this talk, I will focus on two topics. Topic one involves a novel log-linear model for understanding the bias generation process using mock communities, i.e., microbiome communities that the true relative abundances are known a priori. Topic two involves a bias resistant model of microbiome differential abundances using rank-based regression. We show via extensive simulations the benefit of the proposed method compared to its potential competitors.

Bayesian sparse regression for large-scale observational health data

When: Thu, March 17, 2022 - 3:30pm
Where: Kirwan Hall 1313
Speaker: Akihiko Nishimura (Johns Hopkins University) - https://aki-nishimura.github.io
Abstract: Growing availability of large healthcare databases presents opportunities to investigate how patients' response to treatments vary across subgroups. Even with a large cohort size found in these databases, however, low incidence rates make it difficult to identify causes of treatment effect heterogeneity among a large number of clinical covariates. Sparse regression provides a potential solution. The Bayesian approach is particularly attractive in our setting, where the signals are weak and heterogeneity across databases are substantial. Applications of Bayesian sparse regression to large-scale data sets, however, have been hampered by the lack of scalable computational techniques. We adapt ideas from numerical linear algebra and computational physics to tackle the critical bottleneck in computing posteriors under Bayesian sparse regression. For linear and logistic models, we develop the conjugate gradient sampler for high-dimensional Gaussians along with the theory of prior-preconditioning. For more general regression and survival models, we develop the curvature-adaptive Hamiltonian Monte Carlo to efficiently sample from high-dimensional log-concave distributions. We demonstrate the scalability of our method on an observational study involving n = 1,065,745 patients and p = 15,779 clinical covariates, designed to compare effectiveness of the most common first-line hypertension treatments. The large cohort size allows us to detect an evidence of treatment effect heterogeneity previously unreported by clinical trials.

Weight Calibration to Improve Efficiency for Estimating Absolute Risks from Nested Case-control Design

When: Thu, April 7, 2022 - 3:30pm
Where: Kirwan Hall 1313
Speaker: Yei Eun Shin (National Cancer Institute) -
Abstract: We study the efficiency of absolute risk estimates when some covariates are only available for case-control samples nested in a cohort. Researchers have calibrated design-based inclusion probability weights to increase the efficiency of relative hazard estimates under the Cox proportional hazard model. We extend weight calibration approaches to improve the precision of estimates of both relative hazards and absolute risks by additionally using follow-up time information available in the cohort. We derive explicit variance formulas for the weight-calibrated estimates based on influence functions. Simulations show the improvement in precision by using weight calibration and confirm the consistency of variance estimators and the validity of inference based on asymptotic normality. Further studies using weight calibration techniques on semiparametric additive hazards model and risk model validations will also be discussed. Examples are provided using data from Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial Study.

Applying constrained estimation in complex surveys

When: Thu, April 21, 2022 - 3:30pm
Where: Kirwan Hall 1313
Speaker: Jean Opsomer (Westat) -
Abstract: In many surveys conducted by governments and other organizations, estimates are often produced for large numbers of domains. When a priori qualitative constraints between domain means can be specified, it is natural to attempt to ensure that the estimates likewise satisfy the constraints, with the goal of improving the precision of the estimates and their acceptability by data users. We describe a general framework for design-based constrained estimation for domains and briefly review its statistical properties. An R package that implements constrained estimation and inference for complex survey data has recently been released. We will use it to analyze a number of existing datasets in order to illustrate the applicability of the methods.

Three different ideas for model and feature selection for large or complex data

When: Thu, April 28, 2022 - 3:30pm
Where: Kirwan Hall 1313
Speaker: Jiayang Sun (George Mason University ) - https://mason.gmu.edu/~jsun21/
Abstract: The prolific accumulation of data from multiple domains provides a wonderful landscape of many interacting factors to target outcomes, such as those for developing drug targets or understanding reproductive success at high altitudes. These data challenge existing model and feature selection procedures used in statistics and data science. This talk presents three methods/ideas for overcoming some of these challenges. The first is called the Subsampling Winner Algorithm (SWA). It subsamples from the feature space for feature selection in large-p regression problem, different from that subsamples from n observations, or uses a penalized likelihood or a shrinkage estimation, such as a LASSO, SCAD, Elastic Net, or MCP procedure. SWA has the best-controlled FDR in comparison with the benchmark and randomForest procedures, while having a competitive true feature discovery rate in the linear regression setting. The second is called nFCA, providing a simultaneous network and clustering procedure to aid feature selection, for example, for drug discovery. The third is our ongoing work on selecting different non-linear transformations for many variable problems using a semi-parametric pipeline with some sufficiency or necessity guarantees. We demonstrate its application to our study of social, physiological, and genetic contributions to reproductive success among Tibetan women. These non-linear transformations can suggest curves with change points and other non-linear forms that are biologically meaningful. These are joint work with Y Fan, J.Ma, C. Beall, S. Ye, and M. Meyer.

Conditional distributions of pattern statistics and other inferential procedures in states of hidden sparse Markov models

When: Thu, May 5, 2022 - 3:30pm
Where: Kirwan Hall 1313
Speaker: Donald Eugene Kemp Martin (North Carolina State University) -
Abstract: Whereas the number of parameters in a general higher-order Markov model is exponential in the order of dependence and the model has limited flexibility, sparse Markov models help with these problems. A sparse Markov model is a higher-order Markov model for which conditioning histories are grouped into classes such that the conditional probability distribution given any history of the class is the same. We introduce a model where variables following a sparse Markov structure are latent, and all inference over the latent states is conditioned on observed data. Then several tasks are considered in this sparse Markov setting: determining an appropriate model and parameter estimation, methodology for efficient computation of conditional distributions of pattern statistics over the hidden states, determining the likelihood of the observations, and obtaining the most likely hidden state at each time point and the most likely hidden state sequence, given the observations. An application is given to modeling the fluctuations in price of the S and P500.