Abstract: Linear model prediction with a large number of potential predictors is both statistically and computationally challenging. The traditional approaches are largely based on shrinkage selection/estimation methods, which are applicable even when the number of potential predictors is (much) larger than the sample size. A situation of the latter scenario occurs when the candidate predictors involve many binary indicators corresponding to categories of some categorical predictors as well as their interactions. We propose an alternative approach to the shrinkage prediction methods in such a case based on mixed model prediction, which effectively treats combinations of the categorical effects as random effects. We establish theoretical validity of the proposed method, and demonstrate empirically its advantage over the shrinkage methods. We also develop measures of uncertainty for the proposed method and evaluate their performance empirically. A real-data example is considered. This work is joint with Hanmei Sun of Shandong Normal University, China.
Abstract: A rate ratio (RR) is an important metric for comparing cancer risks among different subpopulations. Inference for RR becomes complicated when populations used for calculating age-standardized cancer rates involve sampling errors, a situation that arises increasingly often when sample surveys must be used to obtain the population data. We compare a few strategies of estimating the standardized RR and propose bias-corrected ratio estimators as well as the corresponding variance estimators and confidence intervals that simultaneously consider the sampling error in estimating populations and the traditional Poisson error in the occurrence of cancer case or death. Performance of the proposed methods is evaluated empirically based on simulation studies. An application to immigration disparities in cancer mortality among Hispanic Americans is discussed. Our simulation studies show that a bias-corrected RR estimator performs the best in reducing the bias without increasing the coefficient of variation; the proposed variance estimators for the RR estimators and associated confidence intervals are fairly accurate. Finding of our application study are both interesting and consistent with the common sense as well as the results of our simulation studies.
Abstract: The spectral density plays a pivotal role in time series analysis. Since the classical spectral density is defined as the Fourier transform of autocovariance functions, it fails to capture the distributional features. To overcome this drawback, we consider the spectral density based on copula and show the weak convergence of integrated copula spectra. This result combined with the subsampling procedure enables us to construct uniform confidence bands, a test for time-reversibility, and a test for tail symmetry. This talk is based on joint work with T. Kley (Georg-August-Univ. Gottingen), R. Van Hecke (Ruhr-Univ. Bochum), S. Volgushev (Univ. of Toronto), H. Dette (Ruhr-Univ. Bochum), and M. Hallin (Univ. libre de Bruxelles).
Abstract: Modern machine learning algorithms are compelling in prediction problems. However, regarding to features of black boxes, the performance of machine learning algorithms is hard to statistically evaluate and can vary across datasets and underlying setups. This phenomenon is even more astonishing when multiple machine learning algorithms are evaluated, including penalized regression, random forest, gradient boosting, etc. Among these algorithms, it is notoriously challenging to determine the most appropriate one to use in practice, especially in the context of causal inference. In this talk, I will cover two topics: the first one covers a robust causal machine learner in the context of mean estimation. The proposed learner enables valid statistical inference and has the property of multiple robustness, which allows multiple machine learning algorithms and has shown to be robust as long as one of candidate algorithms works well. The second topic covers an advanced scheme of integrating extra information from auxiliary data into the casual machine learner, which can substantially boost the estimation efficiency. Extensive numerical studies demonstrate the superior of our method over competing methods, regarding to smaller estimation bias and variability. In addition, the validity of the proposed method is assessed in real applications by using UK Biobank data.
Abstract: Drug development is a complex and expensive scientific endeavor. Statisticians play a unique role in the pharmaceutical industry with their quantitative training. Their work impacts business decisions that drive the success of a drug development. They are involved in all stages of a clinical trial, from trial design, protocol writing, data collection, to data analysis and results interpretation. Their contributions do not stop there. They play a pivotal role in the entire clinical development program and its lifecycle management. They have the opportunities to innovate clinical trials, to advance science, and to create new drugs that benefit millions of patients. In this presentation, we will provide an overview of how statisticians contribute to the drug development and what they do in their daily life from an early phase biometrics point of view. Particularly, we will cover some of the exciting innovations that we are working on at AstraZeneca.
Abstract: The frequency-domain properties of nonstationary functional time series often contain valuable information. These properties are characterized through its time-varying power spectrum. Practitioners seeking low-dimensional summary measures of the power spectrum often partition frequencies into bands and create collapsed measures of power within bands. However, standard frequency bands have largely been developed through manual inspection of time series data and may not adequately summarize power spectra. In this project, we propose a framework for adaptive frequency band estimation of nonstationary functional time series that optimally summarizes the time-varying dynamics of the series. We develop a scan statistic and search algorithm to detect changes in the frequency domain. We establish the theoretical properties of this framework and develop a computationally-efficient implementation. The validity of our method is also justified through numerous simulation studies and an application to analyzing electroencephalogram data in participants alternating between eyes open and eyes closed conditions.
Abstract: High dimensional distributions, especially those with heavy tails, are notoriously difficult for off-the-shelf MCMC samplers: the combination of unbounded state spaces, diminishing gradient information, and local moves, results in empirically observed "stickiness" and poor theoretical mixing properties -- lack of geometric ergodicity. In this talk, we introduce a new class of MCMC samplers that map the original high dimensional problem in Euclidean space onto a sphere and remedy these notorious mixing problems. In particular, we develop random-walk Metropolis type algorithms as well as versions of Bouncy Particle Sampler that are uniformly ergodic for a large class of light and heavy-tailed distributions and also empirically exhibit rapid convergence in high dimensions. In the best scenario, the proposed samplers can enjoy the ``blessings of dimensionality'' that the mixing time decreases with dimension.
Abstract: Many high-dimensional problems involve reconstruction of a low-rank matrix from highly incomplete and noisy observations. Despite substantial progress in designing efficient estimation algorithms, it remains largely unclear how to assess the uncertainty of the obtained low-rank estimates, and how to construct valid yet short confidence intervals for the unknown low-rank matrix.
In this talk, I will discuss how to perform inference and uncertainty quantification for two examples of low-rank models: (1) noisy matrix completion, and (2) heteroskedastic PCA with missing data. For both problems, we identify statistically efficient estimators that admit non-asymptotic distributional characterizations, which in turn enable optimal construction of confidence intervals, say, the unseen entries of the low-rank matrix of interest. Our inferential procedures do not rely on sample splitting, thus avoiding unnecessary loss of data efficiency. All this is accomplished by a powerful leave-one-out analysis framework that originated from probability and random matrix theory.
Abstract: Pfeffermann and Sverchkov (JASA, 2007) considered SAE under informative sampling of areas and within the areas. Informative sampling: sample inclusion probabilities related to values of target variable (unobserved area means or observed individual values), even after conditioning on the model covariates. In this presentation, I assume additionally not missing at random (NMAR) nonresponse within the sampled areas. As illustrated in the presentation, Ignoring informative sampling or NMAR nonresponse, may result in highly biased predictors.
Abstract: How can increasingly available observational data be used to improve the design and analysis of randomized controlled trials (RCTs)? One approach is to couple an RCT with an observational study using shrinkage estimation, leaning on the observational data more heavily when it exhibits greater congruence with estimates from the RCT. We operate in a stratified setting, and consider two questions: 1) how can we develop shrinkage estimators that combine causal estimates from observational and experimental sources, and 2) with these estimators at our disposal, how might we design experiments more efficiently?
To answer the former question, we extend results from the Stein shrinkage literature. We propose a generic procedure for deriving shrinkage estimators that leverage observational and randomized data together, making use of a generalized unbiased risk estimate. We develop two new estimators and prove finite sample conditions under which they have lower risk than an estimator using only experimental data. We also draw connections between our approach and results from sensitivity analysis, including proposing a method for evaluating estimator feasibility.
We next consider designing a prospective randomized trial. If we intend to shrink the experiment’s causal estimates toward those of a completed observational study, how do we optimize the experimental design? We show that the risk of the shrinkage estimator can be computed efficiently via numerical integration. We then propose algorithms for determining the best allocation of units to strata, accounting for the imperfect parameter estimates we would have from the observational study.
Abstract: We consider network autoregressive models for count data with a non-random neighborhood structure. The main methodological contribution is the development of conditions that guarantee stability and valid statistical inference for such models. We consider both cases of fixed and increasing network dimension and we show that quasi-likelihood inference provides consistent and asymptotically normally distributed estimators. The work is complemented by simulation results and a data example. This is a joint work with M. Armillotta.
4176 Campus Drive - William E. Kirwan Hall
College Park, MD 20742-4015
P: 301.405.5047 | F: 301.314.0827