Statistics

Statistics Archives for Fall 2015 to Spring 2016

On the Performance of Sequential Procedures for Detecting a Change and Information Quality

When: Thu, September 17, 2015 - 3:30pm
Where: MATH 1313
Speaker: Professor Ron S. Kenett (The KPA Group, Raanana, Israel & Department of Mathematics, University of Turin, Italy ) -
Abstract: The literature on statistical process control has focused on the Average Run Length (ARL) to an alarm, as a performance criterion of sequential schemes. When the process is in control, ARL0 denotes the ARL to false alarm and represents the in-control operating characteristic of the procedure. The average run length from the occurrence of a change to its detection, typically denoted by ARL1, represents the out-of-control operating characteristic. These indices however do not tell the whole story. The concept of information quality (InfoQ) is defined as the potential of a dataset to achieve a specific (scientific or practical) goal using a given empirical analysis method. InfoQ is derived from the utility (U) of applying an analysis (f) to a data set (X) for a given purpose (g). Formally, the concept of Information Quality is defined as: InfoQ(f, X, g) = U(f(X | g)). In this talk, we suggest the use of probability of false alarm (PFA) and conditional expected delay (CED) as an alternative to ARL0 and ARL1 which enhances the information quality of statistical process control methods. As an extension, we discuss the concept of a system for statistical process control in the context of a life cycle view of statistics.

A Scalable Empirical Bayes Approach to Variable Selection

When: Thu, October 1, 2015 - 3:30pm
Where: MTH 1313
Speaker: Professor Haim Bar (Department of Statistics, University of Connecticut - Storrs) -
Abstract: We develop a model-based empirical Bayes approach to variable selection problems in which the number of predictors is very large, possibly much larger than the number of responses (the so-called “large p, small n” problem). We consider the multiple linear regression setting, where the response is assumed to be a continuous variable and it is a linear function of the predictors plus error. The explanatory variables in the linear model can have a positive effect on the response, a negative effect, or no effect. We model the effects of the linear predictors as a three-component mixture in which a key assumption is that only a small (unknown) fraction of the candidate predictors have a non-zero effect on the response variable. By treating the coefficients as random effects we develop an approach that is computationally efficient because the number of parameters that have to be estimated is small, and remains constant regardless of the number of explanatory variables. The model parameters are estimated using the EM algorithm which is scalable and leads to significantly faster convergence, compared with simulation-based methods. This work is joint with James Booth and Martin T. Wells.

Biased Sampling, Missing at Random, Missing not at Random Data: Connections and Some Solutions

When: Thu, October 22, 2015 - 3:30pm
Where: MTH 1313
Speaker: Jing Qin, Ph.D. (Biostatistics Research Branch, National Allergy and Infectious Diseases, NIH) -
Abstract: The problem of missing response data is ubiquitous in medical and social science studies. Missing at random is defined as if the missing data do not depend on unobservable quan-tities. Otherwise it is called missing not at random or non-ignorable missing data. Biased sampling occurs when an investigator records an observation by nature according to a certain stochastic model, the recorded observation will not have the original distribution unless every observation is given an equal chance of being recorded. Biased sampling problems occur in many areas, including, survey sampling, epidemiology study, economics, meta analysis and etc. It occurs more frequently than it appears. As Professor James Heckman (1979), 2000 Nobel Laureate in Econometrics, pointed out "Sample selection bias may arise in practice
for two reasons. First, there may be self selection by the individuals or data units being investigated. Second, sample selection decisions by analysts or data processors operate in much the same fashion as self selection".

In this talk, the connections between biased sampling problems and missing data will be reviewed, and the existing methods for handling missing at random data problems will be discussed briefly. The emphases will be on the missing not at random data problems. As examples, the use of number of failed contact attempts to adjust for non-ignorable non-response (also called paradata in survey) and capture and recapture problems will be considered. Some useful statistical tools such as, empirical likelihood, pairwise conditional likelihood for nuisance parameter elimination, pseudo-likelihood and profile likelihood
etc., will be presented.

Convex Geometry and Gaussian Processes

When: Thu, November 5, 2015 - 3:30pm
Where: MTH 1313
Speaker: Professor Rick Vitale (Department of Statistics, University of Connecticut - Storrs) -
Abstract: The use of convex geometric methods in the study of Gaussian processes
has a long history, especially in the treatment of general bounds and inequalities. More recently there has been a reciprocal use of Gaussian processes in the study of convex geometric questions. The talk will describe interactions of both types with special reference to a class of geometric functionals called intrinsic volumes. Along the way, we will describe a LLN for convex bodies and other results joining the two areas.

Inference when Models are Approximations Rather than Truths

When: Thu, November 19, 2015 - 3:30pm
Where: MTH 1313
Speaker: Professor Andreas Buja (Department of Statistics, The Wharton School, University of Pennsylvania) -
Abstract: We address two problems with statistical inference: (1) Modern approaches to data analysis rarely follow the protocol that lends validity to statistical inference. (2) Statistical inference often makes assumptions that are difficult to justify. Problem (1) arises every time that model selection (stepwise, AIC, Lasso,...) is applied. Problem (2) is forced on us by the treatment of regressors/covariates as fixed values rather than random variables. We will give a tour of the issues and pointers to partial solutions. In particular we will describe the PoSI method for post-selection inference, and we will untangle the problem of regressor ancillarity which is overcome by inference methods based on sandwich estimators and pairs bootstrap. This work is joint with Richard Berk, Lawrence Brown, Mikhail Traskin, Kai Zhang, Emil Pitkin, Linda Zhao, Ed George.

On Some New Advances in Exploiting and Understanding Kendall's Tau

When: Thu, December 3, 2015 - 3:30pm
Where: MTH 1313
Speaker: Professor Fang Han (Visiting Assistant Professor, Department of Biostatistics, Johns Hopkins University) -
Abstract: Kendall's tau is a well-known correlation measurement used for decades of years. Methodologically, it has proved its usefulness in giving robust alternates to solving many fundamentally important statistics problems. In the Big Data era, such robustness is especially desired. Theoretically, as a U-statistic of a discontinuous but bounded kernel, Kendall's tau motivates new techniques in empirical processes (EP), random matrix theory (RMT), and time series analysis. In this talk, I will give a brief introduction to some recent advances in developing and understanding Kendall's tau based approaches. At the core of my analysis is several new and general EP and RMT results under both independent and mixing conditions.

Visualization, Statistical Modeling and Discovery in Computational Epigenomics

When: Thu, February 11, 2016 - 3:30pm
Where: MTH 1313
Speaker: Hector Corrada Bravo (Assistant Professor, Department of Computer Science, UMD) -
Abstract: The use of epigenomics to study mechanisms in development and disease using high-throughput techniques has been one of the most active areas in life and clinical sciences in the last five years. In this talk, I will present advances in statistical learning methods and data visualization for computational epigenomics and fundamental discoveries of molecular mechanisms in cancer facilitated by these tools. Along the way, I will describe novel methods for systems and tools for effective computational and visual interactive statistical exploratory data analysis.

Generalized Additive Coefficient Models with High-Dimensional Covariates for Genome-Wide Association

When: Thu, February 18, 2016 - 3:30pm
Where: MTH 1313
Speaker: Hua Liang (Professor, Department of Statistics, George Washington University) -
Abstract: In the low-dimensional case, the generalized additive coefficient model (GACM) proposed by has been demonstrated to be a powerful tool for studying nonlinear interaction effects of variables. In this paper, we propose estimation and inference procedures for the GACM when the dimension of the variables is high. Specifically, we propose a groupwise penalization based procedure to distinguish significant covariates for the ``large p small n" setting. The procedure is shown to be consistent for model structure identification. Further, we construct simultaneous confidence bands for the coefficient functions in the selected model based on a refined two-step spline estimator. We also discuss how to choose the tuning parameters. To estimate the standard deviation of the functional estimator, we adopt the smoothed bootstrap method. We conduct simulation experiments to evaluate the numerical performance of the proposed methods and analyze an obesity data set from a genome-wide association study as an illustration

Medical Device Clinical Trials and Bayesian Statistics

When: Thu, March 3, 2016 - 3:30pm
Where: MTH 1313
Speaker: Gregory Campbell, Ph.D. (Statistical Consultant ) -
Abstract: The world of medical devices is vastly different from the world of pharmaceutical drugs. For a number of reasons, medical device clinical trials have been more innovative in adopting innovative statistical methods, in part due to the pivotal role FDA has played in encouraging such innovation. An example of this is the effort to encourage medical device submissions using Bayesian statistical methods for the design and analysis of clinical trials. In the past 15 years there has been a noticeable increase in the use of Bayesian statistics in medical device clinical trials. One approach is to use Bayesian hierarchical models to “borrow strength” from pre-identified earlier data. Another approach is adaptive, using accumulating information in the trial, often with a non-informative prior, that allows for early stopping for success or futility but also uses predictive probabilities to curtail patient enrollment earlier. Some of this progress is reviewed. Focus is then turned to research challenges in Bayesian statistics for the future, including: robust hierarchical models of historical information, subgroup identification for successful (or even possibly failed) studies, non-inferiority studies, and decision theoretic trials (based on benefit-risk decisions). This invaluable experience has helped FDA work through the issues concerning adaptive designs and consequently the Center for Devices and Radiological Health has a great deal of experience with such designs.

Guarding from Spurious Discoveries in High Dimension

When: Thu, March 24, 2016 - 3:30pm
Where: MTH 1313
Speaker: Jianqing Fan, Professor (Department of Operations Research and Financial Engineering, Princeton University) -
Abstract: Many data-mining and statistical machine learning algorithms have been developed to select a subset of covariates to associate with a response variable. Spurious discoveries can easily arise in high-dimensional data analysis due to enormous possibilities of such selections. How can we know statistically our discoveries better than those by chance? In this paper, we define a measure of goodness of spurious fit, which shows how good a response variable can be fitted by an optimally selected subset of covariates under the null model, and propose a simple and effective LAMM algorithm to compute it. It coincides with the maximum spurious correlation for linear models and can be regarded as a generalized maximum spurious correlation. We derive the asymptotic distribution of such goodness of spurious fit for generalized linear models and $L_1$-regression. Such an asymptotic distribution depends on the sample size, ambient dimension, the number of variables used in the fit, and the covariance information. It can be consistently estimated by multiplier bootstrapping and used as a benchmark to guard against spurious discoveries. It can also be applied to model selection, which considers only candidate models with goodness of fits better than those by spurious fits. The theory and method are convincingly illustrated by simulated examples and an application to the binary outcomes from German Neuroblastoma Trials. This work is joint with Wenxing Zhou.

Tensor Completion via Nuclear Norm Minimization

When: Thu, March 31, 2016 - 3:30pm
Where: MTH 1313
Speaker: Cunhui Zhang, Professor (Department of Statistics, Rutgers University) -
Abstract: Many problems can be formulated as recovering a low-rank tensor. Although an increasingly common task, tensor recovery remains a challenging problem because of the delicacy associated with the decomposition of higher order tensors. To overcome these difficulties, existing approaches often proceed by unfolding tensors into matrices and then apply techniques for matrix completion. We show here that such matricization fails to exploit the tensor structure and may lead to suboptimal procedure. More specifically, we investigate a convex optimization approach to tensor completion by directly minimizing a tensor nuclear norm and prove that this leads to an improved sample size requirement. To establish our results, we develop a series of algebraic and probabilistic techniques such as characterization of subdifferential for tensor nuclear norm and concentration inequalities for tensor martingales, which may be of independent interests and could be useful in other tensor related problems. This is joint work with Ming Yuan.

On the Asymptotic Behaviour of the Likelihood Ratio Test Statistic under Nonstandard Conditions

When: Thu, April 7, 2016 - 3:30pm
Where: MTH 1313
Speaker: Yong Chen, Ph.D. (Biostatistics Research, School of Medicine, University of Pennylvania) -
Abstract: In this talk, we consider the asymptotic distribution of the likelihood ratio statistic $T$ for hypothesis testing. We consider several important likelihoods including pseudolikelihood (Gong and Samaniego, 1981) and composite likelihood (Lindsay, 1988). We provide the asymptotic distribution of $T$ under the null hypothesis. We also extend our result to a new test by conditioning on the observed data. Examples in parametric and semiparametric models are given.

Respondent Privacy through Randomized Response Models

When: Thu, April 28, 2016 - 3:30pm
Where: MTH 1313
Speaker: Sat N. Gupta, Professor (Department of Mathematics and Statistics, University of North Carolina – Greensboro) -
Abstract: Respondent privacy is an important issue in sample surveys, more so when dealing with sensitive topics where social desirability response bias is a major worry. Randomized response techniques (RRT) are important tools to deal with this problem. In this talk, we will discuss some real applications of RRT models, discuss relevance of these models in reference to data confidentiality and respondent privacy, and present some estimators that are more efficient than the usual RRT mean estimators. We will also present simulation results to validate theoretical findings.

DAS Estimator : A New Method for Parameter Estimation Under Model Misspecification

When: Thu, May 5, 2016 - 3:30pm
Where: MTH 1313
Speaker: Emre Barut, Assistant Professor (Department of Statistics, George Washington University) -
Abstract: In parameter estimation problems, maximum likelihood (ML) approaches possess a series of advantageous properties, which has led to their common use in everyday statistical applications. Unfortunately, ML based methods are also known to be not robust to model misspecification issues or outliers. In that spirit, we provide a new framework for parameter estimation called DAS estimator, which is given as the empirical minimizer of a second order U-statistic. When estimating parameters in the exponential family, the estimator is shown to be a solution of a quadratic convex problem that can be efficiently solved. For parameter estimation, our approach significantly improves upon MLE when outliers are present, or when the model is misspecified. Furthermore, we show how DAS estimator can be used to efficiently fit to distributions with unknown normalizing constants. We demonstrate the validity of this approach on non-parametric Bayesian models. Extensions of DAS estimators for regression and their implications for statistical modeling are discussed.