Statistics

Statistics Archives for Fall 2014 to Spring 2015

A Review of Talks at the August JSM

When: Thu, September 11, 2014 - 3:30pm
Where: MTH 1313
Speaker: Prof. Paul Smith (Department of Mathematics, UMCP) -
Abstract: A few talks will be reviewed that the speaker attended and found of a general interest.

Small Perturbations of Statistical Models and a Robust Property of the Gaussian Distribution

When: Thu, September 18, 2014 - 3:30pm
Where: MTH 1313
Speaker: Abram Kagan (Department of Mathematics, UMCP) -
Abstract: After a brief review of the (location parameter) contamination model and a (general
parameter) misspecification model, a natural perturbation of the setup of direct measurements will be discussed in detail. It turns out that under regularity type conditions (unfortunately, rather strong), the Gaussian distribution is the most robust under small additive perturbations. This should be compared with a well know property that the Gaussian distribution minimizes the Fisher information (about a location parameter) in the class of distributions with a fixed variance.

Semi-Parametric Generalized Linear Models for Time-Series Data

When: Thu, September 25, 2014 - 3:30pm
Where: MTH 1313
Speaker: Dr.Thomas Fung (Department of Statistics, Macquarie University, Sydney, Australia) -
Abstract: We introduce a semiparametric generalized linear models framework for time-series data that does not require specification of a working distribution or variance function for the data. Rather, the conditional response distribution is
treated as an infinite-dimensional parameter, which is then estimated simultaneously with the usual finite-dimensional parameters via a maximum empirical likelihood approach. A general consistency result for the resulting estimators is shown. Simulations suggest that both estimation and inferences using the proposed method can perform as well as a correctly-specified parametric model even for moderate sample sizes, but is much more robust than parametric methods under model misspecification. The method is used to analyze the Polio dataset from Zeger (1988). This talk represents joint research with Dr Alan Huang.

A Sample Approach to Estimation under Informative Sampling Design

When: Thu, October 9, 2014 - 3:30pm
Where: MTH 1313
Speaker: Dr. Michael Sverchkov ( Bureau of Labor Statistics) -
Abstract: As shown in Sverchkov and Pfeffermann (2004), under general conditions where the sample is sufficiently large, the population distribution is close to that of a with-replacement subsample drawn from the observation units with probabilities proportional to the sampling weights. (We call this subsample a ``pseudo-population”). The near equivalence of these distributions suggests a simple method to perform estimation under an informative sampling design: generate a pseudo-population and apply any estimator that ignores design to this pseudo-population. In this talk we discuss advantages and disadvantages of this approach and illustrate how it works for simple Small Area Estimation models.

Propensity-score-adjustment method for nonignorable nonresponse

When: Thu, October 16, 2014 - 3:30pm
Where: 1208, LeFrak Hall, University of Maryland
Speaker: Dr. Jae Kwang Kim (Professor of Statistics and Member of Center for Survey Statistics and Methodology at Iowa State University) -
Abstract: Propensity score adjustment is a popular technique for handling unit nonresponse in sample surveys. If the response probability depends on the study variable that is subject to missingness, estimating the response probability often relies on additional distributional assumptions about the study variable. Instead of making fully parametric assumptions about the population distribution of the study variable and the response mechanism, we propose a new approach of maximum likelihood estimation that is based on the distributional assumptions of the observed part of the sample. Since the model for the observed part of the sample can be verified from the data, the proposed method is less sensitive to failure of the assumed model of the outcomes. Generalized method of moments can be used to improve the efficiency of the proposed estimator. Results from two limited simulation studies are presented to compare the performance of the proposed method with the existing methods. The proposed method is applied to the missing data in the exit poll for the 19th legislative election in Korea.

The Normal Distribution: Photographic Confusions

When: Thu, October 23, 2014 - 3:30pm
Where: MTH 1313
Speaker: Robert W. Jernigan (Department of Mathematics and Statistics, American University) -
Abstract: The normal distribution might not be what you think it is. For some, the iconic bell curve seems to obscure what a normal distribution really is. Although the normal distribution is central to much statistical theory, it is often confused with the very thing that describes it. The culprit is exactly that iconic bell-shaped curve. There is a difference between the normal distribution and the normal curve, a difference that confuses many. We will illustrate with photographs and cartoons.

A discussion of 3 talks from this year's Joint Statistical Meetings

When: Thu, October 30, 2014 - 3:30pm
Where: MTH 1313
Speaker: Eric Slud (Mathematics Department, UMCP) -
Abstract: In this talk, I will start by making some brief general remarks about prominent topics at this past August's Joint Statistical Meetings held in Boston, and will summarize and discuss three talks I heard there: one by Steven Stigler (the Fisher Lecture) on "The Seven Pillars of Statistical Wisdom", another by Andreas Buja on "Valid Inference after selecting predictors and variable transformations", and a third on "Bayesian post-stratification models using multilevel penalized spline regression". The first talk was general, the second about regression and model-selection problems, and the third relates "propensity matching" ideas in survey sampling.

Testing for Presence of Signal in the Signal Plus Noise Model

When: Thu, November 6, 2014 - 3:30pm
Where: MTH 1313
Speaker: Abram, Kagan (UMCP)

Abstract: The problem is to decide if what is observed is a (known) signal plus noise or pure noise independent (not necessarily identically distributed) random variables. In the talk will be discussed how measure-theorists and statisticians approach the problem and how the Fisher information naturally arises in the setup.

Adaptive Sparse Reduced-rank Regression

When: Thu, November 13, 2014 - 3:30pm
Where: MTH 1313
Speaker: Tingni Sun ((UMCP)) -
Abstract: This talk concerns the reduced-rank regression model in the high-dimensional setting, which contains multivariate or even high dimensional response variables together with a large number of predictors, while the sample size can be much smaller. We proposed a new estimation scheme for coefficient matrix, where both dimension reduction and variable selection are taken into account. We derived the error bounds with respect to a class of squared Schatten norm loss functions for the proposed estimators and showed that it achieves near optimal rates adaptively. The practical competitiveness of the estimator is further demonstrated through numerical studies.

Repeated Out of Sample Fusion in Interval Estimation of Small Tail Probabilities in Food Safety

When: Thu, November 20, 2014 - 3:30pm
Where: MTH 1313
Speaker: Ben Kedem (UMCP) -
Abstract: In food safety and bio-surveillance in many cases it is often desired to estimate the probability that a contaminant such as some insecticide or pesticide exceeds unsafe very high thresholds. The probability or chance in question is then very small. To estimate such a probability we need information about large values. However, in many cases the data do not contain information about exceedingly large contamination levels, which ostensibly makes the problem impossible to solve. A solution is provided whereby more information about small tail probabilities is obtained by FUSING the real data with computer generated random data. The method provides short but reliable interval estimates from moderately large samples. An illustration is provided using exposure data of methylmercury, dichlorophenol, and trichlorophenol obtained from the National Health and Nutrition Examination Survey (NHANES).

(Partial) Distance Correlation

When: Thu, February 12, 2015 - 3:30pm
Where: MATH 1313
Speaker: Dr. Gabor Szekely, NSF and Renyi Institute of the Hungarian Academy of Sciences

Abstract: Distance covariance and distance correlation are scalar coefficients that characterize independence of random vectors in arbitrary dimension. Properties, extensions, and applications of distance correlation have been discussed in the recent literature, but the problem of defining the partial distance correlation has remained an open question of considerable interest. The problem of partial distance correlation is more complex than partial correlation partly because the squared distance covariance is not an inner product in the usual linear space. For the definition of partial distance correlation we introduce a new Hilbert space where the squared distance covariance is the inner product. We define the partial distance correlation statistics with the help of this Hilbert space, and develop and implement a test for zero partial distance correlation. Our intermediate results provide an unbiased estimator of squared distance covariance, and a neat solution to the problem of distance correlation for dissimilarities rather than distances.

Shrinkage methods utilizing auxiliary information from external Big Data sources to improve prediction models with many covariates

When: Thu, February 19, 2015 - 3:30pm
Where: MATH 1313
Speaker: Professor Bhramar Mukherjee (Departments of Statistics and Biostatistics, Univ. of Michigan, Ann Arbor) -
Abstract: We consider predicting an outcome Y using a large number of covariates X. However, most of the data we have to fit the model contains only Y and W, which is a noisy surrogate for X, and only on a small number of observations do we observe Y, X, and W. We develop Ridge-type shrinkage methods that trade-off between bias and variance in a data-adaptive way to yield smaller prediction error using information from both datasets. We also demonstrate how the problem can be treated in a full Bayesian context with different forms of adaptive shrinkage. Finally, we introduce the notion of a hyper-penalty for guiding choices of the tuning parameter to perform adaptive shrinkage.

Our work is motivated by the rapid development of genomic assay technologies. In our application, mRNA expression of a selected number of genes is measured by both quantitative real-time polymerase chain reaction (qRT-PCR, X) and microarray technology (W) on a small number of lung cancer patients. In addition, only microarray measurements (W) are available on a larger number of patients. For future patients, the goal is to predict survival time (Y) using qRT-PCR (X). The question of interest is whether the large dataset containing only W aid with prediction of Y using X

Some properties of var E(X|X+Y) for independent X, Y

When: Thu, February 26, 2015 - 3:30pm
Where: MATH 1313
Speaker: Abram Kagan (Dept. of Mathematics, Univ. of Maryland) -
Abstract:
For an arbitrary X with finite variance, we are looking for the least favorable Y that makes the filtration of X from an observation of X with an additive noise most difficult. Under a natural assumption on the class of admissible Y, we find the least favorable Y for the infinitely divisible X while the case of an arbitrary X remains open.

Some related problems are discussed *************************************************

Understanding the Effects of Concussions using Big Data:

When: Thu, March 12, 2015 - 3:30pm
Where: MATH 1313
Speaker: Dr. Jesus Caban (Chief of Clinical and Research Informatics, National Intrepid Center of Excellence at Walter Reed Bethesda ) -
Abstract: From cyber security to healthcare, big data is changing our world and the way scientific breakthroughs are discovered. The benefits, challenges, and enormous research opportunities of leveraging massive collections of multi-modal data are evident in the healthcare domain where integration of numerous disparate measurements are often needed to develop a comprehensive understanding of the patient’s condition.

In this talk we will discuss a large-scale informatics database that has been developed by the Department of Defense (DoD) to enable research in the understanding of the effects of concussions. The database consists of millions of longitudinal clinical data points ranging from diagnosis, deployment, medications, imaging findings, and many additional data elements. The focus of the talk will be on advanced multi-modal analytical techniques that have been developed to model the short- and long-term effects of concussions. Special attention will be given to (a) a hierarchical classification more to predict PTSD which has been tested with over 100,000 patients and shows an accuracy of 86%, (b) a semi-supervised classification method that has been used to detect subtle abnormalities in MRI images of TBI patients, and (c) a visual analytics framework that has been designed to enable clinicians and researchers interactively explore large collections of clinical data.

Consensus and Flocking in Social Dynamics

When: Thu, March 26, 2015 - 3:30pm
Where: MATH 1313
Speaker: Professor Eitan Tadmor (Dept. of Mathematics, CSCAMM and IPST, Univ. of Maryland) -
Abstract: We discuss the dynamics of systems driven by the “social engagement” of its agents with their local neighbors through local gradients. Prototype examples include models for opinion dynamics in human networks, flocking, swarming and bacterial self-organization in biological organisms, or rendezvous in mobile systems.

Two natural questions arise in this context: what is the large time behavior of such systems when the time T tends to infinity, and what is the effective dynamics of such large systems when the number of agents N tends to infinity. The underlying issue is how different rules of engagement influence the formation of clusters, and in particular, the tendency to form “consensus of opinions”. We analyze the flocking dynamics of agent-based models, present novel numerical methods which confirm the large time formation of Dirac masses at the kinetic level, and end up with critical threshold phenomena at the level of social hydrodynamics.

Modelling and Analysis of Treatment Effect for Survival Data When There May Be Treatment by Time Interaction

When: Thu, April 2, 2015 - 3:30pm
Where: MATH 1313
Speaker: Dr. Song Yang (National Heart, Lung, and Blood Institute, NIH) -
Abstract: For risk and benefit assessment in clinical trials and observational studies with time-to-event data, the Cox model has usually been the model of choice. When the hazards are possibly non-proportional, a piece-wise Cox model over a partition of the time axis is often used. Here we propose to analyze clinical trials or observational studies with time-to-event data using a certain semiparametric model. The model allows for a time-dependent treatment effect. It includes the important proportional hazards model as a sub-model, and can accommodate various patterns of time-dependence of the hazard ratio. After estimation of the model parameters using a pseudo-likelihood approach, simultaneous confidence intervals for the hazard ratio function are established using a Monte Carlo method to assess the time-varying pattern of the treatment effect. To assess the overall treatment effect, estimated average hazard ratio and its confidence intervals are also obtained. The proposed methods are applied to data from the Women’s Health Initiative. Compared to the piece-wise Cox model, the proposed model does not require partitioning of the time axis and yields a better model fit.

Benford' s Law and Forensics

When: Thu, April 9, 2015 - 3:30pm
Where: MATH 1313
Speaker: Prof. James Alexander (University of Maryland) -
Abstract: The so-called Benford's law is the statement about the frequency that the numbers 1, 2, …,9 occur as the initial digit of naturally occurring numbers. The mathematics of Benford's law was clarified in the 1990s.

In the 1970s it was suggested that the law could be used for investigating the validity of data, and more recently has been used in such things as fraud investigations--financial, voting, etc. We develop a simple way to make a valid statistical test on any set of data. However, we eventually conclude that Benford's law is not a particularly valid way of forensic investigation.

Use of Sample Model for Modeling Complex Survey Data

When: Thu, April 23, 2015 - 3:15pm
Where: MATH 1313 (NOTE Time Change)
Speaker: Dr. Michael Sverchkov (Bureau of Labor Statistics) -
Abstract: One of the unique features of sample survey data is that the sample is often drawn with unequal probabilities, at least at one stage of the sampling process. The selection probabilities are generally known and accessible for the sampled units in the form of sampling weights (inverse of the sampling probabilities adjusted for nonresponse or calibration), and they are in common use for randomization based inference on finite population quantities. In this talk I focus on estimation of models, either for studying the functional relationships between variables or for prediction.

When the selection probabilities are correlated with the model dependent variable after conditioning on the model covariates, (informative sampling), the population model holding for the population data, can be very different from the sample model holding for the sample data, given the sample selection. In such cases, the sampling process needs to be accounted for in the modelling process or when using the model for prediction. A common approach for dealing with this problem is to estimate the population model by weighting the sample values in the model estimating equations by the sampling weights. However, this approach has some serious limitations. Including the sampling weights among the model covariates and integrating them out at a later stage does not always provide a satisfactory solution either.

I discuss therefore an alternative approach that overcomes the limitations of the other approaches proposed in the literature, but with the price of having to model also the sampling weights. The idea is to fit a model to the sample data and base the inference on the sample model. As shown in the talk, the population model and the sample-complement model holding for data outside the sample (needed for prediction), can be obtained from the sample model.

The main advantages of the use of the sample model are as follows:

1. It accounts for selection bias under informative sampling.
2. Once the model is specified, it lends itself to standard model based inference such as maximum likelihood estimation, Bayesian inference or semi-parametric estimation.
3. The use of the sample model lends itself to conditional inference, given the selected sample.
4. As illustrated in many studies, the use of the sample model generally yields estimators with lower variances than the variances of randomization based estimators.
5. The sample-complement model allows predicting the outcome values for nonsampled units or areas.
6. The use of the sample model enables testing whether the sampling process is informative.

Dynamic Spatial Panel Models, Common Shocks, and Sequential Exogeneity

When: Thu, April 30, 2015 - 3:30pm
Where: MATH 1313
Speaker: : Prof. Guido Kuersteiner (Economics Department, Univ. of Maryland) -
Abstract: In the talk a class of GMM estimators for general dynamic panel models is considered, allowing for cross sectional dependence due to spatial lags and due to unspecified common shocks. We significantly expand the scope of the existing literature by allowing for endogenous spatial weight matrices, time-varying interactive effects, as well as weakly exogenous covariates. The model is expected to be useful for empirical work in both macro and microeconomics. An important area of application is in social interaction and network models where our specification can accommodate data dependent network formation. We discuss explicit examples from the recent social interaction literature. Identification of spatial interaction parameters is achieved through a combination of linear and quadratic moment conditions. We develop an orthogonal forward differencing transformation to aid in the estimation of factor components while maintaining orthogonality of moment conditions. This is an important ingredient to a tractable asymptotic distribution of our estimators.
This is a joint work with I. Prucha and D. Drukker.

Fast Computing for Statistical Dependency (Notice Date Change)

When: Tue, May 12, 2015 - 3:30pm
Where: MATH 1313
Speaker: Prof. Xiaoming Huo (Georgia Tech and NSF) -
Abstract:
Distance correlation had been introduced as a better alternative to the classical Pearson’s correlation. The existing algorithm for the distance correlation seemingly requires an O(n^2) algorithm, and I will show how it can be done in O(n log n). Moreover, many other statistical dependency related quantities can be computed efficiently. I will give some other examples. Based on a joint work with my NSF colleague, Dr. Gabor Szekely. I will assume that some of the audience has heard about Gabor’s talk on energy statistics, though this is not a prerequisite to understand this talk.