Statistics

Statistics Archives for Fall 2012 to Spring 2013

Bayesian Quantile Regression with Endogenous Censoring

When: Thu, September 6, 2012 - 3:30pm
Where: Math 1313
Speaker: Professor Yuan Liao (Department of Mathematics, UMCP)
Abstract: This talk presents a new framework on quantile regression with censored
data, based on a quasi-Bayesian approach. Traditional approach on censored data
has been assuming that conditional on regressors, the survival time is independent
of censoring. Such an assumption is restrictive in many cases and may fail whenever
the censoring mechanism is endogenous (e.g., when there is something else determining
the survival time and censoring simultaneously). The proposed new framework will allow
endogenous censoring. There are three highlights of the talk: 1) We allow arbitrary
dependence between survival time and censoring, even after conditioning on regressors.
2) In this case the regression coefficient is either point or partially identified by a set of
moment inequalities (shown by Khan and Tamer 09 in J. Econometrics), then the set in
which the parameter is identified becomes the target of interest, which may not be a singleton.
3) We propose a Bayesian approach based on empirical likelihood, which is robust whenever the applied researchers are less sure about the true likelihood. Other moment-conditional based Bayesian approach such as Bayesian GMM (Hansen 82) will work too.

We will show the posterior consistency, i.e., asymptotically the empirical likelihood posterior will concentrate on a neighborhood around the ``truth" (either the true coefficient parameter or its identified set, depending on whether or not it is identified). We will also generalize these techniques to a more general instrumental variable regression with interval censored data, which has many applications in economics and social sciences.

Flexible Bayesian Models for Process Monitoring of Paradata Survey Quality Indicators

When: Thu, September 13, 2012 - 3:30pm
Where: Math 1313
Speaker: Dr. Joseph L. Schafer (Area Chief for Statistical Computing Center for Statistical Research & Methodology, U.S. Census Bureau)
Abstract: As data-collecting agencies obtain responses from survey participants, they are also gleaning increasingly large amounts of paradata about the survey process: number of contact attempts, interview duration, reasons for nonresponse, and so on. Paradata may be used to assess the performance of field staff, to describe the effects of interventions on the data collection process, and to alert survey managers to unexpected developments that that may require remedial action. With direct visual inspection and simple plotting of paradata variables over time, it may be difficult to distinguish ordinary random fluctuations from systematic change and long-term trends. The field of Statistical Process Control (SPC), which grew in the context of manufacturing, provides some computational and graphical tools (Shewhart control chart) for process monitoring. Those methods, however, generally assume that the process mean is stable over time, and they may be ill-suited to paradata variables that are often “out of control.” In this talk, I present a flexible class of semiparametric models for monitoring paradata which allow the mean function to vary over time in ways that are not specified in advance. The mean functions are modeled as natural splines with penalties for roughness that are estimated from the data. These splines are allowed to vary over groupings (e.g., regional offices, interview teams and interviewers) by creating a generalized linear mixed model with multiple levels of nested and crossed random effects. I describe efficient Markov chain Monte Carlo strategies for simulating random draws of model parameters from the high-dimensional posterior distribution and produce graphical summaries for process monitoring. I illustrate these methods on monthly paradata series from the National Crime Victimization Survey.

Adding one more observation to the data

When: Thu, September 20, 2012 - 3:30pm
Where: Math 1313
Speaker: Professor Abram Kagan (UMCP) -
Abstract: We discuss ways of incorporating the (n+1)-st observation in an estimator of a population characteristic based on the previous n observations.
Non-parametric estimators such as the empirical distribution function, the sample mean and variance are jackknife extensions and lack novelty.
Classical estimators in the parametric models are expected to be innovative. We prove it for the Pitman estimators of a location parameter and
illustrate by a few more examples.

CANCELED - Random Dirichlet and Quasi Bernoulli Probabilities and Perpetuities

When: Thu, September 27, 2012 - 3:30pm
Where: MATH 1313
Speaker: Prof. Gerard Letac ((University of Toulouse, France)) -
Abstract: Diaconis and Freedman have observed that if $Z,W,U$ are independent real random variables such that $Z\$ and $ZU+(1-U)W$ are identically distributed and such that $U$ has density $au^{a-1}$ on $(0,1)$ then $Z=\int zD(\alpha)(dz)$ where $\alpha /a$ is the distribution of $W$ and $D(\alpha)$ is the random Dirichlet probability governed by the bounded measure $\alpha.$ The extension of this result to the case where $Z$ and $W$ are valued in a convex set and where $U$ follows an arbitrary beta distribution (well, not quite) leads to the study of new random probabilities called quasi Bernoulli where the Ewens distribution on Ferrers diagrams plays an important role.
Some values of the parameters of the beta distribution of $U$ are challenging.
(Joint work with Pawel Hitczenko, Drexel University)

Inference for High Frequency Financial Data: Local Likelihood and Contiguity

When: Thu, October 4, 2012 - 3:30pm
Where: MATH 1313
Speaker: Prof. Per Mykland (University of Chicago) -
Abstract: Recent years have seen a rapid growth in high frequency financial data.
This has opened the possibility of accurately determining volatility and similar quantities in small time periods, such as one day or even less. We introduce the types of data, and then present a local parametric approach to estimation in the relevant data structures. Using contiguity, we show that the technique quite generally yields asymptotic properties (consistency, normality) that are correct
subject to an ex post adjustment involving asymptotic likelihood ratios. Several examples of estimation are provided: powers of volatility, leverage effect, and integrated betas. The approach provides substantial gains in transparency when it comes to defining and analyzing estimators. The theory relies on the interplay
between stable convergence and measure change, and on asymptotic expansions for martingales.

Some Recent Developments of the Support Vector Machine

When: Thu, October 11, 2012 - 3:30pm
Where: MATH 1313
Speaker: Dr. Yufeng Liu (Dept. of Statistics and Operations Research and Carolina Center for Genome Sciences, Univ. of North Carolina at Chapel Hill) -
Abstract: The Support Vector Machine (SVM) has been a popular margin-based technique for classification problems in both machine learning and statistics. It has a wide range of applications, from computer science to engineering to bioinformatics. As a statistical method, the SVM has weak distributional assumptions and great flexibility in dealing with high dimensional data. In this talk, I will present various aspects of the SVM as well as some of its recent developments. Issues including statistical properties of the SVM, multi-category SVM, as well as class probability estimation of the SVM will be discussed. Applications in cancer genomics will be included as well.

A Statistical Paradox

When: Thu, October 18, 2012 - 3:30pm
Where: MATH 1313
Speaker: Dr. Abram Kagan (Department of Mathematics, UMCP) -
Abstract: An interesting paradox observed recently by Moshe Pollak of Hebrew University will be presented. It deals with comparing the conditional distribution of the number of boys in a family having at least m boys (m given) with that in a family in which the first m children are boys.

On the Nile problem by Ronald Fisher

When: Thu, October 25, 2012 - 3:30pm
Where: MATH 1313
Speaker: Dr. Yaakov Malinovsky (Dept. of Mathematics and Statistics, UMBC) -
Abstract: The Nile problem by Ronald Fisher may be interpreted as the problem of making statistical inference for a special curved exponential family when the minimal suﬃcient statistic is incomplete. The problem itself and its versions for general curved exponential families pose a mathematical-statistical challenge: studying the subalgebras of ancillary statistics within the σ-algebra of the (incomplete) minimal suﬃcient statistics and a closely related question on the structure of UMVUEs. In the talk a new method is presented that proves that in the classical Nile problem no statistic subject to mild natural conditions is a UMVUE. The result almost solves an old problem on the existence of the UMVUEs. The method is purely statistical (vs. analytical) and required the existence of an ancillary subalgebra. An analytical method that uses only ﬁrst order ancillarity (and thus works in the setups when the existence of an ancillary subalgebra is an open problem) proves the nonexistence of UMVUEs for curved exponential families with polynomial constraints on the parameters. (Joint work with Abram Kagan)

Semiparametric Regression Based on Multiple Sources

When: Thu, November 1, 2012 - 3:30pm
Where: MATH 1313
Speaker: Professor Benjamin Kedem (UMCP) -
Abstract:
It is possible to approach regression analysis with random covariates from
a semiparametric perspective where information is combined from multiple
multivariate sources. The approach assumes a semiparametric density ratio
model where multivariate distributions are ``regressed" on a reference
distribution. A kernel density estimator can be constructed from many data
sources in conjunction with the semiparametric model, and is shown to be
more efficient than the traditional single-sample kernel density estimator.
Each multivariate distribution and the corresponding conditional expectation
(regression) of interest are estimated from the combined data using all
sources. Graphical and quantitative diagnostic tools are suggested to assess
model validity. The method is applied in quantifying the effect of height
and age on weight of germ cell testicular cancer patients. A comparison is
made with multiple regression, generalized additive models (GAM) and
nonparametric kernel regression.

Joint work with Anastasia Voulgaraki and Barry Graubard.

Penalized Quantile Regression for in Ultra-high Dimensional Data

When: Thu, November 8, 2012 - 3:30pm
Where: MATH 1313
Speaker: Professor Runze Li (Department of Statistics, Penn State University) -
Abstract: Ultra-high dimensional data often display heterogeneity due to either heteroscedastic variance or other forms of non-location-scale covariate effects. To accommodate heterogeneity, we advocate a more general interpretation of sparsity which assumes that only a small number of covariates influence the conditional distribution of the response variable given all candidate covariates; however, the sets of relevant covariates may differ when we consider different segments of the conditional distribution. In this talk, I first introduce recent development on the methodology and theory of nonconvex penalized quantile linear regression in ultra-high dimension. I further propose a two-stage feature screening and cleaning procedure to study the estimation of the index parameter in heteroscedastic single-index models with ultrahigh dimensional covariates.
Sampling properties of the proposed procedures are studied. Finite sample performance of the proposed procedure is examined by Monte Carlo simulation studies. A real example example is used to illustrate the proposed methodology.

Quality Assurance Tests of Tablet Content Uniformity: Small Sample US Pharmacopeia and Large Sample Tests

When: Thu, November 15, 2012 - 3:30pm
Where: MATH 1313
Speaker: Professor Yi Tsong (Office of Biostatistics, CDER, FDA - (Based on Joint works with Meiyu Shen, Jinglin Zhong and Xiaoyu Dong of CDER, FDA)) -
Abstract: The small sample United States Pharmacopeia (USP) content uniformity sampling acceptance plan consists of a two-stage sampling plan with criteria on sample mean and number of out-of-range tablets was the standard for compendium. It is however often used mistakenly for quality assurance of lot. Both FDA and EMA (European Medicinal Agency) proposed large sample quality assurance tests using tolerance interval approach. EMA proposed a test using a modified two-sided tolerance interval as an extension of USP. Their quality assurances are characterized by controlling the percentage required of the lot within the pre-specified specification limits. On the other hand, FDA statisticians proposed an approach based on two one-sided tolerance intervals that provides quality assurance by controlling the below-specification (low efficacy) and above-specification (potential overdose) portions of the lot separately. FDA further proposed the large sample approach with sample size adjusted specifications in order to assure that the accepted lot will have more than 90% chance to pass the small sample USP compendia test during the lot’s life time. The operating characteristic curves of the approaches are generated to characterize the approaches and demonstrate the difference between the (two) approaches.

Understanding and Improving Propensity Score Methods

When: Thu, November 29, 2012 - 3:30pm
Where: MATH 1313
Speaker: Prof. Zhiqiang Tan ( Dept. of Statistics, Rutgers University) -
Abstract: Consider estimating the mean of an outcome in the presence of missing data or estimating population average treatment effects in causal inference. The propensity score is the conditional probability of non-missingness given explanatory variables. In this talk, we will discuss propensity score methods including doubly robust estimators that are consistent if either a propensity score model or an outcome regression model is correctly specified. The focus will be to understand propensity score methods (compared with those based on outcome regression) and to show recent advances of these methods.

Statistical Significance of Clustering for High Dimensional Data

When: Thu, December 6, 2012 - 3:30pm
Where: MATH 1313
Speaker: Dr. Yufeng Liu (Dept. of Statistics and Operations Research & Carolina Center for Genome Sciences, Univ. of North Carolina at Chapel Hill) -
Abstract: Clustering methods provide a powerful tool for the exploratory analysis of high dimensional datasets, such as gene expression microarray data. A fundamental statistical issue in clustering is which clusters are “really there,” as opposed to being artifacts of the natural sampling variation. In this talk, I will present Statistical Significance of Clustering (SigClust) as a cluster evaluation tool. In particular, we define a cluster as data coming from a single Gaussian distribution and formulate the problem of assessing statistical significance of clustering as a testing procedure. Under this hypothesis testing framework, the cornerstone of our SigClust analysis is accurate estimation of those eigenvalues of the covariance matrix of the null multivariate Gaussian distribution. A likelihood based soft thresholding approach is proposed for the estimation of the covariance matrix eigenvalues. Our theoretical work and simulation studies show that our proposed SigClust procedure works remarkably well. Applications to some cancer microarray data examples demonstrate the usefulness of SigClust.

Asymptotic Properties of the Sample Distribution under Informative Selection

When: Mon, January 7, 2013 - 3:30pm
Where: MTH 1313
Speaker: Dr. Daniel Bonnery (Joint Program in Survey Methodology, Univ. of Maryland) -
Abstract: Consider informative selection of a sample from a finite population.
Responses are realized as independent and identically distributed (iid) random variables with a probability density function (pdf) f, referred to as the superpopulation model. The selection is informative in the sense that the sample responses, given that they were selected, are not iid f. A limit sample pdf is defined, which corresponds to the limit distribution of the response of a unit given it was selected, when population and sample sizes grow to \infty. It is a weighted version \rho.f of the population pdf. In general, the informative selection mechanism may induce dependence among the selected observations.
The impact of the dependence among the selected observations on the behavior of basic distribution estimators, the (unweighted) empirical cumulative distribution function (cdf) and the kernel density estimator of the pdf, is studied. An asymptotic framework and weak conditions on the informative selection mechanism are developed under which these statistics computed on sample responses behave as if they were computed from an iid sample of observations from \rho.f.
In particular, the empirical cdf converges uniformly, in L_2 and almost surely, to the corresponding version of the superpopulation cdf, yielding an analogue of the Glivenko-Cantelli theorem. Further, we compute the rate of convergence of the kernel density estimator to the limit sample pdf. When weak conditions on the selection are satisfied, one can consider that the responses are iid \rho.f in order to make inference on the population distribution. For example, if the response pdf belongs to a parametrized set \{f_\theta\}, and the stochastic dependence between the design and response variables is well known, then the likelihood derived as the product of limit sample pdf's can be used to compute a maximum sample likelihood estimator of \theta. Convergence and asymptotic normality of this estimator is established.

Calibrated Elastic Regularization in Matrix Completion

When: Thu, January 31, 2013 - 3:30pm
Where: MTH 1313
Speaker: Dr. Tingni Sun (Dept. of Statistics, Wharton School, Univ. of Pennsylvania) -
Abstract: This talk concerns the problem of matrix completion, which is to estimate a matrix from observations in a small subset of indices. We propose a new method called calibrated elastic regularization and develop an iterative algorithm that alternates between imputing the missing entries and estimating the matrix by a scaled soft-thresholding singular value decomposition. Under proper coherence conditions and for suitable penalties levels, we prove that the proposed estimator achieves an error bound of nearly optimal rate and in proportion to the noise level. This provides a unified analysis of the noisy and noiseless matrix completion problems.

Renyi Entropy and Large Probability Sets

When: Thu, February 7, 2013 - 3:30pm
Where: MTH 1313
Speaker: Himanshu Tyagi (Dept. of Electrical Engineering, Univ. of Maryland) -
Abstract: We provide an estimate of the size of a large probability set, associated with a
general random variable (rv), in terms of Renyi entropy. This result has several potential applications. For instance, in data compression, Renyi entropy serves to represent a general sequence of rvs just as Shannon entropy represents the minimum rate of bits needed to represent i.i.d. rvs. We also discuss another application in the context of multiterminal security.

This talk is based on joint work with Prakash Narayan.

Higher order asymptotics: two applications

When: Thu, February 28, 2013 - 3:30pm
Where: MTH 1313
Speaker: Prof. Thomas Mathew (Dept. of Mathematics and Statistics, UMBC) -
Abstract: For small sample problems, higher order asymptotics appear
to be an attractive option to handle likelihood based parametric
inference problems. In this talk, some higher order asymptotic
procedures will be described with a view to two applications:
(i) computation of tolerance intervals in a general mixed or random
effects model with balanced or unbalanced data, and (ii) testing
the homogeneity of relative potencies in multivariate bioassays. It
will show that in small sample size scenarios, higher order
corrections provide significant improvements to the usual
likelihood based asymptotic procedures. In the talk, numerical results
along this direction will be shown, and the methodology will be
illustrated with examples.

Nonparametric Instrumental Variable Regression

When: Thu, March 7, 2013 - 3:30pm
Where: MTH 1313
Speaker: Prof. Yuan Liao (Dept. of Mathematics, UMCP) -
Abstract: In nonparametric regressions, when the regressor is correlated with
the error term, both the estimation and identification of the
nonparametric function are ill posed problems. In the econometric
literature, people have been using the instrumental variables to solve
the problem. But the problem is still very difficult because the
identification involves inverting a "Fredholm integration of the first
kind", whose inverse either does not exist or is unbounded. I will
start by motivating this problem with an application of the effect of
education on wage, then explain the concepts of instrumental variables
and Fredholm integral equation of the first kind. My proposed Bayesian
method does not require the nonparametric function to be identified,
so we can never consistently estimate it. Instead, a new consistency
concept based on ``partial identification" will be introduced. This is
a joint work with my Ph.D. advisor Professor Wenxin Jiang.

Improving Predictive Performance Using Dimension Reduction

When: Thu, March 14, 2013 - 3:30pm
Where: MTH 1313
Speaker: David Shaw (Dept. of Mathematics, UMCP) -
Abstract: In many statistical applications, fitting models to handle cases in
which the distribution of the predictors changes between the training
and testing phases is a common problem. Specifically, in computer
vision, it is often the case that both well-defined geometric changes
can be estimated, such as rotation in images, as well as more abstract
transformations. It has recently been proposed that using dimension
reduction techniques to estimate a series of intermediate subspaces
between the training and testing data is able to yield good prediction
results under a shift in the covariate distribution. We adopt a general
method for learning shared representations between training and testing
data on which we build models to perform both regression
and classification. We show that this method is more beneficial than
similar methods both from a practical as well as a theoretical
standpoint, and it is used to improve predictive performance in
various computer vision tasks.

A Reversible Jump Hidden Markov Model Analysis of Search for Cancerous Nodules in X-ray Images

When: Thu, March 28, 2013 - 3:30pm
Where: MTH 1313
Speaker: Jin Yan (Department of Mathematics, UMCP) -
Abstract: Nodules that may represent lung cancer are often missed in chest X-rays. Research has investigated factors affecting search effectiveness based on eye movement patterns, but statistical modeling of these patterns is rare. We analyze eye tracking data of participants looking at chest X-rays with a potential cancerous nodule to find out what areas on the images attract participants’ attention more, how their eyes jump among these areas, and which scan pattern is related to an effective capture of the nodule. By using the hidden Markov model and a modified reversible jump Markov chain Monte Carlo algorithm, we estimated the total number of areas of interest (AOIs) on each image, as well as their centers, sizes and orientations. We also use the pixel luminance as prior information, as nodules are often brighter and luminance may thus affect the AOIs. We found that the average number of AOIs per image is about 7, and that participants’ switching rate between AOIs is 4.1% on average. One of the AOIs covers the nodule precisely. Differences in scan patterns between those who found the nodule and those who didn't are discussed.

KEY WORDS: Eye tracking; Areas of interest; Reversible jump Markov chain Monte Carlo.

Coherence Structure and its Application in Mortality Forecasting

When: Thu, April 11, 2013 - 3:30pm
Where: MTH 1313
Speaker: Prof. Benjamin Kedem (Department of Mathematics, UMCP) -
Abstract: Interaction terms expressed as products of covariates may prove
useful in regression problems. A graphical method is presented for identifying
potentially useful interaction and related covariates in modeling mortality time series using lagged coherence, a nonlinear extension of the squared coherence, and its residual coherence offshoot. The identified covariates are tested for their
significance within a regression model using quasi-likelihood and log link.

Shrink Large Covariance Matrix without Penalty: An Empirical Nonparametric Bayesian Framework for Brain Connectivity Network Analsysis

When: Thu, April 25, 2013 - 3:30pm
Where: MTH 1313
Speaker: Dr. Shuo Chen (Dept. of Epidemiology and Biostatistics, Univ. of Maryland) -
Abstract: In neuroimaging, brain connectivity generally refers to associations between neural units from distinct brain locations. We use nodes (vertices) to represent the neural processing units and edges to note connectivity between those units as in graph theory. In brain network analysis, the edge intensities (connectivity strengths) are usually taken as input data. For statistical modeling, the covariance between edges yields important information because it not only reflect the correlation structure between edges also the spatial structure of nodes. However, the dimension of covariance parameters is very high, for example, 300 nodes will lead to more than one billion covariance parameters between edges. Also, the correlations between edges within and out of brain networks show different distributions. We propose a novel empirical nonparametric Bayesian framework that can efficiently shrink the number of covariance parameters between edges with spatial structure constraint rather than penalty term and yield inferences of brain networks. We apply this method to an fMRI study and simulated data sets to demonstrate the properties of our method.

Testing block-circularity of the covariance matrix and bridging to the compound symmetry and block-sphericity tests

When: Tue, April 30, 2013 - 3:30pm
Where: MTH 1313 (Note Change of Date)
Speaker: Prof. Carlos A. Coelho (Departamento de Matemática and Centro de Matemática e Aplicações, Faculdade de Ciências e Tecnologia, Universidade Nova de Lisboa, Portugal) -
Abstract: Using a suitable diagonalization, of the block-circulant structure and by
adequately splitting the null hypothesis of block-circularity, it is possible to easily
define the likelihood ratio test statistics to test for compound symmetry or blocksphericity, once the block-circulant structure of the covariance matrix is assumed. This approach also enables us to easily build similar tests for complex multivariate normal random variables. Near-exact distributions, which lie very close to the exact distributions, are developed for the likelihood ratio test statistics. Keywords: characteristic function, composition of hypotheses, decomposition of the null hypothesis, distribution of likelihood ratio statistics, near exact distributions, product of independent Beta random variables, sum of independent Gamma random variables.

Escort Distributions Minimizing the Kullback-Leibler Divergence for a Large Deviations Principle and Tests of Entropy Level

When: Thu, May 2, 2013 - 3:30pm
Where: MTH 1313
Speaker: Prof. Valérie Girardin (Laboratoire de Mathématiques Nicolas Oresme, Université de Caen Basse Normandie, France) -
Abstract: Escort distributions, ﬁrst introduced in multifractals and non-extensive
statistical physics, are now involved in numerous ﬁelds including coding
theory or large deviations principles (LDP). In this talk, some properties of escort distributions related to entropy will be investigated. Their role in information geometry, the Riemannian geometry induced by the Kullback-Leibler divergence on the linear space of all distributions on a given ﬁnite set, is thus highlighted. The need to minimize the divergence under constraints arises in numerous applications among which statistics. We are here interested in highly non linear entropic constraints that amount to determining ﬁrst projections and ﬁnally the distance between entropic spheres in information geometry. This allows an LDP to be stated for the sequence of plug-in estimators of Shannon entropy of any ﬁnite distribution. Depending on the number of modes, the associated good rate function is shown to be either the divergence with respect to the distribution of one of its escorts or a function of its modes’s weight. Finally, tests of entropy level using both the LDP and the distance between spheres are constructed and shown to have a good behavior in terms of probability errors.