### Statistics Archives for Fall 2014 to Spring 2015

#### Averaged Regression Quantiles

When: Thu, September 19, 2013 - 3:30pm
Where: MATH 1313
Speaker: Prof. Jana Jureckova (Dept. of Statistics, Charles University, Prague) -
Abstract: See this link: www2.math.umd.edu/~akagan/average_abstract.pdf

#### Lecture 2: Complexity Penalization in Low Rank Matrix Estimation

When: Wed, September 25, 2013 - 11:00am
Where: MTH 3206, Colloquium
Speaker: Prof. Vladimir Koltchinskii (School of Mathematics, Georgia Institute of Technology) - (NOTE DAY, TIME and ROOM CHANGE)

#### Lecture 3: Low Rank Estimation of Smooth Kernels on Graphs

When: Thu, September 26, 2013 - 3:30pm
Where: MATH 1313
Speaker: Prof. Vladimir Koltchinskii (School of Mathematics, Georgia Institute of Technology) -

#### Some Old Problems Revisited

When: Thu, October 3, 2013 - 3:30pm
Where: MTH 1313
Speaker: Abram Kagan (Dept. of Math, UMCP) -

#### Within-cluster resampling methods for clustered ROC data

When: Thu, October 10, 2013 - 3:30pm
Where: MATH 1313
Speaker: Dr. Larry Tang (Dept. of Statistics, George Mason University) -

Abstract:

Clustered ROC data is a type of data that each subject has multiple diseased and nondiseased observations. Within the same subject, observations are naturally correlated, and the cluster sizes may be informative of the subject’s disease status. The traditional ROC methods on clustered data could result in large bias and lead to incorrect statistical inference. We introduce within-cluster resampling (WCR) methods for clustered ROC data to account for within-cluster correlation and informative cluster sizes. The WCR methods work as follows. First, one observation is randomly selected from each patient, and then the traditional ROC methods are applied on the resampled data to obtain ROC estimates. These steps are performed multiple times and the average of resampled ROC estimates is the final estimator. The proposed method does not require a specific within-cluster correlation structure and yields a valid estimator when the cluster sizes are informative. We compare the proposed methods to existing methods in extensive simulation studies.

#### Option Prices in Terms of Distributions Functions

When: Thu, October 17, 2013 - 3:30pm
Where: MATH 1313
Speaker: Prof. Ju-Yi Yen (Dept. of Mathematical Sciences, Univ. of Cincinnati) -
Abstract: The Black-Scholes model is an important starting point for studying financial derivatives. In the Black-Scholes formula, the evolution of prices of a risky asset is described by an exponential martingale associated to a Brownian motion and, as a consequence, the Black-Scholes function is increasing and bounded and can be written as a distribution function. We shall explore the connection between Black-Scholes type functions and their distribution functions. We study the distribution function in terms of the last passage times, and extend the underlying martingale beyond the Brownian framework. Explicit examples of computations of these laws are given.

#### Nonparametric Inference Based on Conditional Moment Inequalities

When: Thu, October 24, 2013 - 3:30pm
Where: MTH 1313
Speaker: Prof Donald Andrews ((Dept. of Economics, Dept. of Statistics, Yale University)) -
Abstract: This paper develops methods of inference for nonparametric and semiparametric parameters defined by conditional moment inequalities and/or equalities. The parameters need not be identified. Confidence sets and tests are introduced. The correct uniform asymptotic size of these procedures is established. The false coverage probabilities and power of the CS's and tests are established for fixed alternatives and some local alternatives. Finite-sample simulation results are given for a nonparametric conditional quantile model with censoring and a nonparametric conditional treatment effect model. The recommended CS/test uses a Cramér-von-Mises-type test statistic and employs a generalized moment selection critical value.

#### Asymptotic Normality and Optimalities in Estimation of Large Gaussian Graphical Model

When: Thu, October 31, 2013 - 3:30pm
Where: MTH 1313
Speaker: Dr. Tingni Sun (Wharton School of Business, Univ. of Pennsylvania) -
Abstract: Gaussian graphical model has a wide range of applications. In this talk, we consider a fundamental question: When is it possible to estimate low-dimensional parameters at parametric square-root rate in a large Gaussian graphical model? A novel regression approach is proposed to obtain asymptotically efficient estimation of each entry of a precision matrix under a sparseness condition relative to the sample size. The proposed estimator is also applied to test the presence of an edge in the Gaussian graphical model or to recover the support of the entire model. Theoretical properties are studied under a sparsity condition on the precision matrix and a side condition on the range of its spectrum, which significantly relaxes some commonly imposed conditions, e.g. irrepresentable condition, $\ell_1$ constraint on the precision matrix.

This is a joint work with Zhao Ren, Cun-Hui Zhang and Harrison Zhou.

#### Asymptotic properties of the sample distribution under informative selection

When: Thu, November 7, 2013 - 3:30pm
Where: MTH 1313
Speaker: Dr. Daniel Bonnery (Joint Program in Survey Methodology, UMCP) -
Abstract: Consider informative selection of a sample from a finite population.
Responses are realized as independent and identically distributed (iid) random variables with a probability density function (pdf) f, referred to as the superpopulation model. The selection is informative in the sense that the sample responses, given that they were selected, are not iid f. A limit sample pdf is defined, which corresponds to the limit distribution of the response of a unit given it was selected, when population and sample sizes grow to \infty. It is a weighted version \rho.f of the population pdf. In general, the informative selection mechanism may induce dependence among the selected observations.
The impact of the dependence among the selected observations on the behavior of basic distribution estimators, the (unweighted) empirical cumulative distribution function (cdf) and the kernel density estimator of the pdf, is studied. An asymptotic framework and weak conditions on the informative selection mechanism are developed under which these statistics computed on sample responses behave as if they were computed from an iid sample of observations from \rho.f.
In particular, the empirical cdf converges uniformly, in L_2 and almost surely, to the corresponding version of the superpopulation cdf, yielding an analogue of the Glivenko-Cantelli theorem. Further, we compute the rate of convergence of the kernel density estimator to the limit sample pdf. When weak conditions on the selection are satisfied, one can consider that the responses are iid \rho.f in order to make inference on the population distribution. For example, if the response pdf belongs to a parametrized set \{f_\theta\}, and the stochastic dependence between the design and response variables is well known, then the likelihood derived as the product of limit sample pdf's can be used to compute a maximum sample likelihood estimator of \theta. Convergence and asymptotic normality of this estimator is established.

#### Some Properties of UMVUEs

When: Thu, November 14, 2013 - 3:30pm
Where: MTH 1313
Speaker: Prof. Abram Kagan (UMCP) -
Abstract: In all setups when the structure of the UMVUEs from data X is known, there exists a subalgebra U (a statistic U(X)) of the basic sigma-algebra such that all U-measurable statistics (depending on X only through U(X)) with finite second moments and only they are UMVUEs.
It is shown that these MVE-algebras (statistics) are, in a sense, similar to the subalgebras generated by complete sufficient statistics but the existence of the former does not require the existence of the latter. Examples are given when these subalgebras differ.

#### Statistical Inferences Using Large Estimated Covariances for Panel Data and Factor Models

When: Thu, November 21, 2013 - 3:30pm
Where: MTH 1313
Speaker: Prof. Yuan Liao (UMCP) -
Abstract: While most of the convergence results in the literature on high dimensional
covariance matrix are concerned about the accuracy of estimating the covariance
matrix (and precision matrix), relatively less is known about the effect of estimating large covariances on statistical inferences. We study two important models: factor analysis and panel data model with interactive effects, and focus on the statistical inference and estimation efficiency of structural parameters based on large covariance estimators. For efficient estimation, both models call for a weighted principle components (WPC), which relies on a high dimensional weight matrix. This paper derives an efficient and feasible WPC using the covariance matrix estimator of Fan et al. (2013).
However, we demonstrate that existing results on large covariance estimation based on absolute convergence are not suitable for statistical inferences of the structural parameters. What is needed is some weighted consistency and the associated rate of convergence, which are obtained in this paper. Finally, the proposed method is applied to the US divorce rate data. We find that the efficient WPC identifies the significant effects of divorce-law reforms on the divorce rate, and it provides more accurate estimation and tighter confidence intervals than existing methods.

#### L1-Penalization in Functional Linear Regression with Gaussian Design

When: Thu, February 6, 2014 - 3:30pm
Where: Math 1313
Speaker: Stanislav Minsker (Duke) -
Abstract: The goal of this talk is to discuss the functional regression model with random Gaussian design and real-valued response. The main focus is on the problems in which the regression function can be well-approximated by a functional linear model with the slope function being "sparse" in the sense that it can be represented as a sum of a small number of well-separated "spikes".

This can be viewed as an extension of now classical sparse estimation problems to the infinite-dimensional case. We study an estimator of the regression function which is based on penalized empirical risk minimization with quadratic loss and the complexity penalty defined in terms of L1-norm (a continuous version of LASSO). We will introduce several important parameters characterizing sparsity in this class of models and present sharp oracle inequalities showing how the L2-error of the continuous LASSO estimator depends on the underlying sparse structure of the problem. As a corollary of our general results, we obtain new bounds for performance of the usual LASSO estimator applied to the problems with highly correlated design.

This talk is based on a joint work with Vladimir Koltchinskii.

#### The Estimation of Leverage Effect with High Frequency Data

When: Tue, February 11, 2014 - 3:30pm
Where: MTH 1313
Speaker: Dr. Christina D. Wang (Princeton Univesity ) - (NOTE CHANGE OF DAY)
Abstract: The leverage effect has become an extensively studied phenomenon which describes the (usually) negative relation between stock returns and their volatility. Although this characteristic of stock returns is well acknowledged, most studies of the phenomenon are based on cross-sectional calibration with parametric models. On the statistical side, most previous works are conducted over daily or longer return horizons, and few of them have carefully studied its estimation, especially with high frequency data. However, estimation of the leverage effect is important because sensible inference is possible only when the leverage effect is estimated reliably. In this study, we are the first to provide nonparametric estimation for a class of stochastic measures of leverage effect. In order to construct estimators with good statistical properties, we introduce a new stochastic leverage effect parameter. The estimators and their statistical properties are provided in cases both with and without microstructure noise, under the stochastic volatility model. In asymptotics, the consistency and limiting distribution of the estimators are derived and corroborated by simulation results. Applications of the estimators are also explored. This estimator provides the opportunity to study high frequency regression, which leads to the prediction of volatility using not only previous volatility but also the leverage effect. An empirical study shows the significant prediction power of the return scaled by the leverage effect. The estimator also reveals a theoretical connection between skewness and the leverage effect, which further leads to the prediction of skewness. Furthermore, adopting the ideas similar to the estimation of the leverage effect, it is easy to extend the methods to study other important aspects of stock returns, such as volatility of volatility.

#### On the Identifiability of Q-matrix Based CDM’s

When: Thu, February 13, 2014 - 3:30pm
Where: MTH 1313
Speaker: Stephanie Zhang (Dept. of Statistics, Columbia University) -
Abstract: There has been growing interest in recent years in using cognitive diagnosis models (CDMs) for diagnostic measurement. However, many CDMs suffer from issues of identifiability, which limits their application. We begin addressing the issue by describing necessary and sufficient conditions for identifiability in two popular CDM’s. Depending on the area of application and the researcher’s degree of control over the experiment design, fulfilling these conditions may be difficult; we also propose new methods for parameter estimation and respondent classification when non-identifiability is unavoidable. In addition, our framework allows consistent estimation of the severity of the non-identifiability problem, in terms of the proportion of the population affected by it.

#### Weighted likelihood estimation under two-phase sampling

When: Tue, February 18, 2014 - 3:30pm
Where: Math 1313
Speaker: Takumi Saegusa (University of Washington) - https://sites.google.com/site/tsaegusa/
Abstract: Two-phase sampling is a sampling design for cost reduction and improved efficiency adopted in epidemiological studies. At the first phase, a sample is obtained from population to collect variables for stratification. At the second phase, a stratified sample is drawn without replacement to measure variables of interest. Examples includes case cohort study and stratified case control study. Various statistical methods have been proposed for this design, but dependence due to sampling without replacement has been largely ignored for mathematical convenience. In this talk, I will discuss large sample theory for two-phase sampling that accounts for dependence of observations. Specifically, I will consider three statistical problems, estimation, its efficiency improvement, and bootstrap, arising from the RV144 immune correlate study based on the culminating RV144 HIV vaccine trial. For estimation, an asymptotic distribution of the Weighed Likelihood Estimator (WLE) is derived in a general semiparametric model, and its asymptotic variance is shown to be generally smaller than under the convenient assumption of independence. For improving efficiency of the WLE, the standard methods of estimating weights and calibration are shown to improve the efficiency only under the independence assumption, and a new calibration method is proposed. For bootstrap, a novel bootstrap procedure is proposed to yield randomness from different phases and strata. Theoretical justification behind these methods will be briefly discussed. Our methods are applied to the analysis of the national Wilms tumor study data.

#### Statistical and Computational Tradeoffs in High Dimensional Learning

When: Thu, February 20, 2014 - 3:30pm
Where: Math 1313
Speaker: Quentin Berthet (Princeton) - http://www.princeton.edu/~qberthet/
Abstract: With the recent data revolution, statisticians are considering larger datasets, more sophisticated models, more complex problems. As a consequence, the algorithmic aspect of statistical methods can no longer be neglected in a world where computational power is the bottleneck, not the lack of observations. In this context, we will establish fundamental limits in the statistical performance of computationally efficient procedures, for the problem of sparse principal component analysis. We will show how it is achieved through average-case reduction to the planted clique problem, and introduce further areas of research in this promising field.

#### Is there a needle in the haystack? Marginal screening and non-standard asymptotics

When: Thu, February 27, 2014 - 3:30pm
Where: MTH 1313
Speaker: Prof. Ian McKeague ( Dept. of Biostatistics, Columbia University) -
Abstract: This talk discusses marginal screening for detecting the presence of significant predictors in high-dimensional regression (is there a needle in the haystack?). Screening large numbers of predictors is a challenging problem due to the non-standard limiting behavior of post-model-selected estimators. There is a common misconception that the oracle property for such estimators is a panacea, but the oracle property only holds away from the null hypothesis of interest in marginal screening. To address this difficulty, we propose an adaptive resampling test (ART). Our approach provides an alternative to the popular (yet conservative) Bonferroni method of controlling familywise error rates. ART is adaptive in the sense that thresholding is used to decide whether the centered percentile bootstrap applies, and otherwise adapts to the non-standard asymptotics in the tightest way possible. The talk is based on joint work with Min Qian.

#### On the Identifiability of Q-matrix Based CDM’s

When: Thu, March 6, 2014 - 3:30pm
Where: MTH 1313
Speaker: Stephanie Zhang (Dept. of Statistics, Columbia University) -
Abstract: There has been growing interest in recent years in using cognitive diagnosis models (CDMs) for diagnostic measurement. However, many CDMs suffer from issues of identifiability, which limits their application. We begin addressing the issue by describing necessary and sufficient conditions for identifiability in two popular CDM’s. Depending on the area of application and the researcher’s degree of control over the experiment design, fulfilling these conditions may be difficult; we also propose new methods for parameter estimation and respondent classification when non-identifiability is unavoidable. In addition, our framework allows consistent estimation of the severity of the non-identifiability problem, in terms of the proportion of the population affected by it.

#### Martingale Difference Correlation and High Dimensional Feature Screening

When: Thu, March 13, 2014 - 3:30pm
Where: MTH 1313
Speaker: Prof. Xiaofeng Shao (Univ. of Illinois, Urbana-Champaign) -
Abstract: In this talk, I will introduce a new metric, the so-called martingale difference correlation to measure the conditional mean dependence between a scalar response variable V and a vector predictor variable U. Our metric is a natural extension of the recently proposed distance correlation, which is used to measure the dependence between V and U. The martingale difference correlation and its empirical counterpart inherit a number of desirable features of distance correlation and sample distance correlation, such as algebraic simplicity and elegant theoretical properties. We further use martingale difference correlation as a marginal utility to do high dimensional feature screening to screen out variables that do not contribute to conditional mean of the response given the covariates. An extension to conditional quantile screening will be described and sure screening consistency for both screening procedures will also be presented.
I will conclude the talk by showing selected simulation results and a real data illustration, which demonstrate the effectiveness of martingale difference correlation based screening procedures in comparison with the existing counterparts.

#### Link Prediction for Partially Observed Networks

When: Thu, April 3, 2014 - 3:30pm
Where: MTH 1313
Speaker: Prof. Yunpeng Zhao (George Mason University) -
Abstract: Abstract: Link prediction is one of the fundamental problems in network analysis. In many applications, notably in genetics, a partially observed network may not contain any negative examples of absent edges, which creates a difficulty for many
existing supervised learning approaches. In this talk, we propose a new method which treats the observed network as a sample of the true network with different sampling rates for positive and negative examples. We obtain a relative ranking of potential links by their probabilities, utilizing information on node covariates as well as on network topology. Empirically, the method performs well under many settings, including when the observed net- work is sparse. We apply the method to a protein-protein interaction network and a school friendship network. If time allows, I will briefly discuss a modification of this method using low-rank matrix decomposition technique.

#### Drift in Transaction-Level Asset Price Models

When: Thu, April 10, 2014 - 3:30pm
Where: MTH 1313
Speaker: Prof. Clifford Hurvich ( New York University) -
Abstract:
We study the effect of drift in pure-jump transaction-level models for
asset prices in continuous time, driven by point processes. The drift is assumed to arise from a nonzero mean in the efficient shock series. It follows that the drift is proportional to the driving point process itself, i.e. the cumulative number of transactions. This link reveals a mechanism by which properties of intertrade durations(such as heavy tails and long memory) can have a strong impact on properties of average returns, thereby potentially making it extremely difficult to
determine growth rates.

We focus on a basic univariate model for log price, coupled with general assumptions on durations that are satisfied by several existing flexible models, allowing for both long memory and heavy tails in durations. Under our pure-jump model, we obtain the limiting distribution for the suitably normalized log price. This limiting distribution need not be Gaussian, and may have either finite variance or infinite variance. We show that the drift can affect not only the limiting distribution for the normalized log price, but also the rate in the corresponding normalization. Therefore, the drift (or equivalently, the properties of durations) affects the rate of convergence of estimators of the growth rate, and can invalidate standard hypothesis tests for that growth rate.
Our analysis also sheds some new light on two longstanding debates as
to whether stock returns have long memory or infinite variance.

#### A Semiparametric Approach to Source Separation Using Independent Component Analysis

When: Thu, April 17, 2014 - 3:30pm
Where: MTH 1313
Speaker: Prof. Sujit Ghosh (Department of Statistics, NC State University, National Science Foundation (on IPA Assignment)) -
Abstract: Data processing and source identification using lower dimensional hidden structure plays an essential role in many fields of applications, including image processing, neural networks, genome studies, signal processing and other areas where large datasets are often encountered. Representations of higher dimensional random vector using a lower dimensional vector provide a statistical framework to the identification and separation of the sources. One of the common methods for source separation using lower dimensional structure involves the
use of Independent Component Analysis (ICA), which is based on a linear representation of the observed data in terms of independent hidden sources. A distinguishing feature of the ICA compared with other source separation methods is that the lower dimensional random variables are extracted as independent sources in contrast to uncorrelated variables (e.g., as in PCA). The problem thus involves the estimation of the linear mixing matrix and the densities of the independent latent sources. However, the solution to the problem depends on the identifiability of the sources. First, a set of sufficient conditions are established to resolve the identifiability of the sources using moment restrictions of the hidden source variables. Under such sufficient conditions a semi-parametric maximum likelihood estimate of the mixing matrix and source densities are derived. The consistency of the proposed estimate is established under additional mild regularity conditions. The proposed method is illustrated and compared with existing methods using simulated data scenarios and also using data sets on brain imaging that are commonly used in practice.

#### Binary Response Models for Recognition and Classification of Anti-Microbial Peptides

When: Thu, April 24, 2014 - 3:30pm
Where: MTH 1313
Speaker: Dr. Elena Rantou (FDA/CDER) -
Abstract: There is now great urgency in developing new antibiotics to combat bacterial resistance. Recent attention has turned to naturally-occurring antimicrobial peptides (AMPs) that can serve as templates for antibacterial drug research. Several experiments strongly indicate that the physicochemical properties of a peptide influence its antimicrobial activity. This work focuses on the recognition of AMPs based on such physicochemical properties (global features) and attempts to assign probabilities to whether a peptide is an AMP. In particular, it presents a robust, randomized test of the relevance of features for AMP recognition. Additionally, it suggests a rigorous approach for constructing predictive models that employ relevant features and their combinations to associate with a novel peptide sequence, a probability to have antimicrobial activity. These are binary response type, logistic regression models which are capable of associating probabilities, with single feature or feature-interaction predictors. Based on different criteria, we arrive to the model of best fit. Furthermore, the use of classification trees seems to be a promising robust alternative. Taken together, the present work provides the means of elucidating features of importance for antimicrobial activity and is a first step towards modification or design of novel AMPs for treatment.

Keywords:

Recognition of antimicrobial peptides; AMP; significance testing; logistic regression models; tree classification methods

*:This work was completed prior to joining the US Food and Drug Administration and has not been cleared by the agency.

#### Combining Survival Trials Using Aggregate Data Based on Misspeciﬁed Models

When: Thu, May 1, 2014 - 3:30pm
Where: MATH 1313
Speaker: Dr. Tinghui Yu (FDA/HHS) -
Abstract: The non-linear structure of the (log) hazard ratio estimates based on a Cox
proportional hazard model leads to challenges in the development of
algorithms for combining multiple clinical trials when only aggregate patient data
are available. In particular, if the treatment effects of the same therapeutic
observed from different clinical trials of concern are essentially different, one
do not expect a convex combination of the hazard ratio estimates can lead
to a precise description of the treatment effect among the overall population.
In this paper, we proposed a combined hazard ratio estimate using aggregate
data. Interpretation of the methods are provided in the framework of robust
data analyses with misspecified models. Its asymptotic efficiency and the
power of a Wald test for the combined treatment effects are demonstrated
using simulations.

_______________________________________
FDA, Center for Devices and Radiological Health.
Merck Research Laboratories.
Independent consultant

#### Semiparametric Factor Analysis

When: Thu, May 8, 2014 - 3:30pm
Where: MATH 1313
Speaker: Prof. Yuan Liao (Department of Mathematics, UMCP) -
Abstract: This paper studies a high-dimensional semi-parametric factor model with nonparametric loading curves that depend on a few observed characteristic variables. We propose a projected principal components method to estimate the unknown factors, loadings, and number of factors. It is shown that after projecting the respond variable onto the sieve space spanned by the characteristic variables, the projected-PC yields a significant improvement on the rates of convergence than the regular methods in classical factor analysis. In particular, consistency can be achieved without a diverging sample size, as long as the dimensionality grows. This demonstrates that the high dimensionality for semi-parametric factor analysis is actually a blessing, and thus the proposed method is useful in the typical high-dimension-low-sample-size situations. In addition, we also propose a new specification test for the nonparametric loading curves, which fills the gap of the testing literature for semi-parametric factor models.