Statistics Archives for Fall 2016 to Spring 2017


Large Sample Theory for Multiple-Frame Sampling

When: Thu, September 8, 2016 - 3:30pm
Where: MTH 1313
Speaker: Takumi Saegusa, Assistant Professor (Statistics Program, Department of Mathematics, University of Maryland - College Park) -
Abstract: Multiple-frame sampling is a commonly used sampling technique in sample surveys that takes multiple sam- ples from distinct but overlapping sampling frames. Main statistical issues are (1) the same unit can be sampled multiple times from different frames with different probabilities, and (2) a sample from each frame is dependent due to sampling without replacement. We study weighted empirical process based on Hartley’s estimator, and extend empirical process theory to our non-i.i.d. setting without requiring additional design conditions. We apply our results to general semiparametric models and the optimal calibration problem.

Statistical Strategies in Analyzing Data with Unequal Prior Knowledge

When: Thu, September 15, 2016 - 3:30pm
Where: MTH 1313
Speaker: Heping Zhang, S.D. Bliss Professor ( Biostatistics, School of Public Health, Yale University ) -
Abstract: The advent of technologies including high throughput genotyping and computer information technologies has produced ever large and diverse databases that are potentially information rich. This creates the need to develop statistical strategies that have a sound mathematical foundation and are computationally feasible and reliable. In statistics, we commonly deal with relationship between variables using correlation and regression models. With diverse databases, the quality of the variables may vary and we may know more about some variables than the others. I will present some ideas on how to conduct statistical inference with unequal prior knowledge. Specifically how do we define correlation between two sets of random variables conditional on a third set of random variables and how do we select predictors when we have information from sources other than the databases with raw data? I will address some mathematical and computational challenges in order to answer these questions. Analysis of real genomic data will be presented to support the proposed methods and highlight remaining challenges.

Estimation of a Directed Acyclic Gaussian Graph

When: Thu, September 29, 2016 - 3:30pm
Where: MTH 1313
Speaker: Xiaotong Shen, J.B.Johnson Distinguished Professor (School of Statistics, University of Minnesota) -
Abstract: Directed acyclic graphs are widely used to describe, among interacting units, causal relations. Causal relations are estimated by reconstructing a directed acyclic graph's structure, presenting a great challenge when the unknown total ordering of a DAG needs to be estimated. In such a situation, it remains unclear if a graph's structure is reconstructable in the absence of an identifiable likelihood with regard to graphs, and in facing super-exponentially many candidate graphs in the number of nodes. In this talk, I will introduce a global approach for observational data and interventional data, to identify all estimable causal directions and estimate model parameters. This approach uses constrained maximum likelihood with nonconvex constraints reinforcing the non-loop requirement to yield an estimated directed acyclic graph, where super-exponentially many constraints characterize the major challenge. Computational issues will be discussed in addition to some theoretical aspects. This work is joint with Y. Yuan, W. Pan and Z. Wang.

Unravelling How the Human Microbiome Impacts Health and Disease

When: Thu, October 13, 2016 - 3:30pm
Where: MTH 1313
Speaker: Mihai Pop, Professor ( Department of Computer Science , University of Maryland - College Park ) -
Abstract: Metagenomics studies aim to characterize microbial communities through the direct sequencing of their collective DNA. While initial studies have been focused on simply extending existing approaches developed in microbial genomics, recently scientists have started to explore the potential of metagenomic data to provide biological insights not apparent in isolate genomes. During my talk I will provide an overview of the field and meta-analyses made possible by looking at communities as a whole, and describe some recent results from my lab related to human health.

Mechanism-Based Modeling for Transcriptional Regulation

When: Thu, October 20, 2016 - 3:30pm
Where: MTH 1313
Speaker: Sridhar Hannenhalli, Professor (Department of Cell Biology and Molecular Genetics) -
Abstract: In this short talk, I will first briefly discuss our evolving understanding of transcriptional regulation, starting with proximal promoters to distal enhancers aided by chromatin structure. I will then summarize recent work in the lab that exploits the current understanding of distal enhancers to better infer regulatory polymorphisms and how that can aid in a mechanism-based model for GWAS. I will then summarize our recent work on modeling of transcription factor and DNA interactions. Finally, time permitting, I will present a novel in vivo mechanism for boosting transcription factor occupancy that relies on crowdsourcing in the task to multitude of spatially clustered homotypic binding sites.

On High-dimensional Misspecified Mixed Model Analysis in Genome-wide Association Study

When: Thu, October 27, 2016 - 3:30pm
Where: MTH 1313
Speaker: Jiming Jiang , Professor (Department of Statistics, University of California - Davis) -
Abstract: We study behavior of the restricted maximum likelihood (REML) estimator under a misspecified linear mixed model (LMM) that has received much attention in recent gnome-wide association studies. The asymptotic analysis establishes consistency of the REML estimator of the variance of the errors in the LMM, and convergence in probability of the REML estimator of the variance of the random effects in the LMM to a certain limit, which is equal to the true variance of the random effects multiplied by the limiting proportion of the nonzero random effects present in the LMM. The asymptotic results also establish asymptotic distribution of the REML estimator. The asymptotic results are fully supported by the results of empirical studies, which include extensive simulation comparison the performance of the REML estimator (under the misspecified LMM) with other existing methods, and real data applications (only one example will be presented) that have important genetic implications. This work is joint with Debashis Paul of UC Davis and Cong Li, Can Yang, and Hongyu Zhao of Yale University.

Survival Analysis Approach to Demand Forecasting in a Request-Based Service System

When: Thu, November 3, 2016 - 3:30pm
Where: MTH 1313
Speaker: Ta-Hsin Li, Ph.D. (IBM T. J. Watson Research Center) -
Abstract: n some service businesses, service requests are initiated by the customer who specifies not only what type of service is needed but also when it is needed. Recorded in a service request management system, these requests are utilized to forecast the demand of different categories of service at different time horizons. Because the requests are subject to revision, the customer-specified time of service is not entirely reliable for demand forecasting. In this talk, we consider a resource-pool based software development service operation and discuss a survival analysis approach that explores the statistical characteristics of historical request data with the aim of providing more accurate demand forecasts. The surviva models are constructed on the basis of a large hierarchy of requests, defined by the demand categories and the customers. A nonparametric approach is taken to handle the large scale of the hierarchy and the diversity of survival patterns. We also employ the regularized Cox's proportional hazards regression method and the Dirichlet-prior-based empirical Bayesian method to overcome the inevitable challenge of data sparsity in training the category- and customer-specific surival functions. Different techniques of estimating the hyperparameter are compared for their performance in demand forecasting.

Frequentist and Bayesian Approaches for Characterizing Heterogeneity in Transcriptomic Meta-Analysis

When: Thu, December 1, 2016 - 3:30pm
Where: MTH 1313
Speaker: George Tseng, Professor (Department of Biostatistics, University of Pittsburgh) -
Abstract: Transcriptomic meta-analysis combines multiple gene expression datasets to increase accuracy and robustness in detecting differentially expressed genes. There has been increasing attention to characterize heterogeneity while combining information across multiple genomic studies. In this talk, we will present an adaptively weighted Fisher’s method from frequentist perspective for this purpose. We will show its appealing features in applications, and related theoretical and computational advances. In the second part of the talk, we will present a Bayesian hierarchical model for meta-analysis of RNA-seq data. The full Bayesian model overcomes the efficiency loss of conventional frequentist two-stage approaches (e.g. Fisher’s method) and the new method is particularly powerful for detecting genes with low expression counts. We will use the two methods to demonstrate pros and cons of using frequentist and Bayesian approaches in genomic applications.

High Dimensions, Inference and Combinatorics. A Journey Through the Data Jungle.

When: Tue, January 3, 2017 - 2:00pm
Where: 1311 Math Building
Speaker: Matey N. Neykov (Postdoc, ORFE, Princeton University.) -
Abstract: This talk takes us on a journey through modern high-dimensional statistics. We begin with a brief discussion on variable selection and estimation and the challenges they bring to high-dimensional inference, and we formulate a new family of inferential problems for graphical models. Our aim is to conduct hypothesis tests on graph properties such as connectivity, maximum degree and cycle presence. The testing algorithms we introduce are applicable to properties which are invariant under edge addition. In parallel, we also develop a minimax lower bound showing the optimality of our tests over a broad family of graph properties. We apply our methods to study neuroimaging data.

Borrowing Information over Time in Binomial/Logit Models for Small Area Estimation

When: Fri, January 6, 2017 - 11:00am
Where: MTH 1308
Speaker: Carolina Franco, Ph.D. (Census Bureau) -
Abstract: In the analysis of survey data, linear mixed models, such as that of Fay-Herriot (1979), have been widely studied and applied to exploit the availability of covariates from administrative records and other sources when predicting population parameters. Such models face challenges when applied to discrete survey data as commonly arise from survey estimates of the number of persons possessing some characteristic, such as the number of persons in poverty. For such applications, we examine a binomial/logit normal (BLN) model that assumes a binomial distribution for rescaled survey estimates and a normal distribution with a linear regression mean function for logits of the true proportions. Effective sample sizes are defined so variances given the true proportions equal corresponding sampling variances of the direct survey estimates. We extend the BLN model to bivariate and time series versions to permit borrowing information from past survey estimates, then apply these models to data used by the U.S. Census Bureau Small Area Income and Poverty Estimates (SAIPE) program to predict county poverty for school-age children. For this application, we compare prediction results from the alternative models to see how much the bivariate and time series models reduce prediction error variances from those of the univariate BLN model. More generally, we explore analytically and empirically under what circumstances one might expect bivariate or time series extensions of small area models to result in significant improvements in prediction.

Information Recovery in Shuffled Graphs via Graph Matching

When: Tue, January 10, 2017 - 1:30pm
Where: Kirwan Hall 1308
Speaker: Vincent Lyzinski (JHU) -
Abstract: In a number of methodologies for inference across multiple graphs, it is assumed that an explicit vertex correspondence is a priori known across the vertex sets of the graphs. While this assumption is often reasonable, in practice these correspondences may be unobserved and/or errorfully observed, and graph matching---aligning a pair of graphs to minimize their edge disagreements---is used to align the graphs before performing subsequent inference. We provide an information theoretic foundation for answering the following questions: What is the increase in uncertainty (i.e., loss of the mutual information) between two graphs when the labeling across graphs is errorfully observed, and can this lost information be recovered via graph matching? Working in the correlated stochastic blockmodel setting, we prove that when graph matching can perfectly recover an errorfully observed correspondence, relatively little information is lost due to shuffling. Although we demonstrate that this lost information can have a dramatic effect on the performance of subsequent inference, we also show that asymptotically almost all of the lost information can be recovered via graph matching, which has the effect of recovering much of the lost inferential performance. Lastly, we demonstrate the practical impact of vertex shuffling and subsequent matching in a pair of inference tasks: two sample graph hypothesis testing and joint graph clustering.

A Systematic Study on Weighing Schemes for Functional Data

When: Thu, January 26, 2017 - 3:30pm
Where: MTH 1313
Speaker: Professor Xiaoke Zhang (Department of Statistics, University of Delaware) -
Abstract: In the past few decades, especially as the advent of the “Big Data†era, functional data, that arise from a sample of functions, have become increasingly common, and functional data analysis (FDA) has received substantial attention. Nonparametric estimation of the mean function is a fundamental issue in FDA, and plenty of methods have been developed and intensively studied. Although not always emphasized, each method usually adopts a pre-specified weighing scheme in the estimation procedure, i.e., a strategy of allocating weights to observations in the objective function. In this talk, we systematically explore the effect of weighing schemes on the local linear smoother for the mean function. We first establish the unified asymptotic property of this estimator under a general weighing scheme. Then we focus on two special but commonly used schemes in FDA, the equal-weight-per-observation (OBS) scheme, and the equal-weight-per-subject (SUBJ) scheme. We comprehensively compare their asymptotic properties and numerical performances. Finally, to improve both OBS and SUBJ estimators, and to provide a practical guidance, we propose the optimal weighing scheme in terms of L2 rate of convergence. At the end of the talk, I will briefly introduce my other recent work on FDA.

Adaptive Estimation in Two-Way Sparse Reduced-Rank Regression

When: Thu, February 2, 2017 - 3:30pm
Where: MTH 1313
Speaker: Professor Tingni Sun ( Statistics Program, Department of Mathematics University of Maryland - College Park ) -
Abstract: This talk considers the problem of estimating a large coefficient matrix in a multiple response linear regression model in the high-dimensional setting, where the numbers of predictors and response variables can be much larger than the number of observations. The coefficient matrix is assumed to be not only of low rank, but also has a small number of nonzero rows and nonzero columns. We propose a new estimation scheme and provide its nearly optimal non-asymptotic minimax rates of estimation error under a collection of squared Schatten norm losses simultaneously. Some numerical studies will also be discussed.

Multiple Changepoint Detection in Climate Time Series

When: Thu, February 16, 2017 - 3:30pm
Where: MTH 1313
Speaker: Professor Robert Lund (Department of Mathematical Sciences, Clemson University & NSF) -
Abstract: This talk presents methods to estimate the number of changepoint time(s) and their locations in time-ordered data sequences. A penalized likelihood objective function is developed from minimum description length (MDL) information theory principles. Optimizing the objective function yields estimates of the changepoint number(s) and location time(s). Our MDL penalty depends on where the changepoint(s) lie, but not solely on the total number of changepoints (such as classical AIC and BIC penalties). Specifically, changepoint configurations that occur relatively closely to one and other are penalized more heavily than sparsely arranged changepoints. The techniques allow for autocorrelation in the observations and mean shifts at each changepoint time. A genetic algorithm, which is an intelligent random walk search, is developed to rapidly optimize the penalized likelihood. The scenario where a "metadata" record exists documenting some (not necessarily all) of station changes is next considered. A prior distribution of changepoint times is constructed for this situation and the MDL penalty is modified to describe the scenario. This allows us to analyze documented and undocumented changepoint times throughout. Applications are presented throughout.

Modeling Longitudinal Data with a Random Change Point and No Time-Zero: Applications to Inference and Prediction in Single and Consecutive Labor Curves

When: Thu, February 23, 2017 - 3:30pm
Where: Math Building 1313
Speaker: Paul Albert, Senior Investigator and Chief (Division of Cancer Epidemiology and Genetics, NCI,NIH) -
Abstract: In some longitudinal studies the initiation time of the process is not clearly defined, yet it is important to make inference or do predictions about the longitudinal process. The application of interest in this article is to provide a framework for modeling individualized labor curves (longitudinal cervical dilation measurements) where the start of labor is not clearly defined. This is a well-known problem in obstetrics where the benchmark reference time is often chosen as the end of the process (individuals are fully dilated at 10 cm) and time is run backwards. This approach results in valid and efficient inference unless subjects are censored before the end of the process (due to a c-section, for example), or if we are focused on prediction. Providing dynamic individualized predictions of the longitudinal labor curve prospectively (where backwards time is unknown) is of interest to aid obstetricians to determine if a labor is on a suitable trajectory. We propose a model for longitudinal labor dilation that uses a random-effects with unknown time-zero and a random change point. We present a maximum likelihood approach for parameter estimation that uses adaptive Gaussian quadrature for the numerical integration. A Monte Carlo approach for dynamic prediction of the future longitudinal dilation trajectory from past dilation measurements is proposed. Further, we discuss an extension of this work to the setting for which we have consecutive pregnancy available in the hopes of using prior pregnancy information to predict the labor curves in subsequent pregnancies. We illustrate this methodology with labor dilation data from the Consortium of Safe Labor (CSL) and the Consecutive Pregnancy Study (CPS), both NICHD intramural projects. This work is collaborative research with Drs. Alex McLain of the University of South Carolina and Olive Buhule of NICHD.

Ontology-based Biomedical Data Standardization, Integration, and Statistical Analysis

When: Thu, March 2, 2017 - 3:30pm
Where: Math Building 1313
Speaker: Oliver He, Associate Professor (Department of Microbiology and Immunology University of Michigan Medical School, Ann Arbor) -
Abstract: A biomedical ontology is a human- and computer-interpretable set of terms and relations that represent entities in a specific biomedical domain and how they relate to each other. In the cutting edge biomedical research, ontologies have played critical roles, for example, serving as advanced controlled terminologies, knowledge bases, metadata standards, and supporting integrative statistical data analysis. The Ontology of Biological and Clinical Statistics (OBCS) is a community-based open source ontology that represents statistics-related terms and their relations in a rigorous fashion. There also exist many domain specific biomedical ontologies, such as the Vaccine Ontology (VO) for the domain of vaccines and vaccination, and the Ontology of Adverse Events (OAE) for the domain of adverse events following various medical interventions. The usage of these and other ontologies supports standard and reproducible data representation and statistical analysis in different biomedical domains. For example, ontologies and ontology-based statistical methods support: (i) advanced literature mining and analysis of vaccine-mediated gene-gene interaction networks, and (ii) data standardization and analysis of clinically reported vaccine and drug adverse event cases. A theory-oriented OneNet framework is finally proposed to integrate different ontologies and ontology-supported statistical approaches for integrative and systematic life science research.

Dirichlet Mixtures, the Dirichlet Process, and the Topography of Amino-Acid Multinomial Space

When: Thu, March 9, 2017 - 3:30pm
Where: Math Building 1313
Speaker: Stephen Altschul, Senior Investigator (National Center for Biotechnology Information, NIH) -
Abstract: The Dirichlet Process is used to estimate probability distributions that are mixtures of an unknown and unbounded number of components. Amino acid frequencies at homologous positions within related proteins have been fruitfully modeled by Dirichlet mixtures, and we have used the Dirichlet Process to construct such distributions. The resulting mixtures describe multiple alignment data substantially better than do those previously derived. They consist of over 500 components, in contrast to fewer than 40 previously, and provide a novel perspective on protein structure. Individual protein positions should be seen not as falling into one of several categories, but rather as arrayed near probability ridges winding through amino-acid multinomial space.

Tight and Probabilistic Tight Frames

When: Thu, March 16, 2017 - 3:30pm
Where: Math Building 1313
Speaker: Kasso Okoudjou, Professor ( Department of Mathematics University of Maryland - College Park ) -
Abstract: A tight frame for a finite dimensional Euclidean space is an overcomplete spanning set. The class of tight frames and some of its subclasses appear in many areas including coding theory, discrete geometry, and statistics. Tight probabilist frames are the probabilistic counterpart of tight frames. In the first part of the talk, I will introduce the class of tight frames and derive some of their properties. In the second part, I will talk about the probabilistic counterpart and indicate the connection and applications of this theory to statistics.

Mutual Exclusivity Analysis in Tumor Sequencing Studies

When: Thu, March 30, 2017 - 3:30pm
Where: MTH 1313
Speaker: Jianxin Shi, Ph.D. (Senior Investigator, Division of Cancer Epidemiology and Genetics, NCI, NIH) -
Abstract: In this talk, I will first give an overview of modern cancer genetic and genomic studies, including genetic mapping of cancers, genetic risk prediction, and integrative tumor genomic analyses. I will then present our recent work on mutual exclusivity analysis in tumor sequencing studies. The central challenge in tumor sequencing studies is to identify driver genes and pathways, investigate their functional relationships and nominate drug targets. The efficiency of these analyses, particularly for infrequently mutated genes, is compromised when patients carry different combinations of driver mutations. Mutual exclusivity analysis helps address these challenges. To identify mutually exclusive gene sets (MEGS), we developed a powerful and flexible analytic framework based on a likelihood ratio test and a model selection procedure. Extensive simulations demonstrated that our method outperformed existing methods for both statistical power and the capability of identifying the exact MEGS, particularly for highly imbalanced MEGS. Our method can be used for de novo discovery, pathway-guided searches or for expanding established small MEGS. We applied our method to the whole exome sequencing data for fourteen cancer types from The Cancer Genome Atlas (TCGA). We identified multiple previously unreported non-pairwise MEGS in multiple cancer types. For acute myeloid leukemia, we identified a novel MEGS with five genes (FLT3, IDH2, NRAS, KIT and TP53) and a MEGS (NPM1, TP53 and RUX1) whose mutation status was strongly associated with survival (P=6.7×10-4). For breast cancer, we identified a significant MEGS consisting of TP53 and four infrequently mutated genes (ARID1A, AKT1, MED23 and TBL1XR1), providing support for their role as cancer drivers.

"Thinking Out of the Sample": Repeated out of Sample Fusion in the Estimation of Small Tail Probabilities

When: Thu, April 13, 2017 - 3:30pm
Where: MTH 1313
Speaker: Ben Kedem, Professor ( Statistics Program, Department of Mathematics University of Maryland - College Park ) -
Abstract: Often, it is required to estimate the probability that a quantity such as mercury, lead, toxicity level, plutonium, temperature, rainfall, damage, wind speed, risk, etc., exceeds an unsafe high threshold. The probability in question is then very small. To estimate such a probability, we need information about large values of the quantity of interest. However, in many cases, the data only contain values far below the designated threshold, let alone exceedingly large values, which ostensibly renders the problem insolvable. It is shown that by repeated fusion of the data with externally generated random data, more information about small tail probabilities is obtained with the aid of certain new statistical functions. This provides short, yet reliable interval estimates based on moderately large samples. A comparison of the approach with the well known Peaks over Threshold (POT) method from extreme values theory, using both artificial and real data, points to the merit of repeated out of sample fusion (ROSF).

Functional Regression Models for Gene-based Association Studies of Complex Traits

When: Thu, April 20, 2017 - 3:30pm
Where: MTH 1313
Speaker: Ruzong Fan, Professor (Department of Biostatistics, Georgetown University) -
Abstract: By using functional data analysis techniques, fixed effect functional regression models are developed to test associations between complex traits and genetic variants, which can be rare variants, common variants, or a combination of the two, adjusting for covariates. We treat multiple genetic variants of an individual in a human population as a realization of an underlying stochastic process. The genome of an individual is viewed as a stochastic function which contains both genetic position and linkage disequilibrium (LD) information of the genetic markers. To overcome the curse of high dimensions of modern genetic data, functional regression models are developed to reduce the dimensionality. In the talk, I will show how to build test statistics for fixed effect functional regression models to test association between quantitative/dichotomous/survival traits and genetic variants. Results of extensive simulation analysis and real data analysis will be shown to demonstrate the performance of the proposed models and tests. A comparison with existing popular procedure of sequence kernel association test (SKAT) and its optimal unified test (SKAT-O) will be made to facilitate an understanding of the proposed methods, and to answer whether fixed or mixed models should be used in association analysis of complex disorders.

Statistical Problems Related to the Stam Inequality

When: Thu, May 11, 2017 - 3:30pm
Where: MTH 1313
Speaker: Professor Abram Kagan ( Statistics Program, Department of Mathematics University of Maryland - College Park ) -
Abstract: If X, Y are independent random variables with the Fisher information (on a location parameter) I(X) and I(Y), the Stam inequality claims that 1/I(X+Y)> 1/I(X) + 1/I(Y). The inequality is a corollary of a classical entropy power inequality. It will be shown that a small sample version of the Stam inequality is a property of the Pitman estimators and the Stam inequality is a property of the Pitman estimators in large samples. A generalization of the Stam inequality to the case of a general parameter will be presented. Some open problems will be discussed.