### Statistics Archives for Academic Year 2019

#### Inference on weak signals in presence of an additive noise

When: Thu, September 6, 2018 - 3:30pm
Where: Kirwan Hall 1313
Speaker: Abram Kagan (Dept. of Math. (Statistics program)) - http://math.umd.edu/~amk

#### Semiparametric Transformation Probit Models with Current-Status Data

When: Thu, September 13, 2018 - 3:30pm
Where: Kirwan Hall 1313
Speaker: Dr. Jing Qin (National Institutes of Health) -

#### Generalized Group Testing: Some Results and Open Problems

When: Thu, September 20, 2018 - 3:30pm
Where: Kirwan Hall 1313
Speaker: Yaakov Malinovsky (Dept. of Math. and Stat., UMBC) -

#### Biostatistical Methods for Wearable and Implantable Technology (WIT)

When: Thu, September 27, 2018 - 3:30pm
Where: Kirwan Hall 1313
Speaker: Prof. Ciprian Crainiceanu (Dept. of Biostatistics, Johns Hopkins University) -
Abstract: Wearable and Implantable Technology (WIT) is rapidly changing the Biostatistics data analytic landscape due to their reduced bias and measurement error as well as to the sheer size and complexity of the signals. In this talk I will review some of the most used and useful sensors in Health Sciences and the ever-expanding WIT analytic environment. I will describe the use of WIT sensors including accelerometers, heart monitors, glucose monitors and their combination with ecological momentary assessment (EMA). This rapidly expanding data eco-system is characterized by multivariate densely sampled time series with complex and highly non-stationary structures. I will introduce an array of scientific problems that can be answered using WIT and I will describe methodsdesigned to analyze the WIT data from the micro- (sub-second-level) to the macro-scale (minute-, hour- or day-level) data.

#### An efficient procedure to combine biomarkers with limits of detection for risk prediction

When: Thu, October 11, 2018 - 3:30pm
Where: Kirwan Hall 1313
Speaker: Dr. Ruth Pfeiffer (National Cancer Institute, NIH) -
Abstract: Much research seeks biomarkers for diagnosing disease and understanding disease etiology.
As high-throughput technologies allow measuring multiple markers simultaneously, strategies for combining markers are needed, particularly if no single marker is highly discriminating. Statistical procedures to combine information from multiple markers need to account for correlations and for left and/or right censoring of the markers due to lower or upper limits of detection of the laboratory assays. We thus extend dimension reduction approaches, specifically likelihood-based sufficient dimension reduction, to regression or classification with censored predictors. Using an expectation maximization (EM) algorithm, we find linear combinations that contain all or most of the information contained in correlated markers for modeling and prediction of an outcome variable, while accounting for left and right censoring due to detection limits. We also allow for selection of important variables through penalization. We assess the performance of our methods extensively in simulations and apply them to data from a study conducted to assess associations of 47 inflammatory markers and lung cancer risk and build prediction models.

This is joint work with Diego Tomassi, Liliana Forzani and Efstathia Bura

#### Accuracy of High-Dimensional Deep Learning Networks

When: Tue, October 16, 2018 - 3:30pm
Where: Kirwan Hall 1313 (notice change of time)
Speaker: Jason Klusowski (Dept. of Statistics, Rutgers University) -
Abstract: It has been experimentally observed in recent years that
multi-layer artificial neural networks have a surprising ability to
generalize, even when trained with far more parameters than
observations. Is there a theoretical basis for this? The best available
bounds on their metric entropy and associated complexity measures are
essentially linear in the number of parameters, which is inadequate to
explain this phenomenon. Here we examine the statistical risk (mean
squared predictive error) of multi-layer networks with $\ell^1$-type
controls on their parameters and with ramp activation functions (also
called lower-rectified linear units). In this setting, the risk is shown
to be upper bounded by $[(L^3 \log d)/n]^{1/2}$, where $d$ is the input
dimension to each layer, $L$ is the number of layers, and $n$ is the
sample size. In this way, the input dimension can be much larger than
the sample size and the estimator can still be accurate, provided the
target function has such $\ell^1$ controls and that the sample size is
at least moderately large compared to $L^3\log d$. The heart of the
analysis is the development of a sampling strategy that demonstrates the
accuracy of a sparse covering of deep ramp networks. Lower bounds show
that the identified risk is close to being optimal. This is joint work
with Andrew R. Barron.''

#### On the construction of unbiased estimators for the group testing problem

When: Thu, October 25, 2018 - 3:30pm
Where: Kirwan Hall 1313
Speaker: Dr. Gregory Hader (National Cancer Institute, NIH) -
Abstract:

Abstract

While the use of group testing as a tool for estimation has been on the rise in recent decades, classical problems such as the large bias of the maximum likelihood estimator continue to hinder the implementation of such methods. This has led to the development of many estimators minimizing bias and, most recently, an unbiased estimator based on sequential binomial sampling. Previous research, however, has focused heavily on the simple case where no misclassification is assumed and only one trait is to be tested. In this talk, we consider the problem of unbiased estimation in these broader areas, giving constructions of such estimators for several cases. We show that, outside of the standard case addressed previously in the literature, it is impossible to find any proper unbiased estimator, that is, an estimator giving only values in the parameter space. This is shown to hold generally under any binomial or multinomial sampling plans.

#### Sample-Size Re-estimation in Two-Stage Bioequivalence Trials

When: Thu, November 1, 2018 - 3:30pm
Where: Kirwan Hall 1313
Speaker: Eric Slud (Dept. of Mathematics (Statistics Program)) -
Abstract: Bioequivalence studies are an essential part of the evaluation of generic drugs. The most common in-vivo bioequivalence (BE) study design is the two-period two-treatment open label crossover design, with a metric of bioavailability such as the log of an approximate integral of the measured concentration of the drug in the blood (log AUC). The observation of interest for each subject is the difference between the measurement in the first and second period of the crossover. When this quantity is assumed approximately normally distributed, the sample size for BE studies using the "Two One-sided Tests" approach is a function of the assumed mean difference, the assumed variance, equivalence margins, type I error rate, and desired power. Since BE studies are often rather small, there is a serious possibility that they are under-powered when the assumed variance turns out to be too small, and it would be preferable to have a blinded study design based on re-estimating the sample-size using only a preliminary estimate of variance calculated without unmasking the treatment labels. However, up to this time there has not been such a two-stage study design guaranteed to maintain experimentwise type I error rate in small samples, apart from inefficient procedures related to Stein's 1945 two-stage procedure.
In the research described in this talk, expanding on a portion of Meiyu Shen's 2015 UMD thesis, a two-stage sample-size re-estimation design will be presented. The idea, for second-stage sample size expressed as a function of first-stage estimated sample variance, is to calculate the second-stage rejection threshold in such a way that the experimentwise type I error probability maximized over the (unknown) true variance is equal to the prescribed alpha (usually 0.05). This idea is shown to be computationally and practically feasible in the setting of BE studies.

This work is joint with Meiyu Shen and Estelle Russek-Cohen of FDA.

#### Calibrating Dependence between Random Elements

When: Thu, November 8, 2018 - 3:30pm
Where: Kirwan Hall 1313
Speaker: Abram Kagan (UMCP) -
Abstract: Properties of a measure of dependence will be presented that, in my opinion, should be satisfied by any natural measure of dependence.

The main goal is construction of a calibrated scale of dependence between random elements X and Y that is based on the dimension of the range of the projector of the subspace L^{2}(X) of L^{2}(X, Y) into L^{2}(Y).

For independent X, Y the range is one-dimensional and this property is characteristic of independence.

#### On Reproducibility of Research Findings, Boundary of Meaning and Type S Errors

When: Thu, November 29, 2018 - 3:30pm
Where: Kirwan Hall 1313
Speaker: Prof. Ron Kenett ( KPA Ltd and the Samuel Neaman Institute, Technion, Israel) -
Abstract: The question of reproducibility of research outcomes is discussed now in the open press with a potential negative
impact on science as a whole. In dealing with this question, from a statistical view point, several methodological
advances have been proposed (like FDR) and several clarification attempts have been published (like the ASA
statement on the p value). These attempts seem to only partially address the rising concerns of the public and
research funding agencies.
Kenett and Shmueli in Clarifying the terminology that describes scientific reproducibility, Nature Methods, 12(8), p
699, 2015, review the terminology used in this debate and refer to generalizability, as a dimension that can clarify
what are research claims that should be scrutinize as reproducible. Generalizability is one of the eight dimensions
of the information quality (InfoQ) framework presented in Kenett and Shmueli, On information quality: The
Potential of Data and Analytics to Generate Knowledge, John Wiley and Sons, 2016.
In this talk, we expand on the idea of generalizability of research findings by referring to Type S errors proposed in
Gelman and Carlin (2014) [Beyond power calculations: Assessing Type S (sign) and Type M (magnitude) errors,
Perspectives on Psychological Science, Vol. 9(6), pp. 641–651]. The talk will first discuss methods for setting up a
boundary of meaning used in generalizing research findings. It will then show how Type S errors and directional
FDR methods fit with this generalizability approach. An example from research in localized colon cancer
diagnostics will be used to demonstrate the approach.

#### Mathematical Aspects of Machine Learning

When: Thu, December 6, 2018 - 3:30pm
Where: Kirwan Hall 1313
Speaker: Wojtek Czaja (Dept. of Mathematics, UMCP) -
Abstract: In recent years machine learning with its focus on predictive and
generatve abilities of learning algorithms became a focus of attention of
researchers across many fields, incuding mathematics. In this talk we will
present some of the aspects of mathematical contributions to machine
learning, devoting our attention to approximation theory, optimization,
and convolutional networks.

#### Marginal-ancillary parametric family of distributions

When: Thu, February 7, 2019 - 3:30pm
Where: Kirwan Hall 1313
Speaker: Abram Kagan (Dept. of Mathematics (Statistics program)) - http://math.umd.edu/~amk
Abstract: A parametric family of distributions of a pair (X, Y) of random elements (X, Y) is called marginal-ancillary if the marginal distributions of X and Y are parameter free. Thus all the information on the parameter is contained in the dependence between X and Y. A lower bound for the Fisher information on the parameter is obtained in the case when the parameter is the correlation coefficient.

#### A new method for the analysis of categorical data with repeated measurements

When: Thu, February 14, 2019 - 3:30pm
Where: Kirwan Hall 1313
Speaker: Dr. Tinghui Yu (MedImmune) -
Abstract: The quality of an assay/survey with categorical output is usually characterized by its accuracy (bias) and precision (variation). To assess these parameters, one needs to perform a study testing a set of properly selected samples repeatedly under different conditions. A generalized linear mixed model (GLMM) can be fitted to the test results, providing control over the correlation structure within and between each design factor of concern. However, interpretation of the resulting GLMM, especially for the random effects, is not straightforward bacause the random effects are usually defined through a non-linear transformation (i.e., a link function). We introduced a new statistic to measure the variation in categorical data generated with multiple levels of control factors. The new method is based on the average agreement between the observed outcomes and hence offers intuitive probabilistic interpretations. It can be shown that this new statistic is closely related to the GLMM. We will also demonstrate the new method through simulations and examples with applications to clinical diagnostics.

#### Understanding Generative Adversarials Networks (GANs) in the Gaussian settingment} \pagestyle{empty} \begin{center} {\bf STAT 410-0101 - EXAM \# 1 - October 4, Under

When: Thu, February 21, 2019 - 3:45pm
Where: Kirwan Hall 1313
Speaker: Prof. Soheil Feizi (Dept. of Computer Sci., Univ. of Maryland) -
Abstract: Generative Adversarial Networks (GANs) have become a popular method to learn a probability model from data. In this talk, I will provide an understanding of some of the basic issues surrounding GANs including their formulation, generalization and stability on a simple benchmark where the data has a high-dimensional Gaussian distribution. Even in this simple benchmark, the GAN problem has not been well-understood as we observe that existing state-of-the-art GAN architectures may fail to learn a proper generative distribution owing to (1) stability issues (i.e., convergence to bad local solutions or not converging at all), (2) approximation issues (i.e., having improper global GAN optimizers caused by inappropriate GAN's loss functions), and (3) generalizability issues (i.e., requiring large number of samples for training). In this setup, we propose a GAN architecture which recovers the maximum-likelihood solution and demonstrates fast generalization. Moreover, we analyze global stability of different computational approaches for the proposed GAN and highlight their pros and cons. Finally, we outline an extension of our model-based approach to design GANs in more complex setups than the considered Gaussian benchmark.

#### Uncover genotype-phenotype relationship through multiple-outcome multivariate regression

When: Thu, February 28, 2019 - 3:30pm
Where: Kirwan Hall 1313
Speaker: Dr. Yong Chen (Dept. of Biostatistics, Epidemiology and Informatics, Univ. of Pennsylvania) -
Abstract: Pleiotropic and polygenic effects, where the former means genetic locus affects multiple
phenotypes, and the latter refers to many loci affecting one trait, offer significant insights in
understanding the complex genotype-phenotype relationship. The increasing availability of
medical and genomic data provide the opportunity to uncover such relationship through joint
modeling multiple phenotypes and genetic variants simultaneously. In this talk, I will share a
few recently developed statistical models for detecting pleiotropic and polygenic effects. I will
discuss some key techniques and considerations on modeling large-scale genetic information. I
will also share our analyses on a large-scale biobank linked electronic health record (EHR) data,
the Penn Medicine Biobank (PMBB), for studying complex genetic architectures and their
impacts on multiple phenotypes.

#### A Case Study in Comparing Bayes Estimated Fixed Effects vs Frequentist Random Effects

When: Thu, March 7, 2019 - 3:30pm
Where: Kirwan Hall 1313
Speaker: Prof. Eric Slud (Dept. of Mathematics (Statistics Program)) -
Abstract: Using data from the Current Population Survey, we consider model-based estimates of population subgroups in different employment categories in two successive months (June and July 2017), cross-classified by education, age, and state. These cross-classified population counts are often rather small, too small to be well estimated by design-based survey methods, but seem amenable to small area estimation' models in which state- and other subgroup-effects are viewed as random. The random effects would be viewed differently in a Bayesian analysis and a frequentist one, although each of these different data analysis approaches provides useful information to the other. The talk will discuss computation, display and interpretation of model results, with particular reference to packages and computational tools in R. The theme of the data analysis is the contrast (by likelihood and prediction metrics) between fixed and random-effect models for area-level intercept effects.

#### Fisher Information, Mean Functions and Matrix Inequalities

When: Thu, March 14, 2019 - 3:30pm
Where: Kirwan Hall 1313
Speaker: Paul J. Smith (STAT Program) -

#### The lag-lead debate on global temperature and carbon dioxide: a statistical look through curve registration

When: Thu, March 28, 2019 - 3:30pm
Where: Kirwan Hall 1313
Speaker: Pro. Debasis Sengupta (Indian Statistical Institute) -
Abstract: The close connection between global temperature variation and atmospheric carbon dioxide concentration has been central to the issue of climate change. The lag/lead between sets of longitudinal data on the two variables has implications for the causality of that connection. We consider this problem as one of curve registration. Most of the available solutions for this problem have been designed for the growth data application, where the number of observations is small and the number of replicates is large. We argue that a different emphasis is needed for the paleoclimatic application. We provide a new method, which is able to pool local information without smoothing and to match sharp landmarks without manual identification. We prove the consistency of the proposed method under fairly general conditions. Simulation results show superiority of the performance of the proposed method over two existing methods. Use of the proposed method to Antarctic ice core data leads to some interesting conclusions.

#### Data Privacy for a $\rho$-Recoverable Function

When: Thu, April 18, 2019 - 3:30pm
Where: Kirwan Hall 1313
Speaker: Prof. Prakash Narayan (Dept. of Comp. and Electrical Engineering, Univ. of Maryland) -
Abstract: This talk is based on joint work with Ph.D. student Ajaykrishnan Nageswaran.
A user's data is represented by a finite-valued random variable.
Given a function of the data, a querier is required to recover,
with at least a prescribed probability, the value of the function
based on a query response provided by the user. The user devises
the query response, subject to the recoverability requirement,
so as to maximize privacy of the data from the querier.
Privacy is measured by the probability of error incurred
by the querier in estimating the data from the query response.
We analyze single and multiple independent query responses,
with each response satisfying the recoverability requirement,
that provide maximum privacy to the user. Achievability schemes
with explicit randomization mechanisms for query responses are given
and their privacy compared with converse upper bounds.
More stringent forms of privacy, viz. predicate privacy and
list privacy will also be mentioned.

==============================

#### An Overview of Statistical Machine Learning Techniques with Applications

When: Tue, April 30, 2019 - 3:30pm
Where: Kirwan Hall 1313
Speaker: Dr. Amita Pal (Indian Statistical Institute) -
Abstract: Statistical Machine Learning involves an algorithmic approach, derived from statistical models, for solving certain problems that arise in the domain of Artificial Intelligence, that can be implemented through computers. Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions. Depending on whether training data is labeled/ unlabeled, a variety of supervised/unsupervised Statistical Machine Learning methods are available. An overview of the most widely-used ones will be provided in this talk, and application to the problems of automatic speaker recognition (ASR) and content-based image retrieval (CBIR) will be briefly described.

#### Event-Specific Win Ratios and Testing with Terminal and Non-Terminal Events

When: Thu, May 2, 2019 - 3:30pm
Where: Kirwan Hall 1313
Speaker: Dr. Song Yang (National Heart, Lung, and Blood Institute, NIH) -
Abstract: In clinical trials the primary outcome is often a composite one, defined as time to the first of two or more types of clinical events,
such as cardiovascular death, a terminal event, and heart failure hospitalization, a non-terminal event. Thus if a patient experiences both types of events,
then the terminal event after a non-terminal event does not contribute to the primary outcome, even though the terminal event is more important than the
non-terminal event. If there are substantial number of patients who experience multiple events, the power of the test for treatment effect may be reduced due
to omission of some of the available data. In the win ratio approach, priorities are given to the clinically more important events, and potentially all available data are used. However, the win ratio approach may have low power in detecting a treatment effect if the effect is predominantly on the non-terminal events. We propose
event-specific win ratios obtained separately on the terminal and non-terminal events. These ratios can then be used to form global tests such as a linear combination
test, the maximum test, or a Chi-square test. In simulations these tests often improve the power of the original win ratio test. Furthermore, when the
the terminal and non-terminal events experience differential treatment effects, the new tests often improve the power of the log-rank test for the
composite outcome. Thus whether the treatment effect is primarily on the terminal events or the non-terminal events, the new tests based on the event-specific win ratios can be useful alternatives for testing treatment effect in clinical trials with time-to-event outcomes when different types of events are present.
We illustrate the new tests with the primary outcome in the trial Aldosterone Antagonist Therapy for Adults With
Heart Failure and Preserved Systolic Function (TOPCAT), where the new tests all reject the null hypothesis of no treatment effect while the composite outcome approach used in TOPCAT did not.