Abstract: Advancing medical informatics tools and high-throughput biological experimentation are making large-scale data routinely accessible to researchers, administrators, and policy-makers. This ``data deluge'' poses new challenges and critical barriers for quantitative researchers as existing statistical methods and software grind to a halt when analyzing these large-scale datasets, and calls for a need of methods that can readily fit large-scale data. In this talk I will present a new sparse Cox regression method for high-dimensional massive sample size survival data. Our method is an L0-based iteratively reweighted L2-penalized Cox regression, which inherits some appealing properties of both L0 and L2 penalized Cox regression while overcoming their limitations. We establish that it has an oracle property for selection and estimation and a grouping property for highly correlated covariates. We develop an efficient implementation for high-dimensional massive sample size survival data, which exhibits up to a 20-fold speedup over a competing method in our numerical studies. We also adapt our method to high-dimensional small sample size data. The performance of our method is illustrated using simulations and real data examples.
Abstract: Non-parametric probability density function estimation is an important statistical problem in movement ecology, where researchers are interested in quantifying animal "space use" and delineating "home range" areas. The most common statistical approach to this problem is kernel density estimation, which traditionally assumes independently sampled data. Unfortunately, animal tracking data are invariably correlated in time (non-independent), as the continuity of movement dictates that each animal location is in close proximity to the next. Moreover, as GPS and battery technology improve, researchers are increasing their sampling rates commensurately, which increases the autocorrelation between sequential locations and further violates the assumption of independence. Here I describe the recent development of kernel density methods derived to accommodate autocorrelated data, which are currently being applied the field of movement ecology.
Abstract: For use in connection with the general and complete observations that would be known from a full census, Kiaer (1895, 1897) presents a purposive âRepresentative Methodâ for sampling from a finite population to provide ââ¦more penetrating, more detailed, and more specialized surveysâ¦â Many credit this method with laying seeds for current sampling methods used in producing official social and economic statistics. At a time when just about all official statistics were produced by censuses, Kiaer had much opposition, especially from statistician von Mayr, who said (a translation), ââ¦no calculations when observations can be made.â
Neyman (1934) brought probability to this Representative Method using stratified random sampling. Probability makes it possible to express uncertainty about the results from the Representative Method and to say how good the results are. Neyman presents details for the well-known and widely used optimal allocation of the fixed sample size among the various strata to minimize sampling error. When sample sizes are rounded to integers from Neymanâs allocation, minimum sampling error is not guaranteed. Wright (2012) improves Neymanâs result with a simple derivation obtaining exact results that always yield integer sample size allocations while minimizing sampling error. Wright (2014, 2016, 2017) obtains exact integer optimal allocation results when there are mixed constraints on sample sizes for each stratum or when there are desired precision constraints. With exact optimal allocation, we demonstrate a decrease in needed sample size for the same precision using 2007 Economic Census data in the sample design for part of the subsequent Service Annual Survey.
We conclude by calling on the phrase ââ¦no calculation when observation can be madeâ to muse about current world-wide considerations to make greater use of data from additional sources (e.g., administrative records, commercial data, big dataâ¦) to produce official statistics.
Abstract: In the past 130 years many dependence measures have been introduced. One of the last ones, distance correlation, was introduced by the speaker in 2005.
Is there a universally acceptable system of axioms that helps to select the correlation for the 21st century? In this talk we propose four simple axioms for dependence measures and then discuss if classical and new measures of dependences satisfy them. A general framework connects Energy, Matter, and Mind. This is the starting point of a distance based topological data analysis.
4176 Campus Drive - William E. Kirwan Hall
College Park, MD 20742-4015
P: 301.405.5047 | F: 301.314.0827