Mathematical Data Science Archives for Fall 2024 to Spring 2025
Effective Algorithms for Differentially Private Synthetic Data Generation
When: Mon, September 11, 2023 - 2:30pm
Where:
https://go.umd.edu/MTHDataScienceSpeaker: Yizhe Zhu (University of California, Irvine) - https://sites.google.com/uci.edu/yizhezhu
Abstract: Differentially private synthetic data provide a powerful mechanism to enable data analysis while protecting sensitive information about individuals. We first present a highly effective algorithmic approach for generating differentially private synthetic data in a bounded metric space with near-optimal utility guarantees under the Wasserstein distance. When the data lie in a high-dimensional space, the accuracy of the synthetic data suffers from the curse of dimensionality. We then propose an algorithm to generate low-dimensional private synthetic data from a high-dimensional dataset efficiently. A key step in our algorithm is a private principal component analysis (PCA) procedure with a near-optimal accuracy bound. Based on joint work with Yiyun He (UC Irvine), Thomas Strohmer (UC Davis), and Roman Vershynin (UC Irvine).
On Fine-Tuning Large Language Models with Less Labeling Cost
When: Mon, September 18, 2023 - 2:30pm
Where: Zoom Meeting ID: 927 8056 1489 Password:0900 Link:
https://go.umd.edu/MTHDataScienceSpeaker: Tuo Zhao (Georgia Tech) - https://www2.isye.gatech.edu/~tzhao80/
Abstract: Labeled data is critical to the success of deep learning across various applications, including natural language processing, computer vision, and computational biology. While recent advances like pre-training have reduced the need for labeled data in these domains, increasing the availability of labeled data remains the most effective way to improve model performance. However, human labeling of data continues to be expensive, even when leveraging cost-effective crowd-sourced labeling services. Further, in many domains, labeling requires specialized expertise, which adds to the difficulty of acquiring labeled data.
In this talk, we demonstrate how to utilize weak supervision together with efficient computational algorithms to reduce data labeling costs. Specifically, we investigate various forms of weak supervision, including external knowledge bases, auxiliary computational tools, and heuristic rule-based labeling. We showcase the application of weak supervision to both supervised learning and reinforcement learning across various tasks, including natural language understanding, molecular dynamics simulation, and code generation.
Light speed computation of exact solutions to generic and to degenerate assignment problems
When: Mon, September 25, 2023 - 2:30pm
Where: Zoom Meeting ID: 927 8056 1489 Password:0900 Link:
https://go.umd.edu/MTHDataScienceSpeaker: Patrice Koehl (University of California, Davis) - https://www.cs.ucdavis.edu/~koehl/index.html
Abstract: The linear assignment problem is a fundamental problem in combinatorial optimization with a wide range of applications, from operational research to data sciences. It consists of assigning ``agents" to ``tasks" on a one-to-one basis, while minimizing the total cost associated with the assignment. While many exact algorithms have been developed to identify such an optimal assignment, most of these methods are computationally prohibitive for large size problems.In this talk, I will describe a novel approach to solving the assignment problem using techniques adapted from statistical physics.In particular I will derive a strongly concave effective free energy function that captures the constraints of the assignment problem at a finite temperature. This free energy decreases monotonically as a function of beta, the inverse of temperature, to the optimal assignment cost, providing a robust framework for temperature annealing. For large enough beta values the exact solution to the generic assignment problem can be derived using a simple round-off to the nearest integer of the elements of the computed assignment matrix. I will also describe a provably convergent method to handle degenerate assignment problems. Finally, I will describe computer implementations of this framework that are optimized for parallel architectures, one based on CPU, the other based on GPU.These implementations enable solving large assignment problems (of the orders of a few 10000s) in computing clock times of the orders of minutes.
Robust learning: a tour through variational methods, PDE, and geometry
When: Mon, October 2, 2023 - 2:30pm
Where: Zoom Meeting ID: 927 8056 1489 Password:0900 Link:
https://go.umd.edu/MTHDataScienceSpeaker: Ryan Murray (North Carolina State University) - https://rwmurray.wordpress.ncsu.edu/
Abstract: A major consideration in many learning algorithms is how to appropriately ensure good generalization and robustness. There are many methods, both classical and contemporary, and ranging from statistical to computational, which address this issue. This talk will give a tour of different mathematical approaches to the problem of ensuring robustness in statistical learning problems, with a special focus on non-parametric settings. Special attention will be given to connections that these methods have with classical mathematical tools, such as partial differential equations, geometry, and variational methods. Specifically, I will discuss 1) Hamilton-Jacobi equations satisfied by classical non-parametric generalizations of medians, which have deep connections with convex geometry and control theory, and 2) Geometric and analytical results for adversarially robust classification methods. Time permitting, new computational methods based upon these analytical approaches will also be discussed.
Manifold learning in the Wasserstein space
When: Mon, October 16, 2023 - 2:30pm
Where: Zoom Meeting ID: 927 8056 1489 Password:0900 Link:
https://go.umd.edu/MTHDataScienceSpeaker: Caroline Moosmueller (University of North Carolina at Chapel Hill) - https://math.unc.edu/faculty-member/moosmueller-caroline/
Abstract: In this talk, I will introduce LOT Wassmap, a computationally feasible algorithm to uncover low-dimensional structures in the Wasserstein space. The algorithm is motivated by the observation that many datasets are naturally interpreted as point-clouds or probability measures rather than points in $\mathbb{R}^n$, and that finding low-dimensional descriptions of such datasets requires manifold learning algorithms in the Wasserstein space. Most available algorithms are based on computing the pairwise Wasserstein distance matrix, which can be computationally challenging for large datasets in high dimensions. The proposed algorithm leverages approximation schemes such as Sinkhorn distances and linearized optimal transport to speed-up computations, and in particular, avoids computing a pairwise distance matrix. Experiments demonstrate that LOT Wassmap attains correct embeddings and that the quality improves with increased sample size.
This is joint work with Alex Cloninger (UCSD), Keaton Hamm (UT Arlington) and Varun Khurana (UCSD).
Extreme events: Efficient computation of statistics and the value of data
When: Mon, October 23, 2023 - 2:30pm
Where: Zoom Meeting ID: 927 8056 1489 Password:0900 Link:
https://go.umd.edu/MTHDataScienceSpeaker: Themis Sapsis (MIT) - https://sandlab.mit.edu/
Abstract: Analysis of physical and engineering systems is characterized by unique computational challengesassociated with high dimensionality of parameter spaces, large cost of simulations or experiments, as well as existence of uncertainty. For a wide range of these problems the goal is to either quantifyuncertainty and compute risk for critical events, optimize parameters or control strategies, and/or making decisions. Examples include risk quantification for extreme events in climate modeling using coarse scale simulations, development of reduced order models for prediction of extreme separation events around turbulent airfoils using limited observations, and targeted design of CFD experiments for real-time control of UUV systems operating within the turbulent wake of moving submarines. For this type of problems big data or high-fidelity experiments may be available but their value on the understanding and prediction of extremes is not clear. In this talk we introduce a new class of quantification criteria measuring value-of-data that utilize a likelihood ratio which rigorously accounts for extreme rare events. This ratio acts essentially as a probabilistic sampling weight and guides the identification of data, experiments or models that that capture most effectively extreme events. We discuss optimality properties for these criteria and present their favorable properties in the problems mentioned above.
Dr. Sapsis is Professor of Mechanical and Ocean Engineering at MIT. He is also affiliated with the Institute for Data, Systems and Society (IDSS) and the Center for Computational Science and Engineering (CSSE), both within Schwarzman College of Computing. He received a Diploma in Naval Architecture and Marine Engineering from Technical University of Athens, Greece and a Ph.D. in Mechanical and Ocean Engineering from MIT. Before his faculty appointment at MIT he served as Research Scientist at the Courant Institute of Mathematical Sciences at New York University. He has also been a visiting faculty at ETH-Zurich. Prof. Sapsis work lies on the interface of nonlinear dynamical systems, probabilistic modeling and data-driven methods. A particular emphasis of his work is the formulation of mathematical methods for the prediction, statistical quantification and optimization of complex engineering and physical systems with extreme transient features, such as turbulent fluid flows in engineering and geophysical settings, nonlinear waves, and extreme ship motions.
Natural Gradient Methods for Physics-Informed Neural Networks
When: Mon, October 30, 2023 - 2:30pm
Where: Zoom Meeting ID: 927 8056 1489 Password:0900 Link:
https://go.umd.edu/MTHDataScienceSpeaker: Marius Zeinhofer (Simula Research Laboratory) - https://www.simula.no/people/mariusz
Abstract: We discuss natural gradient methods as a promising choice for the training of physics-informed neural networks (PINN) and the deep Ritz method. As a main motivation we show that the update direction in function space resulting from the energy natural gradient corresponds to the Newton direction modulo an orthogonal projection onto the model’s tangent space. Empirical results demonstrate that natural gradient optimization is able to produce highly accurate solutions in the PINN approach with errors several orders of magnitude smaller than what is obtained when training PINNs with standard optimizers like gradient descent, Adam or BFGS, even when those are allowed significantly more computation time.
Robustness in deep learning: where are we?
When: Mon, November 6, 2023 - 2:30pm
Where: Zoom Meeting ID: 927 8056 1489 Password:0900 Link:
https://go.umd.edu/MTHDataScienceSpeaker: Ju Sun (University of Minnesota, Twin Cities) - https://sunju.org/
Abstract: Deep learning (DL) models are not robust: adversarially constructed and irrelevant natural perturbations can break them abruptly. Despite intensive research in the past few years, surprisingly, there have yet to be tools for reliable robustness evaluation in the first place. I’ll describe our recent efforts toward building such a reliable evaluation package. This new computational capacity leads to more concerns than hopes: we find that the current empirical robust evaluation is problematic, and adversarial training, a predominant framework toward achieving robustness, is fundamentally flawed. On the other hand, before we can obtain robust DL models, or trustworthy DL models in general, we must safeguard our models against making severe mistakes to make imperfect DL models deployable. A promising approach is to allow DL models to restrain from making predictions on uncertain samples. I’ll describe our recent lightweight, universal selective classification method that performs excellently, even under distribution shifts.
Covering Number of Real Algebraic Varieties: Improved Bound and Applications
When: Mon, November 13, 2023 - 2:30pm
Where: Zoom Meeting ID: 927 8056 1489 Password:0900 Link:
https://go.umd.edu/MTHDataScienceSpeaker: Joe Kileel (https://web.ma.utexas.edu/users/jkileel/) - https://web.ma.utexas.edu/users/jkileel/
Abstract: In this talk I will discuss covering numbers of real algebraic varieties and applications to data science. Specifically, we control the number of balls of radius epsilon needed to cover a real variety or semialgebraic set in Euclidean space, in terms of the degrees of the relevant polynomials and number of variables. The bound remarkably improves the best known bound, and its proof is much more straightforward. On the application side, we control covering numbers of low rank CP tensors, bound the sketching dimension for polynomial optimization problems, and bound the generalization error for deep rational and ReLU neural networks. Joint work with Yifan Zhang (UT Austin), see arXiv:2311.05116.
Preconditioning for Kernel Matrices
When: Mon, November 27, 2023 - 2:30pm
Where: Zoom Meeting ID: 927 8056 1489 Password:0900 Link:
https://go.umd.edu/MTHDataScienceSpeaker: Yuanzhe Xi (Emory University) - http://www.math.emory.edu/~yxi26/
Abstract: Kernel matrices play a pivotal role in various fields, ranging from computational physics and chemistry to statistics and machine learning. The need for rapid algorithms to solve large kernel matrix systems has grown over time. This talk delves deep into preconditioning techniques for iterative solutions of these systems. The challenge lies in the varying spectrum of a kernel matrix, contingent upon kernel function parameters, like length scale, making it intricate to devise a robust preconditioner. We explore the Nystrom approximation, efficient for low-rank kernel matrices, and propose a correction for those of moderate rank, resulting in an efficient block factorized form, especially for those with high numerical rank. The rank estimation of the kernel matrix and landmark point selection in the Nystrom approximation are also addressed. Moreover, we will introduce a preconditioned Single-Sample CG (PredSS-CG) estimator to provide an unbiased estimation of the Log Marginal Likelihood (LML) and its gradient for the hyperparameter tuning in the kernel methods. We demonstrate the efficiency of the proposed methods on several real-world datasets. This is a joint work with Edmond Chow, Shifan Zhao, Tianshi Xu, and Hua Huang.
Randomized sparse Neural Galerkin schemes for solving evolution equations with deep networks
When: Mon, December 4, 2023 - 2:30pm
Where: Zoom Meeting ID: 927 8056 1489 Password:0900 Link:
https://go.umd.edu/MTHDataScienceSpeaker: Benjamin Peherstorfer (New York University) - https://cims.nyu.edu/~pehersto/
Abstract: Nonlinear parametrizations such as deep neural networks can circumvent the Kolmogorov barrier of classical model reduction methods that seek linear approximations in subspaces. However, numerically fitting ("training") nonlinear parametrizations is challenging because (a) training data need to be sampled (residual measurements) to estimate well the population loss with the empirical loss and (b) training errors quickly accumulate and amplify over time. This work introduces Neural Galerkin schemes that build on the Dirac-Frenkel variational principle for training nonlinear parametrizations sequentially in time. The accumulation of error over the sequential-in-time training is addressed by updating only randomized sparse subsets of the parameters, which is motivated by dropout that addresses a similar issue of overfitting due to neuron co-adaptation. Additionally, an adaptive sampling scheme is proposed that judiciously tests the residual so that few residual calculations are sufficient for training. In numerical experiments with a wide range of evolution equations, the proposed scheme outperforms classical linear methods, especially for problems with transport-dominated features and high-dimensional spatial domains.
On a McKean-Vlasov approach to optimal control
When: Mon, January 29, 2024 - 2:30pm
Where: Zoom Meeting ID: 927 8056 1489 Password:0900 Link:
https://go.umd.edu/MTHDataScienceSpeaker: Sebastian Reich (University of Potsdam) - https://www.math.uni-potsdam.de/~sreich/
Abstract: Stochastic optimal control problems are typically phrased in terms of the associated Hamilton-Jacobi-Bellman equation. Solving such partial differential equations remains challenging. In this talk, an alternative approach involving forward and reverse time McKean-Vlasov evolution equations will be considered. In its simplest incarnation, these equations become equivalent to the formulations used in score generative modeling. General optimal control problems lead to more complex forward-reverse McKean-Vlasov equations which require further approximations for which we will employ variants of the ensemble Kalman filter and diffusion maps.
Canceled
When: Mon, February 5, 2024 - 2:30pm
Where: Zoom Meeting ID: 927 8056 1489 Password:0900 Link:
https://go.umd.edu/MTHDataScienceSpeaker: Cristina Cipriani (Technical University of Munich) - https://www.math.cit.tum.de/math/personen/wissenschaftliches-personal/cristina-cipriani/
Abstract: Canceled
A proof of the Kotzig–Ringel–Rosa conjecture and its applications
When: Mon, February 12, 2024 - 2:30pm
Where: Zoom Meeting ID: 927 8056 1489 Password:0900 Link:
https://go.umd.edu/MTHDataScienceSpeaker: Edina Gnang (John Hopkins University) - https://engineering.jhu.edu/faculty/edinah-gnang/
Abstract: We describe a proof of the long standing Kotzig–Ringel–Rosa conjecture (1964) also known as the graceful tree labeling conjecture and its application to combinatorial design. Our proof stems from a functional reformulation of the conjecture as well as a marriage of Noga Alon's Combinatorial Nullstellensatz with a new composition lemma. If time permits we will also discuss algorithmic aspects of this result.
Information in Mean Field Games and High-Dimensional Stochastic Control
When: Mon, February 26, 2024 - 2:30pm
Where: Zoom Meeting ID: 927 8056 1489 Password:0900 Link:
https://go.umd.edu/MTHDataScienceSpeaker: Aaron Palmer (UCLA) - https://www.math.ucla.edu/~azp/
Abstract: The rapidly developing field of mean field games and control builds a framework to analyze models of many interacting agents that arise in economics, robotics, and other fields. Information plays an essential role in the design of models and optimal controls. 'Information structures' in these models can dictate how the agents cooperate and how they determine their strategies from available data. In this talk, we will discuss different information structures and their role in the 'mean field limit' as the number of agents goes to infinity through mathematical results on the convergence of models and analysis of their solutions.
Enforcing Physical, Mathematical and Numerical Structure in Learning
When: Mon, March 11, 2024 - 2:30pm
Where: Zoom Meeting ID: 927 8056 1489 Password:0900 Link:
https://go.umd.edu/MTHDataScienceSpeaker: Karthik Duraisamy (University of Michigan) - https://aero.engin.umich.edu/people/duraisamy-karthik/
Abstract: First, I will review some recent developments in scientific machine learning, and will attempt to offer a structured perspective on the current, somewhat chaotic landscape. Following this, I will discuss the synergistic integration of physical and mathematical structure within machine learning models. Particularly, I will touch upon 3 aspects : a) Robust learning of Koopman operators using physics-constraints with guaranteed stability; b) Score-based diffusion models that are physically consistent with known physical laws; and c) Use of conditional parameterization to enforce numerical consistency in mesh agnostic deep learning of spatio-temporal fields.
Directed Chain Generative Adversarial Networks for Multimodal Distributed Financial Data
When: Mon, March 25, 2024 - 2:30pm
Where: Zoom Meeting ID: 927 8056 1489 Password:0900 Link:
https://go.umd.edu/MTHDataScienceSpeaker: Ruimeng Hu (University of California, Santa Barbara) - https://sites.google.com/site/ruimenghu1/
Abstract: Real-world financial data can be multimodal distributed, and generating multimodal distributed real-world data has become a challenge to existing generative adversarial networks (GANs). For example, neural stochastic differential equations (Neural SDEs), treated as infinite-dimensional GANs, are only capable of generating unimodal time series data. In this talk, we present a novel time series generator, named directed chain GANs (DC-GANs), which inserts a time series dataset (called a neighborhood process of the directed chain or input) into the drift and diffusion coefficients of the directed chain SDEs with distributional constraints. DC-GANs can generate new time series of the same distribution as the neighborhood process, and the neighborhood process will provide the key step in learning and generating multimodal distributed time series. Signature from rough path theory will be used to construct the discriminator. Numerical experiments on financial data are presented and show a consistent outperformance over state-of-the-art benchmarks with respect to measures of distribution, data similarity, and predictive ability. If time permits, I will also talk about using Signature to solve mean-field games with common noise.
Canceled
When: Mon, April 1, 2024 - 2:30pm
Where: Zoom Meeting ID: 927 8056 1489 Password:0900 Link:
https://go.umd.edu/MTHDataScience
Doubly Noisy Linear Systems and the Kaczmarz Algorithm
When: Mon, April 8, 2024 - 2:30pm
Where: Zoom Meeting ID: 927 8056 1489 Password:0900 Link:
https://go.umd.edu/MTHDataScienceSpeaker: Anna Ma (University of California, Irvine) - https://sites.google.com/view/annama
Abstract: Large-scale linear systems, Ax=b, frequently arise in data science and scientific computing at massive scales, thus demanding effective iterative methods to solve them. Often, these systems are noisy due to operational errors or faulty data-collection processes. In the past decade, the randomized Kaczmarz algorithm (RK) was studied extensively as an efficient iterative solver for such systems. However, the convergence study of RK in the noisy regime is limited and considers measurement noise in the right-hand side vector, b. Unfortunately, that is not always the case, and the coefficient matrix A can also be noisy. In this talk, we motivate and discuss doubly noise linear systems and the performance of the Kaczmarz algorithm applied to such systems. The presented work is joint work with El Houcine Bergou, Soumia Boucherouite, Aritra Dutta, and Xin Li.
Dynamics of Strategic Agents and Algorithms as PDEs
When: Mon, April 15, 2024 - 2:30pm
Where: Zoom Meeting ID: 927 8056 1489 Password:0900 Link:
https://go.umd.edu/MTHDataScienceSpeaker: Franca Hoffmann (Caltech) - https://francahoffmann.wordpress.com/
Abstract: We propose a PDE framework for modeling the distribution shift of a strategic population interacting with a learning algorithm. We consider two particular settings; one, where the objective of the algorithm and population are aligned, and two, where the algorithm and population have opposite goals. We present convergence analysis for both settings, including three different timescales for the opposing-goal objective dynamics. We illustrate how our framework can accurately model real-world data and show via synthetic examples how it captures sophisticated distribution changes which cannot be modeled with simpler methods.
Optimization, Sampling, and Generative Modeling in Non-Euclidean Spaces
When: Mon, April 22, 2024 - 2:30pm
Where: Zoom Meeting ID: 927 8056 1489 Password:0900 Link:
https://go.umd.edu/MTHDataScienceSpeaker: Molei Tao (Gatech) - https://mtao8.math.gatech.edu/
Abstract: Machine learning in non-Euclidean spaces have been rapidly attracting attention in recent years, and this talk will give some examples of progress on its mathematical and algorithmic foundations. A sequence of developments that eventually leads to non-Euclidean generative modeling will be reported. More precisely, I will begin with variational optimization, which, together with delicate interplays between continuous- and discrete-time dynamics, enables the construction of momentum-accelerated algorithms that optimize functions defined on manifolds. Selected applications, namely a generic improvement of Transformer, and a low-dim. approximation of high-dim. optimal transport distance, will be described. Then I will turn the optimization dynamics into an algorithm that samples from probability distributions on Lie groups. If time permits, the performance of this sampler will also be quantified, without log-concavity condition or its common relaxations. Finally, I will describe how this sampler can lead to a structurally-pleasant diffusion generative model that allows users to, given training data that follow any latent statistical distribution on a Lie group, generate more data exactly on the same manifold that follow the same distribution. If time permits, applications such as to quantum data will be briefly mentioned.
Multi-Operator Learning and Generalization for Partial Differentiation Equations
When: Mon, April 29, 2024 - 2:30pm
Where: Zoom Meeting ID: 927 8056 1489 Password:0900 Link:
https://go.umd.edu/MTHDataScienceSpeaker: Hayden Schaeffer (UCLA) - https://sites.google.com/view/haydenschaeffer/
Abstract: We introduce a multi-modal model for scientific problems, named PROSE-PDE. Our model, designed for bi-modality to bi-modality learning, is a multi-operator learning approach which can predict future states of spatiotemporal systems while simultaneously recovering the underlying governing equations of the observed physical system. We focus on training distinct one-dimensional time-dependent nonlinear constant coefficient partial differential equations. In addition, we will discuss some extrapolation studies related to generalizing physical features and predicting PDE solutions whose models or data were unseen during the training. We show through numerical experiments that the utilization of the symbolic modality in our model effectively resolves the well-posedness problems with training multiple operators and thus enhances our model's predictive capabilities.