RAMSES - Dec. 2021 SISSA, Italy
This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government. This work was funded by the Advanced Simulation and Computing program and the Laboratory Directed Research and Development program at Sandia National Laboratories, a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA-0003525
Francesco Rizzi
(NexGen Analytics)
(in collaboration with E. Parish, P. Blonigan, J.Tencer - Sandia National Labs)
Follow along/slides at: fnrizzi.github.io/ramses-12-2021/
Historically, a key focus of pROMs work has been:
"finding the smallest subspace that can represent/solve a problem"
Intuitively : small system, more convenient to compute
Mathematically : intriguing but hard
Computationally: is this really the best approach?
What if we can formulate the problem such that we don't need to reduce it so much while being efficient?
This talk aims to provide a counterargument
Follow along/slides at: fnrizzi.github.io/ramses-12-2021/
Focus on pROMs for LTI systems
Weren't they a solved problem...?
Emphasis on computational aspects
Little math, error bounds, ML, deep learning (sorry!)
Disclaimer: this work might seem "obvious" (depending on whom you are talking to)
Format: this is going to be a "story"
Walk through how this work started and developed
Finally, we talk about generalization
Follow along/slides at: fnrizzi.github.io/ramses-12-2021/
Obviously a very important topic, no pROMs work on this
Relevant: pROMs for acoustic waves: V. Pereyra et al., Electron. Trans. Numer. Anal. (2008)
Surface waves: travel at the Earth's surface
Body waves: travel through the Earth
Affected by the material properties (density, modulus)
Primary (P-waves): compressional, longitudinal
Secondary or shear (S-waves): transversal (particles oscillate perpendicularly to the direction of wave propagation)
Interesting topic, so we started on this
Surface
Shear effects negligible in liquid, the core is not considered
Surface
Core-mantle
Given:
Find:
* H. Igel, M. Weber, Geophys. Res. Lett. 22 (6) (1995)
Sparse coefficient matrices
(depend on material prop, not on time)
Velocities
Stresses
* P. Benner, S. Gugercin, K. Willcox, SIAM, 2015
Contour plots of velocity field: Ricker wavelet source T = 60 sec, depth = 640 Km
Interference
Reflection
Refraction (from discontinuities)
PREM Earth model
time = 250 sec
time = 1000 sec
time = 2000 sec
For simplicity, assume same # of modes = K for velocity/stresses
Typical form for LTI systems, e.g.:
P. Benner, S. Gugercin, K. Willcox, SIAM, 2015
Approximation:
Many modes!
Not surprising!
Velocity field at final time = 2000 secs computed for the T=69 (extrapolation point)
ROM using 436 modes for velocity and 417 modes for stresses
~3700X reduction in # dofs!
How do we evaluate the efficiency of this kernel?
Assuming a square system of size for simplicity
Data movement: (read ) + (read ) + (write )
FLOPS:
ai ~1/4 (hint: this is small)
gemv kernel:
Visual performance model obtained by plotting:
performance (in GFLOPs/s) against their arithmetic intensity
Evaluate resource efficiency by relating its algorithm’s arithmetic intensity relative to the hardware’s peak main-memory bandwidth and floating-point performance
Hardware limitations for a given kernel, prioritize optimizations
1/4
Memory bandwidth bound!
theoretically attainable performance depending on the arithmetic intensity
Hardware has changed from the 80's!
MBB kernels: not ideal for modern many cores arch
Best when:
cores are kept busy, data is local
access patterns are optimal for the targeted arch
Standard Galerkin ROM
This is useful when we need many solves
Let's consider M trajectories simultaneously
e.g. different forcing evaluations
~ K / 16
Standard formulation
Rank-2 formulation
Kokkos implementation with OpenMP backend;
workstation with two 18-core Intel(R) Xeon(R) Gold 6154 CPU @ 3.00 GHz (24.75MB L3 cache, 125GB total mem)*
M=1: very limited
M>1: increasing # of threads helps
A large K and M is an advantage!
Allows to fully exploit the machine!
M = # of simultaneous trajectories
M = 1 : standard Galerkin ROM
M >= 2: rank-2 ROM formulation
* F.Rizzi et al, CMAME, 2021
What combination of thread count (n) and number of trajectories, M, would be the most efficient to obtain those P samples while satisfying the given constraints?
Launch 36 single-thread ROM runs each using M=1
and repeat until all my P samples are done
CPU 0
CPU 1
Launch 18 two-threaded ROM runs each using M=1
and repeat until all my P samples are done
If we increase # of modes (K), things improve!
# of modes (K) = 512
# of modes (K) = 2048
Compute speedup wrt standard Galerkin ROM (M=1)
Greener is better
For K = 256 : rank2-ROM is 13X more efficient than rank-1
For K = 512 : 19X
For K = 1024: 23X
For K = 2048: 26X
The larger the number of modes, the more efficient it is to evaluate an ensemble of trajectories!
Aeroelasticity:
deforming structures modeled as linear, with a nonlinear load
Acoustic waves:
modeled with a linear PDE, but can have a number of nonlinear sources (turbulent shear layers from wakes)
Neutral particle (neutron, photon, etc.) transport
Linear circuit models
What about if the matrix A changes?
What about nonlinear problems?
Tensors are getting more attention due to DeepLearning
ROMs for LTI can benefit from them
Leverage hardware evolution: CUDA tensor cores
Wave code is open-source: https://github.com/Pressio/SHAW
(within our Pressio ecosystem for ROMs)
(and part of the Exascale Computing Project [ECP] Proxy apps suite)
Questions?
Happy to talk offline too:
francesco.rizzi@ng-analytics.com
fnrizzi@sandia.gov
Slides at: fnrizzi.github.io/ramses-12-2021/
If you are here today, likely you use and/or study and/or believe in surrogate modeling. So I could spend minutes on this but...
Computing/hardware progresses and changes quickly
Exascale is already here: China has two machines
How does this impact surrogates (if at all)?
Can we/how to leverage this for surrogate modeling?
"It allows me to run my same old surrogate faster": not ideal!
More synergistic development of surrogates and computing?
MC study: 512 trajectories sampling the forcing period T
Rank-2 ROM is 950 times faster than FOM
Rank-2 ROM is 15 times faster than rank-1 ROM
Source: https://www.alcf.anl.gov/files/DMello-Nguyen-ALCF-CP-Workshop-MKL-2019-05-01-2019.pdf