Demonstrating mathematical integrity under GPU parallelization and reduced precision.
GPU-Accelerated Symplectic Integrator demonstrates:
Target audience: NVIDIA-level research teams, quantitative researchers, HPC specialists
Integrate Hamilton’s equations: \(\frac{d\mathbf{q}}{dt} = \frac{\partial H}{\partial \mathbf{p}}, \quad \frac{d\mathbf{p}}{dt} = -\frac{\partial H}{\partial \mathbf{q}}\)
Key property: These preserve the symplectic structure — an invariant volume element in phase space.
Most numerical methods do NOT preserve this structure:
The difference between these is not just simulation accuracy—it’s about preserving fundamental mathematical properties of the system.
Properties:
Perfect test case for demonstrating structure preservation.
File: include/henon_heiles.h
struct State4D {
float x, y, px, py;
};
__host__ __device__ inline float compute_energy(float x, float y, float px, float py);
__host__ __device__ inline void compute_gradients(float x, float y, float& grad_x, float& grad_y);
Key Design:
__host__ __device__ functions → works on both CPU and GPUFiles:
include/integrators.h (declarations)src/gpu/integrators.cu (GPU kernels)src/cpu/integrators.py (CPU reference)Three integrator classes:
class EulerIntegrator(BaseIntegrator):
def integrate(self, state, dt, steps):
for n in range(steps):
grad_x, grad_y = henon_heiles_gradients(state.x, state.y)
state.x += dt * state.px
state.y += dt * state.py
state.px -= dt * grad_x
state.py -= dt * grad_y
return state
Characteristics:
class RK4Integrator(BaseIntegrator):
def integrate(self, state, dt, steps):
for n in range(steps):
# 4 stages of RK4...
# Evaluate k1, k2, k3, k4
# Combine with weights 1/6, 2/6, 2/6, 1/6
return state
Characteristics:
class SymplecticIntegrator(BaseIntegrator):
def integrate(self, state, dt, steps):
for n in range(steps):
grad_x, grad_y = henon_heiles_gradients(state.x, state.y)
state.px -= 0.5 * dt * grad_x
state.py -= 0.5 * dt * grad_y
state.x += dt * state.px
state.y += dt * state.py
grad_x, grad_y = henon_heiles_gradients(state.x, state.y)
state.px -= 0.5 * dt * grad_x
state.py -= 0.5 * dt * grad_y
return state
Characteristics:
Files:
src/cpu/analysis.py → Energy tracking and visualizationclass EnergyAnalysis:
def add_result(self, method_name, energies_history, times):
# Store results
def compute_statistics(self, method_name):
# Compute max error, mean error, etc.
def plot_energy_drift(self, output_path):
# Generate comparison plots
def print_summary(self):
# Print statistics table
File: src/gpu/integrators.cu
Kernel structure:
__global__ void symplectic_kernel(
float* x, float* y, // Position arrays
float* px, float* py, // Momentum arrays
int n, // Number of trajectories
float dt, // Timestep
int steps) { // Integration steps
// Each thread processes one trajectory
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i >= n) return;
// Load trajectory into registers
float xi = x[i];
float yi = y[i];
float pxi = px[i];
float pyi = py[i];
// Integrate all steps
for (int t = 0; t < steps; ++t) {
// Compute gradients (arithmetic on registers only)
float grad_x = xi + 2*xi*yi;
float grad_y = yi + xi*xi - yi*yi;
// Symplectic update (all in registers)
pxi -= 0.5f * dt * grad_x;
pyi -= 0.5f * dt * grad_y;
xi += dt * pxi;
yi += dt * pyi;
grad_x = xi + 2*xi*yi;
grad_y = yi + xi*xi - yi*yi;
pxi -= 0.5f * dt * grad_x;
pyi -= 0.5f * dt * grad_y;
}
// Write results back
x[i] = xi;
y[i] = yi;
px[i] = pxi;
py[i] = pyi;
}
GPU Optimization Techniques:
┌─────────────────────────┐
│ Create Initial Ensemble │
│ (random in chaos) │
└────────────┬────────────┘
│
├──→ EulerIntegrator ─┐
│ ├─→ EnergyAnalysis ─→ Plots
├─→ RK4Integrator ───┤
│ │
└─→ SymplecticIntegrator ┘
│
└─→ Write results to data/
Host Device
───────────────────────────────────────────
Initial states Allocate GPU memory
│ │
├─→ cudaMemcpy ─────────────→ GPU buffers
│
└─→ Launch kernel
│
└────────────────────→ 1000s of threads
each integrate 1 trajectory
│
┌───────┴────────┐
│ │
Symplectic_kernel Energy_kernel
│ │
←──────────────┴────────────────┘
Results from GPU
│
├─ cudaMemcpy ←────────────────
│
└─→ Analyze/visualize
Throughput:
Limitation: Sequential execution on single core
Theoretical Peak:
Expected speedup: 100-1000x
cd src/cpu
python -m pip install -r ../../requirements.txt
python benchmark.py
No compilation needed—pure Python.
mkdir build && cd build
cmake ..
make
CMakeLists.txt structure:
find_package(CUDA REQUIRED)
add_library(symplectic_gpu SHARED src/gpu/integrators.cu)
set_target_properties(symplectic_gpu PROPERTIES CUDA_SEPARABLE_COMPILATION ON)
set_target_compile_options(symplectic_gpu PRIVATE -O3 -arch=sm_70)
Key settings:
CUDA_SEPARABLE_COMPILATION → allows device linking-arch=sm_70 → target compute capability 7.0+ (V100/A100/RTX)-O3 → aggressive optimizationTEST(EnergyConservation, SymplecticPreserves) {
// Integrate one step
// Check energy error < 1e-6
}
TEST(IntegratorBounds, SymplecticFinalStateValid) {
// Check final state in reasonable bounds
}
python src/cpu/benchmark.py
# Check:
# 1. Symplectic max error < Euler/RK4
# 2. Symplectic energy bounded over time
# 3. All methods produce physical results
nvprof ./build/symplectic_gpu_example
# Check:
# 1. Kernel launch overhead < 1%
# 2. Memory bandwidth utilization > 50%
# 3. No kernel divergence or shared memory conflicts
| File | Purpose | Language |
|---|---|---|
src/cpu/integrators.py |
Three integrator implementations | Python |
src/cpu/analysis.py |
Energy analysis and visualization | Python |
src/cpu/benchmark.py |
CPU benchmark driver | Python |
src/gpu/integrators.cu |
GPU kernels | CUDA C++ |
include/henon_heiles.h |
System definitions (dual CPU/GPU) | C++ |
include/integrators.h |
Integrator API | C++ |
CMakeLists.txt |
Build configuration | CMake |
notebooks/01_cpu_benchmark.ipynb |
Interactive analysis | Jupyter |
from src.cpu import SymplecticIntegrator
integrator = SymplecticIntegrator()
state = create_initial_condition(1)[0]
final = integrator.integrate(state, dt=1e-3, steps=100)
print(integrator.energies) # Check if energy stays constant
nvprof --print-gpu-trace ./build/symplectic_gpu_benchmark
Modify henon_heiles_gradients() in src/cpu/integrators.py:
def henon_heiles_gradients(x, y):
# Current: Hénon-Heiles
grad_x = x + 2*x*y
grad_y = y + x**2 - y**2
# Alternative: Coupled oscillators
# grad_x = x
# grad_y = y
This architecture enables clear separation between mathematical correctness (CPU) and computational performance (GPU), making it easy to validate, optimize, and extend.