Symplectic Integrator on GPUs

Demonstrating mathematical integrity under GPU parallelization and reduced precision.

View the Project on GitHub OleBo/SymplecticIntegrator

CUDA Implementation Documentation — GPU-Accelerated Integrators

Overview

This document explains the CUDA implementation details of the GPU-accelerated integrators for the Hénon–Heiles system.

The focus is on:


1. Computational Problem

We simulate a large number (N) of independent trajectories:

[ z_i(t) = (x_i, y_i, p_{x,i}, p_{y,i}), \quad i = 1, \dots, N ]

Each trajectory evolves according to:

[ \dot{z}_i = f(z_i) ]

Key property:

All trajectories are independent → ideal for massive parallelism.


2. Parallelization Strategy

2.1 Thread Mapping

Each CUDA thread is responsible for one trajectory:

int i = blockIdx.x * blockDim.x + threadIdx.x;

If:

i < N

then thread i updates:

x[i], y[i], px[i], py[i]

2.2 Execution Model

Grid
 ├── Block 0
 │    ├── Thread 0 → Trajectory 0
 │    ├── Thread 1 → Trajectory 1
 │    └── ...
 ├── Block 1
 │    └── ...
 └── ...

3. Memory Layout

3.1 Structure of Arrays (SoA)

The implementation uses:

double* x;
double* y;
double* px;
double* py;

instead of:

struct State { double x, y, px, py; };
State* states;

3.2 Why SoA?

Benefits:


3.3 Coalesced Access Pattern

Thread 0 → x[0]
Thread 1 → x[1]
Thread 2 → x[2]
...

This allows the GPU to fetch memory in contiguous blocks.


4. Kernel Design

4.1 General Structure

Each kernel follows the pattern:

__global__ void integrator_kernel(...) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i >= N) return;

    // Load state
    double xi = x[i];
    double yi = y[i];
    double pxi = px[i];
    double pyi = py[i];

    // Time stepping loop
    for (int step = 0; step < steps; ++step) {
        // integrator update
    }

    // Store state
    x[i] = xi;
    y[i] = yi;
    px[i] = pxi;
    py[i] = pyi;
}

4.2 Design Choices


5. Gradient Computation

5.1 Device Function

__device__ void compute_gradients(double x, double y,
                                  double& dVdx, double& dVdy)

5.2 Mathematical Form

[ \frac{\partial V}{\partial x} = x + 2xy ] [ \frac{\partial V}{\partial y} = y + x^2 - y^2 ]


5.3 Optimization Notes


6. Symplectic Integrator Kernel

6.1 Algorithm in CUDA

// Half-step momentum update
px -= 0.5 * dt * dVdx;
py -= 0.5 * dt * dVdy;

// Full-step position update
x += dt * px;
y += dt * py;

// Recompute gradients
compute_gradients(x, y, dVdx, dVdy);

// Second half-step momentum update
px -= 0.5 * dt * dVdx;
py -= 0.5 * dt * dVdy;

6.2 Key Property

This update is symplectic, meaning it preserves phase-space structure.


7. RK4 Kernel

7.1 Structure

RK4 requires multiple intermediate evaluations:

k1 = f(z)
k2 = f(z + dt/2 * k1)
k3 = f(z + dt/2 * k2)
k4 = f(z + dt * k3)

7.2 GPU Implications


8. Euler Kernel

8.1 Update Rule

x += dt * px;
y += dt * py;
px -= dt * dVdx;
py -= dt * dVdy;

8.2 Role


9. Energy Kernel

9.1 Purpose

Compute:

[ E_i = \frac{1}{2}(p_x^2 + p_y^2) + V(x,y) ]


9.2 CUDA Implementation Pattern

__global__ void energy_kernel(...) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i >= N) return;

    energy[i] = ...;
}

10. Host–Device Interaction

10.1 Workflow

  1. Allocate device memory (cudaMalloc)
  2. Copy data to GPU (cudaMemcpy)
  3. Launch kernel
  4. Copy results back
  5. Free memory

10.2 Example

cudaMemcpy(d_x, x, size, cudaMemcpyHostToDevice);

kernel<<<grid, block>>>(...);

cudaMemcpy(x, d_x, size, cudaMemcpyDeviceToHost);

11. Performance Considerations

11.1 Occupancy


11.2 Arithmetic Intensity

Symplectic integrator:

RK4:


11.3 Memory Bottlenecks

Avoid:


12. Scaling Behavior

12.1 Strong Scaling

Fix number of trajectories, vary GPU:


12.2 Weak Scaling

Increase (N):

[ \text{Runtime} \sim \mathcal{O}(N) ]

GPU handles millions of trajectories efficiently.


13. Precision Considerations

13.1 FP32 vs FP64


13.2 Mixed Precision (Advanced)


14. Common Pitfalls


15. Summary

This CUDA implementation demonstrates:


Final Insight

GPUs excel when the mathematical structure matches the hardware model.

Here:

This alignment is what enables high-performance scientific computing.

CUDA Implementation Documentation — Annotated Code Snippets

This section adds line-by-line annotated CUDA kernels for the three integrators and key helpers. The goal is to make the mapping from mathematics → GPU execution completely explicit.


1. Thread Indexing & Guard

// Compute global thread index
int i = blockIdx.x * blockDim.x + threadIdx.x;

// Guard against out-of-bounds threads
if (i >= N) return;

Explanation


2. Device Function: Gradient of Potential

__device__ __forceinline__
void compute_gradients(double x, double y,
                       double& dVdx, double& dVdy)
{
    // ∂V/∂x = x + 2xy
    dVdx = x + 2.0 * x * y;

    // ∂V/∂y = y + x^2 - y^2
    dVdy = y + x * x - y * y;
}

Explanation


3. Symplectic (Leapfrog) Kernel

__global__
void symplectic_kernel(double* x, double* y,
                       double* px, double* py,
                       int N, int steps, double dt)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i >= N) return;

    // --- Load state into registers (fast) ---
    double xi  = x[i];
    double yi  = y[i];
    double pxi = px[i];
    double pyi = py[i];

    double dVdx, dVdy;

    // --- Time integration loop ---
    for (int s = 0; s < steps; ++s) {

        // Compute gradient at current position
        compute_gradients(xi, yi, dVdx, dVdy);

        // (1) Half-step momentum update (Kick)
        pxi -= 0.5 * dt * dVdx;
        pyi -= 0.5 * dt * dVdy;

        // (2) Full-step position update (Drift)
        xi += dt * pxi;
        yi += dt * pyi;

        // Recompute gradient at new position
        compute_gradients(xi, yi, dVdx, dVdy);

        // (3) Half-step momentum update (Kick)
        pxi -= 0.5 * dt * dVdx;
        pyi -= 0.5 * dt * dVdy;
    }

    // --- Store results back to global memory ---
    x[i]  = xi;
    y[i]  = yi;
    px[i] = pxi;
    py[i] = pyi;
}

Key Points


4. RK4 Kernel (Annotated)

__global__
void rk4_kernel(double* x, double* y,
                double* px, double* py,
                int N, int steps, double dt)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i >= N) return;

    double xi  = x[i];
    double yi  = y[i];
    double pxi = px[i];
    double pyi = py[i];

    for (int s = 0; s < steps; ++s) {

        // k1
        double dVdx1, dVdy1;
        compute_gradients(xi, yi, dVdx1, dVdy1);

        double k1_x  = pxi;
        double k1_y  = pyi;
        double k1_px = -dVdx1;
        double k1_py = -dVdy1;

        // k2 (midpoint using k1)
        double x2  = xi  + 0.5 * dt * k1_x;
        double y2  = yi  + 0.5 * dt * k1_y;
        double px2 = pxi + 0.5 * dt * k1_px;
        double py2 = pyi + 0.5 * dt * k1_py;

        double dVdx2, dVdy2;
        compute_gradients(x2, y2, dVdx2, dVdy2);

        double k2_x  = px2;
        double k2_y  = py2;
        double k2_px = -dVdx2;
        double k2_py = -dVdy2;

        // k3 (another midpoint)
        double x3  = xi  + 0.5 * dt * k2_x;
        double y3  = yi  + 0.5 * dt * k2_y;
        double px3 = pxi + 0.5 * dt * k2_px;
        double py3 = pyi + 0.5 * dt * k2_py;

        double dVdx3, dVdy3;
        compute_gradients(x3, y3, dVdx3, dVdy3);

        double k3_x  = px3;
        double k3_y  = py3;
        double k3_px = -dVdx3;
        double k3_py = -dVdy3;

        // k4 (full step)
        double x4  = xi  + dt * k3_x;
        double y4  = yi  + dt * k3_y;
        double px4 = pxi + dt * k3_px;
        double py4 = pyi + dt * k3_py;

        double dVdx4, dVdy4;
        compute_gradients(x4, y4, dVdx4, dVdy4);

        double k4_x  = px4;
        double k4_y  = py4;
        double k4_px = -dVdx4;
        double k4_py = -dVdy4;

        // Combine increments
        xi  += dt / 6.0 * (k1_x  + 2*k2_x  + 2*k3_x  + k4_x);
        yi  += dt / 6.0 * (k1_y  + 2*k2_y  + 2*k3_y  + k4_y);
        pxi += dt / 6.0 * (k1_px + 2*k2_px + 2*k3_px + k4_px);
        pyi += dt / 6.0 * (k1_py + 2*k2_py + 2*k3_py + k4_py);
    }

    x[i]  = xi;
    y[i]  = yi;
    px[i] = pxi;
    py[i] = pyi;
}

Key Points


5. Euler Kernel (Baseline)

__global__
void euler_kernel(double* x, double* y,
                  double* px, double* py,
                  int N, int steps, double dt)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i >= N) return;

    double xi  = x[i];
    double yi  = y[i];
    double pxi = px[i];
    double pyi = py[i];

    double dVdx, dVdy;

    for (int s = 0; s < steps; ++s) {
        compute_gradients(xi, yi, dVdx, dVdy);

        // Explicit Euler update
        xi  += dt * pxi;
        yi  += dt * pyi;
        pxi -= dt * dVdx;
        pyi -= dt * dVdy;
    }

    x[i]  = xi;
    y[i]  = yi;
    px[i] = pxi;
    py[i] = pyi;
}

Key Points


6. Energy Kernel

__global__
void energy_kernel(const double* x, const double* y,
                   const double* px, const double* py,
                   double* E, int N)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i >= N) return;

    double xi  = x[i];
    double yi  = y[i];
    double pxi = px[i];
    double pyi = py[i];

    // Kinetic energy
    double T = 0.5 * (pxi * pxi + pyi * pyi);

    // Potential energy
    double V = 0.5 * (xi * xi + yi * yi)
             + xi * xi * yi
             - (1.0 / 3.0) * yi * yi * yi;

    E[i] = T + V;
}

Key Points


7. Kernel Launch Configuration

int blockSize = 256;
int gridSize  = (N + blockSize - 1) / blockSize;

symplectic_kernel<<<gridSize, blockSize>>>(d_x, d_y, d_px, d_py,
                                           N, steps, dt);

Explanation


8. Performance Checklist (Practical)


Final Insight

These kernels illustrate a key HPC principle:

Performance emerges when mathematical structure aligns with hardware structure.