October 15, 2025 • GPU Programming • Dr. Stephen Shao • CUDA ROCm

Optimizing GPU Kernels: Strategies for NVIDIA CUDA and AMD ROCm

GPU Kernel Optimization for NVIDIA H100 and AMD MI300

The gpu-programming-101 repository has been updated to focus on the latest data center GPUs of NVIDIA and AMD. This advanced guide explores cross-platform optimization techniques that leverage the unique capabilities of these architectures while maintaining portable code patterns. Our updated 04_advanced_optimization module includes CUDA 13+ and ROCm 7.x examples, with special attention to tensor cores, matrix engines, and HBM3 memory systems.

Why Optimization Matters (With H100/MI300 Benchmarks)

Modern GPUs like the H100 and MI300 offer unprecedented compute density, but realizing their full potential requires architecture-aware programming. A naive matrix multiplication kernel demonstrates the performance gap:

~450 GFLOPS on NVIDIA H100 (CUDA 13.x)
~420 GFLOPS on AMD MI300X (ROCm 7.x)

With optimized implementations leveraging hardware acceleration features, we achieved:

19.2 TFLOPS on H100 (using Tensor Cores with FP16)
16.8 TFLOPS on MI300X (using Matrix Cores with FP16)

This represents a 43x performance improvement—making proper optimization critical for exploiting these $20k+ GPUs.

Major GPU Kernels in Modern Computing

Contemporary GPU workloads rely on several fundamental kernel patterns. Understanding these building blocks is essential for effective optimization:

1. Matrix Multiplication (GEMM)

The foundation of machine learning, scientific computing, and linear algebra. Both H100 and MI300 include specialized hardware (Tensor Cores and Matrix Cores) explicitly designed for this workload.

2. Convolution

Essential for computer vision and CNNs. Modern implementations use Winograd algorithms or FFT-based approaches to reduce arithmetic operations.

3. Element-wise Operations

Simple arithmetic applied to each element in tensors (e.g., activation functions in neural networks). While seemingly simple, these kernels become critical in deep learning pipelines with large batch sizes.

4. Reduction Kernels

Operations like sum, min/max, and dot products that combine array elements. These require careful load balancing across thousands of threads.

5. Sparse Operations

Efficiently processing sparse matrices/vectors common in graph neural networks, recommendation systems, and scientific simulations.

6. Batched Operations

Processing multiple small problems simultaneously, leveraging the massive parallelism of GPUs while maintaining cache efficiency.

Performance Challenges with H100 and MI300

The H100 (Hopper architecture) and MI300 (CDNA 3 architecture) introduce new optimization challenges alongside their advanced features:

1. Memory Hierarchy Complexity

Both GPUs feature multi-level memory systems: H100 with 50MB L2 cache and 80GB HBM3 (3.35TB/s), MI300X with 128MB L2 cache and 192GB HBM3e (5.3TB/s). Effectively utilizing this hierarchy requires sophisticated data reuse strategies.

2. Specialized Hardware Utilization

H100's Tensor Cores support FP8, BF16, FP16, and TF32 precisions with different throughput characteristics. MI300's Matrix Cores have their own precision-specific capabilities. Maximizing utilization requires careful type selection and kernel design.

3. Thread Block Scheduling

H100's GPCs (Graphics Processing Clusters) and MI300's Shader Engines have different execution resource allocations. Suboptimal block sizes can leave significant compute resources idle.

4. Power and Thermal Constraints

With 700W (H100) and 600W (MI300) TDPs, these GPUs often operate under power caps, requiring optimization strategies that maximize performance per watt.

5. Interconnect Overhead

In multi-GPU systems (common with these data center GPUs), PCIe 5.0 and NVLink (H100) or Infinity Fabric (MI300) communication introduces new optimization considerations.

Optimization Strategies: H100 and MI300 Approaches

While core optimization principles apply across platforms, effectively utilizing H100 and MI300 requires architecture-specific techniques:

1. Hardware Acceleration for Matrix Operations

Both GPUs feature specialized matrix multiplication units that deliver the majority of their compute throughput.

NVIDIA H100: Tensor Core Utilization with CUDA C++

// H100: Tensor Core-accelerated matrix multiplication (FP16)
#include <cublas_v2.h>
#include <cuda_fp16.h>

// For direct tensor core access (lower level than cuBLAS)
__global__ void h100TensorCoreGEMM(half *C, const half *A, const half *B, 
                                  int M, int N, int K) {
    // 16x16x16 warp-level matrix multiplication
    const int warpM = threadIdx.y / 16;
    const int warpN = threadIdx.x / 16;
    const int laneM = threadIdx.y % 16;
    const int laneN = threadIdx.x % 16;
    
    // Load fragments into tensor registers
    nvcuda::wmma::fragment<nvcuda::wmma::matrix_a, 16, 16, 16, half, nvcuda::wmma::row_major> a_frag;
    nvcuda::wmma::fragment<nvcuda::wmma::matrix_b, 16, 16, 16, half, nvcuda::wmma::col_major> b_frag;
    nvcuda::wmma::fragment<nvcuda::wmma::accumulator, 16, 16, 16, float> c_frag;
    
    // Initialize accumulator
    nvcuda::wmma::fill_fragment(c_frag, 0.0f);
    
    // Load data and perform matrix multiplication
    nvcuda::wmma::load_matrix_sync(a_frag, &A[(warpM*16 + laneM)*K], K);
    nvcuda::wmma::load_matrix_sync(b_frag, &B[(warpN*16 + laneN)*K], N);
    nvcuda::wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
    
    // Store result
    nvcuda::wmma::store_matrix_sync(&C[(warpM*16 + laneM)*N + warpN*16 + laneN], 
                                   c_frag, N, nvcuda::wmma::row_major);
}

AMD MI300: Matrix Core Utilization with HIP

// MI300: Matrix Core-accelerated matrix multiplication (FP16)
#include <hip/hip_fp16.h>
#include <hipblaslt/hipblaslt.h>

// For direct matrix core access using MIOpen GEMM kernels
__global__ void mi300MatrixCoreGEMM(half *C, const half *A, const half *B,
                                  int M, int N, int K) {
    // 16x16x16 wavefront-level matrix multiplication
    const int waveM = threadIdx.y / 16;
    const int waveN = threadIdx.x / 16;
    const int laneM = threadIdx.y % 16;
    const int laneN = threadIdx.x % 16;
    
    // Load 2x2 tiles per thread (16x16 tile per wavefront)
    half4 a[4], b[4], c[4];
    
    // Initialize accumulator
    #pragma unroll
    for (int i = 0; i < 4; i++) {
        c[i] = half4{0.0_h, 0.0_h, 0.0_h, 0.0_h};
    }
    
    // Matrix multiplication using MI300 matrix instructions
    #pragma unroll
    for (int k = 0; k < K; k += 16) {
        // Load A and B tiles (vectorized loads for HBM3 efficiency)
        #pragma unroll
        for (int i = 0; i < 4; i++) {
            a[i] = *reinterpret_cast<const half4*>(&A[(waveM*16 + laneM + i*4)*K + k + laneN]);
            b[i] = *reinterpret_cast<const half4*>(&B[(k + laneM + i*4)*N + waveN*16 + laneN]);
        }
        
        // Perform matrix multiplication using AMD matrix cores
        #pragma unroll
        for (int i = 0; i < 4; i++) {
            c[i] = __builtin_amdgcn_mfma_f16_16x16x4_f16(a[i], b[i], c[i], 0, 0, 0);
        }
    }
    
    // Store results
    #pragma unroll
    for (int i = 0; i < 4; i++) {
        *reinterpret_cast<half4*>(&C[(waveM*16 + laneM + i*4)*N + waveN*16 + laneN]) = c[i];
    }
}

2. Memory Optimization for HBM3

Both GPUs feature high-bandwidth memory that requires careful access patterns to saturate:

CUDA: H100 HBM3 Optimization with Shared Memory Banking

// H100: Optimized shared memory usage avoiding bank conflicts
__global__ void h100OptimizedStencil3D(float *out, const float *in, 
                                     int width, int height, int depth) {
    // 32x32x4 tile with padding to avoid bank conflicts
    __shared__ float tile[34][34][5];  // +2 padding in x/y dimensions
    
    // Calculate global indices
    int x = blockIdx.x * 32 + threadIdx.x;
    int y = blockIdx.y * 32 + threadIdx.y;
    int z = blockIdx.z * 4 + threadIdx.z;
    
    // Load data into shared memory with boundary checks
    if (x < width && y < height && z < depth) {
        tile[threadIdx.x + 1][threadIdx.y + 1][threadIdx.z + 1] = in[z * width * height + y * width + x];
        
        // Load boundary tiles with padding
        if (threadIdx.x == 0 && x > 0)
            tile[0][threadIdx.y + 1][threadIdx.z + 1] = in[z * width * height + y * width + (x - 1)];
        if (threadIdx.x == 31 && x < width - 1)
            tile[33][threadIdx.y + 1][threadIdx.z + 1] = in[z * width * height + y * width + (x + 1)];
        // Similar boundary handling for y dimension...
    }
    
    __syncthreads();
    
    // Compute 3D stencil using shared memory (no bank conflicts)
    if (x < width - 1 && y < height - 1 && z < depth - 1 && 
        threadIdx.x > 0 && threadIdx.x < 33 && 
        threadIdx.y > 0 && threadIdx.y < 33) {
        
        out[z * width * height + y * width + x] = 0.125f * (
            tile[threadIdx.x + 1][threadIdx.y][threadIdx.z + 1] +
            tile[threadIdx.x - 1][threadIdx.y][threadIdx.z + 1] +
            tile[threadIdx.x][threadIdx.y + 1][threadIdx.z + 1] +
            tile[threadIdx.x][threadIdx.y - 1][threadIdx.z + 1] +
            tile[threadIdx.x][threadIdx.y][threadIdx.z + 2] +
            tile[threadIdx.x][threadIdx.y][threadIdx.z]
        );
    }
}

HIP: MI300 HBM3e Optimization with Cache Control

// MI300: Optimized memory access with explicit cache control
__global__ void mi300OptimizedStencil3D(float * __restrict__ out, 
                                       const float * __restrict__ in,
                                       int width, int height, int depth) {
    // Use AMD's explicit cache control extensions
    __shared__ float tile[34][34][5];  // Padded to avoid bank conflicts
    
    // Calculate global indices with vectorized access pattern
    int x = blockIdx.x * 32 + threadIdx.x;
    int y = blockIdx.y * 32 + threadIdx.y;
    int z = blockIdx.z * 4 + threadIdx.z;
    
    // Load data with L2 cache hints for HBM3e efficiency
    const size_t global_idx = z * width * height + y * width + x;
    if (x < width && y < height && z < depth) {
        // Prefetch next tile into L2 cache
        if (x + 32 < width) {
            __builtin_amdgcn_l2prefetch(&in[global_idx + 32], 1, 1, 0);
        }
        
        // Load current tile with temporal locality hint
        tile[threadIdx.x + 1][threadIdx.y + 1][threadIdx.z + 1] = 
            __builtin_amdgcn_tracedata_load(&in[global_idx], 1);
        
        // Load boundary tiles (similar to H100 implementation)
        // ...
    }
    
    __syncthreads();
    
    // Compute stencil with optimized shared memory access
    if (x < width - 1 && y < height - 1 && z < depth - 1 && 
        threadIdx.x > 0 && threadIdx.x < 33 && 
        threadIdx.y > 0 && threadIdx.y < 33) {
        
        out[global_idx] = 0.125f * (
            tile[threadIdx.x + 1][threadIdx.y][threadIdx.z + 1] +
            tile[threadIdx.x - 1][threadIdx.y][threadIdx.z + 1] +
            tile[threadIdx.x][threadIdx.y + 1][threadIdx.z + 1] +
            tile[threadIdx.x][threadIdx.y - 1][threadIdx.z + 1] +
            tile[threadIdx.x][threadIdx.y][threadIdx.z + 2] +
            tile[threadIdx.x][threadIdx.y][threadIdx.z]
        );
    }
}

3. Advanced Platform-Specific Techniques

NVIDIA H100 Specialized Optimizations

FP8 Precision: Use __nv_fp8_e4m3 and __nv_fp8_e5m2 types for AI workloads, doubling throughput compared to FP16
Thread Block Clustering: __cluster_dims__(8,8,1) attribute to group thread blocks into clusters sharing L2 cache
Asynchronous Copy Engines: Utilize cudaMemcpyAsync with multiple streams to overlap computation and HBM3 transfers
NVLink P2P Communication: Direct GPU-to-GPU communication using cudaDeviceEnablePeerAccess in multi-GPU systems

AMD MI300 Specialized Optimizations

Matrix Core Programming: Use __builtin_amdgcn_mfma_* intrinsics for direct access to matrix multiplication units
Infinity Fabric Optimization: hipSetDeviceFlags(HIP_DEVICE_FLAGS_P2P_ENABLED) for efficient multi-GPU communication
L2 Cache Control: Explicit cache hints with __builtin_amdgcn_l2prefetch and __builtin_amdgcn_tracedata_load
Partitioned Global Address Space: Use hipExtMallocP2P for unified memory addressing across multiple MI300 GPUs

4. Modern Profiling Workflows

Effective optimization requires understanding hardware utilization with modern profiling tools:

NVIDIA H100 Profiling with Nsight Compute 2025.2

# Detailed kernel analysis with H100-specific metrics
nsight-compute --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed, \
                 tensor_core__throughput.avg.pct_of_peak_sustained_elapsed, \
                 lts__throughput.avg.pct_of_peak_sustained_elapsed \
                 ./h100_application

# Multi-GPU performance analysis
nsight-systems --trace cuda,nvlink,osrt --sampling on --cuda-memory-usage on \
               --duration 60s ./h100_multi_gpu_app

# Tensor Core utilization breakdown
nv-nsight-cu-cli --section TensorCoreUtilization --kernel-filter "gemm" ./h100_app

AMD MI300 Profiling with ROCm 7.0 Tools

# MI300 matrix core utilization analysis
rocprof --hip-trace --metrics matrix_ops:all ./mi300_application

# Memory bandwidth analysis
rocprof --hsa-trace --metrics memory:all --stats ./mi300_application

# Multi-GPU communication analysis
rocprof --hsa-trace --roctx-trace --metrics infinity_fabric:all \
        --output mi300_profile.json ./mi300_multi_gpu_app

# Visualize results with ROCm Profiler GUI
rocprof-visualizer mi300_profile.json

Production-Grade Implementation Considerations

For enterprise deployments on H100 and MI300, additional factors become critical:

Precision Selection

Choose between FP8, BF16, FP16, TF32, and FP32 based on your accuracy requirements and performance needs. H100's FP8 delivers 67 TFLOPS AI performance, while MI300's FP8 provides 54 TFLOPS—both significantly higher than FP16 throughput.

Power Efficiency

Under typical power caps (500W for H100, 450W for MI300), optimize for performance-per-watt rather than peak performance. This often means balancing compute utilization with memory bandwidth.

Multi-GPU Scaling

Leverage NVLink 4.0 (H100) or Infinity Fabric (MI300) for multi-GPU communication. Use techniques like tensor parallelism for large models and data parallelism for high-throughput inference.

Software Ecosystem

Utilize optimized libraries as building blocks:

NVIDIA: cuBLAS 12.x, cuDNN 9.x, TensorRT 10.x
AMD: rocBLAS 3.0+, MIOpen 3.0+, TensorRT-LLM with ROCm support

Update your repository with git pull to access the H100 and MI300 optimization examples, including benchmarking scripts that measure key metrics like tensor core utilization, memory bandwidth, and power efficiency. Share your results in our discussion forum—we're particularly interested in cross-platform performance comparisons and novel optimization techniques for these flagship GPUs.