The gpu-programming-101 repository has been updated to focus on the latest data center GPUs of NVIDIA and AMD. This advanced guide explores cross-platform optimization techniques that leverage the unique capabilities of these architectures while maintaining portable code patterns. Our updated 04_advanced_optimization module includes CUDA 13+ and ROCm 7.x examples, with special attention to tensor cores, matrix engines, and HBM3 memory systems.
Why Optimization Matters (With H100/MI300 Benchmarks)
Modern GPUs like the H100 and MI300 offer unprecedented compute density, but realizing their full potential requires architecture-aware programming. A naive matrix multiplication kernel demonstrates the performance gap:
- ~450 GFLOPS on NVIDIA H100 (CUDA 13.x)
- ~420 GFLOPS on AMD MI300X (ROCm 7.x)
With optimized implementations leveraging hardware acceleration features, we achieved:
- 19.2 TFLOPS on H100 (using Tensor Cores with FP16)
- 16.8 TFLOPS on MI300X (using Matrix Cores with FP16)
Major GPU Kernels in Modern Computing
Contemporary GPU workloads rely on several fundamental kernel patterns. Understanding these building blocks is essential for effective optimization:
1. Matrix Multiplication (GEMM)
The foundation of machine learning, scientific computing, and linear algebra. Both H100 and MI300 include specialized hardware (Tensor Cores and Matrix Cores) explicitly designed for this workload.
2. Convolution
Essential for computer vision and CNNs. Modern implementations use Winograd algorithms or FFT-based approaches to reduce arithmetic operations.
3. Element-wise Operations
Simple arithmetic applied to each element in tensors (e.g., activation functions in neural networks). While seemingly simple, these kernels become critical in deep learning pipelines with large batch sizes.
4. Reduction Kernels
Operations like sum, min/max, and dot products that combine array elements. These require careful load balancing across thousands of threads.
5. Sparse Operations
Efficiently processing sparse matrices/vectors common in graph neural networks, recommendation systems, and scientific simulations.
6. Batched Operations
Processing multiple small problems simultaneously, leveraging the massive parallelism of GPUs while maintaining cache efficiency.
Performance Challenges with H100 and MI300
The H100 (Hopper architecture) and MI300 (CDNA 3 architecture) introduce new optimization challenges alongside their advanced features:
1. Memory Hierarchy Complexity
Both GPUs feature multi-level memory systems: H100 with 50MB L2 cache and 80GB HBM3 (3.35TB/s), MI300X with 128MB L2 cache and 192GB HBM3e (5.3TB/s). Effectively utilizing this hierarchy requires sophisticated data reuse strategies.
2. Specialized Hardware Utilization
H100's Tensor Cores support FP8, BF16, FP16, and TF32 precisions with different throughput characteristics. MI300's Matrix Cores have their own precision-specific capabilities. Maximizing utilization requires careful type selection and kernel design.
3. Thread Block Scheduling
H100's GPCs (Graphics Processing Clusters) and MI300's Shader Engines have different execution resource allocations. Suboptimal block sizes can leave significant compute resources idle.
4. Power and Thermal Constraints
With 700W (H100) and 600W (MI300) TDPs, these GPUs often operate under power caps, requiring optimization strategies that maximize performance per watt.
5. Interconnect Overhead
In multi-GPU systems (common with these data center GPUs), PCIe 5.0 and NVLink (H100) or Infinity Fabric (MI300) communication introduces new optimization considerations.
Optimization Strategies: H100 and MI300 Approaches
While core optimization principles apply across platforms, effectively utilizing H100 and MI300 requires architecture-specific techniques:
1. Hardware Acceleration for Matrix Operations
Both GPUs feature specialized matrix multiplication units that deliver the majority of their compute throughput.
NVIDIA H100: Tensor Core Utilization with CUDA C++
// H100: Tensor Core-accelerated matrix multiplication (FP16)
#include <cublas_v2.h>
#include <cuda_fp16.h>
// For direct tensor core access (lower level than cuBLAS)
__global__ void h100TensorCoreGEMM(half *C, const half *A, const half *B,
int M, int N, int K) {
// 16x16x16 warp-level matrix multiplication
const int warpM = threadIdx.y / 16;
const int warpN = threadIdx.x / 16;
const int laneM = threadIdx.y % 16;
const int laneN = threadIdx.x % 16;
// Load fragments into tensor registers
nvcuda::wmma::fragment<nvcuda::wmma::matrix_a, 16, 16, 16, half, nvcuda::wmma::row_major> a_frag;
nvcuda::wmma::fragment<nvcuda::wmma::matrix_b, 16, 16, 16, half, nvcuda::wmma::col_major> b_frag;
nvcuda::wmma::fragment<nvcuda::wmma::accumulator, 16, 16, 16, float> c_frag;
// Initialize accumulator
nvcuda::wmma::fill_fragment(c_frag, 0.0f);
// Load data and perform matrix multiplication
nvcuda::wmma::load_matrix_sync(a_frag, &A[(warpM*16 + laneM)*K], K);
nvcuda::wmma::load_matrix_sync(b_frag, &B[(warpN*16 + laneN)*K], N);
nvcuda::wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
// Store result
nvcuda::wmma::store_matrix_sync(&C[(warpM*16 + laneM)*N + warpN*16 + laneN],
c_frag, N, nvcuda::wmma::row_major);
}
AMD MI300: Matrix Core Utilization with HIP
// MI300: Matrix Core-accelerated matrix multiplication (FP16)
#include <hip/hip_fp16.h>
#include <hipblaslt/hipblaslt.h>
// For direct matrix core access using MIOpen GEMM kernels
__global__ void mi300MatrixCoreGEMM(half *C, const half *A, const half *B,
int M, int N, int K) {
// 16x16x16 wavefront-level matrix multiplication
const int waveM = threadIdx.y / 16;
const int waveN = threadIdx.x / 16;
const int laneM = threadIdx.y % 16;
const int laneN = threadIdx.x % 16;
// Load 2x2 tiles per thread (16x16 tile per wavefront)
half4 a[4], b[4], c[4];
// Initialize accumulator
#pragma unroll
for (int i = 0; i < 4; i++) {
c[i] = half4{0.0_h, 0.0_h, 0.0_h, 0.0_h};
}
// Matrix multiplication using MI300 matrix instructions
#pragma unroll
for (int k = 0; k < K; k += 16) {
// Load A and B tiles (vectorized loads for HBM3 efficiency)
#pragma unroll
for (int i = 0; i < 4; i++) {
a[i] = *reinterpret_cast<const half4*>(&A[(waveM*16 + laneM + i*4)*K + k + laneN]);
b[i] = *reinterpret_cast<const half4*>(&B[(k + laneM + i*4)*N + waveN*16 + laneN]);
}
// Perform matrix multiplication using AMD matrix cores
#pragma unroll
for (int i = 0; i < 4; i++) {
c[i] = __builtin_amdgcn_mfma_f16_16x16x4_f16(a[i], b[i], c[i], 0, 0, 0);
}
}
// Store results
#pragma unroll
for (int i = 0; i < 4; i++) {
*reinterpret_cast<half4*>(&C[(waveM*16 + laneM + i*4)*N + waveN*16 + laneN]) = c[i];
}
}
2. Memory Optimization for HBM3
Both GPUs feature high-bandwidth memory that requires careful access patterns to saturate:
CUDA: H100 HBM3 Optimization with Shared Memory Banking
// H100: Optimized shared memory usage avoiding bank conflicts
__global__ void h100OptimizedStencil3D(float *out, const float *in,
int width, int height, int depth) {
// 32x32x4 tile with padding to avoid bank conflicts
__shared__ float tile[34][34][5]; // +2 padding in x/y dimensions
// Calculate global indices
int x = blockIdx.x * 32 + threadIdx.x;
int y = blockIdx.y * 32 + threadIdx.y;
int z = blockIdx.z * 4 + threadIdx.z;
// Load data into shared memory with boundary checks
if (x < width && y < height && z < depth) {
tile[threadIdx.x + 1][threadIdx.y + 1][threadIdx.z + 1] = in[z * width * height + y * width + x];
// Load boundary tiles with padding
if (threadIdx.x == 0 && x > 0)
tile[0][threadIdx.y + 1][threadIdx.z + 1] = in[z * width * height + y * width + (x - 1)];
if (threadIdx.x == 31 && x < width - 1)
tile[33][threadIdx.y + 1][threadIdx.z + 1] = in[z * width * height + y * width + (x + 1)];
// Similar boundary handling for y dimension...
}
__syncthreads();
// Compute 3D stencil using shared memory (no bank conflicts)
if (x < width - 1 && y < height - 1 && z < depth - 1 &&
threadIdx.x > 0 && threadIdx.x < 33 &&
threadIdx.y > 0 && threadIdx.y < 33) {
out[z * width * height + y * width + x] = 0.125f * (
tile[threadIdx.x + 1][threadIdx.y][threadIdx.z + 1] +
tile[threadIdx.x - 1][threadIdx.y][threadIdx.z + 1] +
tile[threadIdx.x][threadIdx.y + 1][threadIdx.z + 1] +
tile[threadIdx.x][threadIdx.y - 1][threadIdx.z + 1] +
tile[threadIdx.x][threadIdx.y][threadIdx.z + 2] +
tile[threadIdx.x][threadIdx.y][threadIdx.z]
);
}
}
HIP: MI300 HBM3e Optimization with Cache Control
// MI300: Optimized memory access with explicit cache control
__global__ void mi300OptimizedStencil3D(float * __restrict__ out,
const float * __restrict__ in,
int width, int height, int depth) {
// Use AMD's explicit cache control extensions
__shared__ float tile[34][34][5]; // Padded to avoid bank conflicts
// Calculate global indices with vectorized access pattern
int x = blockIdx.x * 32 + threadIdx.x;
int y = blockIdx.y * 32 + threadIdx.y;
int z = blockIdx.z * 4 + threadIdx.z;
// Load data with L2 cache hints for HBM3e efficiency
const size_t global_idx = z * width * height + y * width + x;
if (x < width && y < height && z < depth) {
// Prefetch next tile into L2 cache
if (x + 32 < width) {
__builtin_amdgcn_l2prefetch(&in[global_idx + 32], 1, 1, 0);
}
// Load current tile with temporal locality hint
tile[threadIdx.x + 1][threadIdx.y + 1][threadIdx.z + 1] =
__builtin_amdgcn_tracedata_load(&in[global_idx], 1);
// Load boundary tiles (similar to H100 implementation)
// ...
}
__syncthreads();
// Compute stencil with optimized shared memory access
if (x < width - 1 && y < height - 1 && z < depth - 1 &&
threadIdx.x > 0 && threadIdx.x < 33 &&
threadIdx.y > 0 && threadIdx.y < 33) {
out[global_idx] = 0.125f * (
tile[threadIdx.x + 1][threadIdx.y][threadIdx.z + 1] +
tile[threadIdx.x - 1][threadIdx.y][threadIdx.z + 1] +
tile[threadIdx.x][threadIdx.y + 1][threadIdx.z + 1] +
tile[threadIdx.x][threadIdx.y - 1][threadIdx.z + 1] +
tile[threadIdx.x][threadIdx.y][threadIdx.z + 2] +
tile[threadIdx.x][threadIdx.y][threadIdx.z]
);
}
}
3. Advanced Platform-Specific Techniques
NVIDIA H100 Specialized Optimizations
- FP8 Precision: Use
__nv_fp8_e4m3and__nv_fp8_e5m2types for AI workloads, doubling throughput compared to FP16 - Thread Block Clustering:
__cluster_dims__(8,8,1)attribute to group thread blocks into clusters sharing L2 cache - Asynchronous Copy Engines: Utilize
cudaMemcpyAsyncwith multiple streams to overlap computation and HBM3 transfers - NVLink P2P Communication: Direct GPU-to-GPU communication using
cudaDeviceEnablePeerAccessin multi-GPU systems
AMD MI300 Specialized Optimizations
- Matrix Core Programming: Use
__builtin_amdgcn_mfma_*intrinsics for direct access to matrix multiplication units - Infinity Fabric Optimization:
hipSetDeviceFlags(HIP_DEVICE_FLAGS_P2P_ENABLED)for efficient multi-GPU communication - L2 Cache Control: Explicit cache hints with
__builtin_amdgcn_l2prefetchand__builtin_amdgcn_tracedata_load - Partitioned Global Address Space: Use
hipExtMallocP2Pfor unified memory addressing across multiple MI300 GPUs
4. Modern Profiling Workflows
Effective optimization requires understanding hardware utilization with modern profiling tools:
NVIDIA H100 Profiling with Nsight Compute 2025.2
# Detailed kernel analysis with H100-specific metrics
nsight-compute --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed, \
tensor_core__throughput.avg.pct_of_peak_sustained_elapsed, \
lts__throughput.avg.pct_of_peak_sustained_elapsed \
./h100_application
# Multi-GPU performance analysis
nsight-systems --trace cuda,nvlink,osrt --sampling on --cuda-memory-usage on \
--duration 60s ./h100_multi_gpu_app
# Tensor Core utilization breakdown
nv-nsight-cu-cli --section TensorCoreUtilization --kernel-filter "gemm" ./h100_app
AMD MI300 Profiling with ROCm 7.0 Tools
# MI300 matrix core utilization analysis
rocprof --hip-trace --metrics matrix_ops:all ./mi300_application
# Memory bandwidth analysis
rocprof --hsa-trace --metrics memory:all --stats ./mi300_application
# Multi-GPU communication analysis
rocprof --hsa-trace --roctx-trace --metrics infinity_fabric:all \
--output mi300_profile.json ./mi300_multi_gpu_app
# Visualize results with ROCm Profiler GUI
rocprof-visualizer mi300_profile.json
Production-Grade Implementation Considerations
For enterprise deployments on H100 and MI300, additional factors become critical:
Precision Selection
Choose between FP8, BF16, FP16, TF32, and FP32 based on your accuracy requirements and performance needs. H100's FP8 delivers 67 TFLOPS AI performance, while MI300's FP8 provides 54 TFLOPS—both significantly higher than FP16 throughput.
Power Efficiency
Under typical power caps (500W for H100, 450W for MI300), optimize for performance-per-watt rather than peak performance. This often means balancing compute utilization with memory bandwidth.
Multi-GPU Scaling
Leverage NVLink 4.0 (H100) or Infinity Fabric (MI300) for multi-GPU communication. Use techniques like tensor parallelism for large models and data parallelism for high-throughput inference.
Software Ecosystem
Utilize optimized libraries as building blocks:
- NVIDIA: cuBLAS 12.x, cuDNN 9.x, TensorRT 10.x
- AMD: rocBLAS 3.0+, MIOpen 3.0+, TensorRT-LLM with ROCm support
Update your repository with git pull to access the H100 and MI300 optimization examples, including benchmarking scripts that measure key metrics like tensor core utilization, memory bandwidth, and power efficiency. Share your results in our discussion forum—we're particularly interested in cross-platform performance comparisons and novel optimization techniques for these flagship GPUs.