Building and Tuning HPL

This guide walks through building, running, and tuning HPL (High Performance Linpack) on a Slurm cluster such as Muscadine. It assumes:

  • You already have a functioning cluster ✅

  • functional MPI ✅

  • Slurm works ✅

  • You are comfortable compiling software and reading performance numbers


What HPL Actually Measures

HPL solves a dense system of linear equations using LU decomposition. In practice, HPL performance is dominated by:

  • DGEMM performance from your BLAS

  • Memory bandwidth and NUMA effects

  • MPI latency and bandwidth

  • Process/thread placement

If DGEMM is slow, HPL will be slow. Everything else is secondary.


Choosing a BLAS Library

This is the single most important decision.

Common BLAS Options

Intel MKL

  • Usually the best performance on Intel CPUs

  • Excellent threading and NUMA behavior

  • Proprietary but free to use

Pros:

  • Usually the fastest

  • Minimal blas-specific tuning required

Cons:

  • Muscadine has AMD Processors

  • Closed source

We’ll consider other options for Muscadine

OpenBLAS

  • Open source

  • Works everywhere

  • Performance varies widely by architecture

module load openblas

Pros:

  • Portable

  • Easy to build

Cons:

  • Threading is less predictable

  • NUMA handling is weaker

(AOCL-)BLIS

  • Modular, modern BLAS

  • Very strong on AMD EPYC

module load amdblis

Pros:

  • Excellent EPYC performance

  • Cleaner threading model than OpenBLAS

Cons:

  • Limited hardware support

Vendor BLAS (AOCL, Cray LibSci, etc.)

If your vendor provides a tuned BLAS, use it. They exist for a reason.

Notes on Multi-threading

If you’re unsure of the difference between Multi-Processing and Multi-Threading, read this first

All spack-managed blas libraries (ones you use module load to use) have multi-threading DISABLED. This means if you’d like to experiment with using hybrid-parallelism to gain more performance, you’ll need to build your blas library of choice, following their documents to enable it.


MPI Implementation

If blas is the single most important performance factor of HPL, Inter-process communication speed and, more importantly, latency, are the second.

Common choices:

  • OpenMPI

  • MPICH

  • Intel MPI

  • Cray MPI

Key requirements:

  • Good support for your interconnect

  • Correct PMI/PMIx integration with Slurm

Muscadine uses a 200Gbit/s Mellanox Infiniband Interconnect. MPI implementations that support this are: Intel MPI, OpenMPI, and MPICH.

Currently, only OpenMPI is provided to Muscadine users. It has compiled in support for slurm, pmi, infiniband, AMD, etc.

module load openmpi

Warning

Please do NOT attempt to build/install a different version of MPI. These builds require a rich understanding of the hardware and how it’s configured in order to build correctly. We’ve already done the work for you, please use it.


Building HPL

Get the Source

curl -LO https://www.netlib.org/benchmark/hpl/hpl-2.3.tar.gz
tar xf hpl-2.3.tar.gz
cd hpl-2.3

Configure the Makefile

Since version 2.3 (the latest as of writing), HPL can either use static Makefiles or GNU Autotools. Since autotools is alot simple and somewhat more consistent, we’ll be using that.

./configure --prefix=$PWD/hpl-demo \
  CC=mpicc \
  LDFLAGS="-L$OPENBLAS_ROOT/lib -L$OPENMPI_ROOT/lib" \
  CPPFLAGS="-I$OPENMPI_ROOT/include" \
  CFLAGS="-O2 -march=znver4 -mtune=znver4 -DHPL_PROGRESS_REPORT"

Build

make -j$(nproc)
make install

Binary appears as:

hpl-demo/bin/xhpl

Hybrid MPI + OpenMP

Modern clusters rarely run optimally with just one rank per core. Shared memory can lead to increased utilization and efficiency. At scale, thousands to tens of thousands of ranks communicating over the fabric can significantly reduce performance.

Instead, it’s preferable to use a hybrid approach.

Why Hybrid?

  • Reduces MPI rank count

  • Improves cache reuse

  • Reduces communication overhead

Typical strategy:

  • 1 MPI rank per socket

  • OpenMP threads = cores per socket

Modern Chiplet architecture strategy (Muscadine is one):

  • 1 MPI rank per NUMA domain (CCX in the case of Muscadine)

  • OpenMP threads = cores per CCX

Example (8 CCX, 6 cores/CCX):

#SBATCH --ntasks-per-node 8
#SBATCH --cpus-per-task   6

export OMP_NUM_THREADS=6
export OMP_PROC_BIND=close
export OMP_PLACES=cores

Note

The keen-eyed who know their way around Muscadine might have noticed that \(6*8=48\) and Muscadine has \(96\) threads. AMD Epyc and many other processors support Symmetric Multi-Threading (SMT) otherwise known as Hyper-Threading. While this feature is available, HPL already saturates the pipeline enough to make SMT less efficient. The scheduler is smart enough to keep threads on their own core.


HPL.dat Tuning

HPL.dat controls performance-critical parameters.

Problem Size (N)

Problem size is the parameter you’ll mess with the most. HPL is a very Memory Bandwidth-intensive application. the more memory you use, the more the LU algorithm with stripe across it, increasing bandwidth. Use to little, and you won’t saturate your memory bandwidth, use too much, and you’ll start swapping. A perfect HERO run balances the two.

Rule of thumb:

N ≈ sqrt(0.8 × total_memory_bytes / 8)

Example: 192 GB total RAM

N ≈ sqrt(0.8 × 192e9 / 8) ≈ 130k

Use as much memory as possible without swapping.

If you’d like to play around with it. I’ve created a handy desmos calculator.


Block Size (NB)

Block size is how much of the matrix each chunk works on at a time. This value is very CPU architecture-specific. Calculating the precise value requires a deep knowledge of how the cores are laid out in the CPU. Thankfully, it’s easier to exhaustively find this value. We know that NB needs to be a multiple of cores. Simply run a test for each multiple of cores between 100 and 400.

Typical values:

  • 128

  • 192

  • 256

  • 384


Process Grid (P × Q)

Choose a grid close to square:

P × Q = number of MPI ranks

Example: 16 ranks

P=4, Q=4

Rule:

  • Q ≥ P

  • Match network topology if possible


Key Parameters Summary

N    : As large as memory allows
NB   : 100-400%core-count (benchmark!)
P,Q  : Nearly square

Slurm Job Script Example

MPI only

#!/bin/bash
#SBATCH -N 4
#SBATCH --ntasks-per-node=48
#SBATCH --cpus-per-task=1
#SBATCH -t 00:30:00
#SBATCH -J hpl

module load openmpi
module load openblas

srun --cpu-bind=cores ./xhpl

MPI + OpenMP

#!/bin/bash
#SBATCH -N 4
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=6
#SBATCH -t 00:30:00
#SBATCH -J hpl

module load openmpi
module load openmp

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OMP_PROC_BIND=close
export OMP_PLACES=cores

srun --cpu-bind=cores ./xhpl

Verify placement:

srun --cpu-bind=verbose ./xhpl

NUMA and Affinity

If NUMA is wrong, performance collapses.

Recommendations:

  • Use numactl --hardware to inspect layout

  • Align MPI ranks with sockets

  • Bind OpenMP threads tightly

Example:

srun --distribution=block:block --cpu-bind=cores ./xhpl

Benchmarking Strategy

Never trust a single run.

Suggested sweep:

  • Fix N

  • Sweep NB = {128,192,256,384}

  • Try different P×Q layouts

  • Record Gflop/s

Track:

  • Per-node performance

  • Scaling efficiency


Interpreting Results

Expected Efficiency

  • 80–90% of theoretical peak is excellent

  • 70–80% is common

  • <60% usually indicates:

    • Bad BLAS

    • Bad affinity

    • Bad NB

    • too few or many N’s


Common Mistakes

  • Using tiny N

  • Ignoring NUMA

  • Using default OpenBLAS builds

  • Running one MPI rank per core


Final Advice

HPL is an incredibly nuanced benchmark that requires practical knowledge of every facet of the hardware. Don’t be discouraged if you cannot achieve the same scores as someone else.

If you have any questions, see those listed in Getting-Help