Monitoring Slurm Jobs

Overview

Muscadine provides two ways to monitor your Slurm jobs:

  • jobstats CLI: a command-line tool that prints a summary of CPU, memory, and GPU usage for any completed or running job.

  • Grafana Dashboard (via Open OnDemand): a time-series dashboard that is able to be viewed under the Jobs menu on Open OnDemand that shows how your job used resources throughout the job’s lifetime.


1. Monitoring Jobs with Jobstats

What it shows

jobstats pulls data from the cluster’s Prometheus and Slurm’s accounting records to give you a per-job resource efficiency report, including:

  • CPU Utilization

  • Memory Usage

  • GPU Utilization

  • VRAM Usage

  • Wall Time Efficiency

Basic usage

After submitting a Slurm job, keep note of the JobID. That is the only input jobstats needs to work correctly. This is how you can run the command:

jobstats <jobid>

Example:

jobstats 5555

The output would be something like this:

================================================================================
                              Slurm Job Statistics
================================================================================
         Job ID: 6035
   User/Account: NetID/GroupID
       Job Name: jobstats-demo
          State: COMPLETED
          Nodes: 1
      CPU Cores: 4
     CPU Memory: 4GB (1GB per CPU-core)
           GPUs: 1
  QOS/Partition: normal/muscadine
        Cluster: muscadine
     Start Time: Wed May 13, 2026 at 3:57 PM
       Run Time: 00:02:42
     Time Limit: 00:30:00

                         Overall Utilization
================================================================================
  CPU utilization  [|||||||||||                                    22%]
  CPU memory usage [||||||                                         12%]
  GPU utilization  [||||||||||||||||||||||||||||||                 60%]
  GPU memory usage [                                                1%]

                         Detailed Utilization
================================================================================
  CPU utilization per node (CPU time used/run time)
      muscadine-node-2: 00:02:22/00:10:48 (efficiency=22.0%)

  CPU memory usage per node - used/allocated
      muscadine-node-2: 530.8MB/4.3GB (132.7MB/1.1GB per core of 4)

  GPU utilization per node
      muscadine-node-2 (GPU 0): 60%

  GPU memory usage per node - maximum used/total
      muscadine-node-2 (GPU 0): 716.0MB/64GB (1.1%)

2. Viewing the Grafana Dashboard via Open OnDemand

For a full time-series view and graphs that show exactly how CPU, memory, and GPU behaved minute-by-minute from job start to job end, use the Muscadine Slurm Job Stats Grafana dashboard.

Step 1: Log into the OOD portal

Navigate to the Muscadine Open OnDemand portal at https://muscadine-ood.hpc.msstate.edu, and log in with your MSU HPC NetID credentials.

Note

To visit the Muscadine OOD portal, you must be connected to the MSU HPC VPN. For more information, look here: VPN setup steps


Step 2: Open Job Stats Helper

From the top navigation bar, click:

Jobs → Job Stats Helper

This opens a new window to the Job Stats Helper app, which generates a Grafana dashboard for a specific JobID.

Enter the JobID for the job you want to view metrics for.


Step 3: Generating the Dashboard

After typing your JobID, click Submit Query. Then, under Link with detailed statistics, click Click here for stats.

This will bring you to a pre-configured Grafana dashboard where you can view the entire lifetime metrics of that specific job.


Step 4: Viewing the “Muscadine Slurm Job Stats” Dashboard

The dashboard opens pre-configured to your specific job ID and time range (from the moment the job started to when it ended, or the current time if it’s still running).

Panels included:

Panel

What it shows

Job CPU Utilization (%)

User, system, and total CPU usage as a percentage of allocated cores over time

Job CPU Memory Utilization (%)

RSS, cached, used, and total allocated memory over time, plus OOM failure events

Allocated CPU Memory

The total memory allocation as specified in your sbatch script (--mem)

Allocated CPU Cores

The number of CPU cores allocated to the job (--ntasks / --cpus-per-task)

GPU Utilization (%)

GFX engine activity percentage across all GPUs assigned to the job over time

GPU Memory Utilization (%)

VRAM consumption as a percentage of total VRAM across all assigned GPUs over time

GPU Memory Utilization (MB)

Raw VRAM consumption in MB across all assigned GPUs over time

Memory Controller Utilization (%)

UMC (Unified Memory Controller) activity (how hard the GPU memory bus is being pushed)

GPU Clockspeed (MHz)

Active GPU core clock frequency over the job’s lifetime

GPU Temperature (°C)

Edge and junction temperatures for the assigned GPUs over the job’s lifetime

GPU Power Usage (W)

Average package power draw in watts for the assigned GPUs over time

Additional Panels:

Panel

What it shows

Node CPU Percentage Utilization

System, user, IO-wait, and total CPU usage at the node level (independent of job cgroup accounting)

Node Total Memory Utilization

Full node memory breakdown (total, used, available, free, buffered, and cached)

Local Disk R/W

Read and write throughput (bytes/sec) per block device on the selected node

Local Disk IOPS

Read and write I/O operations per second per block device on the selected node

NFS Stats

NFS read/write request rate and metadata operation rate (reads and writes) over NFSv3

Infiniband Throughput

Bytes per second received and transmitted over the node’s Infiniband port

Infiniband Packet Rate

Multicast and unicast packet rates in both directions over Infiniband

Infiniband Errors

Link errors, downed links, congestion indicators, and discarded packets on the Infiniband fabric

The time range is automatically scoped to your job, and you don’t need to manually adjust it. The dashboard uses the job’s start and end timestamps pulled from Slurm accounting.


3. Tips for Interpreting Your Job’s Metrics

CPU efficiency below 50%? Your job may be waiting on memory, I/O, or GPU transfers rather than doing compute work. Note that 50% is considered fully utilized for single-threaded code (since Slurm counts hyperthreads as cores). If your job isn’t using multi-threading, efficiency near 50% is expected and healthy. Below that, check if you’re over-allocating cores relative to what your code can actually parallelize.

Memory usage near the limit? You’re at risk of OOM kills on future runs. Request more memory or reduce your problem size per node.

VRAM usage unexpectedly high? Check for memory leaks or unfreed allocations across iterations. The time-series view in Grafana is especially useful here. Look for a rising VRAM trend over the job’s lifetime rather than a stable plateau.


More Info

Jobstats Documentation: https://princetonuniversity.github.io/jobstats/