Getting Started¶
Accessing Muscadine¶
Method One: Titan¶
First log into Titan at titan.hpc.msstate.edu. Use the following template if needed:
ssh <NetID>@titan.hpc.msstate.edu
After entering your HPC password and doing the 2FA push, you can proceed to logging into Muscadine. Muscadine only has one node facing the HPCC network which is muscadine-node-1, so that is the one we’ll want to ssh into. Use the following template if needed:
ssh <NetID>@muscadine-node-1.hpc.msstate.edu
Repeat the authentication process from logging into Titan and you’ll be set.
Method Two: HPC VPN¶
Method Two involves logging into the HPC2 VPN. Go to https://servicedesk.its.msstate.edu and download the Cisco AnyConnect VPN by following ITS’s guide. Once that is completed, launch the VPN and connect to vpn.hpc.msstate.edu. It will then ask for your Group (HPC2), Username (your NetID), Password (your HPC Password), and Duo 2FA Passcode (get from duo app or enter ‘push’ to recieve a push challenge notification). Once you are connected, you can ssh directly into muscadine’s head node muscadine-node-1. Use the following template if needed:
ssh <NetID>@muscadine-node-1.hpc.msstate.edu
Muscadine’s Storage¶
You are provided a decently fast and shared home directory. This is the location you start after connecting into Muscadine. You are welcome to keep stuff in your home directory, but there are also /scratch and /work that you can use.
/scratch is build off of the BeeGFS and is the most performant storage solution for Muscadine. The downside is that there is limited storage, thus there will be wipes of old files every now and then whenever it gets near capacity.
/work and /home are based on the same storage solution so they have the same speed. The difference is that /work is intended for project data such as datasets, AI models, output files, etc. If your program intakes data or outputs data, that data should be stored in /work.
Mount |
Total Capacity |
Speed |
Quota |
Description |
|---|---|---|---|---|
|
2.2TB |
FAST |
None |
Scratch is to be used for active data or programs that are not vital. SCRATCH WILL BE PERIODILY WIPED |
|
20TB |
quick |
10GB |
Home is where your and other user’s personal directories live. Only put binaries, scripts, or small things in here. Capacity shared with |
|
20TB |
quick |
500GB |
Work is where datasets and large quantities of input/output data is to be stored. Capacity shared with |
Note
The speeds are intentionally vague. Feel free to benchmark them yourself
Warning
Do not put anything you want to keep for longer than a day in /scratch as all files are subject to deletion to preserve free space.
Using Muscadine¶
Getting a node allocation¶
Say you want to build a given program. You could build it on the head, but that can interfere with other people doing the same or running a program on the head. An alternative to that is submitting a Slurm request to get a node for building purposes. Run the following command to reserve a node for a bash terminal.
srun -c16 --pty bash
Available modules¶
Modules are software stacks such as GCC, Python, OpenBLAS, or ROCM that we provide that are optimized for the cluster its on. A list of all available modules can be seen by running ml avail. Here is a table of all the modules provided by default. A few more modules can be made available after loading a compiler like aocc or gcc.
-------------------- /apps/spack-zen4-v1.0/modulefiles/Core -----------------
amdblis/5.0 hip/6.2.4 python/3.12.5 rust/1.81.0
aocc/5.0.0 hipblas/6.2.4 rocblas/6.2.4 sqlite/3.34.1
apptainer/1.3.4 miopen-hip/6.2.4 rocm-cmake/6.2.4
contrib/0.1 openmpi/5.0.5 rocm-device-libs/6.2.4
gcc/14.2.0 python/3.11.9 rocsolver/6.2.4
In order to load a module, you can run ml load <moduleName>. Tab complete does work with ml and we encourage you use tab complete as it can also list several different versions we provide for each module, most will only have one though.
What do I do if there is a program I want?¶
The simple answer is to build it from their source repository. The complex answer is to build it from their source repository. We do not install tools or programs for users. That is the user’s job of using the cluster. After building a program, you should get a binary executable. If you want to make binaries you make accessible by you anywhere on the system, complete the following steps.
After building your binary, create a
.binin your home directory by runningmkdir ~/.bin.Now move that binary into the
.bindirectory.Lastly, you’ll need to add the path of that directory to your environment.
Edit
~/.zshrcaddingexport PATH="$PATH:$(HOME)/.binto the end.
Note
NOTE: You only need edit .zshrc once.
Users can submit a request to add something they have built to /apps/contrib for other users to use as well.
Using Slurm¶
Common Slurm commands¶
sinfo¶
sinfo and is a command that lets you see the cluster status. I have ran sinfo on Muscadine during a low usage time so all the nodes are idle. It is highly recommended to check this before submitting jobs, specifically greater than half a node in size.
╭─odh49@muscadine-node-1 ~
╰─$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
muscadine* up infinite 5 idle muscadine-node-[1-5]
squeue¶
squeue is a command that allows you to see any current jobs running on the cluster. This is very useful if you want to monitor if your job is running/waiting for resources along with its Job ID. I ran it on Muscadine while I have my meta-8B.sh running.
╭─odh49@muscadine-node-1 ~
╰─$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3449 muscadine meta-8B. odh49 R 0:18 1 muscadine-node-2
scancel¶
scancel kills a Slurm job. This is useful if you realize there was a mistake in your SBATCH file, made a change before it finished, or if it was an interactive job that froze. Here I used the Job ID from my job previously shown to kill the model running.
╭─odh49@muscadine-node-1 ~
╰─$ scancel 3449
sacct¶
sacct is like squeue except it shows all Slurm jobs that have been ran, is running, or errored out solely by you. This is very useful when looking at several jobs and seeing how its being handled. Here is the output of sacct with a few things running, ran, and failed.
╭─odh49@muscadine-node-1 ~/Demos/helloWorld
╰─$ sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
3449 meta-8B.sh muscadine sighpc 2 CANCELLED+ 0:0
3449.batch batch sighpc 2 CANCELLED 0:15
3449.0 llama-cli sighpc 2 CANCELLED 0:15
3450 llama-cli muscadine sighpc 2 FAILED 0:2
3450.0 llama-cli sighpc 2 CANCELLED 0:2
3451 meta-8B.sh muscadine sighpc 2 RUNNING 0:0
3451.batch batch sighpc 2 RUNNING 0:0
3451.0 llama-cli sighpc 2 RUNNING 0:0
3452 helloWorld muscadine sighpc 48 RUNNING 0:0
3452.batch batch sighpc 48 RUNNING 0:0
3452.0 hello_mpi sighpc 48 RUNNING 0:0
srun¶
srun is the command you’ll use to run a parallel program via Slurm. For the most part, you’ll use this within an SBATCH script described in Section 5.3. But if you want an interactive session (eg. srun -c2 --pty bash). Using srun is also useful when testing code on a small scale before fully implementing the SBATCH for it. Check out https://slurm.schedmd.com/srun.html for a full list of arguments to use with srun.
Building a SBATCH script¶
SBATCH Scripts are what are used by Slurm to better allocate/distribute your program. Below is a template you can use for nearly any Slurm job.
!/bin/bash
Slurm Specific
SBATCH --job-name="my_name" #job name
SBATCH --nodes=1 #number of nodes
SBATCH --ntasks-per-node=16 #number of tasks per node
SBATCH --cpus-per-task=1 #number of physical cores per task
SBATCH --mem=4G #max memory allocated
SBATCH --time=0-1:00:00 #max runtime (D-hh:mm:ss)
SBATCH --output=my_run.log #output name
Muscadine Specific
ml load <neededModules>
Your Application
srun <application>
Here is an example of a SBATCH Script used for a Hello World spanning across 2 nodes, 48 threads each, using the openmpi/5.0.5 module, and the binary I made.
!/bin/bash
Slurm Specific
SBATCH --job-name=helloWorld #job name
SBATCH --nodes=2 #number of nodes
SBATCH --ntasks-per-node=48 #number of tasks per node
SBATCH --cpus-per-task=1 #number of cpus per task
SBATCH --mem=8G #max memory allocated
SBATCH --time=0-1:00:00 #max runtime
SBATCH --output=helloWorld-%j.out #output name
Muscadine Specific
ml load openmpi/5.0.5
Your Application
srun ./hello_mpi
You can submit the SBATCH Script to Slurm via the following command:
sbatch <sbatchScript>