Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.nyc-ai.app/llms.txt

Use this file to discover all available pages before exploring further.

Every job on CSI HPCC is submitted through SLURM. This page collects annotated templates for the common job shapes. Copy one, edit the #SBATCH directives, add your module loads, and submit with sbatch.
Three rules that apply to every job:
  1. Start from /scratch/<username>, never from /global/u/<username> (your home).
  2. Use SLURM syntax. Older PBS Pro scripts must be converted.
  3. Never run jobs on the login (head) node. Any job found running there will be killed and the account may be suspended.

Anatomy of a SLURM script

#!/bin/bash
#SBATCH --job-name=my_job          # a short name that shows up in squeue
#SBATCH --nodes=1                  # how many nodes
#SBATCH --ntasks=1                 # how many MPI tasks total
#SBATCH --cpus-per-task=1          # CPU cores per task (>1 for threaded work)
#SBATCH --mem-per-cpu=4G           # RAM per core
#SBATCH --time=01:00:00            # wall-clock limit (HH:MM:SS)
#SBATCH --output=slurm-%j.out      # stdout file (%j = job ID)
#SBATCH --error=slurm-%j.err       # stderr file
#SBATCH --qos=<qos_name>           # your project's QOS
#SBATCH --partition=<part_name>    # your project's partition

module purge
module load <modules_you_need>

cd $SLURM_SUBMIT_DIR
srun ./your_program
Real jobs on HPCC typically need --qos and --partition matching your project (for example --qos=qoschem --partition=partchem). If you don’t know which values to use, ask your PI or the HPC Helpline. Those values are omitted from examples below so you can paste them in once.

Partitions and QOS

Most production jobs must name a partition (--partition) and the QOS value assigned to your project (--qos). The current HPCC Wiki lists these operational partitions:
PartitionMax cores/jobMax jobs/userMax cores/groupWall-clock limitTierGPU types listed by HPCC
partnsf12850256240 hAdvancedK20m, V100/16, A100/40
partchem12850256No limitCondoA100/80, A30
partcfd965096No limitCondoA40
partsym965096No limitCondoA30
partasrc481616No limitCondoA30
partmatlabD12850256240 hAdvancedV100/16, A100/40
partmatlabN38450384240 hAdvancedNone
partphys965096No limitCondoL40
partdev is dedicated to development. The HPCC Wiki describes it as available to all HPCC users with a four-hour time limit on a 16-core node with 64 GB memory and 2 K20m GPUs.
Run sinfo -s to see which partitions are currently up, and sacctmgr show assoc user=$USER format=Account,Partition,QOS to confirm which partition and QOS values your account is allowed to use.

Submitting, watching, and cancelling

sbatch script.sh                    # submit (prints a job ID)
squeue -u $USER                     # your jobs in the queue
squeue -j <jobid>                   # one specific job
sacct -j <jobid> --format=JobID,State,Elapsed,MaxRSS
scancel <jobid>                     # cancel a job
scontrol show job <jobid>           # everything SLURM knows about it

Serial job (one core)

The simplest case: one process, one core.
serial.sh
#!/bin/bash
#SBATCH --job-name=serial_job
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=8G
#SBATCH --time=01:00:00
#SBATCH --qos=qoschem
#SBATCH --partition=partchem

module purge
module load <your_modules>

cd $SLURM_SUBMIT_DIR
srun ./my_serial_program

Multi-threaded (OpenMP)

One task, multiple cores on the same node. Set OMP_NUM_THREADS so your program actually uses the cores SLURM allocated.
openmp.sh
#!/bin/bash
#SBATCH --job-name=omp_job
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=4G
#SBATCH --time=01:00:00

module purge
module load <your_modules>       # must include an OpenMP-capable compiler runtime

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

cd $SLURM_SUBMIT_DIR
srun ./my_openmp_program         # built with -fopenmp (or compiler equivalent)

MPI (multiple nodes)

Distributed-memory parallelism across nodes.
mpi.sh
#!/bin/bash
#SBATCH --job-name=mpi_job
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=16G
#SBATCH --time=04:00:00

module purge
module load <compiler_module>
module load <mpi_module>

cd $SLURM_SUBMIT_DIR
srun ./my_mpi_program            # 64 ranks total: 32 × 2 nodes

Hybrid MPI + OpenMP

MPI between nodes, OpenMP threads within each rank.
hybrid.sh
#!/bin/bash
#SBATCH --job-name=hybrid_job
#SBATCH --nodes=2
#SBATCH --ntasks=24
#SBATCH --ntasks-per-node=12
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=16G
#SBATCH --time=04:00:00

module purge
module load <compiler_module>
module load <mpi_module>

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK

cd $SLURM_SUBMIT_DIR
srun ./my_hybrid_program         # 24 ranks × 2 OMP threads each
The prototype above allocates 12 ranks per node × 2 nodes = 24 MPI ranks, each spawning 2 OpenMP threads. Adjust --qos, --partition, and --mem-per-cpu for your project before submitting.

GPU job

Request GPUs with --gres=gpu:<count>. On Arrow, the HPCC Wiki lists GPU nodes ranging from 2 to 8 GPUs per node.
gpu.sh
#!/bin/bash
#SBATCH --job-name=gpu_job
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=16G
#SBATCH --gres=gpu:1
#SBATCH --time=02:00:00

module purge
module load <cuda_or_framework_module>

cd $SLURM_SUBMIT_DIR
srun ./my_gpu_program

GPU with a specific type

Several partitions host different NVIDIA GPU types. Use sinfo to inspect the constraints currently advertised by the scheduler, then constrain your job only when the workload requires a specific GPU.
sinfo -o "%P %G %f"

#SBATCH --gres=gpu:1 --constraint='gpu_sku:A100'

Job array (parameter sweep)

Run many copies of the same job, each with a different $SLURM_ARRAY_TASK_ID.
array.sh
#!/bin/bash
#SBATCH --job-name=array_job
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=16G
#SBATCH --time=01:00:00
#SBATCH --array=0-5
#SBATCH --output=slurm-%A_%a.out       # %A = array job ID, %a = task index
#SBATCH --error=slurm-%A_%a.err

module purge
module load <your_modules>

cd $SLURM_SUBMIT_DIR
echo "Array task ID: $SLURM_ARRAY_TASK_ID"
srun ./my_program --case "$SLURM_ARRAY_TASK_ID"
This submits 6 jobs (indices 0–5) sharing a single array job ID.

Interactive debugging

For quick, interactive access to a compute node (short sessions only; don’t hold nodes idle):
srun --pty --nodes=1 --ntasks=1 --cpus-per-task=4 --mem-per-cpu=4G --time=00:30:00 bash
Load modules and run commands as if you were on a compute node. Exit the shell to release the allocation.

Troubleshooting cheatsheet

SymptomFirst thing to check
Job sits PENDING indefinitelyRun squeue -j <jobid> -o "%i %T %r"; the reason column explains why (priority, resources, QOS limit, etc.).
Job fails immediately with “invalid partition / QOS”Your --qos or --partition values are wrong for your project.
Job runs but crashes with no outputYou launched from /global/u. Move to /scratch/$USER and resubmit.
srun: error: Unable to create TCP connectionUsually a transient node issue; resubmit, or check with the helpline if it repeats.
GPU allocated but program can’t see itAdd nvidia-smi to your script to confirm, and make sure you module loaded the matching CUDA runtime.
Still stuck? Open a ticket with the job ID, the command you ran, and the contents of the .out and .err files. The FAQ on the HPCC Wiki covers more edge cases.