Every job on CSI HPCC is submitted through SLURM. This page collects annotated templates for the common job shapes. Copy one, edit theDocumentation Index
Fetch the complete documentation index at: https://docs.nyc-ai.app/llms.txt
Use this file to discover all available pages before exploring further.
#SBATCH directives, add your module loads, and submit with sbatch.
Anatomy of a SLURM script
Real jobs on HPCC typically need
--qos and --partition matching your project (for example --qos=qoschem --partition=partchem). If you don’t know which values to use, ask your PI or the HPC Helpline. Those values are omitted from examples below so you can paste them in once.Partitions and QOS
Most production jobs must name a partition (--partition) and the QOS value assigned to your project (--qos). The current HPCC Wiki lists these operational partitions:
| Partition | Max cores/job | Max jobs/user | Max cores/group | Wall-clock limit | Tier | GPU types listed by HPCC |
|---|---|---|---|---|---|---|
partnsf | 128 | 50 | 256 | 240 h | Advanced | K20m, V100/16, A100/40 |
partchem | 128 | 50 | 256 | No limit | Condo | A100/80, A30 |
partcfd | 96 | 50 | 96 | No limit | Condo | A40 |
partsym | 96 | 50 | 96 | No limit | Condo | A30 |
partasrc | 48 | 16 | 16 | No limit | Condo | A30 |
partmatlabD | 128 | 50 | 256 | 240 h | Advanced | V100/16, A100/40 |
partmatlabN | 384 | 50 | 384 | 240 h | Advanced | None |
partphys | 96 | 50 | 96 | No limit | Condo | L40 |
partdev is dedicated to development. The HPCC Wiki describes it as available to all HPCC users with a four-hour time limit on a 16-core node with 64 GB memory and 2 K20m GPUs.
Submitting, watching, and cancelling
Serial job (one core)
The simplest case: one process, one core.serial.sh
Multi-threaded (OpenMP)
One task, multiple cores on the same node. SetOMP_NUM_THREADS so your program actually uses the cores SLURM allocated.
openmp.sh
MPI (multiple nodes)
Distributed-memory parallelism across nodes.mpi.sh
Hybrid MPI + OpenMP
MPI between nodes, OpenMP threads within each rank.hybrid.sh
--qos, --partition, and --mem-per-cpu for your project before submitting.
GPU job
Request GPUs with--gres=gpu:<count>. On Arrow, the HPCC Wiki lists GPU nodes ranging from 2 to 8 GPUs per node.
gpu.sh
GPU with a specific type
Several partitions host different NVIDIA GPU types. Usesinfo to inspect the constraints currently advertised by the scheduler, then constrain your job only when the workload requires a specific GPU.
Job array (parameter sweep)
Run many copies of the same job, each with a different$SLURM_ARRAY_TASK_ID.
array.sh
Interactive debugging
For quick, interactive access to a compute node (short sessions only; don’t hold nodes idle):Troubleshooting cheatsheet
| Symptom | First thing to check |
|---|---|
Job sits PENDING indefinitely | Run squeue -j <jobid> -o "%i %T %r"; the reason column explains why (priority, resources, QOS limit, etc.). |
| Job fails immediately with “invalid partition / QOS” | Your --qos or --partition values are wrong for your project. |
| Job runs but crashes with no output | You launched from /global/u. Move to /scratch/$USER and resubmit. |
srun: error: Unable to create TCP connection | Usually a transient node issue; resubmit, or check with the helpline if it repeats. |
| GPU allocated but program can’t see it | Add nvidia-smi to your script to confirm, and make sure you module loaded the matching CUDA runtime. |
.out and .err files. The FAQ on the HPCC Wiki covers more edge cases.