If you work with HPC clusters, you likely use sbatch every day. You submit a script and expect it to run.
But that single command triggers a full workflow inside Slurm.
Understanding this internal flow helps you debug issues faster, optimize job performance, and better understand how your cluster behaves.
⸻
Step 1: Submitting the Job
When you run:
sbatch job.sh
You are not starting the job. You are submitting a request to Slurm.
The script includes:
- Resource requirements such as CPUs, memory, GPUs
- Job metadata like name and output paths
- The actual commands to execute
At this point, Slurm simply accepts the job.
⸻
Step 2: Communication with slurmctld
The sbatch command sends the job to the Slurm controller daemon, slurmctld.
This daemon:
- Assigns a Job ID
- Stores the job details
- Marks the job as PENDING
Nothing is running yet.
⸻
Step 3: Job Enters the Queue
The job is now placed in the scheduling queue.
evaluates:
- Job priority
- Fairshare usage
- Partition limits
- Resource availability
This determines when your job will run.
⸻
Step 4: Scheduling Decision
The scheduler continuously checks:
- Free nodes
- Resource fragmentation
- Backfill opportunities
If your job fits available resources, it gets selected. Otherwise, it stays pending.
⸻
Step 5: Resource Allocation
Once selected, Slurm:
- Assigns specific compute nodes
- Reserves CPUs, memory, and GPUs
- Changes job state to RUNNING
Now your job has allocated resources.
⸻
Step 6: Node-Level Communication
Each compute node runs a daemon called slurmd.
The controller sends job details to these nodes. The nodes prepare the execution environment.
⸻
Step 7: Job Execution via slurmstepd
On the compute node, slurmstepd is launched.
This process:
- Starts your application
- Manages job steps
- Handles output and error streams
- Enforces resource limits using cgroups
Your script begins executing here.
⸻
Step 8: Monitoring During Execution
While the job runs:
- Slurm tracks resource usage
- Logs are written to output files
- Accounting data is collected
You can monitor the job using:
squeue
scontrol show job <jobid>
⸻
Step 9: Job Completion
When the job finishes:
- slurmstepd exits
- Resources are released
- Temporary processes are cleaned up
The job state becomes COMPLETED, FAILED, TIMEOUT, or CANCELLED.
⸻
Step 10: Accounting and Logs
Finally:
- Job statistics are stored
- Output files remain available
- Usage data is recorded
You can check this using:
sacct
⸻
Full Flow Summary
- Submit job using sbatch
- slurmctld receives and queues it
- Scheduler evaluates priority
- Resources are allocated
- slurmd prepares nodes
- slurmstepd runs the job
- Job completes and resources are released
⸻
Common Misconceptions
“sbatch runs the job immediately”
It only submits the job.
“Pending means failure”
It usually means waiting for resources.
“Slurm just runs scripts”
It manages scheduling, allocation, execution, and cleanup.
⸻
Final Thought
sbatch may look simple, but it triggers a complete orchestration pipeline inside Slurm.
Once you understand this flow, debugging becomes easier, performance tuning improves, and cluster behavior becomes predictable.
⸻
Top comments (3)
This post made me think about something we rarely talk about: how much of our mental model of Slurm is shaped by assuming the "happy path."
Every step here is what happens when things go right. But the moment you hit a weird failure—job sits pending for hours with no obvious reason, or it goes straight to FAILED without ever touching a compute node—you realize the abstraction leaks badly. Suddenly you’re digging through
slurmctldlogs, checking why the scheduler made a decision you can’t reverse-engineer fromsqueueoutput alone.The part about
slurmstepdenforcing resource limits with cgroups stuck with me. It’s one of those details that’s invisible until it’s not—like when your job gets OOM-killed and you swear you allocated enough memory, but didn’t account for what the step daemon itself reserves. I’ve seen people chase that ghost for days.I think the real value in understanding this pipeline isn’t just debugging, though. It’s learning to stop fighting the scheduler’s incentives. Once you realize it’s optimizing for cluster-wide throughput, not your individual job’s wait time, you start writing requests differently. Smaller, shorter, more frequent—feeding the backfill algorithm instead of resenting it.
Curious if anyone else has had that shift in how they think about job submission after seeing the internals?
That’s a great point, and very relatable.
Most of us start with the happy path model, and it works until something weird happens. Then you realize how much is hidden behind the scenes, especially when you’re digging into slurmctld logs or unexplained failures.
The slurmstepd and cgroups example is spot on. It’s one of those things you only notice when it breaks.
Also agree on the scheduler mindset shift. Once you understand it’s optimizing for the whole cluster, not individual jobs, you naturally start submitting jobs differently.
Really good insight, adds a deeper layer to the discussion.
Good breakdown. The handoff between slurmctld and slurmd is where most people get lost, so it’s nice seeing it laid out cleanly.