Muhammad Zubair Bin Akbar

Posted on Apr 28

What Actually Happens When You Run sbatch in Slurm

#ai #slurm #hpc #jobscheduler

If you work with HPC clusters, you likely use sbatch every day. You submit a script and expect it to run.

But that single command triggers a full workflow inside Slurm.

Understanding this internal flow helps you debug issues faster, optimize job performance, and better understand how your cluster behaves.

⸻

Step 1: Submitting the Job

When you run:

sbatch job.sh

You are not starting the job. You are submitting a request to Slurm.

The script includes:

Resource requirements such as CPUs, memory, GPUs
Job metadata like name and output paths
The actual commands to execute

At this point, Slurm simply accepts the job.

⸻

Step 2: Communication with slurmctld

The sbatch command sends the job to the Slurm controller daemon, slurmctld.

This daemon:

Assigns a Job ID
Stores the job details
Marks the job as PENDING

Nothing is running yet.

⸻

Step 3: Job Enters the Queue

The job is now placed in the scheduling queue.

evaluates:

Job priority
Fairshare usage
Partition limits
Resource availability

This determines when your job will run.

⸻

Step 4: Scheduling Decision

The scheduler continuously checks:

Free nodes
Resource fragmentation
Backfill opportunities

If your job fits available resources, it gets selected. Otherwise, it stays pending.

⸻

Step 5: Resource Allocation

Once selected, Slurm:

Assigns specific compute nodes
Reserves CPUs, memory, and GPUs
Changes job state to RUNNING

Now your job has allocated resources.

⸻

Step 6: Node-Level Communication

Each compute node runs a daemon called slurmd.

The controller sends job details to these nodes. The nodes prepare the execution environment.

⸻

Step 7: Job Execution via slurmstepd

On the compute node, slurmstepd is launched.

This process:

Starts your application
Manages job steps
Handles output and error streams
Enforces resource limits using cgroups

Your script begins executing here.

⸻

Step 8: Monitoring During Execution

While the job runs:

Slurm tracks resource usage
Logs are written to output files
Accounting data is collected

You can monitor the job using:

squeue
scontrol show job <jobid>

⸻

Step 9: Job Completion

When the job finishes:

slurmstepd exits
Resources are released
Temporary processes are cleaned up

The job state becomes COMPLETED, FAILED, TIMEOUT, or CANCELLED.

⸻

Step 10: Accounting and Logs

Finally:

Job statistics are stored
Output files remain available
Usage data is recorded

You can check this using:

sacct

⸻

Full Flow Summary

Submit job using sbatch
slurmctld receives and queues it
Scheduler evaluates priority
Resources are allocated
slurmd prepares nodes
slurmstepd runs the job
Job completes and resources are released

⸻

Common Misconceptions

“sbatch runs the job immediately”
It only submits the job.

“Pending means failure”
It usually means waiting for resources.

“Slurm just runs scripts”
It manages scheduling, allocation, execution, and cleanup.

⸻

Final Thought

sbatch may look simple, but it triggers a complete orchestration pipeline inside Slurm.

Once you understand this flow, debugging becomes easier, performance tuning improves, and cluster behavior becomes predictable.

⸻

Top comments (3)

PEACEBINFLOW • Apr 30

This post made me think about something we rarely talk about: how much of our mental model of Slurm is shaped by assuming the "happy path."

Every step here is what happens when things go right. But the moment you hit a weird failure—job sits pending for hours with no obvious reason, or it goes straight to FAILED without ever touching a compute node—you realize the abstraction leaks badly. Suddenly you’re digging through slurmctld logs, checking why the scheduler made a decision you can’t reverse-engineer from squeue output alone.

The part about slurmstepd enforcing resource limits with cgroups stuck with me. It’s one of those details that’s invisible until it’s not—like when your job gets OOM-killed and you swear you allocated enough memory, but didn’t account for what the step daemon itself reserves. I’ve seen people chase that ghost for days.

I think the real value in understanding this pipeline isn’t just debugging, though. It’s learning to stop fighting the scheduler’s incentives. Once you realize it’s optimizing for cluster-wide throughput, not your individual job’s wait time, you start writing requests differently. Smaller, shorter, more frequent—feeding the backfill algorithm instead of resenting it.

Curious if anyone else has had that shift in how they think about job submission after seeing the internals?

Muhammad Zubair Bin Akbar • Apr 30

That’s a great point, and very relatable.

Most of us start with the happy path model, and it works until something weird happens. Then you realize how much is hidden behind the scenes, especially when you’re digging into slurmctld logs or unexplained failures.

The slurmstepd and cgroups example is spot on. It’s one of those things you only notice when it breaks.

Also agree on the scheduler mindset shift. Once you understand it’s optimizing for the whole cluster, not individual jobs, you naturally start submitting jobs differently.

Really good insight, adds a deeper layer to the discussion.

MournfulCord • Apr 28

Good breakdown. The handoff between slurmctld and slurmd is where most people get lost, so it’s nice seeing it laid out cleanly.