DEV Community

Cover image for Every GPU Container Bug I've Hit on OKE (and How I Fixed Them)
Pavan Madduri
Pavan Madduri

Posted on

Every GPU Container Bug I've Hit on OKE (and How I Fixed Them)

Running GPU containers on Kubernetes is one of those things that works perfectly in tutorials and then breaks in confusing ways on real clusters. I've been deploying GPU workloads on OKE for a few months now, and I've built up a decent collection of debugging war stories.

This isn't a getting-started guide. This is the post I wish existed the first time I saw CrashLoopBackOff on a GPU pod with zero useful logs.

Bug 1: Pod Stuck in Pending — "0/3 nodes are available"

This was my first GPU deployment on OKE. Created a pod requesting nvidia.com/gpu: 1, and it just sat in Pending forever.

$ kubectl describe pod vllm-inference-0
Events:
  Warning  FailedScheduling  0/3 nodes are available: 3 Insufficient nvidia.com/gpu
Enter fullscreen mode Exit fullscreen mode

Three nodes, but none had GPUs. Turns out I created the GPU node pool but it hadn't finished scaling up yet. OKE provisions GPU nodes on-demand when you create the node pool, and it takes 3-5 minutes for the instances to come up.

Fix: Just wait. But also — check that your node pool is actually using a GPU shape:

# Verify GPU nodes exist and are ready
kubectl get nodes -l nvidia.com/gpu=present
kubectl describe node <gpu-node> | grep nvidia.com/gpu

# Should show:
#   nvidia.com/gpu: 1
# under Allocatable
Enter fullscreen mode Exit fullscreen mode

If nvidia.com/gpu doesn't appear in Allocatable, the NVIDIA device plugin isn't running on that node. On OKE it should be automatic, but I've seen it lag behind node creation by a minute or two.

Bug 2: CUDA Version Mismatch

This one was nasty. The container started, then immediately crashed:

CUDA error: no kernel image is available for execution on the device
Enter fullscreen mode Exit fullscreen mode

My Dockerfile used nvidia/cuda:12.4-runtime-ubuntu22.04, but the GPU node's driver only supported CUDA 12.2. The container runtime is newer than the host driver can handle.

Fix: Check the driver version on the node and match your CUDA image:

# SSH to GPU node or run a debug pod
kubectl run gpu-debug --image=nvidia/cuda:12.2.0-base-ubuntu22.04 \
  --restart=Never --rm -it \
  --overrides='{"spec":{"containers":[{"name":"gpu-debug","image":"nvidia/cuda:12.2.0-base-ubuntu22.04","command":["nvidia-smi"],"resources":{"limits":{"nvidia.com/gpu":"1"}}}]}}' \
  -- nvidia-smi
Enter fullscreen mode Exit fullscreen mode

The output shows the driver version and maximum CUDA version it supports. Use a CUDA runtime image that doesn't exceed that version.

On OKE, Oracle controls the GPU node image. When they update the driver, you can bump your CUDA version. Don't go the other way around.

Bug 3: OOM Killed During Model Loading

vLLM loaded halfway and then the pod got OOM killed:

$ kubectl describe pod vllm-0
    Last State:  Terminated
      Reason:    OOMKilled
      Exit Code: 137
Enter fullscreen mode Exit fullscreen mode

I had resources.limits.memory: 16Gi but the model needed more during the loading phase. vLLM memory-maps the model weights, and the kernel counts that against the container's memory limit even though it's not all resident.

Fix: Set memory limits higher than you think you need, or use --gpu-memory-utilization 0.85 in vLLM to cap GPU memory usage and reduce the spill to CPU memory:

resources:
  limits:
    nvidia.com/gpu: 1
    memory: 32Gi    # was 16Gi — doubled it
  requests:
    memory: 16Gi
Enter fullscreen mode Exit fullscreen mode

The gap between requests and limits gives the pod burst room during model loading without permanently reserving 32GB on the node.

Bug 4: Liveness Probe Restart Loop

Already mentioned this in a previous post but it's worth repeating because it gets everyone.

Model loading takes 60-120 seconds. Default liveness probe fires at 10 seconds. Kubernetes thinks the pod is dead, kills it, it restarts, starts loading again, gets killed again. Infinite loop.

livenessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 180   # Give model time to load
  periodSeconds: 15
  failureThreshold: 3
readinessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 60
  periodSeconds: 10
Enter fullscreen mode Exit fullscreen mode

Use a startup probe if you want to be cleaner about it:

startupProbe:
  httpGet:
    path: /health
    port: 8000
  failureThreshold: 30
  periodSeconds: 10
  # Gives up to 300 seconds for model loading
Enter fullscreen mode Exit fullscreen mode

Bug 5: Image Pull Timeout on Large GPU Images

GPU images are 5-15GB. OCIR pull worked fine for small images but GPU images would timeout:

Failed to pull image: context deadline exceeded
Enter fullscreen mode Exit fullscreen mode

Fix: Increase the kubelet image pull deadline and use imagePullPolicy: IfNotPresent so the massive image only downloads once:

spec:
  containers:
    - name: inference
      image: iad.ocir.io/mytenancy/vllm:v1
      imagePullPolicy: IfNotPresent
Enter fullscreen mode Exit fullscreen mode

Also — pull from the same OCI region as your cluster. Cross-region OCIR pulls over the internet are slow. Same-region pulls go over the internal network and are 5-10x faster.

Bug 6: GPU Not Released After Pod Deletion

Deleted a pod, created a new one, and it couldn't get the GPU:

0/1 nodes are available: 1 Insufficient nvidia.com/gpu
Enter fullscreen mode Exit fullscreen mode

But the old pod was gone. kubectl get pods showed nothing using the GPU.

Turns out the pod was stuck in Terminating state because the vLLM process wasn't handling SIGTERM properly. The GPU was still allocated to the zombie pod.

Fix: Add a preStop hook and set a reasonable terminationGracePeriodSeconds:

spec:
  terminationGracePeriodSeconds: 30
  containers:
    - name: inference
      lifecycle:
        preStop:
          exec:
            command: ["/bin/sh", "-c", "kill -SIGTERM 1 && sleep 5"]
Enter fullscreen mode Exit fullscreen mode

If a pod is truly stuck, kubectl delete pod <name> --grace-period=0 --force will release the GPU. Use it as a last resort.

My Debugging Checklist

When a GPU pod is broken on OKE, I now run through this in order:

# 1. Is the pod even scheduled?
kubectl get pod <name> -o wide

# 2. What's the event history?
kubectl describe pod <name>

# 3. Are GPU nodes ready with GPUs allocatable?
kubectl get nodes -l nvidia.com/gpu=present
kubectl describe node <gpu-node> | grep -A5 "Allocatable"

# 4. Can nvidia-smi run on the node?
kubectl exec -it <gpu-pod> -- nvidia-smi

# 5. What's the container's actual error?
kubectl logs <pod> --previous   # logs from the crashed container

# 6. Resource pressure on the node?
kubectl top node <gpu-node>
Enter fullscreen mode Exit fullscreen mode

90% of the time, the answer is in steps 2 or 5. The other 10% is the CUDA version mismatch, which requires step 4.

Prevention

Most of these bugs hit me once and then I added checks so they wouldn't happen again:

  • CI validates CUDA version — build step runs nvidia-smi in a test container to verify compatibility
  • Startup probes on every GPU pod — no more liveness probe restart loops
  • Same-region OCIR — eliminated image pull timeouts
  • Memory limits 2x the model size — OOM kills are gone

None of this is complex. It's just stuff you learn by deploying real GPU workloads and watching them break.


Pavan Madduri — Oracle ACE Associate, CNCF Golden Kubestronaut. GitHub | LinkedIn | Website | Google Scholar | ResearchGate

Top comments (0)