Setting up horizontal pod autoscaling on GPU worker nodes

Review the following steps to enable horizontal pod autoscaling on your GPU worker nodes.

Why horizontal pod autoscaling?: You might want to configure horizontal pod autoscaling to scale the number of pods when the workload consumes more or less than a certain amount of GPU. Because GPUs are an expensive resource you might not want workloads to be run at the fullest capacity long periods of time. Instead, you can scale pods or up down based on the running workload in the cluster.

Prerequisites

To configure HPA, the following components must be installed on your cluster.

NVIDIA Data Center GPU Manager (DCGM) exporter to gather GPU metrics in Kubernetes. The DCGM exporter exposes GPU metrics for Prometheus which can be visualized using Grafana.
Prometheus and the Prometheus adapter to generate custom metrics.

Install the NVIDIA GPU Operator

Install Prometheus.

helm install prom-stack prometheus-community/kube-prometheus-stack -f ~/ca-prom-val.yaml

cat ~/ca-prom-val.yaml

prometheus:
    prometheusSpec:
        additionalScrapeConfigs:
        - job_name: gpu-metrics
            scrape_interval: 1s
            metrics_path: /metrics
            scheme: http
            kubernetes_sd_configs:
            - role: endpoints
                namespaces:
                    names:
                    - nvidia-gpu-operator
            relabel_configs:
            - source_labels: [__meta_kubernetes_endpoints_name]
                action: drop
                regex: .*-node-feature-discovery-master
            - source_labels: [__meta_kubernetes_pod_node_name]
                action: replace
                target_label: kubernetes_node

Get the Prometheus service details.
```
oc get svc
```

Install the Prometheus adapter.

helm upgrade --install prometheus-adapter prometheus-community/prometheus-adapter --set prometheus.url="http://prom-stack-kube-prometheus-prometheus.default.svc.cluster.local"

Setting up HPA

Complete the following steps to create a deployment that uses HPA.

Create a deployment.

apiVersion: apps/v1
kind: Deployment
metadata:
    name: cuda-test
    labels:
        app: cuda-test
spec:
    selector:
        matchLabels:
            app: cuda-test
    template:
        metadata:
            labels:
                app: cuda-test
        spec:
            containers:
            - name: cuda-test-main
                image: "k8s.gcr.io/cuda-vector-add:v0.1"
                command: ["bash", "-c", "for (( c=1; c<=5000; c++ )); do ./vectorAdd; done"]
                resources:
                    limits:
                        nvidia.com/gpu: 1

Create a HorizontalPodAutoscaler resource.

kind: HorizontalPodAutoscaler
apiVersion: autoscaling/v2
metadata:
    name: cuda-hpa
    namespace: default
spec:
    scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: cuda-test
    minReplicas: 1
    maxReplicas: 3
    metrics:
        - type: Pods
            pods:
                metric:
                    name: DCGM_FI_DEV_GPU_UTIL     #the metric you want to use for autoscaling
                target:
                    type: AverageValue
                    averageValue: '5'

Run the following commands to review the results.

oc get pods | grep cuda

cuda-test-d987464bf-brd48                                1/1     Running   0          4m19s
cuda-test-d987464bf-gsx82                                0/1     Pending   0          4m19s
cuda-test-d987464bf-zstzs                                1/1     Running   0          7m35s

There was 1 replica that was scaled up to 3 as the workload resource increased.

Min replicas:       1
Max replicas:       3
Deployment pods:    3 current / 3 desired
Events:
Type    Reason             Age   From                       Message
----    ------             ----  ----                       -------
Normal  SuccessfulRescale  50s   horizontal-pod-autoscaler  New size: 3; reason: pods metric DCGM_FI_DEV_GPU_UTIL above target