Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator unable to delete Kubernetes Deployment #910

Closed
thaisarcanjo-ow opened this issue Oct 14, 2024 · 3 comments
Closed

Operator unable to delete Kubernetes Deployment #910

thaisarcanjo-ow opened this issue Oct 14, 2024 · 3 comments
Labels

Comments

@thaisarcanjo-ow
Copy link

thaisarcanjo-ow commented Oct 14, 2024

There is an issue with the default settings available from the docs where the Operator tries to delete a Kubernetes Deployment using the wrong name and therefore cannot find. The Operator tries to delete a Deployment that is named like the Worker Pod name, which doesn't exist.

Reproducing steps:

  1. Install the Operator with helm install --repo https://helm.dask.org --create-namespace -n dask-operator --generate-name dask-kubernetes-operator, ie this quick start step.
  2. Create the cluster using the default yaml available from this guide as is. At this stage, two workers would be available from two deployments.
  3. Create an autoscaler with the min workers set to 0 and install it
# autoscaler.yaml
apiVersion: kubernetes.dask.org/v1
kind: DaskAutoscaler
metadata:
  name: simple
spec:
  cluster: "simple"
  minimum: 0  # we recommend always having a minimum of 1 worker so that an idle cluster can start working on tasks immediately
  maximum: 10 # you can place a hard limit on the number of workers regardless of what the scheduler requests
  1. Apply this AutoScaler settings:
kubectl apply -f autoscaler.yaml
daskautoscaler.kubernetes.dask.org/simple created
  1. At this stage, the operator would already try to remove some deployments, but it is attempting to delete a Deployment resouirce that matches the Pod name, which doesn't exist:
[2024-10-14 09:22:42,559] kopf.objects         [INFO    ] [default/simple] Autoscaler updated simple worker count from 2 to 1
[2024-10-14 09:22:42,559] kopf.objects         [INFO    ] [default/simple] Timer 'daskautoscaler_adapt' succeeded.
[2024-10-14 09:22:42,662] httpx                [INFO    ] HTTP Request: GET https://10.96.0.1/apis/kubernetes.dask.org/v1/namespaces/default/daskclusters?fieldSelector=metadata.name%3Dsimple "HTTP/1.1 200 OK"
[2024-10-14 09:22:42,668] httpx                [INFO    ] HTTP Request: GET https://10.96.0.1/apis/apps/v1/namespaces/default/deployments?labelSelector=dask.org%2Fworkergroup-name%3Dsimple-default "HTTP/1.1 200 OK"
[2024-10-14 09:22:42,673] kopf.objects         [INFO    ] [default/simple-default] Scaled worker group simple-default up to 1 workers.
[2024-10-14 09:22:42,677] httpx                [INFO    ] HTTP Request: GET https://10.96.0.1/api/v1/namespaces/default/services?fieldSelector=metadata.name%3Dsimple-scheduler "HTTP/1.1 200 OK"
[2024-10-14 09:22:42,687] httpx                [INFO    ] HTTP Request: GET https://10.96.0.1/api/v1/namespaces/default/services?fieldSelector=metadata.name%3Dsimple-scheduler "HTTP/1.1 200 OK"
[2024-10-14 09:22:42,693] kopf.objects         [WARNING ] [default/simple-default] Scaling simple-default failed via the HTTP API and the Dask RPC, falling back to LIFO scaling. This can result in lost data, see https://kubernetes.dask.org/en/latest/operator_troubleshooting.html.
[2024-10-14 09:22:42,697] httpx                [INFO    ] HTTP Request: GET https://10.96.0.1/api/v1/namespaces/default/pods?labelSelector=dask.org%2Fworkergroup-name%3Dsimple-default "HTTP/1.1 200 OK"
[2024-10-14 09:22:42,701] kopf.objects         [INFO    ] [default/simple-default] Workers to close: ['simple-default-worker-057ae426b6-79bcbdb84b-vlcn7']
[2024-10-14 09:22:42,705] httpx                [INFO    ] HTTP Request: DELETE https://10.96.0.1/apis/apps/v1/namespaces/default/deployments/simple-default-worker-057ae426b6-79bcbdb84b-vlcn7 "HTTP/1.1 404 Not Found"
[2024-10-14 09:22:42,705] kopf.objects         [ERROR   ] [default/simple-default] Handler 'daskworkergroup_replica_update/spec.worker.replicas' failed with an exception. Will retry.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/kr8s/_api.py", line 168, in call_api
    response.raise_for_status()
  File "/usr/local/lib/python3.10/site-packages/httpx/_models.py", line 763, in raise_for_status
    raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Client error '404 Not Found' for url 'https://10.96.0.1/apis/apps/v1/namespaces/default/deployments/simple-default-worker-057ae426b6-79bcbdb84b-vlcn7'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/kr8s/_objects.py", line 336, in delete
    async with self.api.call_api(
  File "/usr/local/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/usr/local/lib/python3.10/site-packages/kr8s/_api.py", line 186, in call_api
    raise ServerError(
kr8s._exceptions.ServerError: deployments.apps "simple-default-worker-057ae426b6-79bcbdb84b-vlcn7" not found

If I check the pods, the name simple-default-worker-057ae426b6-79bcbdb84b-vlcn7 of the deployment it tried to delete indeed exists, but as a worker pod:

kubectl get pods -l dask.org/cluster-name=simple
NAME                                                READY   STATUS    RESTARTS   AGE
simple-default-worker-057ae426b6-79bcbdb84b-vlcn7   1/1     Running   0          9m36s
simple-default-worker-54afdedac5-6bdb8f746b-7lzsg   1/1     Running   0          9m36s
simple-scheduler-78db7fbfd8-zmwgr                   1/1     Running   0          9m36s

However, the deployment name that controls this pod has a different name:

kubectl get deployments -l dask.org/cluster-name=simple
NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
simple-default-worker-057ae426b6   1/1     1            1           15m
simple-default-worker-54afdedac5   1/1     1            1           15m
simple-scheduler                   1/1     1            1           15m

As you can see, the deployment that controls that worker pod is actually named simple-default-worker-057ae426b6 instead of simple-default-worker-057ae426b6-79bcbdb84b-vlcn7, so as a result, the operator is unable to delete the deployments and the workers are never deleted from the namespace. It could be coming from this linehere the deletion using worker name as expected Deployment name.

Anything else we need to know?:
This may be relate to #855

Environment:

  • Dask version: 2024.9.1
  • Python version: 3.11
  • Operating System: Mac/Linux
  • Install method (conda, pip, source): pip
@thaisarcanjo-ow
Copy link
Author

thaisarcanjo-ow commented Oct 15, 2024

To provide some extra information, seems like the operator tries 3 times to get the information of which worker/deployment to remove:

  1. Dashboard http here
  2. Dask RCP here
  3. Kubernetes API here (I think the fallback option should not be Pods but Deployments here)

From the logs, we see the first two failed, which was a bit unexpected given the operator can scale up the workers.
We added to the operator some params to get the debug logs with

helm install --repo https://helm.dask.org --create-namespace -n dask-operator dask-kubernetes-operator dask-kubernetes-operator --set kopfArgs="{--all-namespaces,--verbose,--debug}"

and could see that there were some 404 on the response body (would be useful to see which request it was) and after digging through the issues here, this one #807 gave some light on adding distributed.http.scheduler.api to the distributed.scheduler.http.routes Dask config, so added that to the config map as:

    # config map settings applied to the dask-cluster
    distributed:
      scheduler:
        http:
          routes:
          - distributed.http.scheduler.prometheus
          - distributed.http.scheduler.info
          - distributed.http.scheduler.json
          - distributed.http.health
          - distributed.http.proxy
          - distributed.http.statics
          - distributed.http.scheduler.api 

then recreated the scheduler and we could see that likely the first http call on getting the workers to retire returned the right name (which seems to be the value from the env var DASK_WORKER_NAME, given when we open the dashboard the workers are named like that, ie matching deployment name) and they are then getting removed after all tasks were computed:

[2024-10-15 15:40:04,912] kopf.objects         [INFO    ] [my-namespace/dask-autoscaler] Autoscaler updated dask-cluster worker count from 2 to 1
[2024-10-15 15:40:04,912] kopf.objects         [INFO    ] [my-namespace/dask-autoscaler] Timer 'daskautoscaler_adapt' succeeded.
[2024-10-15 15:40:04,997] httpx                [INFO    ] HTTP Request: GET https://10.0.0.1/apis/kubernetes.dask.org/v1/namespaces/my-namespace/daskclusters?fieldSelector=metadata.name%3Ddask-cluster "HTTP/1.1 200 OK"
[2024-10-15 15:40:05,022] httpx                [INFO    ] HTTP Request: GET https://10.0.0.1/apis/apps/v1/namespaces/my-namespace/deployments?labelSelector=dask.org%2Fworkergroup-name%3Ddask-cluster-default "HTTP/1.1 200 OK"
[2024-10-15 15:40:05,034] kopf.objects         [INFO    ] [my-namespace/dask-cluster-default] Scaled worker group dask-cluster-default up to 1 workers.
[2024-10-15 15:40:05,041] httpx                [INFO    ] HTTP Request: GET https://10.0.0.1/api/v1/namespaces/my-namespace/services?fieldSelector=metadata.name%3Ddask-cluster-scheduler "HTTP/1.1 200 OK"
[2024-10-15 15:40:05,057] kopf.objects         [INFO    ] [my-namespace/dask-cluster-default] Retired workers {'tcp://172.18.69.197:34793': {'type': 'Worker', 'id': 'dask-cluster-default-worker-9e4e522e22', 'host': '172.18.69.197', 'resources': {}, 'local_directory': '/tmp/dask-scratch-space/worker-20y99qa3', 'name': 'dask-cluster-default-worker-9e4e522e22', 'nthreads': 1, 'memory_limit': 12000000000, 'last_seen': 1729006804.7547565, 'services': {'dashboard': 44215}, 'metrics': {'task_counts': {}, 'bandwidth': {'total': 100000000, 'workers': {}, 'types': {}}, 'digests_total_since_heartbeat': {'tick-duration': 0.5005748271942139, 'latency': 0.0019073486328125}, 'managed_bytes': 0, 'spilled_bytes': {'memory': 0, 'disk': 0}, 'transfer': {'incoming_bytes': 0, 'incoming_count': 0, 'incoming_count_total': 12, 'outgoing_bytes': 0, 'outgoing_count': 0, 'outgoing_count_total': 36}, 'event_loop_interval': 0.020009407997131346, 'cpu': 4.0, 'memory': 187879424, 'time': 1729006804.256867, 'host_net_io': {'read_bps': 285.6785263997705, 'write_bps': 1480.334182253356}, 'host_disk_io': {'read_bps': 8182.791917017202, 'write_bps': 270032.1332615676}, 'num_fds': 22}, 'status': 'closed', 'nanny': 'tcp://172.18.69.197:41727'}}
[2024-10-15 15:40:05,058] kopf.objects         [INFO    ] [my-namespace/dask-cluster-default] Workers to close: ['dask-cluster-default-worker-9e4e522e22']
[2024-10-15 15:40:05,067] httpx                [INFO    ] HTTP Request: DELETE https://10.0.0.1/apis/apps/v1/namespaces/my-namespace/deployments/dask-cluster-default-worker-9e4e522e22 "HTTP/1.1 200 OK"
[2024-10-15 15:40:05,067] kopf.objects         [INFO    ] [my-namespace/dask-cluster-default] Scaled worker group dask-cluster-default down to 1 workers.
[2024-10-15 15:40:05,068] kopf.objects         [INFO    ] [my-namespace/dask-cluster-default] Handler 'daskworkergroup_replica_update/spec.worker.replicas' succeeded.
[2024-10-15 15:40:05,068] kopf.objects         [INFO    ] [my-namespace/dask-cluster-default] Updating is processed: 1 succeeded; 0 failed.
[2024-10-15 15:40:07,830] kopf.objects         [INFO    ] [my-namespace/dask-cluster] Timer 'daskcluster_autoshutdown' succeeded.

Is this setting distributed.http.scheduler.api correct to add to have the downscale bit of autoscaler working? That wasn't required to get the scale up bit working (workers are created correctly)

@fcourtial
Copy link

Hello,

I went through all the trouble understanding the issue, only to find out the issue has been reported and fixed. 😊

The consequence on my cluster is that it scales up but doesn't scale down.

So if we spawn 400 pods and they don't shutdown, it can gets expensive quickly.

I installed the operator through:

apiVersion: v2
name: dask
version: 1.0.0
dependencies:
  - name: dask-kubernetes-operator
    version: 2024.9.0
    repository: https://helm.dask.org/

Do you know when this fix will be released ? Or if there is an easy way to use it on my kube cluster ?

Best regards.

@jacobtomlinson
Copy link
Member

I just tagged 2025.1.0 which will contain the fix for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants