Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus scaler unable to connect to Grafana Cloud Prometheus #6487

Open
tylerauerbeck opened this issue Jan 15, 2025 · 7 comments
Open

Prometheus scaler unable to connect to Grafana Cloud Prometheus #6487

tylerauerbeck opened this issue Jan 15, 2025 · 7 comments
Labels
bug Something isn't working

Comments

@tylerauerbeck
Copy link

Report

When configuring a Prometheus scaler to run against a Grafana Cloud hosted Prometheus, an error occurs resulting in "https://my-prometheus.grafana.net/api/prom/api/v1/query?query=buildkite_queues_scheduled_jobs_count&time=2025-01-14T05:53:39Z": context canceled

Expected Behavior

According to the documentation, this should be just like any other Prometheus, with the expected api being available on Grafana Cloud at https://my-prometheus.grafana.net/api/prom/.

Actual Behavior

Error occurs.

Get "https://my-cloud-prom.grafana.net/api/prom/api/v1/query?query=buildkite_queues_scheduled_jobs_count&time=2025-01-14T05:53:39Z": context canceled

Steps to Reproduce the Problem

  1. Deploy KEDA
  2. Apply below ScaledJob
  3. Fails immediately
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: my-scaled-job
spec:
  jobTargetRef:
    parallelism: 1
    completions: 1
    activeDeadlineSeconds: 600
    backoffLimit: 6
    template:
      spec:
        containers:
        - name: pi
          image: perl:5.34.0
          command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
        restartPolicy: Never
  pollingInterval: 10
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  minReplicaCount: 1
  maxReplicaCount: 35
  triggers:
    - type: prometheus
      metadata:
        serverAddress: https://my-cloud-prom.grafana.net/api/prom
        threshold: '1'
        query:  buildkite_queues_scheduled_jobs_count
        authModes: "basic"
      authenticationRef:
        name: keda-grafana-cloud-prom-creds

Logs from KEDA operator

2025-01-14T21:18:59Z    ERROR   prometheus_scaler       error executing prometheus query        {"type": "ScaledJob", "namespace": "default", "name": "my-scaled-job", "error": "Get \"https://my-cloud-prom.grafana.net/api/prom/api/v1/query?query=buildkite_queues_scheduled_jobs_count&time=2025-01-14T21:18:56Z\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
github.com/kedacore/keda/v2/pkg/scalers.(*prometheusScaler).GetMetricsAndActivity
        /workspace/pkg/scalers/prometheus_scaler.go:291
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).GetMetricsAndActivityForScaler
        /workspace/pkg/scaling/cache/scalers_cache.go:151
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScaledJobMetrics
        /workspace/pkg/scaling/scale_handler.go:847
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).isScaledJobActive
        /workspace/pkg/scaling/scale_handler.go:897
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
        /workspace/pkg/scaling/scale_handler.go:262
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
        /workspace/pkg/scaling/scale_handler.go:182

KEDA Version

2.16.1

Kubernetes Version

1.30

Platform

Any

Scaler Details

Prometheus

Anything else?

No response

@tylerauerbeck tylerauerbeck added the bug Something isn't working label Jan 15, 2025
@JorTurFer
Copy link
Member

Hello
That error points to a connectivity issue, can you hit the service from others pod in the namespace? are you using a service mesh or any CNI to manage the traffic?

@JorTurFer JorTurFer moved this from To Triage to Proposed in Roadmap - KEDA Core Feb 13, 2025
@tylerauerbeck
Copy link
Author

@JorTurFer I don't think connectivity is the issue. I spin up a fedora container in the same namespace and I get the following:

curl -u <user>:<token> "https://my-cloud-prom.grafana.net/api/prom/api/v1/query?query=custom_metric_info%7Btest%3D%22manual%22%7D&time=2025-02-15T05:26:35Z"
{"status":"success","data":{"resultType":"vector","result":[]}}

So it at least looks like I'm getting a response.I have a feeling maybe it's an issue with the metric that I'm testing with? But I figured I'd get something other than a context canceled for that.

@tylerauerbeck
Copy link
Author

If I drop the time parameter, I'm getting results from the query. So I feel like the problem is definitely somewhere in this area.

@tylerauerbeck
Copy link
Author

Alright, I wanted to blame this on me. So I went back and just used the example metric:

With the query:

curl -u <user>:<token> "https://my-cloud-prom.grafana.net/api/prom/api/v1/query?query=sum%28rate%28http_requests_total%7Bdeployment%3D%22my-deployment%22%7D%5B2m%5D%29%29&time=2025-02-15T06:30:05Z"
{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1739601005,"0.06452386000165918"]}]}}

So I can see that even with:

  • getting a value from the prometheus response
  • verifying that I have connectivity from within the cluster

I'm still seeing a context canceled

"error": "Get \"https://my-cloud-prom.grafana.net/api/prom/api/v1/query?query=sum%28rate%28http_requests_total%7Bdeployment%3D%22my-deployment%22%7D%5B2m%5D%29%29&time=2025-02-15T06:30:05Z\": context canceled"}

@JorTurFer
Copy link
Member

Have you tried to increase a bit the global timeout? Default value is 3 seconds -> https://github.com/kedacore/charts/blob/main/keda/values.yaml#L543

@tylerauerbeck
Copy link
Author

Bumped it up to 6000 and still the same immediate behavior. Here's my config for my ScaledJob in case that helps

---
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: keda-grafana-cloud-prom-creds
spec:
  secretTargetRef:
    - parameter: username
      name: keda-grafana-cloud-prom-secret
      key: username
    - parameter: password
      name: keda-grafana-cloud-prom-secret
      key: password
---
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: my-scaled-job-1
spec:
  jobTargetRef:
    parallelism: 1                   
    completions: 1                            
    activeDeadlineSeconds: 600               
    backoffLimit: 6                       
    template:
      spec:
        containers:
        - name: pi
          image: fedora:latest 
          command: ["sleep",  "2m"]
        restartPolicy: Never
  pollingInterval: 10                   
  successfulJobsHistoryLimit: 3              
  failedJobsHistoryLimit: 3                  
  minReplicaCount: 1                   
  maxReplicaCount: 5                  
  triggers:
    - type: prometheus
      metadata:
        serverAddress: https://my-cloud-prom.grafana.net/api/prom
        query: sum(rate(http_requests_total{deployment="my-deployment"}[2m])) 
        threshold: '0.03'
        activationThreshold: '0'
        authModes: "basic"
      authenticationRef:
        name: keda-grafana-cloud-prom-creds 

@JorTurFer
Copy link
Member

have you checked the needed time when you use curl?

curl -o /dev/null -s -w 'Total: %{time_total}s\n'  ....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Proposed
Development

No branches or pull requests

2 participants