Prometheus scaler unable to connect to Grafana Cloud Prometheus #6487

tylerauerbeck · 2025-01-15T00:49:47Z

Report

When configuring a Prometheus scaler to run against a Grafana Cloud hosted Prometheus, an error occurs resulting in "https://my-prometheus.grafana.net/api/prom/api/v1/query?query=buildkite_queues_scheduled_jobs_count&time=2025-01-14T05:53:39Z": context canceled

Expected Behavior

According to the documentation, this should be just like any other Prometheus, with the expected api being available on Grafana Cloud at https://my-prometheus.grafana.net/api/prom/.

Actual Behavior

Error occurs.

Get "https://my-cloud-prom.grafana.net/api/prom/api/v1/query?query=buildkite_queues_scheduled_jobs_count&time=2025-01-14T05:53:39Z": context canceled

Steps to Reproduce the Problem

Deploy KEDA
Apply below ScaledJob
Fails immediately

apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: my-scaled-job
spec:
  jobTargetRef:
    parallelism: 1
    completions: 1
    activeDeadlineSeconds: 600
    backoffLimit: 6
    template:
      spec:
        containers:
        - name: pi
          image: perl:5.34.0
          command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
        restartPolicy: Never
  pollingInterval: 10
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  minReplicaCount: 1
  maxReplicaCount: 35
  triggers:
    - type: prometheus
      metadata:
        serverAddress: https://my-cloud-prom.grafana.net/api/prom
        threshold: '1'
        query:  buildkite_queues_scheduled_jobs_count
        authModes: "basic"
      authenticationRef:
        name: keda-grafana-cloud-prom-creds

Logs from KEDA operator

2025-01-14T21:18:59Z    ERROR   prometheus_scaler       error executing prometheus query        {"type": "ScaledJob", "namespace": "default", "name": "my-scaled-job", "error": "Get \"https://my-cloud-prom.grafana.net/api/prom/api/v1/query?query=buildkite_queues_scheduled_jobs_count&time=2025-01-14T21:18:56Z\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
github.com/kedacore/keda/v2/pkg/scalers.(*prometheusScaler).GetMetricsAndActivity
        /workspace/pkg/scalers/prometheus_scaler.go:291
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).GetMetricsAndActivityForScaler
        /workspace/pkg/scaling/cache/scalers_cache.go:151
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScaledJobMetrics
        /workspace/pkg/scaling/scale_handler.go:847
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).isScaledJobActive
        /workspace/pkg/scaling/scale_handler.go:897
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
        /workspace/pkg/scaling/scale_handler.go:262
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
        /workspace/pkg/scaling/scale_handler.go:182

KEDA Version

2.16.1

Kubernetes Version

1.30

Platform

Any

Scaler Details

Prometheus

Anything else?

No response

The text was updated successfully, but these errors were encountered:

JorTurFer · 2025-02-13T07:23:10Z

Hello
That error points to a connectivity issue, can you hit the service from others pod in the namespace? are you using a service mesh or any CNI to manage the traffic?

tylerauerbeck · 2025-02-15T05:32:38Z

@JorTurFer I don't think connectivity is the issue. I spin up a fedora container in the same namespace and I get the following:

curl -u <user>:<token> "https://my-cloud-prom.grafana.net/api/prom/api/v1/query?query=custom_metric_info%7Btest%3D%22manual%22%7D&time=2025-02-15T05:26:35Z"
{"status":"success","data":{"resultType":"vector","result":[]}}

So it at least looks like I'm getting a response.I have a feeling maybe it's an issue with the metric that I'm testing with? But I figured I'd get something other than a context canceled for that.

tylerauerbeck · 2025-02-15T05:49:08Z

If I drop the time parameter, I'm getting results from the query. So I feel like the problem is definitely somewhere in this area.

tylerauerbeck · 2025-02-15T06:37:42Z

Alright, I wanted to blame this on me. So I went back and just used the example metric:

With the query:

curl -u <user>:<token> "https://my-cloud-prom.grafana.net/api/prom/api/v1/query?query=sum%28rate%28http_requests_total%7Bdeployment%3D%22my-deployment%22%7D%5B2m%5D%29%29&time=2025-02-15T06:30:05Z"
{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1739601005,"0.06452386000165918"]}]}}

So I can see that even with:

getting a value from the prometheus response
verifying that I have connectivity from within the cluster

I'm still seeing a context canceled

"error": "Get \"https://my-cloud-prom.grafana.net/api/prom/api/v1/query?query=sum%28rate%28http_requests_total%7Bdeployment%3D%22my-deployment%22%7D%5B2m%5D%29%29&time=2025-02-15T06:30:05Z\": context canceled"}

JorTurFer · 2025-02-15T23:29:36Z

Have you tried to increase a bit the global timeout? Default value is 3 seconds -> https://github.com/kedacore/charts/blob/main/keda/values.yaml#L543

tylerauerbeck · 2025-02-17T17:31:46Z

Bumped it up to 6000 and still the same immediate behavior. Here's my config for my ScaledJob in case that helps

---
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: keda-grafana-cloud-prom-creds
spec:
  secretTargetRef:
    - parameter: username
      name: keda-grafana-cloud-prom-secret
      key: username
    - parameter: password
      name: keda-grafana-cloud-prom-secret
      key: password
---
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: my-scaled-job-1
spec:
  jobTargetRef:
    parallelism: 1                   
    completions: 1                            
    activeDeadlineSeconds: 600               
    backoffLimit: 6                       
    template:
      spec:
        containers:
        - name: pi
          image: fedora:latest 
          command: ["sleep",  "2m"]
        restartPolicy: Never
  pollingInterval: 10                   
  successfulJobsHistoryLimit: 3              
  failedJobsHistoryLimit: 3                  
  minReplicaCount: 1                   
  maxReplicaCount: 5                  
  triggers:
    - type: prometheus
      metadata:
        serverAddress: https://my-cloud-prom.grafana.net/api/prom
        query: sum(rate(http_requests_total{deployment="my-deployment"}[2m])) 
        threshold: '0.03'
        activationThreshold: '0'
        authModes: "basic"
      authenticationRef:
        name: keda-grafana-cloud-prom-creds

JorTurFer · 2025-02-17T19:22:47Z

have you checked the needed time when you use curl?

curl -o /dev/null -s -w 'Total: %{time_total}s\n'  ....

tylerauerbeck added the bug Something isn't working label Jan 15, 2025

keda-automation added this to Roadmap - KEDA Core Jan 15, 2025

github-project-automation bot moved this to To Triage in Roadmap - KEDA Core Jan 15, 2025

JorTurFer moved this from To Triage to Proposed in Roadmap - KEDA Core Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus scaler unable to connect to Grafana Cloud Prometheus #6487

Prometheus scaler unable to connect to Grafana Cloud Prometheus #6487

tylerauerbeck commented Jan 15, 2025

JorTurFer commented Feb 13, 2025

tylerauerbeck commented Feb 15, 2025

tylerauerbeck commented Feb 15, 2025

tylerauerbeck commented Feb 15, 2025

JorTurFer commented Feb 15, 2025

tylerauerbeck commented Feb 17, 2025

JorTurFer commented Feb 17, 2025

Prometheus scaler unable to connect to Grafana Cloud Prometheus #6487

Prometheus scaler unable to connect to Grafana Cloud Prometheus #6487

Comments

tylerauerbeck commented Jan 15, 2025

Report

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Logs from KEDA operator

KEDA Version

Kubernetes Version

Platform

Scaler Details

Anything else?

JorTurFer commented Feb 13, 2025

tylerauerbeck commented Feb 15, 2025

tylerauerbeck commented Feb 15, 2025

tylerauerbeck commented Feb 15, 2025

JorTurFer commented Feb 15, 2025

tylerauerbeck commented Feb 17, 2025

JorTurFer commented Feb 17, 2025