You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are running mimir with 7 store gateway replicas with S3 as the backend. Our PVCs were deleted accidentally and we managed to create new PVCs for the store gateways. However after this incident, we are noticing few of the blocks are unable to get downloaded to store gateway because of which the ruler evaluations are failing.
Here are some of the logs from different components
Ruler logs:
ts=2025-02-12T07:00:10.66262752Z caller=spanlogger.go:109 method=blocksStoreQuerier.selectSorted user=prod level=warn user=prod msg="failed consistency check after all attempts" err="failed to fetch some blocks (err-mimir-store-consistency-check-failed). The failed blocks are: 01JKEQS19460B594S9ERDQP0XX" ts=2025-02-12T06:59:35.261638482Z caller=group.go:507 level=warn name=slo_requests_28d_total index=1 component=ruler insight=true user=prod file=/data/prod/general_rules group="eb reporting count based success us-east-1" msg="Evaluating rule failed" rule="record: slo_requests_28d_total\nexpr: sum without (instance, chart, playbook, team) (increase(slo_requests_raw{category=\"success\",region=~\"us-east-1\"}[4w]))\n" err="failed to fetch some blocks (err-mimir-store-consistency-check-failed). The failed blocks are: 01JKEQS19460B594S9ERDQP0XX 01JKQVKRGS66RFAAQRE1PHW1E6" ts=2025-02-12T06:59:35.261013787Z caller=spanlogger.go:109 method=blocksStoreQuerier.selectSorted user=prod level=warn user=prod msg="failed consistency check after all attempts" err="failed to fetch some blocks (err-mimir-store-consistency-check-failed). The failed blocks are: 01JKEQS19460B594S9ERDQP0XX 01JKQVKRGS66RFAAQRE1PHW1E6" ts=2025-02-12T06:59:13.808530061Z caller=group.go:507 level=warn name=slo_requests_28d_total index=1 component=ruler insight=true user=prod file=/data/prod/general_rules group="eb reporting count based success ap-southeast-2" msg="Evaluating rule failed" rule="record: slo_requests_28d_total\nexpr: sum without (instance, chart, playbook, team) (increase(slo_requests_raw{category=\"success\",region=~\"ap-southeast-2\"}[4w]))\n" err="failed to fetch some blocks (err-mimir-store-consistency-check-failed). The failed blocks are: 01JKEQS19460B594S9ERDQP0XX"
Store gateway logs for a specific failing block
ts=2025-02-12T08:12:13.935234084Z caller=memcached_client.go:372 level=warn name=chunks-cache msg="failed to fetch items from memcached" numKeys=2 firstKey=subrange:prod/01JKNWCGTB1FWRRA36HYMSPBP6/chunks/000023:212768000:212784000 err="memcache: connect timeout to 10.160.30.28:11211" ts=2025-02-12T08:11:51.943023071Z caller=memcached_client.go:372 level=warn name=chunks-cache msg="failed to fetch items from memcached" numKeys=49 firstKey=subrange:prod/01JKNWCGTB1FWRRA36HYMSPBP6/chunks/000023:207872000:207888000 err=EOF ts=2025-02-12T09:32:40.672916816Z caller=grpc_logging.go:97 level=warn method=/gatewaypb.StoreGateway/Series duration=1.117367ms msg=gRPC err="rpc error: code = Internal desc = fetch series for block 01JKEQS19460B594S9ERDQP0XX: expanded matching postings: toPostingGroups: filtering posting group: filter posting keys: cannot load sparse index-header from disk: failed to create sparse index-header reader: EOF"
We even tried to increase the index-cache replicas because of this, but no luck. Also, we confirm that there are no disk space issues or resource issues on the store gateway pods.
Can someone please help on how to debug this issue further.
How to reproduce it?
Start Mimir
Wait for the Store gateway pods' PVC to load the blocks from S3.
Delete the PVC of store gateway
Wait for the pods to come up and check the store gateway logs to see if all blocks are loaded.
What did you think would happen?
We did not expect the blocks to fail and expect to load into the store gateway from S3 just like other blocks were loaded.
What is the bug?
We are running mimir with 7 store gateway replicas with S3 as the backend. Our PVCs were deleted accidentally and we managed to create new PVCs for the store gateways. However after this incident, we are noticing few of the blocks are unable to get downloaded to store gateway because of which the ruler evaluations are failing.
Here are some of the logs from different components
Ruler logs:
ts=2025-02-12T07:00:10.66262752Z caller=spanlogger.go:109 method=blocksStoreQuerier.selectSorted user=prod level=warn user=prod msg="failed consistency check after all attempts" err="failed to fetch some blocks (err-mimir-store-consistency-check-failed). The failed blocks are: 01JKEQS19460B594S9ERDQP0XX" ts=2025-02-12T06:59:35.261638482Z caller=group.go:507 level=warn name=slo_requests_28d_total index=1 component=ruler insight=true user=prod file=/data/prod/general_rules group="eb reporting count based success us-east-1" msg="Evaluating rule failed" rule="record: slo_requests_28d_total\nexpr: sum without (instance, chart, playbook, team) (increase(slo_requests_raw{category=\"success\",region=~\"us-east-1\"}[4w]))\n" err="failed to fetch some blocks (err-mimir-store-consistency-check-failed). The failed blocks are: 01JKEQS19460B594S9ERDQP0XX 01JKQVKRGS66RFAAQRE1PHW1E6" ts=2025-02-12T06:59:35.261013787Z caller=spanlogger.go:109 method=blocksStoreQuerier.selectSorted user=prod level=warn user=prod msg="failed consistency check after all attempts" err="failed to fetch some blocks (err-mimir-store-consistency-check-failed). The failed blocks are: 01JKEQS19460B594S9ERDQP0XX 01JKQVKRGS66RFAAQRE1PHW1E6" ts=2025-02-12T06:59:13.808530061Z caller=group.go:507 level=warn name=slo_requests_28d_total index=1 component=ruler insight=true user=prod file=/data/prod/general_rules group="eb reporting count based success ap-southeast-2" msg="Evaluating rule failed" rule="record: slo_requests_28d_total\nexpr: sum without (instance, chart, playbook, team) (increase(slo_requests_raw{category=\"success\",region=~\"ap-southeast-2\"}[4w]))\n" err="failed to fetch some blocks (err-mimir-store-consistency-check-failed). The failed blocks are: 01JKEQS19460B594S9ERDQP0XX"
Store gateway logs for a specific failing block
ts=2025-02-12T08:12:13.935234084Z caller=memcached_client.go:372 level=warn name=chunks-cache msg="failed to fetch items from memcached" numKeys=2 firstKey=subrange:prod/01JKNWCGTB1FWRRA36HYMSPBP6/chunks/000023:212768000:212784000 err="memcache: connect timeout to 10.160.30.28:11211" ts=2025-02-12T08:11:51.943023071Z caller=memcached_client.go:372 level=warn name=chunks-cache msg="failed to fetch items from memcached" numKeys=49 firstKey=subrange:prod/01JKNWCGTB1FWRRA36HYMSPBP6/chunks/000023:207872000:207888000 err=EOF ts=2025-02-12T09:32:40.672916816Z caller=grpc_logging.go:97 level=warn method=/gatewaypb.StoreGateway/Series duration=1.117367ms msg=gRPC err="rpc error: code = Internal desc = fetch series for block 01JKEQS19460B594S9ERDQP0XX: expanded matching postings: toPostingGroups: filtering posting group: filter posting keys: cannot load sparse index-header from disk: failed to create sparse index-header reader: EOF"
We even tried to increase the index-cache replicas because of this, but no luck. Also, we confirm that there are no disk space issues or resource issues on the store gateway pods.
Can someone please help on how to debug this issue further.
How to reproduce it?
What did you think would happen?
We did not expect the blocks to fail and expect to load into the store gateway from S3 just like other blocks were loaded.
What was your environment?
Infrastructure: Kubernetes
Deployment: Helm
Version: 2.13.0
Any additional context to share?
No response
The text was updated successfully, but these errors were encountered: