Bug: Store gateway not loading few blocks are PVC wipe out #10630

bhargavmg · 2025-02-12T09:46:43Z

What is the bug?

We are running mimir with 7 store gateway replicas with S3 as the backend. Our PVCs were deleted accidentally and we managed to create new PVCs for the store gateways. However after this incident, we are noticing few of the blocks are unable to get downloaded to store gateway because of which the ruler evaluations are failing.

Here are some of the logs from different components

Ruler logs:

ts=2025-02-12T07:00:10.66262752Z caller=spanlogger.go:109 method=blocksStoreQuerier.selectSorted user=prod level=warn user=prod msg="failed consistency check after all attempts" err="failed to fetch some blocks (err-mimir-store-consistency-check-failed). The failed blocks are: 01JKEQS19460B594S9ERDQP0XX" ts=2025-02-12T06:59:35.261638482Z caller=group.go:507 level=warn name=slo_requests_28d_total index=1 component=ruler insight=true user=prod file=/data/prod/general_rules group="eb reporting count based success us-east-1" msg="Evaluating rule failed" rule="record: slo_requests_28d_total\nexpr: sum without (instance, chart, playbook, team) (increase(slo_requests_raw{category=\"success\",region=~\"us-east-1\"}[4w]))\n" err="failed to fetch some blocks (err-mimir-store-consistency-check-failed). The failed blocks are: 01JKEQS19460B594S9ERDQP0XX 01JKQVKRGS66RFAAQRE1PHW1E6" ts=2025-02-12T06:59:35.261013787Z caller=spanlogger.go:109 method=blocksStoreQuerier.selectSorted user=prod level=warn user=prod msg="failed consistency check after all attempts" err="failed to fetch some blocks (err-mimir-store-consistency-check-failed). The failed blocks are: 01JKEQS19460B594S9ERDQP0XX 01JKQVKRGS66RFAAQRE1PHW1E6" ts=2025-02-12T06:59:13.808530061Z caller=group.go:507 level=warn name=slo_requests_28d_total index=1 component=ruler insight=true user=prod file=/data/prod/general_rules group="eb reporting count based success ap-southeast-2" msg="Evaluating rule failed" rule="record: slo_requests_28d_total\nexpr: sum without (instance, chart, playbook, team) (increase(slo_requests_raw{category=\"success\",region=~\"ap-southeast-2\"}[4w]))\n" err="failed to fetch some blocks (err-mimir-store-consistency-check-failed). The failed blocks are: 01JKEQS19460B594S9ERDQP0XX"

Store gateway logs for a specific failing block

ts=2025-02-12T08:12:13.935234084Z caller=memcached_client.go:372 level=warn name=chunks-cache msg="failed to fetch items from memcached" numKeys=2 firstKey=subrange:prod/01JKNWCGTB1FWRRA36HYMSPBP6/chunks/000023:212768000:212784000 err="memcache: connect timeout to 10.160.30.28:11211" ts=2025-02-12T08:11:51.943023071Z caller=memcached_client.go:372 level=warn name=chunks-cache msg="failed to fetch items from memcached" numKeys=49 firstKey=subrange:prod/01JKNWCGTB1FWRRA36HYMSPBP6/chunks/000023:207872000:207888000 err=EOF ts=2025-02-12T09:32:40.672916816Z caller=grpc_logging.go:97 level=warn method=/gatewaypb.StoreGateway/Series duration=1.117367ms msg=gRPC err="rpc error: code = Internal desc = fetch series for block 01JKEQS19460B594S9ERDQP0XX: expanded matching postings: toPostingGroups: filtering posting group: filter posting keys: cannot load sparse index-header from disk: failed to create sparse index-header reader: EOF"

We even tried to increase the index-cache replicas because of this, but no luck. Also, we confirm that there are no disk space issues or resource issues on the store gateway pods.

Can someone please help on how to debug this issue further.

How to reproduce it?

Start Mimir
Wait for the Store gateway pods' PVC to load the blocks from S3.
Delete the PVC of store gateway
Wait for the pods to come up and check the store gateway logs to see if all blocks are loaded.

What did you think would happen?

We did not expect the blocks to fail and expect to load into the store gateway from S3 just like other blocks were loaded.

What was your environment?

Infrastructure: Kubernetes
Deployment: Helm
Version: 2.13.0

Any additional context to share?

No response

The text was updated successfully, but these errors were encountered:

56quarters · 2025-02-12T19:45:57Z

It looks like you're hitting this bug which was fixed in 2.14:

[BUGFIX] Store-gateway: store sparse index headers atomically to disk. #8485

PR.

To work around this, you can delete the sparse index header files from each store-gateway disk (not fun, I know).

bhargavmg added the bug Something isn't working label Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Store gateway not loading few blocks are PVC wipe out #10630

Bug: Store gateway not loading few blocks are PVC wipe out #10630

bhargavmg commented Feb 12, 2025

56quarters commented Feb 12, 2025

Bug: Store gateway not loading few blocks are PVC wipe out #10630

Bug: Store gateway not loading few blocks are PVC wipe out #10630

Comments

bhargavmg commented Feb 12, 2025

What is the bug?

How to reproduce it?

What did you think would happen?

What was your environment?

Any additional context to share?

56quarters commented Feb 12, 2025