Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Store gateway not loading few blocks are PVC wipe out #10630

Open
bhargavmg opened this issue Feb 12, 2025 · 1 comment
Open

Bug: Store gateway not loading few blocks are PVC wipe out #10630

bhargavmg opened this issue Feb 12, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@bhargavmg
Copy link

What is the bug?

We are running mimir with 7 store gateway replicas with S3 as the backend. Our PVCs were deleted accidentally and we managed to create new PVCs for the store gateways. However after this incident, we are noticing few of the blocks are unable to get downloaded to store gateway because of which the ruler evaluations are failing.

Here are some of the logs from different components

Ruler logs:

ts=2025-02-12T07:00:10.66262752Z caller=spanlogger.go:109 method=blocksStoreQuerier.selectSorted user=prod level=warn user=prod msg="failed consistency check after all attempts" err="failed to fetch some blocks (err-mimir-store-consistency-check-failed). The failed blocks are: 01JKEQS19460B594S9ERDQP0XX" ts=2025-02-12T06:59:35.261638482Z caller=group.go:507 level=warn name=slo_requests_28d_total index=1 component=ruler insight=true user=prod file=/data/prod/general_rules group="eb reporting count based success us-east-1" msg="Evaluating rule failed" rule="record: slo_requests_28d_total\nexpr: sum without (instance, chart, playbook, team) (increase(slo_requests_raw{category=\"success\",region=~\"us-east-1\"}[4w]))\n" err="failed to fetch some blocks (err-mimir-store-consistency-check-failed). The failed blocks are: 01JKEQS19460B594S9ERDQP0XX 01JKQVKRGS66RFAAQRE1PHW1E6" ts=2025-02-12T06:59:35.261013787Z caller=spanlogger.go:109 method=blocksStoreQuerier.selectSorted user=prod level=warn user=prod msg="failed consistency check after all attempts" err="failed to fetch some blocks (err-mimir-store-consistency-check-failed). The failed blocks are: 01JKEQS19460B594S9ERDQP0XX 01JKQVKRGS66RFAAQRE1PHW1E6" ts=2025-02-12T06:59:13.808530061Z caller=group.go:507 level=warn name=slo_requests_28d_total index=1 component=ruler insight=true user=prod file=/data/prod/general_rules group="eb reporting count based success ap-southeast-2" msg="Evaluating rule failed" rule="record: slo_requests_28d_total\nexpr: sum without (instance, chart, playbook, team) (increase(slo_requests_raw{category=\"success\",region=~\"ap-southeast-2\"}[4w]))\n" err="failed to fetch some blocks (err-mimir-store-consistency-check-failed). The failed blocks are: 01JKEQS19460B594S9ERDQP0XX"

Store gateway logs for a specific failing block

ts=2025-02-12T08:12:13.935234084Z caller=memcached_client.go:372 level=warn name=chunks-cache msg="failed to fetch items from memcached" numKeys=2 firstKey=subrange:prod/01JKNWCGTB1FWRRA36HYMSPBP6/chunks/000023:212768000:212784000 err="memcache: connect timeout to 10.160.30.28:11211" ts=2025-02-12T08:11:51.943023071Z caller=memcached_client.go:372 level=warn name=chunks-cache msg="failed to fetch items from memcached" numKeys=49 firstKey=subrange:prod/01JKNWCGTB1FWRRA36HYMSPBP6/chunks/000023:207872000:207888000 err=EOF ts=2025-02-12T09:32:40.672916816Z caller=grpc_logging.go:97 level=warn method=/gatewaypb.StoreGateway/Series duration=1.117367ms msg=gRPC err="rpc error: code = Internal desc = fetch series for block 01JKEQS19460B594S9ERDQP0XX: expanded matching postings: toPostingGroups: filtering posting group: filter posting keys: cannot load sparse index-header from disk: failed to create sparse index-header reader: EOF"

We even tried to increase the index-cache replicas because of this, but no luck. Also, we confirm that there are no disk space issues or resource issues on the store gateway pods.

Can someone please help on how to debug this issue further.

How to reproduce it?

  1. Start Mimir
  2. Wait for the Store gateway pods' PVC to load the blocks from S3.
  3. Delete the PVC of store gateway
  4. Wait for the pods to come up and check the store gateway logs to see if all blocks are loaded.

What did you think would happen?

We did not expect the blocks to fail and expect to load into the store gateway from S3 just like other blocks were loaded.

What was your environment?

Infrastructure: Kubernetes
Deployment: Helm
Version: 2.13.0

Any additional context to share?

No response

@bhargavmg bhargavmg added the bug Something isn't working label Feb 12, 2025
@56quarters
Copy link
Contributor

It looks like you're hitting this bug which was fixed in 2.14:

[BUGFIX] Store-gateway: store sparse index headers atomically to disk. #8485

PR.

To work around this, you can delete the sparse index header files from each store-gateway disk (not fun, I know).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants