Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Spike) Cubejs Preagg when configured with S3 bucket #31297

Open
victoralfaro-dotcms opened this issue Feb 3, 2025 · 1 comment
Open

(Spike) Cubejs Preagg when configured with S3 bucket #31297

victoralfaro-dotcms opened this issue Feb 3, 2025 · 1 comment

Comments

@victoralfaro-dotcms
Copy link
Contributor

victoralfaro-dotcms commented Feb 3, 2025

User Story

As a software engineer, I want to research the reason behind the error and determine a solution so that pre-aggregations can be created when massive data needs to be queried in the CubeJS setup for dotCMS analytics infrastructure.

Acceptance Criteria

  • Identify the root cause of the CubeJS pre-aggregation failure when using S3 for storage.
  • Document the findings with logs and possible misconfigurations.
  • Propose a fix or workaround to allow pre-aggregations to be stored correctly in production.
  • Validate the solution locally and in a staging/production-like environment.

dotCMS Version

main

Proposed Objective

Quality Assurance

Proposed Priority

Priority 2 - Important

External Links... Slack Conversations, Support Tickets, Figma Designs, etc.

to_define

Assumptions & Initiation Needs

  • CubeJS is correctly configured to use S3 storage in production.
  • Pre-aggregations work correctly in a local development environment.
  • The error is related to how CubeJS interacts with S3 storage.

Quality Assurance Notes & Workarounds

  • Verify if S3 permissions, policies, or configurations impact the upload process.
  • Test different CubeJS storage options for pre-aggregations.
  • Ensure network connectivity and access rights to S3 from the production environment.
  • Identify any missing dependencies or incorrect CubeJS settings.

Sub-Tasks & Estimates

  • Investigate error logs and analyze root cause (4h)
  • Review CubeJS and S3 configuration settings (3h)
  • Document findings and proposed solution (3h)
@victoralfaro-dotcms
Copy link
Contributor Author

victoralfaro-dotcms commented Feb 20, 2025

Findings

Root Cause

The root cause of the CubeJS pre-aggregation failure is that it's unable to upload a .csv.gz file with, as its extension suggests, the data queried from the pre-aggregation in CSV format.

According to exchanged slack messages here with platform team the S3 bucket used to stored data from CubeJS is operational.

Logs

Log error when trying to run pre-aggregations at prod.

Error: Error during upload of prod_pre_aggregations.request_count_stats20250101_0frqnjyy_pwzhtg4i_1jpoaqq-0.csv.gz create table: CREATE TABLE prod_pre_aggregations.request_count_stats20250101_0frqnjyy_pwzhtg4i_1jpoaqq (`request__base_type` varchar(255), `request__cluster_id` varchar(255), `request__customer_id` varchar(255), `request__identifier` varchar(255), `request__title` varchar(255), `request__url` varchar(255), `request__created_at_day` timestamp, `request__count` bigint) WITH (build_range_end = '2025-01-31T01:35:53.000') INDEX request_count_stats_daily_index_0frqnjyy_pwzhtg4i_1jpoaqq (`request__customer_id`,`request__cluster_id`): Internal: AWS S3 error: Got HTTP 400 with content 

Proposed solution

S3 Bucket Removal

Remove any S3 bucket configuration from the current CubeJS instances as they are not required since they have some local partitions to store cubestore data.
Waiting on @spbolton input here to back up this suggestion.

Troubleshooting of CubeJS with debug log level set

According to the following env-vars we can look for a more detailed message to give us a hint of the nature of the error when uploading the file.

      CUBESTORE_S3_BUCKET:             cubestore-data-prod
      CUBESTORE_S3_REGION:             us-east-1
      CUBEJS_LOG_LEVEL:                warn
      CUBESTORE_LOG_LEVEL:             warn
      CUBEJS_TELEMETRY:                false
      CUBESTORE_TELEMETRY:             false
      AWS_STS_REGIONAL_ENDPOINTS:      regional
      AWS_DEFAULT_REGION:              us-east-1
      AWS_REGION:                      us-east-1
      AWS_ROLE_ARN:                    arn:aws:iam::948170117212:role/s3-cubestore-data-env-rw-role

And check for the following:

  • Invalid request format to S3
  • Malformed authentication headers
  • Incorrect content-type headers
  • Bucket permissions/policies
  • Invalid characters in the request URL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

No branches or pull requests

2 participants