Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is the Spark memory set to 2gb? #127

Open
MrPowers opened this issue Jun 11, 2024 · 2 comments
Open

Why is the Spark memory set to 2gb? #127

MrPowers opened this issue Jun 11, 2024 · 2 comments

Comments

@MrPowers
Copy link

Here's the line: https://github.com/pola-rs/tpch/blob/6c5bbe93a04cfcd25678dd860bab5ad61ad66edb/queries/pyspark/utils.py#L24

If these benchmarks are being run on a single node, we should probably set the shuffle partitions to be like 1-4 instead of 200 (which is the default).

@ritchie46
Copy link
Member

@stinodego any clue?

@stinodego
Copy link
Contributor

stinodego commented Jun 11, 2024

These are default values that will let you run scale factor 1 locally without any problems. If we use PySpark defaults, certain queries fail due to memory issues.

The benchmark blog has the actual values used during the benchmark:

For PySpark, driver memory and executor memory were set to 20g and 10g respectively.

I have tried a few different settings, and these seemed to work best for scale factor 10.

If these benchmarks are being run on a single node, we should probably set the shuffle partitions to be like 1-4 instead of 200 (which is the default).

Could be - I am by no means a PySpark optimization expert. Perhaps they should implement better/dynamic defaults.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants