Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an error handling mechanism for collect_all #20835

Open
mjkanji opened this issue Jan 21, 2025 · 0 comments
Open

Add an error handling mechanism for collect_all #20835

mjkanji opened this issue Jan 21, 2025 · 0 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@mjkanji
Copy link

mjkanji commented Jan 21, 2025

Description

It'd be great if there was an error-handling mechanism for pl.collect_all such that the whole operation doesn't fail if a subset of the LazyFrames fail.

For example,

paths = [
    "normal.csv",
    "empty.csv",
]

dfs = []
for p in paths:
    dfs.append(pl.scan(p))

pl.collect_all(dfs)

This currently raises an error and causes the whole operation to fail.

Traceback (most recent call last):
  File "script.py", line 26, in <module>
    df.explain(optimized=False)
  File "/.pixi/envs/default/lib/python3.12/site-packages/polars/lazyframe/frame.py", line 1124, in explain
    return self._ldf.describe_plan()
           ^^^^^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.NoDataError: empty CSV

It'd be great if I'd at least get a result for the first lazy frame.

One other issue is that I get a polars.exceptions.NoDataError: empty CSV, it doesn't show which file caused the error. For long lists of data frames, that can make debugging rather painful.

In the meantime, I'm looping over the list and running pl.collect on each LazyFrame individual (and wrapping everything in a try/except block. I'd love any guidance on the most efficient/performant way to speed this up would be, since the built-in collect_all doesn't work right now. Should I use a threading pool or multiprocessing? Polars seems to use all cores even for a single file so I was curious about how to handle parallelism for collecting multiple LazyFrames when collect_all is not an option.

@mjkanji mjkanji added the enhancement New feature or an improvement of an existing feature label Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

1 participant