Skip to content

The easiest way to perform reshard #7379

Closed Answered by lhoestq
yzhangcs asked this question in Q&A
Discussion options

You must be logged in to vote

Maybe you can create multiple iterators per file, e.g. if you want 4 iterators per file you can have the first iterator that starts at the beginning of the file and stops at the first EOL at 1/4 of the file, then the second iterator starts after the first EOL at 1/4 of the file until the first EOL at 1/2 of the file, etc.

Btw instead of using a GeneratorBasedBuilder it's maybe easier to use Dataset.from_generator or IterableDataset.from_generator

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by yzhangcs
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants