-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use takeWhile method from Range #720
base: main
Are you sure you want to change the base?
Conversation
Hi @EnverOsmanov, thanks for the PR! val lastCellNum = r.getLastCellNum
colInd
.iterator
.filter(_ < lastCellNum) |
If Benchmarks: Here is the code how I read the data. Btw, I just checked the content of |
The alternative approach to avoid iteration over full
But I'm not exactly sure what was the idea behind the change in V2. |
Hmm, maybe it is to be able to do the r.getCell(_, MissingCellPolicy.CREATE_NULL_AS_BLANK) @quanghgx could you chime in here? |
If |
6b58ec4
to
6866cb1
Compare
The symptoms:
I have a file with ~1 million rows, 125 columns. It takes ~12 seconds to count lines with spark-excel's API V1 and ~2 minutes with API V2.
The issue:
Range
does not contain own optimized methodfilter
, that's why it uses method fromTraversableLike
which iterates over each number in range.r.getLastCellNum
evaluated for each number in range.Here are some rough benchmarks with another file:
filter => 50 seconds
val lastCellNum => 38 seconds
withFilter => 20 seconds
takeWhile => 12 seconds
API V1 => 12 seconds
(File taken from here and manually converted to "xlsx")
PS. API V2 seems great! :)