What's is the expected loading time of a SparkNLP pre-trained pipeline? #10676
Unanswered
Nickersoft
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hey folks,
Unfortunately, I hit quite the bottleneck with SparkNLP: the loading time of pre-trained pipelines. I currently have a hosted API that loads and runs pre-trained SNLP pipelines for small amounts of data passed to the API in various languages. I've noticed that on a cold start, the pipelines take anywhere from 10s to 20s to load and process data, which is unbearably slow for a user waiting for a response.
My stack is a single-node Spark cluster built on Zio, and uses zio-cache to cache pipelines after they've loaded. The processing time is near instantaneous with the cached pipelines, however, the more pipelines cached there are the more it increases the heap, which has lead to OOMs in the past. So ultimately the bottleneck is the load time of these pipelines, as it'd be easier to just always load from disk and not have to worry about retaining pipelines in-memory.
I'm curious if these load times are expected or not. The example notebooks on the John Snow Labs site seem to load their pipelines faster than this, however, I've stripped out all my custom annotators and the load times are still over ten seconds. Specifically, this means running a pipeline featuring a document assembler, sentence detector, word segmenter, lemmatizer, POS tagger, and NER tagger (basically an "explain" pipeline).
Iis there any way to speed it up? I've found similar complaints online, but without a solid solution (most people just moved to another library).
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions