What's is the expected loading time of a SparkNLP pre-trained pipeline? #10676

Nickersoft · 2022-07-21T22:44:50Z

Nickersoft
Jul 21, 2022

Hey folks,

Unfortunately, I hit quite the bottleneck with SparkNLP: the loading time of pre-trained pipelines. I currently have a hosted API that loads and runs pre-trained SNLP pipelines for small amounts of data passed to the API in various languages. I've noticed that on a cold start, the pipelines take anywhere from 10s to 20s to load and process data, which is unbearably slow for a user waiting for a response.

My stack is a single-node Spark cluster built on Zio, and uses zio-cache to cache pipelines after they've loaded. The processing time is near instantaneous with the cached pipelines, however, the more pipelines cached there are the more it increases the heap, which has lead to OOMs in the past. So ultimately the bottleneck is the load time of these pipelines, as it'd be easier to just always load from disk and not have to worry about retaining pipelines in-memory.

I'm curious if these load times are expected or not. The example notebooks on the John Snow Labs site seem to load their pipelines faster than this, however, I've stripped out all my custom annotators and the load times are still over ten seconds. Specifically, this means running a pipeline featuring a document assembler, sentence detector, word segmenter, lemmatizer, POS tagger, and NER tagger (basically an "explain" pipeline).

Iis there any way to speed it up? I've found similar complaints online, but without a solid solution (most people just moved to another library).

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's is the expected loading time of a SparkNLP pre-trained pipeline? #10676

{{title}}

Replies: 0 comments

Select a reply

What's is the expected loading time of a SparkNLP pre-trained pipeline? #10676

Nickersoft Jul 21, 2022

Replies: 0 comments

Nickersoft
Jul 21, 2022