Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SIMD-accelerated APIs #7

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ashvardanian
Copy link

This was indented as a small path upgrading from JellyFish to StringZilla to accelerate some of the slowest and frequently used string similarity measures. Along the way I've patched a few minor things.

  • Hamming and Levenshtein support SIMD and buffers.
  • Added docstrings for all APIs.
  • Fixed non-standard 5-char indent in functions.py.
  • Upgraded PyTest for compatibility with newer Pyhton.
  • Added pkg_resources for setuptools for tests.

Compared to JellyFish, StringZilla is generally at least 20% faster even on shorter strings. It is also more accurate, as JellyFish doesn't correctly handle Unicode strings. Here is a comparison table for the distance output by different packages.

Example Jellyfish Levenshtein RapidFuzz EditDistance NLTK StringZilla (Unicode) StringZilla (Bytes)
0 apple vs aple 1 1 1 1 1 1 1
1 αβγδ vs αγδ 1 1 1 1 1 1 2
2 école vs école 1 2 2 2 2 2 3
3 Schön vs Schön 1 2 2 2 2 2 3
4 💖 vs 💗 1 1 1 1 1 1 1
5 𠜎 𠜱 𠝹 𠱓 vs 𠜎𠜱𠝹𠱓 3 3 3 3 3 3 3
6 München vs Muenchen 2 2 2 2 2 2 2
7 façade vs facade 1 1 1 1 1 1 2
8 こんにちは世界 vs こんばんは世界 2 2 2 2 2 2 3
9 👩‍👩‍👧‍👦 vs 👨‍👩‍👧‍👦 1 1 1 1 1 1 1
10 Data科学123 vs Data科學321 3 3 3 3 3 3 3
11 🙂🌍🚀 vs 🙂🌎✨ 2 2 2 2 2 2 5

This patch introduces several SIMD-accelerated APIs
for strings and raw byte-arrays, compatible with
PySpark v2 and v3. In more detail:

- Hamming and Levenshtein support SIMD and buffers.
- Fixed non-standard 5-char indent in `functions.py`.
- Upgraded PyTest for compatibility with newer Pyhton.
- Added `pkg_resources` for `setuptools` for tests.

On typical English words StringZilla is 15x faster than
JellyFish on both x86 and Arm CPUs.
@MrPowers
Copy link
Owner

@ashvardanian - thanks for submitting this. Do you have any benchmarks that show StringZilla makes ceja faster?

@ashvardanian
Copy link
Author

ashvardanian commented Feb 22, 2024

@MrPowers I don't have benchmarks specific to Ceja, but have several benchmarks against Jellyfish in the StringZilla repository. There is also a Jupyter notebook to help explore the differences at stringzilla/scripts/bench_similarity.ipynb 🤗

Is there some specific benchmark you have in mind?


PS: There is also a portability issue I haven't referenced. Seems like jellyfish builds only 65 wheels, while today PyPi expects 105 targets. StringZilla publishes all of them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants