Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scoring: remove IDF from BM25 scoring #912

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

stefanhengl
Copy link
Member

@stefanhengl stefanhengl commented Feb 12, 2025

We remove IDF from our BM25 scoring, effectively treating it as constant.

This is supported by our evaluations which showed that for keyword style queries, IDF can down-weight the score of important keywords too much, leading to a worse ranking. The intuition is that for code search, each keyword is important independently of how frequent it appears in the corpus.

Removing IDF allows us to apply BM25 scoring to a wider range of query types. Previously, BM25 was limited to queries with individual terms combined using OR, as IDF was calculated on the fly at query time.

Test plan:
updated tests

We remove IDF because we want to use BM25 scoring for keyword search and
our query-time calculation of IDF won't work anymore if terms are
AND-ed (keyword search) instead of OR-ed (codycontext).

Our evaluations show a slight improvement which supports the decision to
treat IDF as constant.

This is also in line with how we calculate line-based BM25.

Test plan:
updated unit tests
@@ -79,8 +79,8 @@ func TestBM25(t *testing.T) {
query: &query.Substring{Pattern: "example"},
content: exampleJava,
language: "Java",
// bm25-score: 0.58 <- sum-termFrequencyScore: 14.00, length-ratio: 1.00
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remove the score from the comments because it is redundant and requires an update for every update of the scoring function. sum-termFrequencyScore and length-ratio are more robust.

@mmanela
Copy link
Contributor

mmanela commented Feb 12, 2025

Couple questions

  1. I thought we saw this helped with Cody contest originally
  2. Is it still technically BM25 without IDF?

@stefanhengl
Copy link
Member Author

Couple questions

  1. I thought we saw this helped with Cody contest originally
  2. Is it still technically BM25 without IDF?
  1. Our evaluation set was not as strong back then and we chose to add IDF because it is generally accepted to be part of a "correct" implementation of BM25. If I remember correctly, adding IDF didn't improve our evaluation or even made it slightly worse.
  2. Variations of BM25 are common, especially in specialized domains. It’s not unusual to adjust or remove certain components when they negatively affect scoring in specific contexts. By removing IDF, our scoring is technically no longer "pure BM25". Maybe we can call it "scoring inspired by BM25".

@jtibshirani
Copy link
Member

Although I had the same reaction as @mmanela, that we should really avoid deviating from the classic BM25 formula, we did an analysis here that made me feel okay about this: https://linear.app/sourcegraph/issue/SPLF-838/use-bm25-for-multi-term-keyword-searches. TL;DR: eval results on every dataset improved, and IDF may not be as critical for the code search use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants