scoring: remove IDF from BM25 scoring #912

stefanhengl · 2025-02-12T11:25:54Z

We remove IDF from our BM25 scoring, effectively treating it as constant.

This is supported by our evaluations which showed that for keyword style queries, IDF can down-weight the score of important keywords too much, leading to a worse ranking. The intuition is that for code search, each keyword is important independently of how frequent it appears in the corpus.

Removing IDF allows us to apply BM25 scoring to a wider range of query types. Previously, BM25 was limited to queries with individual terms combined using OR, as IDF was calculated on the fly at query time.

Test plan:
updated tests

We remove IDF because we want to use BM25 scoring for keyword search and our query-time calculation of IDF won't work anymore if terms are AND-ed (keyword search) instead of OR-ed (codycontext). Our evaluations show a slight improvement which supports the decision to treat IDF as constant. This is also in line with how we calculate line-based BM25. Test plan: updated unit tests

stefanhengl · 2025-02-12T11:27:39Z

internal/e2e/scoring_test.go

@@ -79,8 +79,8 @@ func TestBM25(t *testing.T) {
 			query:    &query.Substring{Pattern: "example"},
 			content:  exampleJava,
 			language: "Java",
-			// bm25-score: 0.58 <- sum-termFrequencyScore: 14.00, length-ratio: 1.00


I remove the score from the comments because it is redundant and requires an update for every update of the scoring function. sum-termFrequencyScore and length-ratio are more robust.

mmanela · 2025-02-12T12:13:46Z

Couple questions

I thought we saw this helped with Cody contest originally
Is it still technically BM25 without IDF?

stefanhengl · 2025-02-12T12:47:39Z

Couple questions

I thought we saw this helped with Cody contest originally

Is it still technically BM25 without IDF?

Our evaluation set was not as strong back then and we chose to add IDF because it is generally accepted to be part of a "correct" implementation of BM25. If I remember correctly, adding IDF didn't improve our evaluation or even made it slightly worse.
Variations of BM25 are common, especially in specialized domains. It’s not unusual to adjust or remove certain components when they negatively affect scoring in specific contexts. By removing IDF, our scoring is technically no longer "pure BM25". Maybe we can call it "scoring inspired by BM25".

jtibshirani · 2025-02-12T16:35:51Z

Although I had the same reaction as @mmanela, that we should really avoid deviating from the classic BM25 formula, we did an analysis here that made me feel okay about this: https://linear.app/sourcegraph/issue/SPLF-838/use-bm25-for-multi-term-keyword-searches. TL;DR: eval results on every dataset improved, and IDF may not be as critical for the code search use case.

stefanhengl added 2 commits February 11, 2025 14:31

update comments

a134781

stefanhengl requested a review from jtibshirani February 12, 2025 11:25

stefanhengl commented Feb 12, 2025

View reviewed changes

jtibshirani approved these changes Feb 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scoring: remove IDF from BM25 scoring #912

scoring: remove IDF from BM25 scoring #912

stefanhengl commented Feb 12, 2025 •

edited

Loading

stefanhengl Feb 12, 2025

mmanela commented Feb 12, 2025

stefanhengl commented Feb 12, 2025

jtibshirani commented Feb 12, 2025

scoring: remove IDF from BM25 scoring #912

Are you sure you want to change the base?

scoring: remove IDF from BM25 scoring #912

Conversation

stefanhengl commented Feb 12, 2025 • edited Loading

stefanhengl Feb 12, 2025

Choose a reason for hiding this comment

mmanela commented Feb 12, 2025

stefanhengl commented Feb 12, 2025

jtibshirani commented Feb 12, 2025

stefanhengl commented Feb 12, 2025 •

edited

Loading