Inconsistent results using KNN script score with Cosine Similarity

Hi, I am experimenting with using OpenSearch KNN functionality with a 512 dim vector. The index and query I am using are attached below. To summarize quickly I have a nested set of questions each with its corresponding vector and I also have one answer field per document. The use case is looking for answers to question based on similar meanings.

The problem I am having is that sometimes I get inconsistent results. For example, let’s say I have a document with just one question and then I query the index with a vector that I know matches that one exact document, sometimes open search will send me back a different document with a higher “Score” even though I queried with an exact match. This only happens on occasion.

I was hoping that someone might be able to advise me:

  1. If my query or index is incorrect for what I am trying to achieve. I am using a script with a cosine similarity function. I reviewed this similar post Opendistro KNN score giving different scores on the same query vector - #3 by utpal but could not find an answer.

  2. Is it possible to get the results of the cosine similarity scoring back with my search results? What I mean is when I query and get a document back can I get the cosine similarity between my query vector and my document’s vector?

      { 
       "mappings" : {
        "properties" : {
          "questions": {
            "type": "nested",
            "properties": {
              "question": {
                "type": "text"
              },
              "vector": {
                "type": "knn_vector",
                "dimension": 512
              }
            }
          },
          "answer": {
            "type": "text"
          },
          "enabled": {
            "type": "boolean"
          },
        }
      }

When I want to query open search to try and find an answer to a similar question then I take the text, get its 512 vector and search for that vector and the following query.

    {
      "size": 1,
      "query": {
        "bool": {
          "must":  [
            {"match": {"enabled": true}}
          ],
          "should" : [
            {
              "nested": {
                "score_mode": "avg",
                "path": "questions",
                "query": {
                  "script_score": {
                    "query": {
                      "match_all": { }
                    },
                    "script": {
                      "lang": "knn",
                      "source": "knn_score",
                      "params": {
                        "field": "questions.vector",
                        "space_type": "cosinesimil",
                        "query_value": [VECTOR...]
                      }
                    }
                  }
                }
              }
            }
          ]
        }
      }
    }

@jmazane Any thoughts here?

1 Like

Hi @jack

I am not sure at the moment why it isnt returning the matching document. Ill see if I can reproduce and go from there.

I have a couple questions:

  1. When you increase size greater than 1 (say 100), do you see the document?
  2. Are there any duplicates in the index?
  3. Is the enabled field on the doc you are querying set to true?

For the k-NN query, the returned score calculation can be found here: Exact k-NN with scoring script - OpenSearch documentation. “must” may impact the score in the query. If you replace that with filter, then that wont contribute to the score. Im not sure how nesting impacts the score, so to get the cosine similarity score maybe first do a query without the nested parameter and then use the formula in the link above to reverse engineer the cosine similarity from the score returned.

Jack

2 Likes

Thank you very much for the quick reply, this is super helpful.

  1. When you increase size greater than 1 (say 100), do you see the document?

When I do this the variance in the results does decrease, as in I get fewer incorrect results but the odd random result does still pop up. So in theory, as I add more documents this becomes less of a problem but it just becomes significantly harder to debug at that stage.

  1. Are there any duplicates in the index?

No duplicates.

  1. Is the enabled field on the doc you are querying set to true?

Yes, it is. This is always the first thing I look for when testing to ensure I haven’t made that mistake.

“must” may impact the score in the query. If you replace that with filter

I did not know this. WIll make this change and take another look thanks.

so to get the cosine similarity score maybe first do a query without the nested parameter and then use the formula in the link above to reverse engineer the cosine similarity from the score returned.

This is also very helpful will try now thanks

1 Like

This alone massively impacts the score. Thanks, it has helped a lot.

1 Like

No problem! However, the score contributed by the must match query should be the same for enabled results. So I dont think that can explain why you arent getting the expected result.

Are you able to reproduce when you arent using nested query?