Different results for Nmslib and Elastic Knn Search

nmsaey42 · June 22, 2020, 5:53pm

I created a Nmslib Index of 200,000 Vectors and created an Elasticsearch Knn Index using the same vectors with the same properties.

Then I searched for 50 similar results of the first 10,000 vectors in both Indices and compared the results.

On average, only 0.5% of the results matched.

Since elasticsearch-knn is based on NMSLIB, so I think results should also match.

bpavani · June 22, 2020, 6:12pm

Hey @nmsaey42,

Thank you for sharing your results. Our team is looking into it and will reach out to you for additional details.
Thanks,
Pavani

vamshin · June 22, 2020, 6:20pm

@nmsaey42,

Could you please describe the parameters you used in nmslib and the type of space(l2, cosine) ?

nmsaey42 · June 22, 2020, 7:10pm

Hello Team @vamshin @bpavani

Parameters :

space: l2
ef_construction: 500
ef_search: 300
m: 40

vamshin · June 22, 2020, 8:30pm

@nmsaey42,

thank you. So you are looking at different parameters and hence the results.

By default k-NN plugin uses m=16, ef_search=512, ef_construction=512.

You can make the settings consistent by providing settings as shown below

PUT /my_index/_settings
{
    "index" : {
        "knn": true,
        "knn.algo_param.m": 40, 
        "knn.algo_param.ef_search" : 300,
        "knn.algo_param.ef_construction" : 500
    }
}

nmsaey42 · June 22, 2020, 10:22pm

Hello @vamshin

I already set the index settings using

"index": {
      "knn": True,
      "knn.space_type": "l2",
      "knn.algo_param.ef_construction": 500,
      "knn.algo_param.ef_search": 300,
      "knn.algo_param.m": 40
    }

& verified it from
GET /my_index_name/_settings

I set the same parameters in both nmslib and k-NN plugin.

@vamshin is there a chance that this problem is because I am using k-NN through Community AMIs section of the EC2 console?

vamshin · June 23, 2020, 6:58am

Hi @nmsaey42,

Sorry did not get about the Community AMI. But that does not seem to be problem as long as we have the k-NN plugin installed and running.

There could be slight variation of results because of the sharding and number of segments(nmslib has one giant graph, Elasticsearch has multiple smaller graphs, one for each segment)

Can you help us with the recall rate you are seeing? May be its better to compare with the ground truth results and see the deviation.
Share us the Elasticsearch query to retrieve neighbors.
Which version of Elasticsearch?

nmsaey42 · June 23, 2020, 1:20pm

@vamshin

Elastic Search Version: 7.7.0

Query to retrieve neighbours:

body = {

  "size": 25,
       "query": {
            "knn": {
              "feature_vector": {
                "vector": array,
                "k": 25
              }
            }
          }
        }

# Using py-elastisearch
data = es.search(index='index-nam', body=body, size=25)

Recall of First 10,000 Samples for k=1 is: 0.0011

@vamshin can you please share what recall you are getting for a vector of length 600 & and sample of 200,000 using above parameters?

vamshin · June 23, 2020, 5:44pm

Hi @nmsaey42,

We did not specifically test for 200k docs and 600 dimensions. But we tested on data sets available on ANN benchmark and our default params most of the times give us >=0.95 recall.

How is the recall with nmslib with the same sample? Trying to understand the deviation with respect to ground truth and nmslib vs ground truth and k-NN plugin. May be k=1 might be not a great idea, how about recall at k=5 for your experiment.

Is there a way we could get access to your data for running experiments?

nmsaey42 · June 23, 2020, 6:25pm

Hi @vamshin

Recall with nmslib with the same sample is almost 99% for k=1

I have created a similar ec2 instance with almost the same data and if you want, you can scroll through this to get the data.

[‘13.232.20.38’],
http_auth=(‘admin’, ‘admin’)
scheme=“http”
port=9200

I will keep it open for 1 day.

Thanks

jmazane · June 23, 2020, 8:28pm

Hi @nmsaey42, I am trying to investigate the issue.

I am unable to get access to ‘13.232.20.38’. I am trying to connect via the Elasticsearch python client with the following configuration:

es = Elasticsearch(
    ['13.232.20.38'],
    http_auth=('admin', 'admin'),
    scheme="http",
    port=9200,
)

Can you confirm that this is publicly accessible?

Additionally, I have a few questions about the experiment:

Have you confirmed that all of the documents indexed are searchable before running the recall queries? After indexing, the documents may not be immediately searchable. You can check how many documents are available by running /_cat/indices.
What do you mean by “first 10,000 vectors in both Indices”? Is this set of queries the same for both Elasticsearch index and NMSLIB index?
In reference to “Recall of First 10,000 Samples for k=1 is: 0.0011”, you are just checking if the document returned matches the query, correct?
Are there duplicate vectors?
How is recall computed computed? Is the process different for Elasticsearch versus NMSLIB?

nmsaey42 · June 24, 2020, 6:30am

Hello @jmazane

Thanks for the reply

Yes, it is accessible publicly.

I too got the same problem, but upgrading the elasticsearch-py version to 7.7.1 solved the issue. This exact elasticsearch-py configuration works for me.

es = Elasticsearch(
    ['13.233.184.36'],
    http_auth=('admin', 'admin'),
    scheme="http",
    port=9200,
)

The return of the query /_cat/indices is

yellow open security-auditlog-2020.06.15 FOrEpEcaQuWf6IVplXd2UA 1 1 1 0 12.5kb 12.5kb
green open .opendistro_security hnf07EVLQcSn8dMEqBT68w 1 0 7 0 37.1kb 37.1kb
yellow open knn-index 8bhi5QIWQtSi7PePn-SIDQ 1 1 501270 0 4gb 4gb

I extracted 10,000 vectors from nmslib index and for each vector, I query both nmslib index and elastic k-NN index.
Yes, just checking if the document returned matches the query
No, there is no duplicate vector
For k = 1, I query both indexes(nmslib & elasticsearch k-NN) & if the query returns the exact same ID then I add 1 or otherwise add 0 to total_score. In the end, divide the total_score with the number of test elements (10,000).

For k > 1, Considering nmslib as ground truth, I query both indexes(nmslib & elasticsearch k-NN) & check the intersection of both the results & add then add (length of intersection)/k to total_score for each query. In the end, divide the total_score with the number of test elements (10,000).

Here is a dummy code for k > 1.

def intersection_score(lst1, lst2): 
    lst3 = [value for value in lst1 if value in lst2] 
    return len(lst3)/len(lst1)
total_score = 0
for vector in vectors:
    res1 = query_elastic_knn(vector, k=5)
    res2 = query_nmslib(vector, k=5)
    total_score += intersection_score(res1, res2)
print('Final Score', total_score/len(vectors))

I consider nmslib as ground truth because elasticsearch k-NN is based on nmslib. So, I want to check the similarity of results.

Thanks

jmazane · June 24, 2020, 4:12pm

Hi @nmsaey42,

I was able to connect to the cluster. I think I had a VPN issue.

I wanted to try to reproduce the test cases for k=1. With the following query, I was able to get a few random documents from the index:

POST /knn-index/_search
{
 "query": {
   "function_score": {
     "query": {
       "match_all": {}
     },
     "functions": [
       {
         "random_score": {}
       }
     ]
   }
 }
}

I took the vectors from 5 of the returned results (document IDs: my-7557114, ss-204754389_9303, my-2173151, my-8457237, kv-1842763) and ran the following query for each:

POST /knn-index/_search
{
  "size" : 1,
  "query": {
    "knn": {
      "feature_vector": {
        "vector": [<vals>],
        "k": 1
      }
    }
  }
}

Each query returned the corresponding document. While this is only a small subset, it would still lead one to believe recall to be much higher than 0.0011. If possible, could you provide document IDs for the Elasticsearch documents whose queries do not return their associated doc ID?

Additionally, looking at the knn-index, it has 29 segments. Each of these segments corresponds to one HNSW graph. During search, Elasticsearch will run the k-NN search over each segment. Each segment will produce it’s top k results with a score of 1/(1+distance from vector to query). Then, Elasticsearch will take the top size scores from all of the segment results. So, searching over many smaller graphs and then aggregating the results may improve recall (at the expense of latency) as opposed to searching a single large graph. So, taking the NMSLIB results as ground truth is not correct. It may be better to checkout ann-benchmark data sets. Their data sets contain a set of queries and the ground truth nearest neighbors for each of them.

One more thing, in the code you used to calculate recall, do you account for floating point in the return of intersection_score (i.e. len(lst3)/len(lst1) will always yield 0 or 1)? This could reduce the score as well.

nmsaey42 · June 24, 2020, 7:55pm

Hello @jmazane

Thanks for reply

Using elasticsearch-py,

body = {"query": { "match_all": {} } }
hits = es.search(index='knn-index', body=body, size=10)['hits']['hits']
for d in hits:
  feat = d['_source']['feature_vector']
  _id  = d['_id']
  body_n = {
    "size": 1,
    "query": {
      "knn": {
        "feature_vector": {
          "vector": feat,
          "k": 1
        }
      }
    }
  }
  new_id = es.search(index='knn-index', body=body_n, size=1)['hits']['hits'][0]['_id']
  print(_id, new_id, _id == new_id )

I am running this code. I think the _id & new_id should match every time. But it is matching only the first time.

And in intersection_score len(lst3)/len(lst1) I am adding the floating score also.

Thanks for helping me @jmazane

jmazane · June 24, 2020, 11:35pm

Ah, okay I see the same failure. I played around with it a little bit.

Something seems wrong.

When I issue this query (corresponding to document ID ss-204944620_9204):

POST /knn-index/_search
{
  "stored_fields": "_none_",
  "docvalue_fields": ["_id"],
  "size" : 15,
  "query": {
    "knn": {
      "feature_vector": {
        "vector": [<ss-204944620_9204>'s vectors],
        "k": 15
      }
    }
  }
}

it returns 15 documents with scores of 1. However, manually computing these scores (for instance for docID ss-204655098_9212) does not yield a score of 1. Additionally, indexing both ss-204944620_9204 and ss-204655098_9212 into a different test cluster and then searching for ss-204944620_9204 yields only 1 result (itself) with a score of 1. Meanwhile, the score for ss-204655098_9212 is what would be expected.

Additionally, replacing body = {"query": { "match_all": {} } } with body = {"query":{"function_score":{"query":{"match_all":{}},"functions":[{"random_score":{}}]}}} yields much better results. The documents that seem to be failing most often start with ss-*.

I am going to run some more tests on the Community AMI. Would it be okay to download the documents from your cluster, so that I can ingest them into my own cluster for reproducibility?

nmsaey42 · June 25, 2020, 4:13am

Hi @jmazane

Yes, Please download the document from cluster.

Whether this problem is with community AMI only?

Thanks for the help

jmazane · June 26, 2020, 3:20am

Hi @nmsaey42

I was able to reproduce the issue on community AMI. We are still root causing the issue. The community AMI may not have the most up to date JNI Library installed.

While we are root causing it, as a workaround, we were able to build the JNI Library from source on the community AMI instance and copy it to /usr/lib. It solved the issue on our end. If you are able to, could you try doing this on your instance:

Installing cmake, g++, and Java 14
Setting JAVA_HOME=
git clone https://github.com/opendistro-for-elasticsearch/k-NN.git
git checkout v1.8.0.0
cd jni && cmake . && make
cp release/libKNNIndexV1_7_3_6.so /usr/lib
Start ES process
Rerun tests

Alternatively, using the Docker image may work as well.

Please update this thread with results if you can. We will update this thread once we have the final fix.

Thanks for finding this!

nmsaey42 · June 26, 2020, 11:51am

Hi @jmazane

I followed the steps provided above.
In step 6, there was no folder named “release” & I found the libKNNIndexV1_7_3_6.so file in “k-NN/buildSrc” folder.

But, this code still have the same error.

body = {"query": { "match_all": {} } }
hits = es.search(index='knn-index', body=body, size=10)['hits']['hits']
for d in hits:
  feat = d['_source']['feature_vector']
  _id  = d['_id']
  body_n = {
    "size": 1,
    "query": {
      "knn": {
        "feature_vector": {
          "vector": feat,
          "k": 1
        }
      }
    }
  }
  new_id = es.search(index='knn-index', body=body_n, size=1)['hits']['hits'][0]['_id']
  print(_id, new_id, _id == new_id )

I will try some other methods to install the kNN (RPM or Docker) and will let you know the results.

Thanks for the help

vamshin · July 1, 2020, 8:10am

Hi @nmsaey42,

Looks possibly because of this bug Bad recall from ODFE1.8 · Issue #154 · opendistro-for-elasticsearch/k-NN · GitHub.
@jmazane and I are working on fix. Same issue could happen with other distributions as well. We would update you after doing some performance tests and you could verify on the community AMI. Thanks for pointing this issue.

nmsaey42 · July 26, 2020, 2:32am

Hello @vamshin

I upgraded the opendistro version to 1.9.0

And

I have indexed 520229 documents & I am continuously indexing new documents & searching in the index.

But, the results are mismatched.

When I search with an already indexed document vector, the original document does not get returned. Instead the document which is recently indexed gets high priority & is returned as number 1 choice.

It looks like it is not searching the entire graph, and is prioritising the latest documents indexed.

Topic		Replies	Views
Knn plugin output different from nmslib's output OpenDistro	0	345	September 12, 2022
k-NN multiple field search in OpenSearch k-NN	6	1243	May 12, 2023
Cosine Similarity Formula k-NN	17	3587	December 29, 2020
Opendistro KNN score giving different scores on the same query vector k-NN	3	1124	October 13, 2020
Reindexing Produces Different Result On The Same Query Vector k-NN	9	1082	May 12, 2021

Different results for Nmslib and Elastic Knn Search

Related Topics