Different results for Nmslib and Elastic Knn Search

I created a Nmslib Index of 200,000 Vectors and created an Elasticsearch Knn Index using the same vectors with the same properties.

Then I searched for 50 similar results of the first 10,000 vectors in both Indices and compared the results.

On average, only 0.5% of the results matched.

Since elasticsearch-knn is based on NMSLIB, so I think results should also match.

1 Like

Hey @nmsaey42,

Thank you for sharing your results. Our team is looking into it and will reach out to you for additional details.
Thanks,
Pavani

1 Like

@nmsaey42,

Could you please describe the parameters you used in nmslib and the type of space(l2, cosine) ?

Hello Team @vamshin @bpavani

Parameters :

space: l2
ef_construction: 500
ef_search: 300
m: 40

@nmsaey42,

thank you. So you are looking at different parameters and hence the results.

By default k-NN plugin uses m=16, ef_search=512, ef_construction=512.

You can make the settings consistent by providing settings as shown below

PUT /my_index/_settings
{
    "index" : {
        "knn": true,
        "knn.algo_param.m": 40, 
        "knn.algo_param.ef_search" : 300,
        "knn.algo_param.ef_construction" : 500
    }
}

Hello @vamshin

I already set the index settings using

"index": {
      "knn": True,
      "knn.space_type": "l2",
      "knn.algo_param.ef_construction": 500,
      "knn.algo_param.ef_search": 300,
      "knn.algo_param.m": 40
    }

& verified it from
GET /my_index_name/_settings

I set the same parameters in both nmslib and k-NN plugin.

@vamshin is there a chance that this problem is because I am using k-NN through Community AMIs section of the EC2 console?

Hi @nmsaey42,

Sorry did not get about the Community AMI. But that does not seem to be problem as long as we have the k-NN plugin installed and running.

There could be slight variation of results because of the sharding and number of segments(nmslib has one giant graph, Elasticsearch has multiple smaller graphs, one for each segment)

  • Can you help us with the recall rate you are seeing? May be its better to compare with the ground truth results and see the deviation.
  • Share us the Elasticsearch query to retrieve neighbors.
  • Which version of Elasticsearch?

@vamshin

Elastic Search Version: 7.7.0

Query to retrieve neighbours:

body = {

  "size": 25,
       "query": {
            "knn": {
              "feature_vector": {
                "vector": array,
                "k": 25
              }
            }
          }
        }
# Using py-elastisearch
data = es.search(index='index-nam', body=body, size=25)

Recall of First 10,000 Samples for k=1 is: 0.0011

@vamshin can you please share what recall you are getting for a vector of length 600 & and sample of 200,000 using above parameters?

Hi @nmsaey42,

We did not specifically test for 200k docs and 600 dimensions. But we tested on data sets available on ANN benchmark and our default params most of the times give us >=0.95 recall.

  • How is the recall with nmslib with the same sample? Trying to understand the deviation with respect to ground truth and nmslib vs ground truth and k-NN plugin. May be k=1 might be not a great idea, how about recall at k=5 for your experiment.

Is there a way we could get access to your data for running experiments?

Hi @vamshin

Recall with nmslib with the same sample is almost 99% for k=1

I have created a similar ec2 instance with almost the same data and if you want, you can scroll through this to get the data.

[‘13.232.20.38’],
http_auth=(‘admin’, ‘admin’)
scheme=“http”
port=9200

I will keep it open for 1 day.

Thanks

1 Like

Hi @nmsaey42, I am trying to investigate the issue.

I am unable to get access to ‘13.232.20.38’. I am trying to connect via the Elasticsearch python client with the following configuration:

es = Elasticsearch(
    ['13.232.20.38'],
    http_auth=('admin', 'admin'),
    scheme="http",
    port=9200,
)

Can you confirm that this is publicly accessible?

Additionally, I have a few questions about the experiment:

  1. Have you confirmed that all of the documents indexed are searchable before running the recall queries? After indexing, the documents may not be immediately searchable. You can check how many documents are available by running /_cat/indices.

  2. What do you mean by “first 10,000 vectors in both Indices”? Is this set of queries the same for both Elasticsearch index and NMSLIB index?

  3. In reference to “Recall of First 10,000 Samples for k=1 is: 0.0011”, you are just checking if the document returned matches the query, correct?

  4. Are there duplicate vectors?

  5. How is recall computed computed? Is the process different for Elasticsearch versus NMSLIB?

Hello @jmazane

Thanks for the reply

Yes, it is accessible publicly.

I too got the same problem, but upgrading the elasticsearch-py version to 7.7.1 solved the issue. This exact elasticsearch-py configuration works for me.

es = Elasticsearch(
    ['13.233.184.36'],
    http_auth=('admin', 'admin'),
    scheme="http",
    port=9200,
)
  1. The return of the query /_cat/indices is

yellow open security-auditlog-2020.06.15 FOrEpEcaQuWf6IVplXd2UA 1 1 1 0 12.5kb 12.5kb
green open .opendistro_security hnf07EVLQcSn8dMEqBT68w 1 0 7 0 37.1kb 37.1kb
yellow open knn-index 8bhi5QIWQtSi7PePn-SIDQ 1 1 501270 0 4gb 4gb

  1. I extracted 10,000 vectors from nmslib index and for each vector, I query both nmslib index and elastic k-NN index.

  2. Yes, just checking if the document returned matches the query

  3. No, there is no duplicate vector

  4. For k = 1, I query both indexes(nmslib & elasticsearch k-NN) & if the query returns the exact same ID then I add 1 or otherwise add 0 to total_score. In the end, divide the total_score with the number of test elements (10,000).

For k > 1, Considering nmslib as ground truth, I query both indexes(nmslib & elasticsearch k-NN) & check the intersection of both the results & add then add (length of intersection)/k to total_score for each query. In the end, divide the total_score with the number of test elements (10,000).

Here is a dummy code for k > 1.

def intersection_score(lst1, lst2): 
    lst3 = [value for value in lst1 if value in lst2] 
    return len(lst3)/len(lst1)
total_score = 0
for vector in vectors:
    res1 = query_elastic_knn(vector, k=5)
    res2 = query_nmslib(vector, k=5)
    total_score += intersection_score(res1, res2)
print('Final Score', total_score/len(vectors)) 

I consider nmslib as ground truth because elasticsearch k-NN is based on nmslib. So, I want to check the similarity of results.

Thanks

Hi @nmsaey42,

I was able to connect to the cluster. I think I had a VPN issue.

I wanted to try to reproduce the test cases for k=1. With the following query, I was able to get a few random documents from the index:

POST /knn-index/_search
{
 "query": {
   "function_score": {
     "query": {
       "match_all": {}
     },
     "functions": [
       {
         "random_score": {}
       }
     ]
   }
 }
}

I took the vectors from 5 of the returned results (document IDs: my-7557114, ss-204754389_9303, my-2173151, my-8457237, kv-1842763) and ran the following query for each:

POST /knn-index/_search
{
  "size" : 1,
  "query": {
    "knn": {
      "feature_vector": {
        "vector": [<vals>],
        "k": 1
      }
    }
  }
}

Each query returned the corresponding document. While this is only a small subset, it would still lead one to believe recall to be much higher than 0.0011. If possible, could you provide document IDs for the Elasticsearch documents whose queries do not return their associated doc ID?

Additionally, looking at the knn-index, it has 29 segments. Each of these segments corresponds to one HNSW graph. During search, Elasticsearch will run the k-NN search over each segment. Each segment will produce it’s top k results with a score of 1/(1+distance from vector to query). Then, Elasticsearch will take the top size scores from all of the segment results. So, searching over many smaller graphs and then aggregating the results may improve recall (at the expense of latency) as opposed to searching a single large graph. So, taking the NMSLIB results as ground truth is not correct. It may be better to checkout ann-benchmark data sets. Their data sets contain a set of queries and the ground truth nearest neighbors for each of them.

One more thing, in the code you used to calculate recall, do you account for floating point in the return of intersection_score (i.e. len(lst3)/len(lst1) will always yield 0 or 1)? This could reduce the score as well.

Hello @jmazane

Thanks for reply

Using elasticsearch-py,

body = {"query": { "match_all": {} } }
hits = es.search(index='knn-index', body=body, size=10)['hits']['hits']
for d in hits:
  feat = d['_source']['feature_vector']
  _id  = d['_id']
  body_n = {
    "size": 1,
    "query": {
      "knn": {
        "feature_vector": {
          "vector": feat,
          "k": 1
        }
      }
    }
  }
  new_id = es.search(index='knn-index', body=body_n, size=1)['hits']['hits'][0]['_id']
  print(_id, new_id, _id == new_id )

I am running this code. I think the _id & new_id should match every time. But it is matching only the first time.

And in intersection_score len(lst3)/len(lst1) I am adding the floating score also.

Thanks for helping me @jmazane

Ah, okay I see the same failure. I played around with it a little bit.

Something seems wrong.

When I issue this query (corresponding to document ID ss-204944620_9204):

POST /knn-index/_search
{
  "stored_fields": "_none_",
  "docvalue_fields": ["_id"],
  "size" : 15,
  "query": {
    "knn": {
      "feature_vector": {
        "vector": [<ss-204944620_9204>'s vectors],
        "k": 15
      }
    }
  }
}

it returns 15 documents with scores of 1. However, manually computing these scores (for instance for docID ss-204655098_9212) does not yield a score of 1. Additionally, indexing both ss-204944620_9204 and ss-204655098_9212 into a different test cluster and then searching for ss-204944620_9204 yields only 1 result (itself) with a score of 1. Meanwhile, the score for ss-204655098_9212 is what would be expected.

Additionally, replacing body = {"query": { "match_all": {} } } with body = {"query":{"function_score":{"query":{"match_all":{}},"functions":[{"random_score":{}}]}}} yields much better results. The documents that seem to be failing most often start with ss-*.

I am going to run some more tests on the Community AMI. Would it be okay to download the documents from your cluster, so that I can ingest them into my own cluster for reproducibility?

Hi @jmazane

Yes, Please download the document from cluster.

Whether this problem is with community AMI only?

Thanks for the help

Hi @nmsaey42

I was able to reproduce the issue on community AMI. We are still root causing the issue. The community AMI may not have the most up to date JNI Library installed.

While we are root causing it, as a workaround, we were able to build the JNI Library from source on the community AMI instance and copy it to /usr/lib. It solved the issue on our end. If you are able to, could you try doing this on your instance:

  1. Installing cmake, g++, and Java 14
  2. Setting JAVA_HOME=
  3. git clone https://github.com/opendistro-for-elasticsearch/k-NN.git
  4. git checkout v1.8.0.0
  5. cd jni && cmake . && make
  6. cp release/libKNNIndexV1_7_3_6.so /usr/lib
  7. Start ES process
  8. Rerun tests

Alternatively, using the Docker image may work as well.

Please update this thread with results if you can. We will update this thread once we have the final fix.

Thanks for finding this!

Hi @jmazane

I followed the steps provided above.
In step 6, there was no folder named “release” & I found the libKNNIndexV1_7_3_6.so file in “k-NN/buildSrc” folder.

But, this code still have the same error.

body = {"query": { "match_all": {} } }
hits = es.search(index='knn-index', body=body, size=10)['hits']['hits']
for d in hits:
  feat = d['_source']['feature_vector']
  _id  = d['_id']
  body_n = {
    "size": 1,
    "query": {
      "knn": {
        "feature_vector": {
          "vector": feat,
          "k": 1
        }
      }
    }
  }
  new_id = es.search(index='knn-index', body=body_n, size=1)['hits']['hits'][0]['_id']
  print(_id, new_id, _id == new_id )

I will try some other methods to install the kNN (RPM or Docker) and will let you know the results.

Thanks for the help

Hi @nmsaey42,

Looks possibly because of this bug Bad recall from ODFE1.8 · Issue #154 · opendistro-for-elasticsearch/k-NN · GitHub.
@jmazane and I are working on fix. Same issue could happen with other distributions as well. We would update you after doing some performance tests and you could verify on the community AMI. Thanks for pointing this issue.

Hello @vamshin

I upgraded the opendistro version to 1.9.0

And

I have indexed 520229 documents & I am continuously indexing new documents & searching in the index.

But, the results are mismatched.

When I search with an already indexed document vector, the original document does not get returned. Instead the document which is recently indexed gets high priority & is returned as number 1 choice.

It looks like it is not searching the entire graph, and is prioritising the latest documents indexed.