Elasticsearch Hybrid Query - No Results

I’m currently trying to do a hybrid search on two indexes: a full text index and knn_vector (word embeddings) index. Currently, over 10’000 documents from Wikipedia are indexed on an ES stack, indexed on both of these fields (see mapping: “content”, “embeddings”).

It is important to note that the knn_vector index is defined as a nested object.

This is the current mapping of the items indexed:

mapping = {
        "settings": {
            "index": {
                "knn": True,
                "knn.space_type": "cosinesimil"
            }
        },
       "mappings": {
        "dynamic": 'strict', 
        "properties": {
            "elasticId": 
                { 'type': 'text' },
            "owners": 
                { 'type': 'text' },
            "type": 
                { 'type': 'keyword' },
            "accessLink": 
                { 'type': 'keyword' },
            "content": 
                { 'type': 'text'}, 
 	"embeddings": {
                'type': 'nested', 
                "properties": {
                  "vector": {
                    "type": "knn_vector", 
                    "dimension": VECTOR_DIM, 
                          },
                    },
 	},
}

My goal is to compare the query scores on both indexes to understand if one is more efficient than the other (full text vs. knn_vectors), and how elastic chooses to return an object from based on the score of each index.

I understand I could simply split the queries (two separate queries), but ideally, we might want to use a hybrid search of this type in production.

This is the current query that searches on both full text and the knn_vectors:

def MakeHybridSearch(query):
    query_vector = convert_to_embeddings(query)
    result = elastic.search({
        "explain": True, 
        "profile": True, 
        "size": 2,
        "query": {
        "function_score": { #function_score
        "functions": [
            {
          "filter": { 
              "match": { 
                  "text": {
                      "query": query,
                      'boost': "5",  
                      }, 
                    }, 
                  },
            "weight": 2
          },
          {
          "filter": { 
              'script': {
                'source': 'knn_score',
                'params': {
                  'field': 'doc_vector',
                  'vector': query_vector,
                  'space_type': "l2"
                      }
                  }
                  },
                  "weight": 4
              }
          ],
          "max_boost": 5,
          "score_mode": "replace",
          "boost_mode": "multiply",
          "min_score": 5
          }
        }
      }, index='files_en', size=1000)

The current problem is that all queries are not returning anything.
Result:

{
"took": 3,
"timed_out": false,
"_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
},
"hits": {
    "total": {
        "value": 0,
        "relation": "eq"
    },
    "max_score": null,
    "hits": []
},

Even when the query does return a response, it returns hits with a score of 0.

Is there an error in the query structure ? Could this be on the mapping side ? If not, is there a better of way of doing this ?

Thank you for your help !

@vincentD apologies for responding late. Some observations from my sides are

  1. Since your knn_vector is nested, field should be “embeddings.vector” instead of doc_vector
{     "script_score": {
                  "script":{
                     "source":"knn_score",
                      "lang": "knn",
                     "params":{
                        "field":"embeddings.vector",
                        "vector":[2.0, 3.0, 5.0, 6.0],
                        "space_type":"l2"
                     }
            }
  1. Since min_score is 5, there is a possibility that calculated l2 score is < min score, hence, i would set min_score to zero, to check whether you are getting any results are not.

Also, If you only want to use custom scoring ( like your example) , you can omit "index.knn": true . The benefit of this approach is faster indexing speed and lower memory usage, but you lose the ability to perform standard k-NN queries on the index.
Please let us know if you still have issues after making above changes.

I have a similar question, where my index is as follows:

    "mappings": { 
    "properties": {
 "children":{ 
     'type': 'nested', 
     "properties": { "metadata" : { "properties" : {
          "retrieval_vectors" :{ "properties" : {
              "vector" : {
                  "type": "knn_vector", "dimension": 512 
                }
          }}
 }}}}
 }
}  

So the path is children.metadata.retrieval_vectors.vector, where children is a list that I’ve set as nested. However a query like the one below doesn’t return results. Is it possible to do a query with nested data like this?

{
"size" : 10,
"query": {
    "knn": {
        "children.metadata.retrieval_vectors.vector": {
            "vector": query_vec,
            "k": 10
        }
    }
}

}

After trying with a query where I instead match all and do a custom score similar to the original post, it gives a score of 1.4e-45 to all documents, even when the query vector is in the data (so should be 1).

I’m confused because I might expect some error if it couldn’t find the field but it seems to use some vector for the score in order to get 1.4e-45

@gdd314596 Have you tried using include_in_parent parameter? In the meantime, let me try to reproduce your use case and get back to you soon.

Thanks for the reply @Vijay
I had tried include_in_root and it works when there is only one child (so only one vector). However, you cannot index multiple children in this way (I believe that is a general limitations of mapping vectors with the include_in_root/parent).

Here is a minimal example that can help with your testing and hopefully save you some work. It assumes a working elasticsearch.ElasticSearch object in python called es

First create the index

import numpy as np
import random
import json

index_name = "knn_dummy"

mapping = {
        "settings": {
        "index": {
          "knn": True,
          "knn.space_type": "l2"
        }
     },
    "mappings": { 
        "properties": {
     "children":{ 
         'type': 'nested',
         "properties": { "metadata" : { "properties" : {
              "retrieval_vectors" :{ "properties" : {
                  "vgg16" : {
                      "type": "knn_vector", "dimension": 8
                    }
              }}
     }}}}
     }
    }                         
}

es.indices.delete(index=index_name, ignore=[400, 404])
es.indices.create(index_name, body=mapping)

Then ingest random data - you can control the number of children and you’ll see if you only have one child and include_in_parent, it does work. However with more children, you cannot even ingest and have to remove those settings.

num_items = 20
num_children=2
for i in range(num_items):
    data = {"id_field": str(i), "children" : [{"metadata":{ "retrieval_vectors": {"vgg16": np.random.randn(8)}}} for i in range(num_children)]}
    res = es.index(index=index_name, id = str(i), body= data)

Then you’ll get a very small max_score with the query below

qry = {
  "size": 1,

  "query": {
    "script_score": {
      "query": {
         "match_all": {}
      },
      "script": {
        "lang": "knn",
        "source": "knn_score",
        "params": {
          "field": "children.metadata.retrieval_vectors.vgg16",
          "query_value":query_vec,
          "space_type": "l2"
        }
      }
    }
  }
}
matches = es.search(index=index_name,  body =qry).
matches['hits']['max_score']

It’s also worth mentioning that the standard query simply gives no results.

qry = {
    "size" : 10,
    "query": {
        "knn": {
            "children.metadata.retrieval_vectors.vgg16": {
                "vector": query_vec,
                "k": 10
            }
        }
    }
    
}
    
matches = es.search(index=index_name,  body =qry)

Thanks for script @gdd314596
I managed to solve one of the problem among two you posted here. I thought i will update you before i move on to next.

  1. custom scoring is not working ( will work on this in next post ),
  2. standard query gives no result ( see my findings below on how to make it work.)

Note: i didn’t do three level nesting, just did two level, i presume that it should work if you extend it.

Step 1: Create a mapping

curl -X PUT "localhost:9200/myindex1?pretty" -H 'Content-Type: application/json' -d'
{
   "settings":{
      "index":{
         "knn":true,
         "knn.space_type":"l2"
      }
   },
   "mappings":{
      "properties":{
         "children":{
           "type":"nested",
           "properties":{
              "metadata":{
                    "type":"nested",
                    "properties":{
                        "vgg16":{
                            "type":"knn_vector",
                            "dimension":8
                        }
                    }
              }
            }
           }
      }
    }
}'

Step 2: Add some documents

curl -X PUT "localhost:9200/myindex1/_doc/1?pretty" -H 'Content-Type: application/json' -d'
{
   "children":[
      {
     
            "metadata": {
               "vgg16":[
                  -0.56617706,
                  -1.97073141,
                  2.34508821,
                  0.76267552,
                  -0.99612565,
                  1.83671205,
                  -0.39932499,
                  -2.17742888
               ],
                  "id_field":"0"
               }, "m_field": "hello" 

      },
      {
                   "metadata": {
               "vgg16":[
                  1.00222467,
                  0.63005195,
                  1.43128642,
                  0.20697815,
                  1.34556994,
                  0.4318985,
                  0.42407732,
                  -0.68597343
               ],
                  "id_field":"1"
                   }, "m_field": "rate" 

      }
   ]
}
'

curl -X PUT "localhost:9200/myindex1/_doc/2?pretty" -H 'Content-Type: application/json' -d'
{
   "children":[
      {
     
          "metadata": { 
               "vgg16":[
                  0.56617706,
                  1.97073141,
                  2.34508821,
                  0.76267552,
                  0.99612565,
                  1.83671205,
                  0.39932499,
                  2.17742888
               ],
                  "id_field":"0"
                   }, "m_field": "rate" 

      }
   ]
}
'

Step 3: Search document with size 1

curl -X GET "localhost:9200/myindex1/_search?pretty" -H 'Content-Type: application/json' -d'{
   "size":1,
  "query": {
    "nested": {
      "path": "children",
    "query": {
    "nested": {
      "path": "children.metadata",
       "query":{
            "knn":{
               "children.metadata.vgg16":{
                  "vector":[
                     -0.56617706,
                     -1.97073141,
                     2.34508821,
                     0.76267552,
                     -0.99612565,
                     1.83671205,
                     -0.39932499,
                     -2.17742888
                  ],
                  "k":10
               }
            }
         }
    }
  }
    }
  }
}'

output:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.5900459,
    "hits" : [
      {
        "_index" : "myindex1",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.5900459,
        "_source" : {
          "children" : [
            {
              "metadata" : {
                "vgg16" : [
                  -0.56617706,
                  -1.97073141,
                  2.34508821,
                  0.76267552,
                  -0.99612565,
                  1.83671205,
                  -0.39932499,
                  -2.17742888
                ],
                "id_field" : "0"
              },
              "m_field" : "hello"
            },
            {
              "metadata" : {
                "vgg16" : [
                  1.00222467,
                  0.63005195,
                  1.43128642,
                  0.20697815,
                  1.34556994,
                  0.4318985,
                  0.42407732,
                  -0.68597343
                ],
                "id_field" : "1"
              },
              "m_field" : "rate"
            }
          ]
        }
      }
    ]
  }
}

You have to pass nested in your search along with parent name. I will look into your custom scoring issue and get back to you soon. Hope this helps.

Here is the query for custom scoring. Logic remains same that we should build the query with nested format like above in order to get access to the data from lucene.

curl -X GET "localhost:9200/myindex1/_search?pretty" -H 'Content-Type: application/json' -d'{
   "size":1,
   "query":{
      "nested":{
         "path":"children",
         "query":{
            "nested":{
               "path":"children.metadata",
               "query":{
                  "script_score":{
                     "query":{
                        "match_all":{
                           
                        }
                     },
                     "script":{
                        "lang":"knn",
                        "source":"knn_score",
                        "params":{
                           "field":"children.metadata.vgg16",
                           "query_value":[
                              -0.56617706,
                              -1.97073141,
                              2.34508821,
                              0.76267552,
                              -0.99612565,
                              1.83671205,
                              -0.39932499,
                              -2.17742888
                           ],
                           "space_type":"l2"
                        }
                     }
                  }
               }
            }
         }
      }
   }
}'

output:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.5230126,
    "hits" : [
      {
        "_index" : "myindex1",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.5230126,
        "_source" : {
          "children" : [
            {
              "metadata" : {
                "vgg16" : [
                  -0.56617706,
                  -1.97073141,
                  2.34508821,
                  0.76267552,
                  -0.99612565,
                  1.83671205,
                  -0.39932499,
                  -2.17742888
                ],
                "id_field" : "0"
              },
              "m_field" : "hello"
            },
            {
              "metadata" : {
                "vgg16" : [
                  1.00222467,
                  0.63005195,
                  1.43128642,
                  0.20697815,
                  1.34556994,
                  0.4318985,
                  0.42407732,
                  -0.68597343
                ],
                "id_field" : "1"
              },
              "m_field" : "rate"
            }
          ]
        }
      }
    ]
  }
}

Fantastic, that works. Thanks so much!

However I do notice a couple things. In my testing with a traditional query (not custom scoring), I get a score of 1. This is the desired behaviour since that is the max score over all the children and the vectors are the same. For custom scoring and confirmed by your output, it seems to take the average score over the children. Not sure why I get the score of 1 for mine though and you do not.

In any case, this is still very helpful as it is likely the best vectors will still score highly even after they are averaged. I guess the best method would be to do the query with a high k and then do some local calculations to get then distance for each child vector and rerank. Unless you have a better idea? I will need to do this anyways because I need to figure out which child is the match.

Lastly I think you only needed nested once though since only children is a list (i.e the structure is {children:[ {'metadata' : {'retrieval_vectors' : {'vgg16' :[] }}} ] } . So with the mapping I gave above, this query works. Just thought I’d mention for others interested as it will make things a bit simpler in both mapping and search…

qry = {"size":10,
       "query": {
    "nested": {
      "path": "children",
       "query":{
            "knn":{
               "children.metadata.retrieval_vectors.vgg16":{
                  "vector":query_vec,
                  "k":10
               }
            }
         }
    }
  }
}
1 Like

I solved the issue with the scoring taking the mean - just a simple arg to the scoring. The beginning of the query looks like:

  "query": 
{
      "nested":{
        "score_mode": "max",
         "path":"children",
         "query":{
            "nested":{
                "score_mode": "max",
1 Like