OpenSearch for text similarity searching

Hello everybody,

I am new here, nice to meet you! :slight_smile:

Here I am looking for advice to use OpenSearch with my data, which is different from the sample data provided after setting up. Imagine this data structure:

text, feature vector, file

text: a single sentence
feature vector: numpy vector (computed via GitHub - explosion/sense2vec: 🦆 Contextually-keyed word vectors)
file: link to the file that contains the text

example: “Hello I am a sentence about OpenSearch.”, [0.23455, 0.644, 0.0, 0.3446, … , 0.1395], “/path/to/file.pdf”

The workflow that I am imagining to implement is like this:

.1 search for a string (can be anything, might not be in the database)
.2 receive a list of most similar sentences (based on feature vector distance)
.3 open the associated file of the closest result

Do you have a recommendation how I could go about this? Is OpenSearch the right tool for this? I am very early in my research of how to make use of OpenSearch, in case I am missing something obvious please forgive me.

Thank you already for any tips & leads

Marcel

1 Like

Hi @schwittlick,

There are three methods described in the link below to perform in a way what you want using the kNN feature of OpenSearch. However, it is required that you create your index such that file content is stored as knn vector. Then the calling application should convert the search query into a kNN vector and pass it to the kNN query. In other words, OpenSearch provides vector similarity search but how you use it is up to you.

Regards

1 Like

To echo @asfoorial, k-nn is likely the way to go and it doesn’t need to involve preprocessing with numpy.

Thanks @asfoorial

The k-nn plugin seems exactly what I am looking for. I can’t seem to find info on how the plugin turns a piece of text into a vector- you mention it could replace my pre-processing via numpy @searchymcsearchface

My original approach was for the calling application to pass the feature vector and let the k-nn plugin’s approximate search do the heavy work. Like that I can tweak the feature vector extraction with python/spacy.

What I was secretly hoping for, is some example-app that does k-nn based text-recommendation using OpenSearch. It must be a basic use case, or am I mistaken?

Thank you :four_leaf_clover:

@schwittlick You may want to take a look at this post.

1 Like

Thank you @asfoorial. Haystack seems to be exactly that I am looking for. How have I missed this!!!

Re your initiative to extend Open Search with haystack, I also think it’d be a milestone, but I’d have to make some benchmarks on pure python haystack first. Will keep you updated here. Let me know how I can help, I do have some experience with embedding python environments in c++, but not in Java.

I think that the way to approach it is to have another Python enviroment running side-by-side with OpenSearch. We will also need to develop APIs as follows:

  1. Haystack/Python API
    This API recieves text (and optionally opensearch query), perform semantic search and return opensearch doc ids, and sentense highlights (The highlights in this case would be the matching sentences).

Of course the whole haystack environment is going to be OpenSearch as the backend for data indexing.

The challenge here, how can we integrate OpenSearch security with Haystack such that users cannot call Haystack APIs directly but rather through OpenSearch authentication/ssl layer. Perhaps we can host python inside the nodes of OpenSearch and let OpenSearch communicate with it locally. Then we would need to worry about resource sharing between the two!

  1. OpenSearch plugin
    This can be an REST API plugin that recieves two parameters, free text query and an OpenSearch query to perform initial filtering. These two can then be directly passed to Haystack API in the first point to perform the magic.

Let me know your thoughts ladies and gentlemen.

Regards

1 Like