OpenSearch for text similarity searching

Hello everybody,

I am new here, nice to meet you! :slight_smile:

Here I am looking for advice to use OpenSearch with my data, which is different from the sample data provided after setting up. Imagine this data structure:

text, feature vector, file

text: a single sentence
feature vector: numpy vector (computed via GitHub - explosion/sense2vec: 🦆 Contextually-keyed word vectors)
file: link to the file that contains the text

example: “Hello I am a sentence about OpenSearch.”, [0.23455, 0.644, 0.0, 0.3446, … , 0.1395], “/path/to/file.pdf”

The workflow that I am imagining to implement is like this:

.1 search for a string (can be anything, might not be in the database)
.2 receive a list of most similar sentences (based on feature vector distance)
.3 open the associated file of the closest result

Do you have a recommendation how I could go about this? Is OpenSearch the right tool for this? I am very early in my research of how to make use of OpenSearch, in case I am missing something obvious please forgive me.

Thank you already for any tips & leads

Marcel

1 Like

Hi @schwittlick,

There are three methods described in the link below to perform in a way what you want using the kNN feature of OpenSearch. However, it is required that you create your index such that file content is stored as knn vector. Then the calling application should convert the search query into a kNN vector and pass it to the kNN query. In other words, OpenSearch provides vector similarity search but how you use it is up to you.

Regards

1 Like

To echo @asfoorial, k-nn is likely the way to go and it doesn’t need to involve preprocessing with numpy.

Thanks @asfoorial

The k-nn plugin seems exactly what I am looking for. I can’t seem to find info on how the plugin turns a piece of text into a vector- you mention it could replace my pre-processing via numpy @searchymcsearchface

My original approach was for the calling application to pass the feature vector and let the k-nn plugin’s approximate search do the heavy work. Like that I can tweak the feature vector extraction with python/spacy.

What I was secretly hoping for, is some example-app that does k-nn based text-recommendation using OpenSearch. It must be a basic use case, or am I mistaken?

Thank you :four_leaf_clover:

@schwittlick You may want to take a look at this post.

1 Like

Thank you @asfoorial. Haystack seems to be exactly that I am looking for. How have I missed this!!!

Re your initiative to extend Open Search with haystack, I also think it’d be a milestone, but I’d have to make some benchmarks on pure python haystack first. Will keep you updated here. Let me know how I can help, I do have some experience with embedding python environments in c++, but not in Java.

I think that the way to approach it is to have another Python enviroment running side-by-side with OpenSearch. We will also need to develop APIs as follows:

  1. Haystack/Python API
    This API recieves text (and optionally opensearch query), perform semantic search and return opensearch doc ids, and sentense highlights (The highlights in this case would be the matching sentences).

Of course the whole haystack environment is going to be OpenSearch as the backend for data indexing.

The challenge here, how can we integrate OpenSearch security with Haystack such that users cannot call Haystack APIs directly but rather through OpenSearch authentication/ssl layer. Perhaps we can host python inside the nodes of OpenSearch and let OpenSearch communicate with it locally. Then we would need to worry about resource sharing between the two!

  1. OpenSearch plugin
    This can be an REST API plugin that recieves two parameters, free text query and an OpenSearch query to perform initial filtering. These two can then be directly passed to Haystack API in the first point to perform the magic.

Let me know your thoughts ladies and gentlemen.

Regards

1 Like

Hey, I’m from deepset :slight_smile: It’s great to see this level of interest towards Haystack! Re #2 above - we’ve recently implemented an early query classifier like that. Have a look :wink: Release v0.9.0 · deepset-ai/haystack · GitHub

1 Like

Have a look at one possible approach if you’re still curious :slight_smile: Section on “Rerouting Keyword Queries” Save Time and Resources with the Query Classifier for Neural Search | by Andrey A. | deepset-ai | Aug, 2021 | Medium

1 Like

This is great. Actually it was one of the challenges I faced as I am thinking about solutions to the problem. I will give it a try at some point.

Thanks

In fact, one of the main challenges that I will face when using Haystack with OpenSearch is related to data security (specifically document level security).

I have a document repo with an authorization layer specifying which user can access which document. Currently I use OpenSearch to index all these documents while reflecting the same authorization rules defined at the source.

Now I can use Haystack to index all those documents, but would I be able reflect the same security at the source? OpenSearch is already capable of doing that but can Haystack instruct it to do so?

Also, knowing BERT-based models they mostly work on a sentense/paragraph level, how would I pass a complete set of multi-pages documents to it?

Ah, that’s a nice one. Btw, how does your current setup for ‘reflecting the same authorization rules’ look like? There’s an option to have metadata fields and filter by them with Haystack, but not sure if that’ll work for you (Metadata Filtering in Haystack | deepset-ai). This would also require reindexing.

My other question would be - do you need a fully-fledged question answering system on top (QA), or just a better ‘semantic search,’ or something else.

Re BERT and whole documents - you can just use the “Elastic” Retriever that we have (using BM25) if you don’t want/can’t reindex, and then it’s the job of the reader to split it further. This would also be slower than if fully reindexed first with Haystack. (That’s mostly about a QA pipeline).

If you can reindex, then you have a choice of using the preprocessor to split the documents into ‘passages’ and that’ll speed it up for the reader then. Again, if you’d like to leverage the metadata filtering, that would require reindexing too.

I’m hoping that answers your question above :slight_smile: Happy to try elaborate on it here, or you’re also welcome to chat in the Haystack community channel when it’s the right time.

1 Like

I am using OpenSearch security plugin which enables defining security roles that can be applied on an index level or document level. In simple words, I have an index my_docs that has content, accessors_names and other metadata as fields in the index. I also have a role with document-level security which defines if the current calling user is listed in the accessors_names field of a particular document.

I want to extend my current setup with QA and semantic search capabilities. I am open to rebuilding the indices.

The questions:

  1. what would the answer be? Is it extractive or generative? That is, does Haystack returns documents or generate sentences that could from multiple documents?

  2. How would 1 work when it comes to the security described above? Does Haystack interact with OpenSearch security in any way? (Seems to me that is a must)

  3. Is it possible to let OpenSearch call Haystack instead? So OpenSearch does the initial keyword BM25 search and then pass the result to Haystack to execute QA on the resulting documents ( I expect them to be 10-100 max). This way we ensure that security is already maintained by OpenSearch because we know that all the resulting docs are accessible by the calling user.

That is why I mentioned earlier in this thread and other threads that Haystack can become an internal/external component that OpenSearch can leverage.

I am also not sure if Haystack has any data authorization features at this point since it does not store data by itself.

Thanks for the above - that’s thought-provoking :slight_smile: To answer your questions:

  1. That could be either - pipelines in Haystack are very flexible, and usually the simplest one would be an extractive QA. But it could also be something like this Ask Wikipedia ELI5-like Questions Using Long-Form Question Answering on Haystack | by Vladimir Blagojevic | Towards Data Science

  2. Currently we don’t use any of it, no. But that’s a great feature request, so thanks :slight_smile:

  3. I do not think this is currently possible.

Re data authorization features - not really, aside from maybe leveraging the metadata and a basic separation by index. Something for us to look more into :slight_smile: We also try to be a bit flexible with the document stores, but I can totally see how leveraging certain advanced aspects of a particular docstore might be beneficial.

(We’ve also just [finally] added support for ANN in OpenSearch :slight_smile: as well as new components for the pipelines. Release v0.10.0 · deepset-ai/haystack · GitHub)

Great! Thanks for the update. I will give it a try.

I also need to find a proper way to make the two work together in an enterprise setup where all security, authentication and authorization measures are met while maintaining the expected user experience from word search and semantic search.