Taking OpenSearch to The Next Level - From Data Search to Information Search

Greetings all,

In this post, I would to discuss an idea aiming to take OpenSearch to the next level. What is our vision for OpenSearch?

As you know that OpenSearch, and Elasticsearch, primarily rely on a TF-IDF based approach to find documents containing certain words or phrases. This is great, fast and efficient but it is no longer enough to meet customer demands and expectations these days. It basically locates data but may fail to find information. For example, if I search for “Who is Arya Stark’s Father?” in OpenSearch then it is less likely to return “Ned Stark”.

Users today would like to type natural questions and get good enough answer that considers semantics and context.

I am currently exploring options to introduce Deep NLP functionalities and language models to an ODFE environment in order to have a seemless semantic search within ODFE APIs.

Yes there is a clear challenge of maintaining a Python environment running side-by-side with ODFE. It may not be the only way but I see it as a practically viable way. (Note here that Python has now become a default component within major database technologies, why not OpenSearch?). Java may have libraries/frameworks for deep NLP but it is not as rich as Python and most of the time it is only a Python wrapper.

ODFE has KNN. But this is not enough. What is needed is an API that recieves text (not vectors) and returns human understandable information (text, documents, analytical information etc…). I believe that the transformations (text2vec etc…) can be done internally within OpenSearch APIs.

Also I have found GitHub - deepset-ai/haystack: End-to-end Python framework for building natural language search interfaces to data. Leverages Transformers and the State-of-the-Art of NLP. Supports DPR, Elasticsearch, Hugging Face’s Hub, and much more! which is an apache 2.0 project that offers what I mentioned above and more. It is a framework that offers NLP semantic search over several document stores (including ODFE, search for OpenDistroElasticsearchDocumentStore in this link https://haystack.deepset.ai/docs/documentstoremd). In fact I asked “Who is Arya Stark’s Father?” in Haystacks and surprisingly got “Ned Stark” and “Eddard Stark” as answers. (Those who watched Game Of Thrones would know these are correct answers)

I think that Haystack can become part of OpenSearch platform with a bit of integration effort from our side. Having such as a thing is going to be a game changer for OpenSearch.

Using Haystack as a main platform with ODFE/OpenSearch as a document store is not an ideal approach in my opinion because Haystack interface will limit the functionality of OpenSearch. However, the reverse (OpenSearch as the main platform and Haystack is a subset functionality) is a better approach. This way we get all enterprise features of OpenSearch while having haystack NLP functionality as APIs or query options as part of OpenSearch APIs.

I appreciate your thoughts here.

Regards,
Hasan

3 Likes

I wonder how the demo we saw today at the meetup featuring YANG DB would fare with this?
@asfoorial were you at the meetup on June 1st?

@amitai

Unfortunately I missed it! Something went wrong in my calendar that it failed to remind me of it. In fact I was looking forward to it so I appreciate if you could share with me any related material.

Also I want to ask if there is any performance benchmark, or even feature comparison, between Yang DB and other projects like JanusGraph.

Thanks,
Hasan

@liorp - since you did the Yang DB presentation at the meetup I am sure you could answer @asfoorial’s questions best:)

Hi Amitai
I would be happy to elaborate a bit more regarding this project…
You can also view the project in github

The main concept (at a supper high level overview ) is to give the capability to review the data stored in opensearch as a knowledge graph with its own ontology schema and graph (cypher) query language

It differes from JanusGraph for example in that it is build as a graph on-top of opensearch enabling to use its existing indexes as domain entities with a logical model while other graph DB may use opensearch only as a secondary index location…

Please see the attached presentation I’ve created for the recent meeting :

Also review the tutorial references in the git repo:

Please send me your thoughts
Lior

3 Likes

This is great and will be a key taking OpenSearch to the next level.

If I understood YangDB correctly, it assumes that each index is an entity in a graph and then it enables performing queries across multiple entities (indices). However, in the case of unstructured data entities are hidden inside text and not extracted by default. One will need to use NER models and Part-of-speech analysis to figure out business entities and their relationships. Once ready then they can use YangDB to create a graph.

So, the intelligence part of facilitating NER, POS and then generating the graph is also missing piece of our puzzle.

In addition, throwing natural language queries against OpenSearch, and also Elasticsearch, would not produce the desired result due to the lack of a Natural Language Understanding component.

If I can summarize, the components needed to take OpenSearch to the next level are:

1.Graph Engine. I am happy that YangDB is covering this component

  1. Capabilities to facilitate generating a knowledge graph from unstructured data. YangDB does not have this. Please correct me if I am wrong.

  2. Natural Language Understanding capability. This requires a pre-trained or user-trained language model (such as BERT) to be integrated seamlessly with OpenSearch KNN feature (seemlessly passing text and returning text). One option we here is to leverage Haystack which is already integrating with OpenDistro and also covers a wide range of NLP/NLU tasks.

Regards,
Hasan

2 Likes

Hi asfoorial
Thanks for your comments, One correction to the assumption is that YangDb differs the actual logical entities schema from the physical index structures:

  • Allowing Single index containing multiple entities / edges
  • Allowing Embedded documents to represent relationships / entities
  • Allow Time based partitioning indexes to represent the same relationship
    All these capabilities allow the functionality of evolving the logical schema without re-indexing the data and therefor create the dynamic “unstructured” processing feature you desired.

Regarding the natural language processing - this is something which is kind of hard to solve in regards to the different context of the query that one might create - this is why the importance of an ontology layer that allow the creation of semantic patterns that mostly resemble the language:

  Match (p:Person {gender:FEMALE})-[o:Own]-(d:Dragon {color :BLACK}),
            (p:Person)-[oh:Own]-(h:Horse),
            (d:Dragon)-[f:Fire]-(other:Dragon { gender:MALE}),
            (h:Horse)-[org:OriginatedIn]->(k:Kingdom )
     where k.funds > 0  return *

Hi asfoorial
Thanks for your comments, One correction to the assumption is that YangDb differs the actual logical entities schema from the physical index structures:

  • Allowing Single index containing multiple entities / edges
  • Allowing Embedded documents to represent relationships / entities
  • Allow Time based partitioning indexes to represent the same relationship
    All these capabilities allow the functionality of evolving the logical schema without re-indexing the data and therefore create the dynamic “unstructured” processing feature you desire.

Regarding the natural language processing - this is something which is kind of hard to solve in regards to the different context of the query that one might create - this is why the importance of an ontology layer that allow the creation of semantic patterns that mostly resemble the language:

  Match (p:Person {gender:FEMALE})-[o:Own]-(d:Dragon {color :BLACK}),
            (p:Person)-[oh:Own]-(h:Horse),
            (d:Dragon)-[f:Fire]-(other:Dragon { gender:MALE}),
            (h:Horse)-[org:OriginatedIn]->(k:Kingdom )
     where k.funds > 0  return *

‫בתאריך יום ו׳, 4 ביוני 2021 ב-9:33 מאת ‪Hasan Asfoor via Open Distro for Elasticsearch‬‏ <‪mauve_hedgehog@discoursemail.com‬‏>:‬

1 Like

Hey, some very interesting ideas above :wink: We’d be happy to chat if this goes farther. We also have an upcoming article on OpenSearch/Haystack use case.

1 Like

@aantti I am glad to hear from Haystack team here in the OpenSearch community.

OpenSearch offers very robust enterprise features in addition to rich search APIs. HayStack, on the hand, offers powerful semantic search features (that OpenSearch lacks) but hides many of the other OpenSearch APIs and security features. Both these systems solves different usecases with some overlap. I think that we can combine both environments to produce a very powerful search platform satisfying wider scope with Enterprise concerns in mind.

One way to tackle it is to embed Haystack inside OpenSearch and extend its API to expose Haystack semantic search capabilities.

In fact, we may not need to have a backend for Haystack in this since OpenSearch will do the filtering at the first stage and then will call Haystack to perform semantic search over a filtered subset.

Haystack in this case can be hosted in the same servers/containers as OpenSearch or on different servers. Hosting Haystack on the same server will relieve us from worrying about HTTP security layer (ssl and any other enterprise and compliance measures). However, it will compete with OpenSearch on resources.

Regards

@asfoorial We’ve discussed this internally and this really sounds quite exciting :slight_smile: What would be the best way forward to make that ‘progress in small steps’ maybe?

1 Like

A haystacksai plugin for OpenSearch is one option. Another option is to extend the OpenSearch kNN plugin. This way both technologies can exist as one platform.

I’m not very familiar with the kNN plugin, but it looks more like extending OpenSearch with “vector search” capabilities, correct? I guess we’d be leaning more towards haystack-as-a-plugin too. On a related note - we are still a very small organization, we aren’t experts in OpenSearch code per se, we don’t have any Java developers too… So we’d be relying a lot on the OpenSearch community. We are fully ready to help with the Haystack-related questions/parts, though. Also, do you have an issue or something for the ideas above already?

1 Like

That is correct the kNN plugin provides vector similarity search. No issues yet created for these ideas yet.