Taking OpenSearch to The Next Level - From Data Search to Information Search

Greetings all,

In this post, I would to discuss an idea aiming to take OpenSearch to the next level. What is our vision for OpenSearch?

As you know that OpenSearch, and Elasticsearch, primarily rely on a TF-IDF based approach to find documents containing certain words or phrases. This is great, fast and efficient but it is no longer enough to meet customer demands and expectations these days. It basically locates data but may fail to find information. For example, if I search for “Who is Arya Stark’s Father?” in OpenSearch then it is less likely to return “Ned Stark”.

Users today would like to type natural questions and get good enough answer that considers semantics and context.

I am currently exploring options to introduce Deep NLP functionalities and language models to an ODFE environment in order to have a seemless semantic search within ODFE APIs.

Yes there is a clear challenge of maintaining a Python environment running side-by-side with ODFE. It may not be the only way but I see it as a practically viable way. (Note here that Python has now become a default component within major database technologies, why not OpenSearch?). Java may have libraries/frameworks for deep NLP but it is not as rich as Python and most of the time it is only a Python wrapper.

ODFE has KNN. But this is not enough. What is needed is an API that recieves text (not vectors) and returns human understandable information (text, documents, analytical information etc…). I believe that the transformations (text2vec etc…) can be done internally within OpenSearch APIs.

Also I have found GitHub - deepset-ai/haystack: End-to-end Python framework for building natural language search interfaces to data. Leverages Transformers and the State-of-the-Art of NLP. Supports DPR, Elasticsearch, Hugging Face’s Hub, and much more! which is an apache 2.0 project that offers what I mentioned above and more. It is a framework that offers NLP semantic search over several document stores (including ODFE, search for OpenDistroElasticsearchDocumentStore in this link https://haystack.deepset.ai/docs/documentstoremd). In fact I asked “Who is Arya Stark’s Father?” in Haystacks and surprisingly got “Ned Stark” and “Eddard Stark” as answers. (Those who watched Game Of Thrones would know these are correct answers)

I think that Haystack can become part of OpenSearch platform with a bit of integration effort from our side. Having such as a thing is going to be a game changer for OpenSearch.

Using Haystack as a main platform with ODFE/OpenSearch as a document store is not an ideal approach in my opinion because Haystack interface will limit the functionality of OpenSearch. However, the reverse (OpenSearch as the main platform and Haystack is a subset functionality) is a better approach. This way we get all enterprise features of OpenSearch while having haystack NLP functionality as APIs or query options as part of OpenSearch APIs.

I appreciate your thoughts here.

Regards,
Hasan

3 Likes

I wonder how the demo we saw today at the meetup featuring YANG DB would fare with this?
@asfoorial were you at the meetup on June 1st?

@amitai

Unfortunately I missed it! Something went wrong in my calendar that it failed to remind me of it. In fact I was looking forward to it so I appreciate if you could share with me any related material.

Also I want to ask if there is any performance benchmark, or even feature comparison, between Yang DB and other projects like JanusGraph.

Thanks,
Hasan

@liorp - since you did the Yang DB presentation at the meetup I am sure you could answer @asfoorial’s questions best:)

Hi Amitai
I would be happy to elaborate a bit more regarding this project…
You can also view the project in github

The main concept (at a supper high level overview ) is to give the capability to review the data stored in opensearch as a knowledge graph with its own ontology schema and graph (cypher) query language

It differes from JanusGraph for example in that it is build as a graph on-top of opensearch enabling to use its existing indexes as domain entities with a logical model while other graph DB may use opensearch only as a secondary index location…

Please see the attached presentation I’ve created for the recent meeting :

Also review the tutorial references in the git repo:

Please send me your thoughts
Lior

3 Likes

This is great and will be a key taking OpenSearch to the next level.

If I understood YangDB correctly, it assumes that each index is an entity in a graph and then it enables performing queries across multiple entities (indices). However, in the case of unstructured data entities are hidden inside text and not extracted by default. One will need to use NER models and Part-of-speech analysis to figure out business entities and their relationships. Once ready then they can use YangDB to create a graph.

So, the intelligence part of facilitating NER, POS and then generating the graph is also missing piece of our puzzle.

In addition, throwing natural language queries against OpenSearch, and also Elasticsearch, would not produce the desired result due to the lack of a Natural Language Understanding component.

If I can summarize, the components needed to take OpenSearch to the next level are:

1.Graph Engine. I am happy that YangDB is covering this component

  1. Capabilities to facilitate generating a knowledge graph from unstructured data. YangDB does not have this. Please correct me if I am wrong.

  2. Natural Language Understanding capability. This requires a pre-trained or user-trained language model (such as BERT) to be integrated seamlessly with OpenSearch KNN feature (seemlessly passing text and returning text). One option we here is to leverage Haystack which is already integrating with OpenDistro and also covers a wide range of NLP/NLU tasks.

Regards,
Hasan

1 Like

Hi asfoorial
Thanks for your comments, One correction to the assumption is that YangDb differs the actual logical entities schema from the physical index structures:

  • Allowing Single index containing multiple entities / edges
  • Allowing Embedded documents to represent relationships / entities
  • Allow Time based partitioning indexes to represent the same relationship
    All these capabilities allow the functionality of evolving the logical schema without re-indexing the data and therefor create the dynamic “unstructured” processing feature you desired.

Regarding the natural language processing - this is something which is kind of hard to solve in regards to the different context of the query that one might create - this is why the importance of an ontology layer that allow the creation of semantic patterns that mostly resemble the language:

  Match (p:Person {gender:FEMALE})-[o:Own]-(d:Dragon {color :BLACK}),
            (p:Person)-[oh:Own]-(h:Horse),
            (d:Dragon)-[f:Fire]-(other:Dragon { gender:MALE}),
            (h:Horse)-[org:OriginatedIn]->(k:Kingdom )
     where k.funds > 0  return *

Hi asfoorial
Thanks for your comments, One correction to the assumption is that YangDb differs the actual logical entities schema from the physical index structures:

  • Allowing Single index containing multiple entities / edges
  • Allowing Embedded documents to represent relationships / entities
  • Allow Time based partitioning indexes to represent the same relationship
    All these capabilities allow the functionality of evolving the logical schema without re-indexing the data and therefore create the dynamic “unstructured” processing feature you desire.

Regarding the natural language processing - this is something which is kind of hard to solve in regards to the different context of the query that one might create - this is why the importance of an ontology layer that allow the creation of semantic patterns that mostly resemble the language:

  Match (p:Person {gender:FEMALE})-[o:Own]-(d:Dragon {color :BLACK}),
            (p:Person)-[oh:Own]-(h:Horse),
            (d:Dragon)-[f:Fire]-(other:Dragon { gender:MALE}),
            (h:Horse)-[org:OriginatedIn]->(k:Kingdom )
     where k.funds > 0  return *

‫בתאריך יום ו׳, 4 ביוני 2021 ב-9:33 מאת ‪Hasan Asfoor via Open Distro for Elasticsearch‬‏ <‪mauve_hedgehog@discoursemail.com‬‏>:‬

1 Like