In this post, I would to discuss an idea aiming to take OpenSearch to the next level. What is our vision for OpenSearch?
As you know that OpenSearch, and Elasticsearch, primarily rely on a TF-IDF based approach to find documents containing certain words or phrases. This is great, fast and efficient but it is no longer enough to meet customer demands and expectations these days. It basically locates data but may fail to find information. For example, if I search for “Who is Arya Stark’s Father?” in OpenSearch then it is less likely to return “Ned Stark”.
Users today would like to type natural questions and get good enough answer that considers semantics and context.
I am currently exploring options to introduce Deep NLP functionalities and language models to an ODFE environment in order to have a seemless semantic search within ODFE APIs.
Yes there is a clear challenge of maintaining a Python environment running side-by-side with ODFE. It may not be the only way but I see it as a practically viable way. (Note here that Python has now become a default component within major database technologies, why not OpenSearch?). Java may have libraries/frameworks for deep NLP but it is not as rich as Python and most of the time it is only a Python wrapper.
ODFE has KNN. But this is not enough. What is needed is an API that recieves text (not vectors) and returns human understandable information (text, documents, analytical information etc…). I believe that the transformations (text2vec etc…) can be done internally within OpenSearch APIs.
Also I have found GitHub - deepset-ai/haystack: End-to-end Python framework for building natural language search interfaces to data. Leverages Transformers and the State-of-the-Art of NLP. Supports DPR, Elasticsearch, Hugging Face’s Hub, and much more! which is an apache 2.0 project that offers what I mentioned above and more. It is a framework that offers NLP semantic search over several document stores (including ODFE, search for OpenDistroElasticsearchDocumentStore in this link https://haystack.deepset.ai/docs/documentstoremd). In fact I asked “Who is Arya Stark’s Father?” in Haystacks and surprisingly got “Ned Stark” and “Eddard Stark” as answers. (Those who watched Game Of Thrones would know these are correct answers)
I think that Haystack can become part of OpenSearch platform with a bit of integration effort from our side. Having such as a thing is going to be a game changer for OpenSearch.
Using Haystack as a main platform with ODFE/OpenSearch as a document store is not an ideal approach in my opinion because Haystack interface will limit the functionality of OpenSearch. However, the reverse (OpenSearch as the main platform and Haystack is a subset functionality) is a better approach. This way we get all enterprise features of OpenSearch while having haystack NLP functionality as APIs or query options as part of OpenSearch APIs.
I appreciate your thoughts here.