Indexing Causing Index to Enter Red State

When indexing ~4000 documents (two vectors per document) with the _bulk update API to 12 different indices in elastic search a small number of indices are entering red state. Cluster resources (CPU and memory) do not seem stressed. Is there any advice that can be provided with regards to indexing vectors into multiple indices on the same cluster concurrently?

UPDATE: This is only occurring when we try writing to indices restored from a snapshot. Everything is fine if we rebuild the indices from scratch.

Hi @dnock,

Interesting that it only occurs when writing to indices restored from a snapshot. I will try to reproduce this and get back.

Jack

Hi @dnock, could you provide the following information:

  1. your cluster configuration. How many nodes? Is there a dedicated master? What distribution are you using? (RPM, Docker, Tar, etc.)
  2. sharding strategy for the indices

Additionally, when you restore from snapshot, do you ensure cluster is green before beginning the indexing workload?

We’re still struggling to write to indices created from snapshots.

  1. 5 datanodes, 3 master nodes, Amazon Elastic Search Service
  2. we have 1 shard with 2 replicas per index
  3. yes we wait for the cluster be be green

Hey @dnock,

This forum is for Open Distro only. Do you mind sending me an email bpavani@amazon.com with details and we can get someone from the service team to help you out.

Thanks,
Pavani

Hey @dnock, I believe this is related to: https://github.com/opendistro-for-elasticsearch/k-NN/issues/204.

I followed up with @bpavani via email. The exception we’re seeing is a bit different than the one in the specified issue. It’s a merge exception during shard allocation.

[2020-09-02T06:47:07,602][WARN ][o.e.i.c.IndicesClusterStateService] [dfe88a6a13565ff8c940025d7f324a3c] [[catalog_6_c3749655-ff83-48aa-973e-7ab1cad7d51b_fds][0]] marking and sending shard failed due to [shard failure, reason [merge failed]]
org.apache.lucene.index.MergePolicy$MergeException: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.NullPointerException
at org.elasticsearch.index.engine.InternalEngine$EngineMergeScheduler$2.doRun(InternalEngine.java:2310) ~[elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:760) ~[elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.1.1.jar:7.1.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.NullPointerException
AMAZON_INTERNAL
at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:152) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
at org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:195) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:150) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4459) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4054) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:625) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:101) ~[elasticsearch-7.1.1.jar:7.1.1]
at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:662) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
Caused by: java.lang.RuntimeException: java.lang.NullPointerException
AMAZON_INTERNAL
AMAZON_INTERNAL
AMAZON_INTERNAL
at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:152) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
at org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:195) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:150) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4459) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4054) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:625) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:101) ~[elasticsearch-7.1.1.jar:7.1.1]
at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:662) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
Caused by: java.lang.NullPointerException
at org.apache.lucene.index.SegmentDocValuesProducer.getBinary(SegmentDocValuesProducer.java:103) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
AMAZON_INTERNAL
AMAZON_INTERNAL
AMAZON_INTERNAL
at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:152) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
at org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:195) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:150) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4459) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4054) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:625) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:101) ~[elasticsearch-7.1.1.jar:7.1.1]
at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:662) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]

I followed up with @bpavani via email. The exception we’re seeing is a bit different. It’s a merge exception on shard allocation. Based off the below stack trace I don’t think our issues are related.

[2020-09-02T06:56:58,697][WARN ][o.e.i.c.IndicesClusterStateService] [dfe88a6a13565ff8c940025d7f324a3c] [[SOME_INDEX][0]] marking and sending shard failed due to [shard failure, reason [merge failed]]
org.apache.lucene.index.MergePolicy$MergeException: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.NullPointerException
at org.elasticsearch.index.engine.InternalEngine$EngineMergeScheduler$2.doRun(InternalEngine.java:2310) ~[elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:760) ~[elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.1.1.jar:7.1.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.NullPointerException
AMAZON_INTERNAL
at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:152) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
at org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:195) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:150) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4459) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4054) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:625) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:101) ~[elasticsearch-7.1.1.jar:7.1.1]
at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:662) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
Caused by: java.lang.RuntimeException: java.lang.NullPointerException
AMAZON_INTERNAL
AMAZON_INTERNAL
AMAZON_INTERNAL
at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:152) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
at org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:195) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:150) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4459) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4054) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:625) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:101) ~[elasticsearch-7.1.1.jar:7.1.1]
at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:662) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
Caused by: java.lang.NullPointerException
at org.apache.lucene.index.SegmentDocValuesProducer.getBinary(SegmentDocValuesProducer.java:103) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
AMAZON_INTERNAL
AMAZON_INTERNAL
AMAZON_INTERNAL
at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:152) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
at org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:195) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:150) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4459) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4054) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:625) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]
at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:101) ~[elasticsearch-7.1.1.jar:7.1.1]
at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:662) ~[lucene-core-8.0.0.jar:8.0.0-SNAPSHOT f754f8b0b8588981b899b802b6b5b14806325d78 - akjain - 2020-02-17 14:46:56]

It generally seems like ES transiently struggles to assign shards to indices with embeddings written to them.

Hi @dnock,

we are considering this as a bug and working on this. My guess is you have 2 vector fields defined in the index and it is possible not both the fields are present in the document.

Work around:-

  • Have one vector field per index
  • If planning to stick to more than one vector field, make sure all the vector fields are present in the document

@vamshin We have two vectors fields defined on dynamic index template thus we could have more than two fields (product.{dynamic}.vector_1 & product.{dynamic}.vector_2), but if product.{dynamic}.vector_1 is defined so should product.{dynamic}.vector_2. I can see if the offending indices have vector_1 defined, but not vector_2 or vice versa.

UPDATE:

In the offending indices, the vectors are always both defined (they are retrieved and index in tandem). So we either have documents with both vectors defined or no vectors defined.

I wanted to provide some clarity for discussion participates (and future readers). This seems to be an issue for any existing index not just indices created from snapshots.

@dnock,

Appreciate your help in providing details and helping us understand the issue. Fix would be deployed to opendistro-1.10, opendistro-1.9, opendistro-1.8.
Pr with fix https://github.com/opendistro-for-elasticsearch/k-NN/pull/212