(Manual) Retry of IM Policy fails

This morning I discovered that I had a few cases of my Index Management policy failing. When I try to manually retry the policy (i.e. click on the RETRY POLICY button), I receive the following error message in a pop-up dialog.

Failed to retry: [kubernetes_cluster-kube-system-2020-03-27, RemoteTransportException[[v4m-es-data-2][10.254.5.253:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[v4m-es-data-2][10.254.5.253:9300][indices:data/write/bulk[s][p]]]; nested: EsRejectedExecutionException[rejected execution of processing of [3086976][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[.opendistro-ism-config][0]] containing [update {[.opendistro-ism-config][_doc][MJkR_h7-TaedLJIqWCuWGA], doc_as_upsert[false], doc[index {[.opendistro-ism-config][_doc][MJkR_h7-TaedLJIqWCuWGA], source[{"managed_index":{"last_updated_time":1585339823801,"enabled":true,"enabled_time":1585339823800}}]}], scripted_upsert[false], detect_noop[true]}], target allocation id: oBVL06_cTPKMGECPPq0dUQ, primary term: 4 on EsThreadPoolExecutor[name = v4m-es-data-2/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@44ed9f55[Running, pool size = 1, active threads = 1, queued tasks = 200, completed tasks = 1848255]]];]

I should also mention that Fluent Bit is experiencing problems communicating with Elasticsearch in this cluster as well. I see many messages in the Fluent Bit log about failed attempts to send data to ES although there are also many messages indicating Fluent Bit was eventually successful on the 2nd or 3rd attempt. I see no messages in the Elasticsearch logs that appear to correlate to the Fluent Bit activity. So, I’m not sure if the events are related or not. Coincidentally (or not?), things started going bad right around 8PM EDT which corresponds to midnight GMT/UTC…which I believe is when all of my date-based indexes “roll over”.

As things stand now, I don’t think the ES cluster is in a healthy state but I’d like to figure out why before I blindly redeploy everything the same way again. Any assistance is appreciated.

Upon further review, I discovered that there are frequent (~every second) “Exception during establishing a SSL connection: java.io.IOException: Connection reset by peer” messages coming from the 2 ES client instances in this cluster. I suspect those are responsible for (or another sign of) the communication problems reported by Fluent Bit. But, I’m still not sure if that’s at all related to the IM Policy retry failure or just a separate problem.

Hey @GSmith,

Haven’t seen the failed to retry issue before. Are you specifically using the Open Distro for Elasticsearch distribution or just the Index Management plugin by itself? You could try disabling the Index Management plugin and see if it’s contributing to these issues or not for starters. It definitely doesn’t do anything ~every second so it makes me think you’re running into some unrelated issues. And the retry failure seems like some issue with one of your threadpools possibly being full, I’m guessing it would go through once the queue cleared out a little.

@dbbaughe We’re using the whole Open Distro , not just the Index Management plugin. I suspect the two issue (Index Management and Connectivity) are unrelated but didn’t want to make too many assumptions. I’ll try to sort out disabling the plugin over the week-end to see the impact. Thanks.

This is due to your ES nodes being too busy (EsRejectedExecutionException, queued tasks = 200). The thread pool queue is at 200 with a limit of 200. The only way to fix it is to reduce your traffic, add more nodes, or increase your thread pool queues.

Thanks @adammike! Which type of ES nodes should I add? I’ve increased the number of “client” nodes from 2 to 3 and didn’t see any real impact. I also have 3 data nodes and 3 master nodes in the cluster as well. I am using a fairly small ingest pipeline splitting the incoming log messages based a field to direct the messages to different indexes.

indices:data/write/bulk indicates you should add data nodes.