Failed Transition

I have a simple index management policy defined that deletes an index after 3 days. Over the weekend, I’ve noticed that this policy failed for a small number of the indexes to which it applied but worked for several others. In the Kibana Index Management panel, on the Managed Indices page, the status for the indices for which the policy failed is shown as “failed”. When I click on the link in the INFO column, the dialog displayed shows “message”: “Previous action was not able to update IndexMetaData.”

Looking through the Elasticsearch log messages, I see cluster of messages like the following, which I think correspond to the failed transition attempt:

[2020-03-23T01:16:13,519][WARN ][c.a.o.i.ManagedIndexRunner] [odfe-opendistro-es-data-2] Operation failed. Retrying in 250ms.\n
org.elasticsearch.transport.RemoteTransportException: [odfe-opendistro-es-master-1][10.254.2.161:9300][cluster:admin/ism/update/managedindexmetadata]\n
Caused by: org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (opendistro-ism) within 30s\n
\tat org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$0(MasterService.java:143) ~[elasticsearch-7.4.2.jar:7.4.2]\n
\tat java.util.ArrayList.forEach(ArrayList.java:1540) ~[?:?]\n
\tat org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$1(MasterService.java:142) ~[elasticsearch-7.4.2.jar:7.4.2]\n
\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) ~[elasticsearch-7.4.2.jar:7.4.2]\n
\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]\n
\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]\n
\tat java.lang.Thread.run(Thread.java:835) ~[?:?]\n

Is it possible to increase the time allotted to process the index since the current 30 second limit doesn’t appear to be sufficient?

Will Index Management plugin continue trying? The note indicates that it will and I see multiple clusters of these notes but some of the indexes remain. So, does it “give up” at some point or should I assume it will eventually “catch up”?

The list of of indices shown in the plugin truncates the Index name field, so it is hard for me to be sure which index is which, but the affected indexes appear to be “small”, i.e. a few hundred MB, so it’s unclear why the request would be timing out. Is there additional logging that can be enabled that might be helpful?

I have been successful with manually retrying the action by clicking the RETRY POLICY button on the Managed Indices page of the IM plugin. But, obviously, I defined a policy so this would all be automated.

Can anyone offer some helpful advice/guidance?

Thanks!

Hi @GSmith,

Just to get some more information, could you let us know the setup of your cluster (i.e. number of nodes, indices, and shards)? And is the failure only during the transition part or also while execution any of the actions?

That error appears to be from the runner attempting to do some work and not being able to successfully update the IndexMetaData after said work is done because of that 30 second timeout. On the next execution the runner doesn’t know if it was able to complete the previous work or not so it fails itself in case the current work to be executed is not idempotent. We currently do not have that timeout for the update call configurable, but it could be something we add. And no the plugin won’t keep trying indefinitely which is why you eventually see it in the failed state. It retries up to 3 times, which also is currently not configurable, but something we could open up as a setting.

We do have some thoughts to improve this significantly, but will need some time to implement them. It would be helpful if you could let us know at which part your indices are failing though.

Thanks

Drew:

Thanks for responding. I saw this same behavior in two different clusters over the weekend. Both have the same structure: 3 master nodes, 2 data nodes and 2 client nodes. Both clusters were deployed via Helm chart and both are running on Kubernetes clusters. One Kubernetes cluster (Cluster A) is composed of 7 nodes and the other (Cluster B) is composed of 5 nodes. The Kubernetes nodes themselves are VMs running in OpenStack, each configured with 8 vCPU and 32GM RAM.

The indices are configured with 1 shard and 1 replica; all are small. I suspect that the current configuration is “sub-optimal” but I’m still in prototyping mode and waiting for our larger application efforts to settle down to where I can get a more realistic sense of data volumes and user activities before tackling optimization.

I manually hit the RETRY POLICY button in Kibana for the indices that failed on Cluster B; so there are currently no indices in a failed state there. However, on Cluster A, I only did that for one of the failed indices. I’ve included the _cat/shards output for this cluster below.

The indices currently showing with a “failed” status in the IM plugin’s “Managed Indices” page are:

kubernetes_cluster-monitoring-2020-03-19
kubernetes_cluster-moss-2020-03-20
kubernetes_cluster-moss-2020-03-22

The plug-in’s screens truncate the index names before the date portion, so I may be mistaken about which ones are affected. But, based on the order they appear when I sort by name, I believe that’s a correct list. I’m not sure why the “moss” index was affected since it had not passed the 3 day mark.

By the way, I would suggest the UI be tweaked to either not truncate the index name or, at least, show the full index name in a pop-up as you mouse over the name. Being able to filter the managed indexes by status would also be helpful.

As I mentioned in my original post, the policy is fairly simple: after 3 days, move the index from “hot” to “doomed” stage and the only action w/in the “doomed” stage is to delete the index. All of the indices that start with kubernetes_cluster-* have this same ISM policy assigned to them.

I was just re-reading the doc and see that in the description of the ACTIONS, it does mention that both retries and timeout are user-configurable. Your note mentioned suggests that these are hard-coded. Is user-configurability a newly introduced featuere or is the documentation mistaken (or prematurely updated)?

Thanks again for your assistance.
Greg

.kibana_1                                               0 p STARTED      21  65.1kb 10.254.1.151 odfe-opendistro-es-data-1
.kibana_1                                               0 r STARTED      21  65.1kb 10.254.2.140 odfe-opendistro-es-data-0
kubernetes_cluster-ingress-nginx-2020-03-23             0 r STARTED  884627   487mb 10.254.3.167 odfe-opendistro-es-data-2
kubernetes_cluster-ingress-nginx-2020-03-23             0 p STARTED  864190 610.7mb 10.254.2.140 odfe-opendistro-es-data-0
.opendistro-ism-config                                  0 r STARTED      31  77.3kb 10.254.3.167 odfe-opendistro-es-data-2
.opendistro-ism-config                                  0 p STARTED      31  69.5kb 10.254.2.140 odfe-opendistro-es-data-0
.opendistro-ism-managed-index-history-2020.03.22-000004 0 r STARTED      14  31.7kb 10.254.1.151 odfe-opendistro-es-data-1
.opendistro-ism-managed-index-history-2020.03.22-000004 0 p STARTED      14  31.7kb 10.254.2.140 odfe-opendistro-es-data-0
kubernetes_cluster-monitoring-2020-03-21                0 r STARTED   89356  19.1mb 10.254.1.151 odfe-opendistro-es-data-1
kubernetes_cluster-monitoring-2020-03-21                0 p STARTED   89356    19mb 10.254.2.140 odfe-opendistro-es-data-0
.opendistro-ism-managed-index-history-2020.03.18-1      0 p STARTED      26  29.5kb 10.254.3.167 odfe-opendistro-es-data-2
.opendistro-ism-managed-index-history-2020.03.18-1      0 r STARTED      26  29.5kb 10.254.2.140 odfe-opendistro-es-data-0
kubernetes_cluster-jackson-2020-03-21                   0 p STARTED  746472 397.4mb 10.254.3.167 odfe-opendistro-es-data-2
kubernetes_cluster-jackson-2020-03-21                   0 r STARTED  746472 402.1mb 10.254.1.151 odfe-opendistro-es-data-1
kubernetes_cluster-istio-system-2020-03-21              0 r STARTED     629 621.2kb 10.254.1.151 odfe-opendistro-es-data-1
kubernetes_cluster-istio-system-2020-03-21              0 p STARTED     629 604.6kb 10.254.2.140 odfe-opendistro-es-data-0
kubernetes_cluster-istio-system-2020-03-22              0 p STARTED     628 580.1kb 10.254.3.167 odfe-opendistro-es-data-2
kubernetes_cluster-istio-system-2020-03-22              0 r STARTED     628 713.4kb 10.254.1.151 odfe-opendistro-es-data-1
security-auditlog-2020.03.21                            0 p STARTED      30  56.5kb 10.254.3.167 odfe-opendistro-es-data-2
security-auditlog-2020.03.21                            0 r STARTED      30  56.5kb 10.254.2.140 odfe-opendistro-es-data-0
kubernetes_cluster-cert-manager-2020-03-22              0 p STARTED     394 294.9kb 10.254.3.167 odfe-opendistro-es-data-2
kubernetes_cluster-cert-manager-2020-03-22              0 r STARTED     394   295kb 10.254.1.151 odfe-opendistro-es-data-1
.opendistro_security                                    0 p STARTED       6  32.8kb 10.254.3.167 odfe-opendistro-es-data-2
.opendistro_security                                    0 r STARTED       6  32.8kb 10.254.1.151 odfe-opendistro-es-data-1
.opendistro_security                                    0 r STARTED       6    19kb 10.254.2.140 odfe-opendistro-es-data-0
kubernetes_cluster-ingress-nginx-2020-03-22             0 p STARTED 1008703 550.1mb 10.254.3.167 odfe-opendistro-es-data-2
kubernetes_cluster-ingress-nginx-2020-03-22             0 r STARTED 1008703 550.2mb 10.254.2.140 odfe-opendistro-es-data-0
kubernetes_cluster-monitoring-2020-03-23                0 p STARTED   62702    15mb 10.254.3.167 odfe-opendistro-es-data-2
kubernetes_cluster-monitoring-2020-03-23                0 r STARTED   69373  15.9mb 10.254.1.151 odfe-opendistro-es-data-1
kubernetes_cluster-jackson-2020-03-22                   0 p STARTED  746460 401.9mb 10.254.3.167 odfe-opendistro-es-data-2
kubernetes_cluster-jackson-2020-03-22                   0 r STARTED  746460 397.9mb 10.254.1.151 odfe-opendistro-es-data-1
kubernetes_cluster-moss-2020-03-20                      0 p STARTED  749560 396.3mb 10.254.3.167 odfe-opendistro-es-data-2
kubernetes_cluster-moss-2020-03-20                      0 r STARTED  749560 396.7mb 10.254.1.151 odfe-opendistro-es-data-1
kubernetes_cluster-ingress-nginx-2020-03-21             0 p STARTED 1008778 551.3mb 10.254.1.151 odfe-opendistro-es-data-1
kubernetes_cluster-ingress-nginx-2020-03-21             0 r STARTED 1008778 551.7mb 10.254.2.140 odfe-opendistro-es-data-0
kubernetes_cluster-istio-system-2020-03-23              0 r STARTED     564 665.8kb 10.254.3.167 odfe-opendistro-es-data-2
kubernetes_cluster-istio-system-2020-03-23              0 p STARTED     564 604.7kb 10.254.2.140 odfe-opendistro-es-data-0
kubernetes_cluster-jackson-2020-03-23                   0 r STARTED  636140 404.8mb 10.254.3.167 odfe-opendistro-es-data-2
kubernetes_cluster-jackson-2020-03-23                   0 p STARTED  657803 355.3mb 10.254.2.140 odfe-opendistro-es-data-0
kubernetes_cluster-cert-manager-2020-03-21              0 r STARTED     394 294.3kb 10.254.3.167 odfe-opendistro-es-data-2
kubernetes_cluster-cert-manager-2020-03-21              0 p STARTED     394 294.3kb 10.254.2.140 odfe-opendistro-es-data-0
security-auditlog-2020.03.18                            0 r STARTED       6  78.3kb 10.254.3.167 odfe-opendistro-es-data-2
security-auditlog-2020.03.18                            0 p STARTED       6  78.3kb 10.254.1.151 odfe-opendistro-es-data-1
security-auditlog-2020.03.20                            0 p STARTED       7  81.7kb 10.254.3.167 odfe-opendistro-es-data-2
security-auditlog-2020.03.20                            0 r STARTED       7  81.7kb 10.254.2.140 odfe-opendistro-es-data-0
.opendistro-job-scheduler-lock                          0 p STARTED      21   2.6mb 10.254.1.151 odfe-opendistro-es-data-1
.opendistro-job-scheduler-lock                          0 r STARTED      21   3.1mb 10.254.2.140 odfe-opendistro-es-data-0
kubernetes_cluster-moss-2020-03-22                      0 r STARTED  749847   398mb 10.254.3.167 odfe-opendistro-es-data-2
kubernetes_cluster-moss-2020-03-22                      0 p STARTED  749847 395.6mb 10.254.2.140 odfe-opendistro-es-data-0
.opendistro-ism-managed-index-history-2020.03.21-000003 0 r STARTED      20  33.6kb 10.254.1.151 odfe-opendistro-es-data-1
.opendistro-ism-managed-index-history-2020.03.21-000003 0 p STARTED      20  33.6kb 10.254.2.140 odfe-opendistro-es-data-0
kubernetes_cluster-monitoring-2020-03-19                0 p STARTED   89206  18.9mb 10.254.3.167 odfe-opendistro-es-data-2
kubernetes_cluster-monitoring-2020-03-19                0 r STARTED   89206  18.6mb 10.254.1.151 odfe-opendistro-es-data-1
security-auditlog-2020.03.23                            0 p STARTED      15  83.9kb 10.254.3.167 odfe-opendistro-es-data-2
security-auditlog-2020.03.23                            0 r STARTED      15    84kb 10.254.1.151 odfe-opendistro-es-data-1
kubernetes_cluster-monitoring-2020-03-22                0 p STARTED   89324  18.7mb 10.254.1.151 odfe-opendistro-es-data-1
kubernetes_cluster-monitoring-2020-03-22                0 r STARTED   89324    19mb 10.254.2.140 odfe-opendistro-es-data-0
kubernetes_cluster-moss-2020-03-23                      0 p STARTED  649696 418.5mb 10.254.3.167 odfe-opendistro-es-data-2
kubernetes_cluster-moss-2020-03-23                      0 r STARTED  648895 498.8mb 10.254.1.151 odfe-opendistro-es-data-1
.opendistro-ism-managed-index-history-2020.03.20-000002 0 p STARTED       7   9.8kb 10.254.1.151 odfe-opendistro-es-data-1
.opendistro-ism-managed-index-history-2020.03.20-000002 0 r STARTED       7   9.8kb 10.254.2.140 odfe-opendistro-es-data-0
viya_ops-2020.03.23                                     0 p STARTED   37815  14.3mb 10.254.1.151 odfe-opendistro-es-data-1
viya_ops-2020.03.23                                     0 r STARTED   37919  14.4mb 10.254.2.140 odfe-opendistro-es-data-0
security-auditlog-2020.03.19                            0 r STARTED      34 117.5kb 10.254.3.167 odfe-opendistro-es-data-2
security-auditlog-2020.03.19                            0 p STARTED      34 117.4kb 10.254.2.140 odfe-opendistro-es-data-0
kubernetes_cluster-moss-2020-03-21                      0 p STARTED  749696 398.2mb 10.254.1.151 odfe-opendistro-es-data-1
kubernetes_cluster-moss-2020-03-21                      0 r STARTED  749696 398.4mb 10.254.2.140 odfe-opendistro-es-data-0
kubernetes_cluster-cert-manager-2020-03-23              0 p STARTED     588 438.6kb 10.254.3.167 odfe-opendistro-es-data-2
kubernetes_cluster-cert-manager-2020-03-23              0 r STARTED     588 438.6kb 10.254.1.151 odfe-opendistro-es-data-1
security-auditlog-2020.03.22                            0 p STARTED      20  27.3kb 10.254.3.167 odfe-opendistro-es-data-2
security-auditlog-2020.03.22                            0 r STARTED      20  27.4kb 10.254.1.151 odfe-opendistro-es-data-1
.opendistro-ism-managed-index-history-2020.03.23-000005 0 p STARTED       3    18kb 10.254.3.167 odfe-opendistro-es-data-2
.opendistro-ism-managed-index-history-2020.03.23-000005 0 r STARTED       3    18kb 10.254.2.140 odfe-opendistro-es-data-0

Hi @GSmith,

Thanks for the information. We will fix the issue with the Kibana UI plugin truncating index names.

Regarding your “mos” index being affected even though it had not passed the 3 day mark: I’m assuming it does not say that index is in the doomed stage. Most likely it just had a timeout while attempting to check if it had any work to do and entered that failed state too.

To give a brief background on what’s happening internally:

The ISM plugin is running “jobs” based off one of our other plugins called Job Scheduler. We refer to the jobs in ISM as Managed Indices and they run on a pretty simple/dumb interval right now, i.e. run this job every 5 minutes (the default). Every index you apply a policy to has an associated Managed Index job for it that is running every 5 minutes. It’s running every 5 minutes to basically check if it has any “work” to do.

For the vast majority of “runs” it will have no work to do, such as when it’s waiting for 3 days to move into a “doomed” state. This is one of the improvements we are working on. The issue is currently that every 5 minutes it’s checking if it should move into this “doomed” state and each one of these runs has a chance to “fail”. When we say “fail”, it could mean a number of different things, but a few things are common across all runs.

Because we currently do not have any distinction between which actions are idempotent/safe and which are dangerous, we treat them all as requiring strict safety.

What that means is every time one of these Managed Index jobs runs, we first take a lock (that we store in a document) for that index. This is to ensure no other plugin running on another node attempts to run the same job for the same index (such as when a new node joins the cluster or one temporarily drops).

Once we have that lock and all our other checks pass (such as confirming cluster is healthy before doing any work) we also change the state of the Managed Index to “STARTING”. We store this in the master cluster state. Once we have acknowledgement from the master node, we attempt to do work. After that we set the state of the Managed Index back to something other than “STARTING” (such as COMPLETED, NOOP, FAILED, etc). We have a hardcoded 3 retry exponential backoff policy for these updates. And once we get acknowledgement of that update we are finished.

Now if any future execution ever runs into a current status of “STARTING” before it has updated it itself, then we know we attempted to do some work and failed to update the final status. Because we don’t know what that final update was (COMPLETED vs NOOP vs FAILED) we can’t assume one or the other and we enter into this “failed” state to let the user make the choice of retrying or not. From a users point of view, generally you don’t care about this and everything should just work. We are working on making this more resilient to random failures such as network issues, timeouts, etc. and have a lot of ideas already in place to improve this.

This should hopefully give some background as to why your indices are sometimes failing (and failing even if they aren’t actively trying to do something).

One thing you can do if you don’t require an interval as low as 5 minutes is to change the job_interval cluster setting. This can reduce the chances of these random timeout failures by simply reducing the number of times it can even happen. This does reduce the granularity of some actions, e.g. if you changed it to every 1 hour, then it could mean an index doesn’t transition to that “doomed” state until 3 days and 59 minutes etc. at worst if it previously executed at 2 days and 23 hours and 59 minutes.

As for the action retries/timeouts, those are a bit different from these internal retry timeouts we just talked about. As an example if for whatever reason your Managed Index was executing the “delete” action to delete your index and failed, the retries/timeouts specified in the action configuration allows this to be retried instead of entering a "failed’ state. So perhaps it failed the first time because of some networking issue and you had configured the “delete” action with 5 retries, then it would continue to retry up to 5 times. And the action timeout is keeping track of when the first attempt of an action started. If the timeout is ever breached then it enters the failed state. These unfortunately don’t help with the STARTING issue because we do not know if it’s even ok to retry from that (ideally something like delete is definitely ok to retry, whereas something like shrink would not be).

Hopefully that clarified the issue you’re facing.

@dbbaughe Thank you very much for the detailed explanation. I look forward to the coming enhancements. I’ll hold off on making any changes unless (until?) I see the problem re-appear. If it does, I’ll try increasing the job_interval setting to once an hour (or maybe once every few hours).

I found the following syntax works (I’m assuming the units is ‘minutes’):
PUT _cluster/settings
{
“persistent”: {
“opendistro.index_state_management.job_interval”: 60
}
}

Can you clarify the difference between persistent and transient cluster settings?

Other than the Index Management jobs, are there other jobs (or back-end processes) that are affected by this setting?

Thanks again for your assistance.

Hi @GSmith,

Regarding persistent vs transient:

Updates to settings can be persistent, meaning they apply across restarts, or transient, where they don’t survive a full cluster restart.

And that specific setting will only apply to the jobs running in Index State Management which is only the Managed Indices. Any other jobs that have configurable job_interval settings are configurable through their own namespaced settings.

@dbbaughe Thanks for clarification!

I’ve run into a problem when retrying the policy…but will start a new thread for that.