Indices stuck on "force_merge" action

I’ve been finding that many of my indices are becoming stuck during the force_merge action and as a result never move on to transition into the next state which is to delete the index. As a result of this it’s becoming very hard to keep the cluster running. I’m continually having to manually juggle things to prevent hitting node shard limits.

I have updated my policies to remove the force_merge action however this isn’t helping the indices which are currently stuck. Is there a way to unstick them?

I’m currently running v1.6 of the plugin.

Thanks

1 Like

Hi @govule,

Could you clarify a bit more on “getting stuck”? Which part of the action do they get stuck at?

If they are stuck in the middle of the force_merge action then the quickest way would be to remove the policy from the indices that are “stuck” and then either a) manually do what you intended to do or b) add a temporary policy to finish the rest of their lifecycle.

Perhaps we could consider an administrative API that allows someone to “override” a managed index to skip the current action it’s on when it gets stuck like this.

But, either way if possible please let us know how (if possible) and where these got stuck so we can look into it. The best option is for us to just make sure it can’t happen so you don’t have to do anything manually.

Thanks for your response. Basically, in many cases after an index rolled over they entered the force_merge action and never left this and so would not then transition to the deletion phase - typically 10 days after the force_merge should have happened.

I had looked at the tasks using the ES Tasks API and it wasn’t uncommon for these merges to take up to 5 hours to complete…it’s possible some merges were taking even longer but I just didn’t see them. However none of the merges were taking days to complete so it seemed pretty clear that the ISM state machines were getting stuck.

I have since removed the force_merge action from all policies warm phase and this has helped somewhat. In addition though I’ve create a little job that fires every hour which checks for pipelines in a dodgy state and helps them along a little - I’m still seeing many pipelines ceasing up due to timeouts writing their metadata even after increasing the transition attempt frequency to 1h.

BTW, I’m still on OpenDistro v1.6. I understand there have been significant improvements to ISM in v1.7 and would like to try and upgrade soon.

Is there a particular node type that ISM executes on? Each of our clusters have dedicated master, data and ingest/client-coordination nodes. If the ISM plugin doesn’t execute on data nodes then this would at least allow us to test out the new plugin very quickly on larger clusters.

1 Like