Thanks for your response. Basically, in many cases after an index rolled over they entered the force_merge action and never left this and so would not then transition to the deletion phase - typically 10 days after the force_merge should have happened.
I had looked at the tasks using the ES Tasks API and it wasn’t uncommon for these merges to take up to 5 hours to complete…it’s possible some merges were taking even longer but I just didn’t see them. However none of the merges were taking days to complete so it seemed pretty clear that the ISM state machines were getting stuck.
I have since removed the force_merge action from all policies warm phase and this has helped somewhat. In addition though I’ve create a little job that fires every hour which checks for pipelines in a dodgy state and helps them along a little - I’m still seeing many pipelines ceasing up due to timeouts writing their metadata even after increasing the transition attempt frequency to 1h.
BTW, I’m still on OpenDistro v1.6. I understand there have been significant improvements to ISM in v1.7 and would like to try and upgrade soon.
Is there a particular node type that ISM executes on? Each of our clusters have dedicated master, data and ingest/client-coordination nodes. If the ISM plugin doesn’t execute on data nodes then this would at least allow us to test out the new plugin very quickly on larger clusters.