Jobs stay running

Hi everyone, I’m running into an issue that I think might be slowing down my cluster and causing it to not keep up. First I’d like to verify that this is, in fact, an issue and maybe be able to resolve it.

I have filebeat and metricbeat creating new daily indices and pushing logs into my cluster. I have an index management policy that should transition from hot to warm to cold and then rotten for deletion. All of my indices seem to be getting to the transition stage, more or less, but they always stay running and never complete. Can anyone give me any insight into what might be happening?

{
    "policy_id": "hot_cold_rotten_workflow",
    "description": "Default policy that moves indicies from hot to warm to cold to rotten states. Indicies delete on rotten.",
    "last_updated_time": 1617990491574,
    "schema_version": 1,
    "error_notification": null,
    "default_state": "hot",
    "states": [
        {
            "name": "hot",
            "actions": [],
            "transitions": [
                {
                    "state_name": "warm",
                    "conditions": {
                        "min_index_age": "1d"
                    }
                }
            ]
        },
        {
            "name": "warm",
            "actions": [
                {
                    "replica_count": {
                        "number_of_replicas": 3
                    }
                }
            ],
            "transitions": [
                {
                    "state_name": "cold",
                    "conditions": {
                        "min_index_age": "30d"
                    }
                }
            ]
        },
        {
            "name": "cold",
            "actions": [
                {
                    "replica_count": {
                        "number_of_replicas": 1
                    }
                }
            ],
            "transitions": [
                {
                    "state_name": "rotten",
                    "conditions": {
                        "min_index_age": "90d"
                    }
                }
            ]
        },
        {
            "name": "rotten",
            "actions": [
                {
                    "delete": {}
                }
            ],
            "transitions": []
        }
    ]
}

Hi @mmcdermott,

Based on your policy the indices go into the default hot state and then after 1 day move to warm. I see a bunch of warm indices in your picture, so I assume that’s working as expected. Then it will set replica count to 3 and transition to cold after 30 days. So all of those in the picture are in warm and in the transitions phase waiting for the index to reach 30 days old. Based on the pic they don’t seem to be over 30 days old (using the index name as reference), so what is the issue exactly? Or are some over 30 days?

Thank you for your response @dbbaughe . My question is more or less just to make sure that everything looks ok. For some reason my cluster can’t seem to keep up with a few containers dumping logs with filebeat and metricbeat pulling kubernetes and host metrics. It seemed to be keeping up for a few days and now can’t keep up at all with no real change in the amount of logging. I wanted to make sure that like, for example, none of these jobs were doing something that would be pulling a large amount of resources causing that issue. It seems, based on your response, that this policy and outcome looks ok, though. So i think I have my answer. Thank you very much!

Yeah, the jobs are not using much resources, they get scheduled and are just asleep until the next execution time when it’ll do some check (in this case checking the index age from the local cluster state on the node). Everything looks fine from the ISM point of view.