Slow shard reallocation speed, shards building up

We run an elasticsearch cluster in Azure backed by ‘premium’ SSDs (up to 5000 iops). We take in a large amount of data on a daily basis, so we’re currently running 10 hot nodes with 10 primaries/replicas per index. The indices roll over each day, and after two days they’re allocated to our single cold node. The nodes themselves are all 8 cores, 64gb of memory. node.processors has been set accordingly and our heap size for the hot nodes is 25gb and 8gb for our cold node.

The problem is that we’re not seeing anywhere close to the write speed on the cold node that we should be. Even with the cold disk being a ‘standard’ SSD (500 iops), some benchmarks suggest we should be seeing around the 100mb/s range, but we’re currently seeing 2-5mb/s.

We’ve tried the following settings:
cluster.routing.allocation.node_concurrent_incoming_recoveries: 10
cluster.routing.allocation.node_concurrent_recoveries: 10
cluster.routing.allocation.node_initial_primaries_recoveries: 10
indices.recovery.max_bytes_per_sec: 1000mb (set to an absurd level just to rule it out)
indices.recovery.max_concurrent_file_chunks: 5 (this appears to be the maximum)

We do see the 10 concurrent recoveries, though it’s unclear to me whether reallocations count as recoveries (though they appear to). What we don’t see is anywhere close to the throughput we would expect writing to our cold storage. It seems like the concurrent file chunks setting may have had some effect, as our writes went from 2mb/s to 5mb/s, but that would also suggest we’re only writing a single chunk at a time, which seems unlikely.

Edit: A couple pieces of anecdotal evidence from this evening:

  • thread_pool.write.queue_size was set to 200 by default; the docs suggest this should be much higher, so we tried 2000. Initially we saw disk write speeds shoot up to 180mb/s, but after a few minutes they plummeted back to almost nothing.
  • I believe related to the above, we started seeing CircuitBreakingException errors in the logs around that time. Unsure if they were happening before throughput dropped to <1mb/s.
  • Also related to the above; the official elastic documentation seems misleading; this page says that thread_pool.write.queue_size is set to 10000, while everywhere else (including in the cluster itself) that I’ve read suggests the default is 200. Some guidance on calculating the correct number would be helpful.
  • We’ve set indices.breaker.total.use_real_memory to false to avoid the above; time will tell if it has any bearing on allocations.

Any support is appreciated, as shards are piling up from previous days and starting to cause issues. Please let me know if I can provide any more information. Thanks in advance.

What is your actual policy and how big are the shards?

Since the allocation action only made it into the ISM plugin recently we had been using a workaround using ISM policy notifications to send requests to a simple api that does the following:

PUT /<index>/_settings
{
  "index.routing.allocation.include.temp": "cold",
  "index.number_of_replicas": 0
}

Our shards are generally between 20-50gb.

I should mention that since disabling the breaker as well as bumping up the queue_size to 2000, we’ve seen mild improvement on throughput now being up in the 20-50mb/s range, which occasional periods of 80-100mb/s, but without any obvious correlation. We’re still nowhere near our upper limits on write speed or iops though.

Do you forcemerge the shards on the hot nodes before you migrate them to the cold box?

We do not currently. Historically we did prior to switching to OpenDistro, but force_merge has been very flaky in the ISM policies and tends to fail more often than not. As far as I can tell this is a known issue when force_merges take a long time, as it’s addressed in the upcoming 1.10.0 release notes. For the moment we took out that step.

Beyond allocations we also just generally see low to moderate CPU usage in comparison across our entire cluster, both hot and cold nodes. We’ve got plenty of overhead, so ultimately we’re just looking to see if there are some knobs to turn that will make more use of that overhead.