We run an elasticsearch cluster in Azure backed by ‘premium’ SSDs (up to 5000 iops). We take in a large amount of data on a daily basis, so we’re currently running 10 hot nodes with 10 primaries/replicas per index. The indices roll over each day, and after two days they’re allocated to our single cold node. The nodes themselves are all 8 cores, 64gb of memory.
node.processors has been set accordingly and our heap size for the hot nodes is 25gb and 8gb for our cold node.
The problem is that we’re not seeing anywhere close to the write speed on the cold node that we should be. Even with the cold disk being a ‘standard’ SSD (500 iops), some benchmarks suggest we should be seeing around the 100mb/s range, but we’re currently seeing 2-5mb/s.
We’ve tried the following settings:
indices.recovery.max_bytes_per_sec: 1000mb (set to an absurd level just to rule it out)
indices.recovery.max_concurrent_file_chunks: 5 (this appears to be the maximum)
We do see the 10 concurrent recoveries, though it’s unclear to me whether reallocations count as recoveries (though they appear to). What we don’t see is anywhere close to the throughput we would expect writing to our cold storage. It seems like the concurrent file chunks setting may have had some effect, as our writes went from 2mb/s to 5mb/s, but that would also suggest we’re only writing a single chunk at a time, which seems unlikely.
Edit: A couple pieces of anecdotal evidence from this evening:
thread_pool.write.queue_sizewas set to 200 by default; the docs suggest this should be much higher, so we tried 2000. Initially we saw disk write speeds shoot up to 180mb/s, but after a few minutes they plummeted back to almost nothing.
- I believe related to the above, we started seeing
CircuitBreakingExceptionerrors in the logs around that time. Unsure if they were happening before throughput dropped to <1mb/s.
- Also related to the above; the official elastic documentation seems misleading; this page says that
thread_pool.write.queue_sizeis set to 10000, while everywhere else (including in the cluster itself) that I’ve read suggests the default is 200. Some guidance on calculating the correct number would be helpful.
- We’ve set
indices.breaker.total.use_real_memoryto false to avoid the above; time will tell if it has any bearing on allocations.
Any support is appreciated, as shards are piling up from previous days and starting to cause issues. Please let me know if I can provide any more information. Thanks in advance.