Hello! My team and I are using Anomaly Detection as a SIEM tool but we have accoutered several problems with our platform.
Git issue reference
We are wondering why our coordinator nodes kept falling periodically. Here is a sample of different logs encountered when one of the node fell :
[2021-03-02T14:44:42,574][ERROR][c.a.o.s.s.h.n.OpenDistroSecuritySSLNettyHttpServerTransport] [KBN_0] Exception during establishing a SSL connection: java.lang.Exception: java.lang.OutOfMemoryError: Java heap space java.lang.Exception: java.lang.OutOfMemoryError: Java heap space
[2021-03-02T14:45:02,191][WARN ][o.e.m.j.JvmGcMonitorService] [KBN_0] [gc] overhead, spent [3.2s] collecting in the last [3.2s]
Here is the Kibana & Elasticsearch structure we have :
- 2 coordinator nodes (installed on Kibana VMs to ensure load balancing on the cluster): 4 virtual cores, 15 GB ram, 8G heap size (coordinating)
- 3 master node: 4 virtual cores, 15 GB ram, 8G heap size (master): not used in this scenario from our understanding
- 15 data node : 8 virtual cores, 30 GB ram, 16G heap size (ingest & data)
- Anomaly detectors: 26 running detectors with around 500 active entities, using in total 600 MiB each
Trying to understand the reasons behind these failures, we came up with different questions :
- How is the load balancing done between several node coordinators? Is the heap percentage supposedly equally distributed between all the coordinator nodes?
- How are the coordinator nodes linked with the anomaly detector plugin? Is the coordinator node responsible for distributing the trees on different shards? where are the trees of the Random Cut Forest saved? Is the coordinator node responsible for collecting and aggregating the final result like the anomaly grade?
When many detectors are launched the heap memory (here: 75% and 80%) of the coordinator nodes have increased greatly. How would we need to scale the coordinator node, to keep up with the demands? Adding more coordinator nodes? increasing the heap size? having more CPU(s) cores? …
- Reducing the batched size can be used as a protection mechanism and reduce the memory overhead per search request. Would this be helpful in the case of detectors as well?
- Since the detectors are configured on indexes using rollover we tried to use only the current write index (to reduce heap space errors). But detectors using write alias take 10x more time to initialize than detectors using indexes. Do you have any recommendations on this?
My team and I would be very grateful for the time you take answering all these questions.