Performance and Sizing Help and Insights

Question:
We are looking for some guidance or some commentary on our OpenDistro cluster which is deployed onto our Kubernetes infrastructure via Helm. I have noted the current configuration for the cluster below.
While our cluster seems stable, if there are slight bumps in the night the cluster seems to become unstable quickly. We recently had an issue where two (2) of the client containers were destroyed and recreated. This caused the masters to disconnect and the cluster to become unstable and yellow. Several shard had to be reinitialized which took ~36 hours to complete. During the yellow period we noticed a drop in the ingest rate by ~20%.
On previous occasions we recognized that garbage collection on the clients was impacting the rate of ingest, and we were seeing drops in the rate by 10-15% per day until the client containers were destroyed and redeployed (individually and serially).
We feel as though we are at the edge of a cliff with the current configuration of the cluster. As slight wind and it causes the cluster to fall and become yellow or red.
We are hoping that others can provide some insight into the next steps toward further stabilizing the cluster and enabling solid scaling for ingest. We would like to stop shooting in the dark with where in the cluster to devote resources and attention. It does not seem that there are any examples online that meet this same level of ingest rate on Kubernetes.

Version:

  • OD: 1.1.0
  • ES: 7.1.1

Data Type:

  • Syslog

Index Configuration:

  • Single daily index
    • Shards: 10
    • Replica: 1
  • Ingest Rate
    • Documents per Day: ~6,800,000,000
    • Size per Day: 1.8TB
  • Retention
    • Indices closed after: 7 days
    • Indices deleted after: 30 days

Architecture:

  1. Data:
    • Number: 10
    • CPU: 4
    • Memory: 32G
    • Heap: 16G
  2. Master:
    • Number: 5
    • CPU: 4
    • Memory: 16G
    • Heap: 8G
  3. Client:
    • Number: 5
    • CPU: 2
    • Memory: 8G
    • Heap: 4G

Storage:

  • Storage Backed:
    • NFS mounts to data nodes
    • Disks: 7.2K

Garbage Collection:

  • JAVA_OPTS: “-XX:-UseConcMarkSweepGC -XX:-UseCMSInitiatingOccupancyOnly -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=75”