Heap Memory Issues

We just update to 1.13.1 and I seem to have heap memory issues
We can see this on our data only node :

CircuitBreakingException[[parent] Data too large, data for [internal:index/shard/recovery/start_recovery] would be [33575573184/31.2gb], which is larger than the limit of [32641751449/30.3gb], real usage: [33575572272/31.2gb], new bytes reserved: [912/912b], usages [request=0/0b, fielddata=0/0b, in_flight_requests=9908/9.6kb, accounting=4222558184/3.9gb]];

So I look the the Node stat and found the following :

"jvm" : {
    "timestamp" : 1616102416988,
    "uptime_in_millis" : 25389419,
    "mem" : {
      "heap_used_in_bytes" : 32834308920,
      "heap_used_percent" : 95,
      "heap_committed_in_bytes" : 34359738368,
      "heap_max_in_bytes" : 34359738368,
      "non_heap_used_in_bytes" : 198294112,
      "non_heap_committed_in_bytes" : 207482880,
      "pools" : {
        "young" : {
          "used_in_bytes" : 402653184,
          "max_in_bytes" : 0,
          "peak_used_in_bytes" : 20501757952,
          "peak_max_in_bytes" : 0
        },
        "old" : {
          "used_in_bytes" : 32408545328,
          "max_in_bytes" : 34359738368,
          "peak_used_in_bytes" : 34307804680,
          "peak_max_in_bytes" : 34359738368
        },
        "survivor" : {
          "used_in_bytes" : 23110408,
          "max_in_bytes" : 0,
          "peak_used_in_bytes" : 1583291440,
          "peak_max_in_bytes" : 0
        }
      }
    },
    "threads" : {
      "count" : 227,
      "peak_count" : 368
    },
    "gc" : {
      "collectors" : {
        "young" : {
          "collection_count" : 3560,
          "collection_time_in_millis" : 107619
        },
        "old" : {
          "collection_count" : 360,
          "collection_time_in_millis" : 1131408
        }
      }
    },
    "buffer_pools" : {
      "mapped" : {
        "count" : 157716,
        "used_in_bytes" : 3579872863310,
        "total_capacity_in_bytes" : 3579872863310
      },
      "direct" : {
        "count" : 190,
        "used_in_bytes" : 68222478,
        "total_capacity_in_bytes" : 68222477
      },
      "mapped - 'non-volatile memory'" : {
        "count" : 0,
        "used_in_bytes" : 0,
        "total_capacity_in_bytes" : 0
      }
    }

"breakers" : {
    "request" : {
      "limit_size_in_bytes" : 20615843020,
      "limit_size" : "19.1gb",
      "estimated_size_in_bytes" : 0,
      "estimated_size" : "0b",
      "overhead" : 1.0,
      "tripped" : 0
    },
    "fielddata" : {
      "limit_size_in_bytes" : 13743895347,
      "limit_size" : "12.7gb",
      "estimated_size_in_bytes" : 0,
      "estimated_size" : "0b",
      "overhead" : 1.03,
      "tripped" : 0
    },
    "in_flight_requests" : {
      "limit_size_in_bytes" : 34359738368,
      "limit_size" : "32gb",
      "estimated_size_in_bytes" : 563,
      "estimated_size" : "563b",
      "overhead" : 2.0,
      "tripped" : 0
    },
    "accounting" : {
      "limit_size_in_bytes" : 34359738368,
      "limit_size" : "32gb",
      "estimated_size_in_bytes" : 4222558184,
      "estimated_size" : "3.9gb",
      "overhead" : 1.0,
      "tripped" : 0
    },
    "parent" : {
      "limit_size_in_bytes" : 32641751449,
      "limit_size" : "30.3gb",
      "estimated_size_in_bytes" : 32834308920,
      "estimated_size" : "30.5gb",
      "overhead" : 1.0,
      "tripped" : 13593254
    }
  }

I look in the gc.log and didn’t found any more answer to why the memory is so high

 [2021-03-18T19:39:04.440+0000][1][gc,start       ] GC(3275) Pause Young (Concurrent Start) (G1     Humongous Allocation)
 [2021-03-18T19:39:04.440+0000][1][gc,task        ] GC(3275) Using 43 workers of 43 for evacuation
 [2021-03-18T19:39:04.440+0000][1][gc,age         ] GC(3275) Desired survivor size 109051904 bytes, new threshold 15 (max threshold 15)
 [2021-03-18T19:39:04.461+0000][1][gc,age         ] GC(3275) Age table with threshold 15 (max threshold 15)
 [2021-03-18T19:39:04.461+0000][1][gc,age         ] GC(3275) - age   1:   12946144 bytes,   12946144 total
 [2021-03-18T19:39:04.461+0000][1][gc,phases      ] GC(3275)   Pre Evacuate Collection Set: 0.9ms
 [2021-03-18T19:39:04.461+0000][1][gc,phases      ] GC(3275)   Merge Heap Roots: 1.5ms
 [2021-03-18T19:39:04.461+0000][1][gc,phases      ] GC(3275)   Evacuate Collection Set: 13.8ms
 [2021-03-18T19:39:04.461+0000][1][gc,phases      ] GC(3275)   Post Evacuate Collection Set: 3.9ms
 [2021-03-18T19:39:04.461+0000][1][gc,phases      ] GC(3275)   Other: 1.0ms
 [2021-03-18T19:39:04.461+0000][1][gc,heap        ] GC(3275) Eden regions: 22->0(101)
 [2021-03-18T19:39:04.461+0000][1][gc,heap        ] GC(3275) Survivor regions: 0->1(13)
 [2021-03-18T19:39:04.461+0000][1][gc,heap        ] GC(3275) Old regions: 1972->1972
 [2021-03-18T19:39:04.461+0000][1][gc,heap        ] GC(3275) Archive regions: 0->0
 [2021-03-18T19:39:04.461+0000][1][gc,heap        ] GC(3275) Humongous regions: 1->0
 [2021-03-18T19:39:04.461+0000][1][gc,metaspace   ] GC(3275) Metaspace: 112607K(115708K)->112607K(115708K) NonClass: 99340K(101500K)->99340K(101500K) Class: 13267K(14208K)->13267K(14208K)
 [2021-03-18T19:39:04.461+0000][1][gc             ] GC(3275) Pause Young (Concurrent Start) (G1 Humongous Allocation) 31546M->31202M(32768M) 21.182ms
 [2021-03-18T19:39:04.461+0000][1][gc,cpu         ] GC(3275) User=0.64s Sys=0.04s Real=0.02s
 [2021-03-18T19:39:04.461+0000][1][gc             ] GC(3276) Concurrent Cycle
 [2021-03-18T19:39:04.461+0000][1][safepoint      ] Safepoint "G1TryInitiateConcMark", Time since last: 960189638 ns, Reaching safepoint: 582197 ns, At safepoint: 21526113 ns, Total: 22108310 ns
 [2021-03-18T19:39:04.462+0000][1][gc,marking     ] GC(3276) Concurrent Clear Claimed Marks
 [2021-03-18T19:39:04.462+0000][1][gc,marking     ] GC(3276) Concurrent Clear Claimed Marks 0.137ms
 [2021-03-18T19:39:04.462+0000][1][gc,marking     ] GC(3276) Concurrent Scan Root Regions
 [2021-03-18T19:39:04.482+0000][1][gc,marking     ] GC(3276) Concurrent Scan Root Regions 19.986ms
 [2021-03-18T19:39:04.482+0000][1][gc,marking     ] GC(3276) Concurrent Mark (19316.913s)
 [2021-03-18T19:39:04.482+0000][1][gc,marking     ] GC(3276) Concurrent Mark From Roots
 [2021-03-18T19:39:04.482+0000][1][gc,task        ] GC(3276) Using 11 workers of 11 for marking
 [2021-03-18T19:39:05.462+0000][1][safepoint      ] Safepoint "Cleanup", Time since last: 1000186658 ns, Reaching safepoint: 603855 ns, At safepoint: 43795 ns, Total: 647650 ns
 [2021-03-18T19:39:08.564+0000][1][gc,marking     ] GC(3276) Concurrent Mark From Roots 4082.489ms
 [2021-03-18T19:39:08.564+0000][1][gc,marking     ] GC(3276) Concurrent Preclean
 [2021-03-18T19:39:08.617+0000][1][gc,marking     ] GC(3276) Concurrent Preclean 52.501ms
 [2021-03-18T19:39:08.617+0000][1][gc,marking     ] GC(3276) Concurrent Mark (19316.913s, 19321.048s) 4135.083ms
 [2021-03-18T19:39:08.617+0000][1][gc,start       ] GC(3276) Pause Remark
 [2021-03-18T19:39:08.624+0000][1][gc             ] GC(3276) Pause Remark 31709M->31702M(32768M) 7.209ms
 [2021-03-18T19:39:08.625+0000][1][gc,cpu         ] GC(3276) User=0.17s Sys=0.04s Real=0.00s
 [2021-03-18T19:39:08.625+0000][1][safepoint      ] Safepoint "G1Concurrent", Time since last: 3154514507 ns, Reaching safepoint: 353958 ns, At safepoint: 7443548 ns, Total: 7797506 ns
 [2021-03-18T19:39:08.625+0000][1][gc,marking     ] GC(3276) Concurrent Rebuild Remembered Sets
 [2021-03-18T19:39:09.504+0000][1][gc,start       ] GC(3277) Pause Young (Normal) (G1 Humongous Allocation)
 [2021-03-18T19:39:09.504+0000][1][gc,task        ] GC(3277) Using 43 workers of 43 for evacuation
 [2021-03-18T19:39:09.504+0000][1][gc,age         ] GC(3277) Desired survivor size 109051904 bytes, new threshold 15 (max threshold 15)
 [2021-03-18T19:39:09.541+0000][1][gc,age         ] GC(3277) Age table with threshold 15 (max threshold 15)
 [2021-03-18T19:39:09.541+0000][1][gc             ] GC(3277) To-space exhausted
 [2021-03-18T19:39:09.541+0000][1][gc,phases      ] GC(3277)   Pre Evacuate Collection Set: 0.8ms
 [2021-03-18T19:39:09.541+0000][1][gc,phases      ] GC(3277)   Merge Heap Roots: 1.6ms
 [2021-03-18T19:39:09.541+0000][1][gc,phases      ] GC(3277)   Evacuate Collection Set: 14.4ms
 [2021-03-18T19:39:09.541+0000][1][gc,phases      ] GC(3277)   Post Evacuate Collection Set: 20.3ms
 [2021-03-18T19:39:09.541+0000][1][gc,phases      ] GC(3277)   Other: 0.5ms
 [2021-03-18T19:39:09.541+0000][1][gc,heap        ] GC(3277) Eden regions: 34->0(102)
 [2021-03-18T19:39:09.541+0000][1][gc,heap        ] GC(3277) Survivor regions: 1->0(0)
 [2021-03-18T19:39:09.541+0000][1][gc,heap        ] GC(3277) Old regions: 1967->1995
 [2021-03-18T19:39:09.541+0000][1][gc,heap        ] GC(3277) Archive regions: 0->0
 [2021-03-18T19:39:09.541+0000][1][gc,heap        ] GC(3277) Humongous regions: 46->0
 [2021-03-18T19:39:09.541+0000][1][gc,metaspace   ] GC(3277) Metaspace: 112607K(115708K)->112607K(115708K) NonClass: 99340K(101500K)->99340K(101500K) Class: 13267K(14208K)->13267K(14208K)
 [2021-03-18T19:39:09.541+0000][1][gc             ] GC(3277) Pause Young (Normal) (G1 Humongous Allocation) 32465M->31617M(32768M) 37.608ms
 [2021-03-18T19:39:09.541+0000][1][gc,cpu         ] GC(3277) User=0.85s Sys=0.05s Real=0.04s
 [2021-03-18T19:39:09.541+0000][1][safepoint      ] Safepoint "G1CollectForAllocation", Time since last: 877640595 ns, Reaching safepoint: 1084860 ns, At safepoint: 37973528 ns, Total: 39058388 ns
 [2021-03-18T19:39:11.416+0000][1][safepoint      ] Safepoint "PrintThreads", Time since last: 1860186604 ns, Reaching safepoint: 953011 ns, At safepoint: 13170786 ns, Total: 14123797 ns
 [2021-03-18T19:39:11.417+0000][1][safepoint      ] Safepoint "PrintJNI", Time since last: 180516 ns, Reaching safepoint: 1018775 ns, At safepoint: 37052 ns, Total: 1055827 ns
 [2021-03-18T19:39:11.418+0000][1][safepoint      ] Safepoint "FindDeadlocks", Time since last: 70564 ns, Reaching safepoint: 940848 ns, At safepoint: 57897 ns, Total: 998745 ns
 [2021-03-18T19:39:12.842+0000][1][gc,marking     ] GC(3276) Concurrent Rebuild Remembered Sets 4217.641ms
 [2021-03-18T19:39:12.843+0000][1][gc,start       ] GC(3276) Pause Cleanup
 [2021-03-18T19:39:12.856+0000][1][gc             ] GC(3276) Pause Cleanup 31877M->31877M(32768M) 12.297ms
 [2021-03-18T19:39:12.856+0000][1][gc,cpu         ] GC(3276) User=0.03s Sys=0.05s Real=0.01s
 [2021-03-18T19:39:12.856+0000][1][safepoint      ] Safepoint "G1Concurrent", Time since last: 1424551309 ns, Reaching safepoint: 848540 ns, At safepoint: 12507145 ns, Total: 13355685 ns
 [2021-03-18T19:39:12.856+0000][1][gc,marking     ] GC(3276) Concurrent Cleanup for Next Mark
 [2021-03-18T19:39:12.872+0000][1][gc,marking     ] GC(3276) Concurrent Cleanup for Next Mark 16.241ms
 [2021-03-18T19:39:12.872+0000][1][gc             ] GC(3276) Concurrent Cycle 8410.749ms

It’s running in docker containers, the official one, with the G1GC default prameters :
-XX:+UseG1GC
-XX:G1ReservePercent=25
-XX:InitiatingHeapOccupancyPercent=30

I know we have a high shard volume, 12000, running on 6 data node, 3 ingest and 3 master.

Do anyone has an idea where to look to resolve this issue ?

IMO the memory requirements are “high” because the number of shard are high. I guess either make sure that your instances or docker/elasticserach have enough memory or add more Elasticsearch nodes to your cluster.

I have not used the official docker containers but I’m not sure that those are the default values I thought these were the default java options: -Xms512m -Xmx512m ( which would be way to low for you environment ).

The Memory limit for the heap are set at 32GB…
Before we had a low space issue last weekend, the stack worked great, even with 19 000 shard at it’s peak. After the rebuilt, we always got the CircuitBreakingException[[parent] so I updated from 1.10 to 1.13 to see if it would fix the issue but now I got it even with 12000 shards…

There are a couple of things here…

#1 - You should not set the heap size to exactly 32GB. You need to stay slightly lower. 31GB is a common best practice. For background on why this is important see the following:

#2 - 12000 shards is an insane number of shards for an Elasticsearch node. 19000 is even worse. Again, for background see the following blog. In particular the Tip:

The number of shards you can hold on a node will be proportional to the amount of heap you have available, but there is no fixed limit enforced by Elasticsearch. A good rule-of-thumb is to ensure you keep the number of shards per node below 20 per GB heap it has configured. A node with a 30GB heap should therefore have a maximum of 600 shards, but the further below this limit you can keep it the better. This will generally help the cluster stay in good health.

I highly recommend that you revisit your sharding strategy.

  • The best option is to simply add more Elasticsearch nodes, thus reducing the number of shards per node.
  • If your indices have multiple shards on a single node, change to using a single shard per index.
  • If you have multiple shards per index because you are using JBOD and multiple data paths, it is better to configure the disks for RAID-0 (or RAID-10 for a single-node cluster). You will get better performance and be able to reduce shard count.
  • If you already have one shard per index (meaning you have a lot of indices) you need to combine multiple dataset per index. Add a field such as dataset or type so that you can more easily query specific subsets of the overall data.
1 Like

First thanks robcowart to have given me all this information.

We currently have 6 node with 32 GB of ram each and a total of 12 000 shards. This is around 2000 shards per node.
Each index has 6 nodes (3 primary and 3 duplicate) and we have a total of 21TB of data total ( data+duplicate)

Reducing the shard number would only increase the data in each of them and since some are already over 100GB, I’m not sure it would reduce the amount of ram consume.

Without closing the index, is there a way to reduce the ram it use?
Is there a way to know how much memory a specific shard take in memory?

I just want to know if I can optimize the way ram is used before addin new servers.

@Franckiboy What you should try is force merging segments. You should only do this once you are no longer writing to an index. At a low level this is combing multiple smaller files into a smaller number of larger files. This reduces the amount of data that must be maintained regarding which documents (and other structures like terms and doc values) is located in which segment. Here is an example of an index on one of our lab boxes…

First fetch the stats for the index.

GET elastiflow-flow-ecs-1.7-1.0-2021.w06/_stats/segments

Find the section indices.total.segments. In my example it looks like this:

"segments": {
  "count": 40,
  "memory_in_bytes": 1642928,
  "terms_memory_in_bytes": 1068128,
  "stored_fields_memory_in_bytes": 21056,
  "term_vectors_memory_in_bytes": 0,
  "norms_memory_in_bytes": 0,
  "points_memory_in_bytes": 0,
  "doc_values_memory_in_bytes": 553744,
  "index_writer_memory_in_bytes": 0,
  "version_map_memory_in_bytes": 0,
  "fixed_bit_set_memory_in_bytes": 0,
  "max_unsafe_auto_id_timestamp": -1,
  "file_sizes": {}
},

You can see that this index consist of 40 segments. Also note the memory_in_bytes of 1642928.

Next, request a _forcemerge for the index.

POST elastiflow-flow-ecs-1.7-1.0-2021.w06/_forcemerge?max_num_segments=1&only_expunge_deletes=false

When the force merge is complete, get the stats again.

NOTE! For your large indices it may take a while for the force merge to complete.

"segments": {
  "count": 2,
  "memory_in_bytes": 73312,
  "terms_memory_in_bytes": 54336,
  "stored_fields_memory_in_bytes": 1680,
  "term_vectors_memory_in_bytes": 0,
  "norms_memory_in_bytes": 0,
  "points_memory_in_bytes": 0,
  "doc_values_memory_in_bytes": 17296,
  "index_writer_memory_in_bytes": 0,
  "version_map_memory_in_bytes": 0,
  "fixed_bit_set_memory_in_bytes": 0,
  "max_unsafe_auto_id_timestamp": -1,
  "file_sizes": {}
},

At a quick glance the memory_in_bytes is now only 73312, a reduction of about 1.45MB for this relatively small 2 shard index.

Force merging an index, after data is no longer being written, is the primary way that you can minimize its overhead. You can use Open Distro’s ISM features to automatically force merge indices.

1 Like

Thanks for your answer. This indeed fix my memory issue on the Data Node. Thanks!