How to deal with "knn.circuit_breaker.triggered stays set. Nodes at max cache capacity"

Hi,
I’m currently having this message in our cluster (3 master nodes / version 1.12.0 ) and therefore cannot index anymore.

Each node has in total about 21gb memory with a heapsize of 20% of that 21gb.
I enabled the cache expiry with 15 mins.

Is there anything else I can do in a running system to unload graphs?

Hi @mafr

How many data nodes are there?

Another way to unload cache is to change around some settings: for example, you could set “knn.memory.circuit_breaker.limit” to null and then return it to its previous value. This will evict all entries in the cache and allow you to index.

We have a long standing issue for this here: API to flush indices out of cache · Issue #68 · opendistro-for-elasticsearch/k-NN · GitHub

Feel free to plus 1

Jack

Currently none, there are only 3 master nodes.

What I do not completely understand, if the circuit_breaker is triggered it should unload graphs right?

If memory usage exceeds this value, KNN removes the least recently used graphs.

And therefore evantually untrigger? But it looks to me as this is never happening

Right, this was a design decision we made.

Assume that there are 10 graphs each occupying 6 GiB of memory and the circuit breaker limit is 59 GiB. Assuming all 10 graphs are searched, only 9 graphs will be able to fit in the cache at one time, making the maximum cache size 54 GiB.

So, we decided to trip the circuit breaker whenever there is a cache eviction due to capacity: k-NN/KNNIndexCache.java at main · opendistro-for-elasticsearch/k-NN · GitHub.

Then, to untrip the circuit breaker, to prevent the situation described above from constantly tripping and then untripping the circuit breaker, we decided to untrip the circuit breaker when the cache capacity is at 75%: k-NN/KNNSettings.java at main · opendistro-for-elasticsearch/k-NN · GitHub

Thank you for that explaination, just to understand this correctly:
75% of 59GB = 44,25GB
Therefore the untripping would happen as soon as there are not more than 7 Graphs in the memory (42GB)

The 59GB limit in your example is the memory limit of one node or the cluster?

As of my initial problem, the issue then must have been that in this 15 min cache expiry more than 9 graphs have been requested and therefore it was never able to unload more?

I guess you’re not planing to have a feaure like: evict_complete_graphcache_if_cuircuit_breaker_stays_active_for=20min
:slight_smile: ?

One more question regarding:

set “knn.memory.circuit_breaker.limit” to null and then return it to its previous value

Will setting the limit to null have dangerous effects? Even tough I will try to set the original value directly afterwards?

Thank you very much for the explanation so far

The 59GB limit in your example is the memory limit of one node or the cluster?

Yes correct.

As of my initial problem, the issue then must have been that in this 15 min cache expiry more than 9 graphs have been requested and therefore it was never able to unload more?

Right, for an individual query, all graphs for the field you are interested in will be loaded into the cache.

I guess you’re not planing to have a feaure like: evict_complete_graphcache_if_cuircuit_breaker_stays_active_for=20min
:slight_smile: ?

I do not think we have a plan for this at the moment. Once, cache clear API is implemented, this could functionality could be implemented on the client side.

Will setting the limit to null have dangerous effects? Even tough I will try to set the original value directly afterwards?

Setting the limit to null will just return the value to its default (which will trigger a cache rebuild where all entries are evicted). This is not dangerous.

Jack

1 Like