Nodes crashed and kNN questions

Good time of day!
Faced with an inexplicable thing - during the kNN search, the servers fall (in different order and not always (e.g. two success searches, one fail etc).

Cluster:
master 4cpu, 4gb
coordinator 8cpu, 8gb (I’m not sure that it plays a role, no load is visible on it)
3x data-node 16cpu, 64gb, SSD
circuit_breaker_limit=85%

Elasticsearch 7.9.1 (installed OpenDistro For Elasticsearch 1.10.1 (used because there was a problem with kNN-score calculations in 1.11.0))

Cluster contents: each index is about 10 GB, there are 22 such indexes (elastic itself moved them through the data-nodes if necessary). Index fields = vector of 128 elements, and 3 fields with additional data. Number of segments in the index = number of shards per index = 1

Nodes crash

Initially, heap put 30gb, then experimentally saw that even with 4gb on primary-shards, the search works properly, 22 indexes are read from disks for 30 seconds, then the search takes milliseconds.

After I enabled 1 replica for all indexes, the search with heap=4gb stopped working at all, with 12gb < heap < 30gb it works every other time. Crashes with the following error (example, nodes may change)

{‘error’: {‘root_cause’: , ‘type’: ‘search_phase_execution_exception’, ‘reason’: ‘’, ‘phase’: ‘fetch’, ‘grouped’: True, ‘failed_shards’: , ‘caused_by’: {‘type’: ‘node_not_connected_exception’, ‘reason’: ‘[data-node-1][10.250.7.90:9300] Node not connected’}}, ‘status’: 500}

I tried reducing circuit_breaker_limit to 50%, but it doesn’t have any effect…

If I understand correctly - when reducing circuit_breaker_limit, I should have received a high search time (for example, the same 15-30s), and the indexes should be rotated in memory? Then why do the nodes fall - I put logging.level: DEBUG in the logs and there are no errors, except that a message appears that the next data-node is working again. When i set circuit_breaker_limit=20%:
(64gb - 12gb (jvm)) * 20% = 10.4gb to hold in memory. But in zabbix i have seen growth of memory for 40gb! How is that possible?

Replicas trouble

When i created 1 replica for indices above, i saw in /_opendistro/_knn/stats that replica was also being loaded into memory. At the same time, the search speed remained the same. Are there any ways to solve this problem?

The problem of the balancer (not the most important, but still)

Balancer trying to load the third data-node more than other.
When 1 and 2 data-nodes take 13-14% of disk total space, data-node-3 takes 18%. No settings were changed, I do not understand where this problem comes from.

Vector size (economy space of SSD)

Is it true that each element of the vector needs 4 bytes, whether of the data type or not? it looks like if the vector consists of 512 elements float32, and the vector consisting of 512 elements int8 do not differ in size. Is it possible to reduce this value?

I will be grateful for your help in my questions. I’m ready to provide any logs

P.S. Sorry for my English please

Hi @doc113

Initially, heap put 30gb, then experimentally saw that even with 4gb on primary-shards, the search works properly, 22 indexes are read from disks for 30 seconds, then the search takes milliseconds.

After I enabled 1 replica for all indexes, the search with heap=4gb stopped working at all, with 12gb < heap < 30gb it works every other time.

Crashes with the following error (example, nodes may change)
{‘error’: {‘root_cause’: , ‘type’: ‘search_phase_execution_exception’, ‘reason’: ‘’, ‘phase’: ‘fetch’, ‘grouped’: True, ‘failed_shards’: , ‘caused_by’: {‘type’: ‘node_not_connected_exception’, ‘reason’: ‘[data-node-1][10.250.7.90:9300] Node not connected’}}, ‘status’: 500}

I think your understanding is correct. It seems like a memory issue. If I understand correctly, the crash only occurs when replicas are enabled?

When i set circuit_breaker_limit=20%:
(64gb - 12gb (jvm)) * 20% = 10.4gb to hold in memory. But in zabbix i have seen growth of memory for 40gb! How is that possible?

Lucene may still be mapping segments into memory, causing memory to continue to grow.

When 1 and 2 data-nodes take 13-14% of disk total space, data-node-3 takes 18%. No settings were changed, I do not understand where this problem comes from.

Interesting, how many shards are on each node? Also, can you see how many docs are in each shard?

Is it true that each element of the vector needs 4 bytes, whether of the data type or not?

Yes it is true. We translate all numbers to floats here: https://github.com/opendistro-for-elasticsearch/k-NN/blob/master/src/main/java/com/amazon/opendistroforelasticsearch/knn/index/KNNVectorFieldMapper.java#L315.

Thanks for reply!

I think your understanding is correct. It seems like a memory issue. If I understand correctly, the crash only occurs when replicas are enabled?

It looks like it was about the size of the segments. On new indices i did not do /_forcemerge?max_num_segments=1 and on breaker_limit=40% it works correctly (without nodes fails)

Interesting, how many shards are on each node? Also, can you see how many docs are in each shard?

On each node 56 shards.
1)First type index: 18 indices, with 3 primary shards and 1 replice :36 shards on each node, near to 2millions docs (512-dimension vector and 3 data-fields) (total near to 216 millions docs);
2) Second index type (disbalanced) - 128-dimension vector and 3 data-fields
data-node-1 14 shards, 110millions docs
data-node-2 13 shards 95millions docs
data-node-3 17 shards 140millions docs
3) Third index type (disbalansed) - 320-dimension vector and 3 data-fields
data-node-2 17 shards 50millions docs
data-node-1 17 shards 46millions docs
data-node-3 26 shards 77millions docs

Indices creates by timestamp range → they are not the same size

Total:
data-node-1 uses 572 GB
data-node-2 uses 566 GB
data-node-3 uses 694 GB

the load balancer regularly sends (after upload data) shards to the data-node-3

the load balancer regularly sends (after upload data) shards to the data-node-3

Interesting. Do you see any insights from the _cluster/allocation/explain output?

Usually there’s nothing there. Now the allocation is reaching its madness. 25%-27%-50% (data-nodes 1, 2, 3, respectively) , 660, 670 and 1100 gb, respectively. Perhaps this is due to the fact that the indexes are created automatically by date when filling?

Elasticsearch balances shards based on number of shards on the nodes and not on the size of the shards. Do you see the shards are uniformly distributed?

_cat/allocation should give you the distribution.