ElasticSearch Master Nodes Unable to Join Cluster - OpenDistro Security Plugin

~12 hours ago, our Kibana instance stopped and restarted, which is when instability with our Elasticsearch master nodes began. The Kibana error was:

"message": "{ [security_exception] Open Distro Security not initialized for indices:admin/get :: {\"path\":\"/.kibana\",\"query\":{},\"statusCode\":503,\"response\":\"{\\\"error\\\":{\\\"root_cause\\\":[{\\\"type\\\":\\\"security_exception\\\",\\\"reason\\\":\\\"Open Distro Security not initialized for indices:admin/get\\\"}],\\\"type\\\":\\\"security_exception\\\",\\\"reason\\\":\\\"Open Distro Security not initialized for indices:admin/get\\\"},\\\"status\\\":503}\"}\n at respond (/usr/share/kibana/node_modules/elasticsearch/src/lib/transport.js:349:15)\n at checkRespForFailure (/usr/share/kibana/node_modules/elasticsearch/src/lib/transport.js:306:7)\n at HttpConnector.<anonymous> (/usr/share/kibana/node_modules/elasticsearch/src/lib/connectors/http.js:173:7)\n at IncomingMessage.wrapper (/usr/share/kibana/node_modules/elasticsearch/node_modules/lodash/lodash.js:4929:19)\n at IncomingMessage.emit (events.js:194:15)\n at endReadableNT (_stream_readable.js:1103:12)\n at process._tickCallback (internal/process/next_tick.js:63:19)\n status: 503,\n displayName: 'ServiceUnavailable',\n message:\n '[security_exception] Open Distro Security not initialized for indices:admin/get',\n path: '/.kibana',\n query: {},\n body:\n { error:\n { root_cause: [Array],\n type: 'security_exception',\n reason: 'Open Distro Security not initialized for indices:admin/get' },\n status: 503 },\n statusCode: 503,\n response:\n '{\"error\":{\"root_cause\":[{\"type\":\"security_exception\",\"reason\":\"Open Distro Security not initialized for indices:admin/get\"}],\"type\":\"security_exception\",\"reason\":\"Open Distro Security not initialized for indices:admin/get\"},\"status\":503}',\n toString: [Function],\n toJSON: [Function] }"

This occurred on 2 separate indexes. When we could still access Kibana, The GUI showed another error on the main splash page: “Tenant Indices migration failed in Kibana”

A few hours later, all three of our master nodes either went down or stopped. They couldn’t, and still cannot rejoin the cluster. When running the following command

sh "/usr/share/elasticsearch/plugins/opendistro_security/tools/securityadmin.sh" -cd "/usr/share/elasticsearch/plugins/opendistro_security/securityconfig" -icl -key "/usr/share/elasticsearch/config/kirk-key.pem" -cert "/usr/share/elasticsearch/config/kirk.pem" -cacert "/usr/share/elasticsearch/config/[REDACTED]" -nhnv --accept-red-cluster --diagnose

On Master 2, it returned with the following output:

Open Distro Security Admin v7
Will connect to localhost:9300 ... done
Connected as CN=kirk,OU=client,O=client,L=test,C=de
Elasticsearch Version: 7.4.2
Open Distro Security Version: 1.4.0.0
Diagnostic trace written to: /usr/share/elasticsearch/config/securityadmin_diag_trace_2020-Nov-09_16-30-35.txt
Contacting elasticsearch cluster 'elasticsearch' ...
Clustername: [REDACTED]
Clusterstate: RED
Number of nodes: 103
Number of data nodes: 101
ERR: An unexpected MasterNotDiscoveredException occured: null
Trace:
MasterNotDiscoveredException[null]
        at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.onTimeout(TransportMasterNodeAction.java:214)
        at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:325)
        at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:252)
        at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:598)
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.lang.Thread.run(Thread.java:835)

When run on Master 1, it returned with the following output. I ran it a few minutes ago, and the “Number of nodes” was 27 and the “Number of data nodes” was 25. However, I just ran the same command again, and got the following:

Open Distro Security Admin v7
Will connect to localhost:9300 ... done
ERR: Cannot connect to Elasticsearch. Please refer to elasticsearch logfile for more information
Trace:
NoNodeAvailableException[None of the configured nodes are available: [{#transport#-1}{REDACTED}{localhost}{REDACTED:9300}]]
        at org.elasticsearch.client.transport.TransportClientNodesService.ensureNodesAreAvailable(TransportClientNodesService.java:352)
        at org.elasticsearch.client.transport.TransportClientNodesService.execute(TransportClientNodesService.java:248)
        at org.elasticsearch.client.transport.TransportProxyClient.execute(TransportProxyClient.java:57)
        at org.elasticsearch.client.transport.TransportClient.doExecute(TransportClient.java:394)
        at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:396)
        at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:385)
        at com.amazon.opendistroforelasticsearch.security.tools.OpenDistroSecurityAdmin.execute(OpenDistroSecurityAdmin.java:520)
        at com.amazon.opendistroforelasticsearch.security.tools.OpenDistroSecurityAdmin.main(OpenDistroSecurityAdmin.java:153)

When trying to get the cluster IDs for these two master nodes, I got the response “Open Distro Security not initialized.”

Here is our scrubbed elasticsearch.yml for master nodes:

cluster.name: "REDACTED"
network.host: REDACTED
network.publish_host: #{node['ipaddress']}   ##ip-address of the host
node.master: true
node.data: false
node.ingest: false
node.name: master-#{node_id} 
bootstrap.memory_lock: true
gateway.expected_data_nodes: 1
gateway.recover_after_time: 15m
gateway.recover_after_master_nodes: 2
transport.tcp.port: 9300
discovery.zen.minimum_master_nodes: 2
discovery.seed_hosts: #{ips[0]},#{ips[1]},#{ips[2]}      ##ip-addresses of master nodes
cluster.initial_master_nodes: #{ips[0]},#{ips[1]},#{ips[2]}      ##ip-addresses of master nodes

######## Start OpenDistro for Elasticsearch Security Demo Configuration ########
# WARNING: revise all the lines below before you go into production
opendistro_security.ssl.transport.pemcert_filepath: REDACTED
opendistro_security.ssl.transport.pemkey_filepath: REDACTED
opendistro_security.ssl.transport.pemtrustedcas_filepath: REDACTED
opendistro_security.ssl.transport.enforce_hostname_verification: false
opendistro_security.ssl.http.enabled: true
opendistro_security.ssl.http.pemcert_filepath: REDACTED
opendistro_security.ssl.http.pemkey_filepath: REDACTED
opendistro_security.ssl.http.pemtrustedcas_filepath: REDACTED
opendistro_security.allow_unsafe_democertificates: true
opendistro_security.allow_default_init_securityindex: true
opendistro_security.authcz.admin_dn:
  - CN=kirk,OU=client,O=client,L=test, C=de

opendistro_security.audit.type: internal_elasticsearch
opendistro_security.enable_snapshot_restore_privilege: true
opendistro_security.check_snapshot_restore_write_privileges: true
opendistro_security.restapi.roles_enabled: ["all_access", "security_rest_api_access"]
cluster.routing.allocation.disk.threshold_enabled: false
node.max_local_storage_nodes: 3

######## End OpenDistro for Elasticsearch Security Demo Configuration ########

The error message shows when Kibana tries to connect to Elasticsearch , ES hit this error .

Did you update your security plugin authN/authZ related configurations, or restarted the nodes? You may check elasticsearch.log of the nodes, especially the node start up process part, to see if there is anything abnormal.

No we did not restart, but, we have a coordinating node on Kibana and it crash shortly before masters went down, we did try restarting master a few times with no changes,.

Hi Brian, thanks for this info.

Based on this setup, I am thinking, it might be the master nodes cannot join cluster caused the “Open Distro Security not initialized for indices:admin/get” issue on Kibana side. As in, your coordinator node crash and restarted, but due to it somehow cannot connect to master, it will not be able to join the cluster, thus not able to initialize the security configurations.

Can you look at the elasticsearch.log of the master nodes, and see if there are any error in the logs?

@zengyan-amazon I can’t find any files named elasticsearch.log in our master node. We have application logs going to Cloudwatch - do you suspect those are the same?

We haven’t updated the security plugin authN/authZ - could you give some more guidance on that?

Also, in the hyperlinked error, the code is

            if (!eval.isInitialized()) {
                log.error("Open Distro Security not initialized for {}", action);
                listener.onFailure(new ElasticsearchSecurityException("Open Distro Security not initialized for "
                + action, RestStatus.SERVICE_UNAVAILABLE));
                return;
            }

The log.error should tell us what the action is, correct? The output I am seeing is “Open Distro Security not initialized” which looks different than the string referenced in the error.

@everbeck32 according to the kibana log you pasted, the action in the log is indices:admin/get

regarding the logs, depending on your installation, the logs may be saved at /var/log/elasticsearch/ or
elasticsearch/logs/ . you can refer to https://opendistro.github.io/for-elasticsearch-docs/docs/elasticsearch/logs/ for details

@zengyan-amazon sorry, I thought you were referring to the output I get from Elasticsearch when trying to run curl command to get the cluster ids from Elastic Master nodes.

Ok - yeah, I’ve got those logs. It’s constantly repeating these same logs over and over:

[2020-11-09T19:15:32,895][ERROR][c.a.o.s.a.BackendRegistry] [master-2] Not yet initialized (you may need to run securityadmin)
[2020-11-09T19:15:33,327][ERROR][c.a.o.s.f.OpenDistroSecurityFilter] [master-2] Open Distro Security not initialized for indices:admin/create

There is also this log:

[2020-11-09T19:15:43,117][ERROR][c.a.o.s.s.t.OpenDistroSecuritySSLNettyTransport] [master-2] Exception during establishing a SSL connection: javax.net.ssl.SSLHandshakeException: Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)

^ But Elasticsearch docs says that these is a non-issue.

This is the only unique error log I’m seeing otherwise:

[2020-11-09T19:25:00,214][ERROR][c.a.o.s.a.s.InternalESSink] [master-1] Unable to index audit log 
{
    "audit_cluster_name": "REDACTED",
    "audit_node_name": "master-1",
    "audit_category": "SSL_EXCEPTION",
    "audit_request_origin": "TRANSPORT",
    "audit_node_id": "dXXV0zqySKiE3zPTZgBgZQ",
    "audit_request_layer": "TRANSPORT",
    "audit_request_exception_stacktrace": "javax.net.ssl.SSLHandshakeException: Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)\n\tat java.base/sun.security.ssl.Alert.createSSLException(Alert.java:131)\n\tat java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:320)\n\tat java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:263)\n\tat java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:258)\n\tat java.base/sun.security.ssl.SSLTransport.decode(SSLTransport.java:129)\n\tat java.base/sun.security.ssl.SSLEngineImpl.decode(SSLEngineImpl.java:681)\n\tat java.base/sun.security.ssl.SSLEngineImpl.readRecord(SSLEngineImpl.java:636)\n\tat java.base/sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:454)\n\tat java.base/sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:433)\n\tat java.base/javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:634)\n\tat io.netty.handler.ssl.SslHandler$SslEngineType$3.unwrap(SslHandler.java:282)\n\tat io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1329)\n\tat io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1224)\n\tat io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1271)\n\tat io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:505)\n\tat io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:444)\n\tat io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:283)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)\n\tat io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1421)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)\n\tat io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930)\n\tat io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:697)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:597)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:551)\n\tat io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511)\n\tat io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918)\n\tat io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat java.base/java.lang.Thread.run(Thread.java:835)\nCaused by: javax.crypto.BadPaddingException: Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)\n\tat java.base/sun.security.ssl.SSLCipher$T13GcmReadCipherGenerator$GcmReadCipher.decrypt(SSLCipher.java:1878)\n\tat java.base/sun.security.ssl.SSLEngineInputRecord.decodeInputRecord(SSLEngineInputRecord.java:240)\n\tat java.base/sun.security.ssl.SSLEngineInputRecord.decode(SSLEngineInputRecord.java:197)\n\tat java.base/sun.security.ssl.SSLEngineInputRecord.decode(SSLEngineInputRecord.java:160)\n\tat java.base/sun.security.ssl.SSLTransport.decode(SSLTransport.java:108)\n\t... 27 more\n",
    "@timestamp": "2020-11-09T19:23:00.210+00:00",
    "audit_request_effective_user_is_admin": false,
    "audit_format_version": 4,
    "audit_node_host_address": "REDACTED",
    "audit_node_host_name": "REDACTED"
}

^ This looks like the SSL error mentioned above …

We’re also getting failed shard notifications, but I didn’t paste those because they didn’t seem relevant. Let me know if those would be helpful.

I expanded my search back and have a few more error logs to send:

[2020-11-09T02:53:55,112][ERROR][o.e.c.s.ConsistentSettingsService] [master-3] unable to publish secure settings hashes
[2020-11-09T02:55:55,411][ERROR][c.a.o.a.a.AlertIndices   ] [master-1] Error creating alert indices. Alerts can't be recorded until master node is restarted.

@everbeck32 Thanks for sharing this information. I am checking with our security plugin team and see if they have more context about node initialization failure and this SSL exception.

@everbeck32 another thought would be something related to the cluster formation. like before the event happen, you have 3 master nodes, and master-1 was the master node of the cluster at that time.

Brain mentioned that your cluster master nodes went down somehow. If more than 2 of your master nodes were down, your cluster lost the quorum.

Do you mind to look at your master nodes’ log (on all 3 master ndoes), and check node initialization related log entries, especially the nodes discovery part, and see which node was elected as the new master, or the cluster failed to elect a master node due to some errors?

Yes - All three nodes went down within ~40 minutes of each other.

Are you asking me to find what discovery logs looked like after the first node went down (roughly 13 hours ago)? Or are you asking me to look at logs related to node discovery now that all three are down?

I think this issue is some what similar to this About "Tenant indices migration failed" error in kibana

@everbeck32 I think looking at the node discovery log after you restarted your master nodes would help to understand the current cluster status, and identify possible fix

Kibana tenant index migration failure won’t impact Elasticsearch cluster health status, while in your case, look like the elasticsearch cluster lost its master quorum, we need to fix it to bring ES back

I’ll send the discovery logs over a private channel. I agree that the migration failure isn’t impacting ES cluster health status, but I think the OpenDistro Security plugin failing to initialize is preventing the master nodes from fully coming up and joining the cluster.

@everbeck32 yes, I agree the OpenDistro Security plugin failing to initialize is the problem here. base on my understanding, the OpenDistro security ES plugin will try to load the security configuration from the .opendistro_security index, then initialize itself according to the configuration (if the index doesn’t exist, then load from the yml config files).

While in this case, looks like the master nodes are not able to join the cluster, thus not able to load the security configurations, it might because of the SSHHandShake problem, or cluster formation issue, we are not sure about that yet

EDIT: (I am limited to 3 replies in the topic )
@everbeck32 Thanks for sending the logs over, I see some logs stating Received response for a request that has timed out. Is your master-1 or other master nodes are overloaded, like experiencing high CPU utilization for high JVM pressure?

e.g. if you have a lot of indices/shards in your cluster, while your master nodes are not powerful enough, you may experience problems when recovering your cluster

@zengyan-amazon got it - feel free to keep editing that message with updates lol

CPU, at its highest, was around 33% for master nodes over the last 24 hours. It’s pretty consistently below 30%.

We do have a lot of indices and shards in our cluster, though - could that manifest anywhere else besides CPU/JVM?

@zengyan-amazon hey interesting. I was looking at logs for our warm, hot data nodes and then Elasticsearch master nodes. When looking at a warm node, it basically only had “Running Full Sweep” until midday Sunday. Then bam, all of these warm and hot nodes started being added/removed and that behavior just keeps happening over and over.

[2020-11-08T19:05:51,224][INFO ][o.e.c.s.ClusterApplierService] [data-10-warm] removed {{data-16-warm}{JZbmhoUaTP-3-IJtYWPp7Q}{jtMZ5uwqSvSHD6Or2ofmeQ}{REDACTED}{REDACTED:9300}{di}{aws_availability_zone=us-east-1a, box_type=warm},{data-61-warm}{REDACTED}{REDACTED}{REDACTED}{REDACTED}{di}{aws_availability_zone=us-east-1a, box_type=warm},{data-19-warm}{jelSq-KrTA-XAj2hwyIouA}{uHFiIpMRQeO98ajHoQ6BDw}{REDACTED}{REDACTED}{di}{aws_availability_zone=us-east-1a, box_type=warm},}, term: 44, version: 4298678, reason: ApplyCommitRequest{term=44, version=4298678, sourceNode={master-2}{REDACTED}{REDACTED}{REDACTED}{REDACTED}{m}}
[2020-11-08T19:06:18,437][INFO ][o.e.c.s.ClusterApplierService] [data-143-warm-kibana] added {{data-61-warm}{Kk_Q6-kJSbyMvOahIyTBdA}{75JsFPFdRCCLEP7g1-kFnw}{REDACTED}{REDACTED:9300}{di}{aws_availability_zone=us-east-1a, box_type=warm},{data-16-warm}{JZbmhoUaTP-3-IJtYWPp7Q}{jtMZ5uwqSvSHD6Or2ofmeQ}{REDACTED}{REDACTED:9300}{di}{aws_availability_zone=us-east-1a, box_type=warm},{data-19-warm}{REDACTED}{REDACTED}{REDACTED}{REDACTED:9300}{di}{aws_availability_zone=us-east-1a, box_type=warm},}, term: 44, version: 4298679, reason: ApplyCommitRequest{term=44, version=4298679, sourceNode={master-2}{REDACTED}{REDACTED}{REDACTED}{REDACTED}{m}}

I started looking into the CPU again. When I look at the EC2 CPU, it’s about 20% consistently. However, when I run “docker stat” it is showing CPU at 150%.