"Connection reset by peer" error messsges

I’m seeing “connection reset by peer” error messages appearing in my Elasticsearch log frequently. In a newly restaged deployment on Kubernetes, it’s occurring every 30 seconds. My assumption has been that the connection in question was from Fluent Bit as it was attempting to send new log messages to Elasticsearch. But I don’t see any messages (errors or otherwise) on the Fluent Bit side, so now I’m questioning the assumption.

I’ve included an example of the error message and associated stack trace below. Can someone verify that this is most likely related to the Fluent Bit connection? All of the errors are coming from ES client nodes. I had 2 client nodes and upped it to 3 but it hasn’t reduced the number of errors. Fluent Bit is the only thing feeding documents to ES.

I’ve deployed with the sample demo security enabled including the demo certs. Could that (or TLS, in general) be a factor here?

[2020-04-03T03:06:41,898][ERROR][c.a.o.s.s.h.n.OpenDistroSecuritySSLNettyHttpServerTransport] [v4m-es-client-5648c4cb49-kf84r] Exception during establishing a SSL connection: java.io.IOException: Connection reset by peer
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:?]
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:?]
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:276) ~[?:?]
        at sun.nio.ch.IOUtil.read(IOUtil.java:233) ~[?:?]
        at sun.nio.ch.IOUtil.read(IOUtil.java:223) ~[?:?]
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:358) ~[?:?]
        at org.elasticsearch.transport.CopyBytesSocketChannel.readFromSocketChannel(CopyBytesSocketChannel.java:137) ~[transport-netty4-client-7.4.2.jar:7.4.2]
        at org.elasticsearch.transport.CopyBytesSocketChannel.doReadBytes(CopyBytesSocketChannel.java:122) ~[transport-netty4-client-7.4.2.jar:7.4.2]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:697) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:597) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:551) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) [netty-common-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.38.Final.jar:4.1.38.Final]
        at java.lang.Thread.run(Thread.java:835) [?:?]

In another deployment where I’m seeing the same “connection reset by peer” error messages in the ES log, I do see Fluent Bit messages that seem to line up with the ES errors (as shown in the following screenshot):

The ES_ERROR_PLOT shows instances of the “Connection reset by peer” message from the ES log, the FB_FAILED_PLOT shows instances of “failed to flush chunk” messages in the Fluent Bit log and the FB_SUCCESS_PLOT shows instances of “succeeded at retry” messages in the Fluent Bit log (which are generated when a chunk was successfully sent to ES after previously failing.

So, things seem “related” even if I can’t say one caused the other. As mentioned in the original post, I have seen instances of either error message without the corresponding one on the other side.

@dbbaughe
I did my original prototyping work with the Elastic distribution of ES and didn’t see these errors (although I never went looking for them). Since the stack trace includes references to ODFE components, could these communication issues be signs of an issue with the ODFE security plugin? Hmmm, I suppose I also didn’t have TLS enabled then either. Any thoughts on what may be causing the connection issues?

Any solution on this?
We have a similar issue with Opendistro for ES version 0.10.0

Sorry, @atorelli this problem stopped for me, but I do not remember why. I don’t believe I changed any connection configuration settings. But I may have improved the parsing I was doing in Fluent Bit to make sure the data going to Elasticsearch was formatted more consistently.

1 Like