Architecture questions

#1

Hello,

I was looking at the source code of https://github.com/opendistro-for-elasticsearch/performance-analyzer quickly and I would like to ask some questions about general architecture. The Perf analyzer seems to be a native ES plugin that stores metrics into a file on disk (MetricDB) on 5 sec intervals. There is also a standalone Java daemon (PerformanceAnalyzerApp.java) that is started (for each ES node?) listening on port 9600 that is able to respond to client HTTP requests (like requests from the CLI tool) and is able to serve accumulated stats from one or more nodes. Am I at least close?

One thing that is not very clear to me is why in blog posts (like the recent https://aws.amazon.com/blogs/opensource/analyze-your-open-distro-for-elasticsearch-cluster-using-performance-analyzer-and-perftop/) it is often said that:

Performance Analyzer is an agent and REST API […] independent of the Java Virtual Machine (JVM)

What exactly do you mean by the independency? Isn’t Perf Analyzer running in the same JVM as all other ES plugins and sharing it with ES node itself? What happens when ES is under heavy load can it still collect ES metrics? May be I am missing something.

Regards,
Lukáš

#2

Hi Lucas,

We made a conscious decision to keep only the most essential instrumentation logic inside the Elasticsearch process and move everything else into the Performance Analyzer agent. The plugin writes events such as http requests from a user into /dev/shm. The agent then processes these events, enriches them with system and OS statistics(CPU utilization) and generates the metricsDB file every 5 seconds. A few advantages of process isolation are -

  • The agent web service does not suffer from resource contention(threads/GC/memory) and hence can service requests for metrics independently.

  • The agent can aggregate results from other nodes which are healthy, independent of the health of the Elasticsearch process on that node.

  • Any computation and memory resources consumed by the agent are separate from the Elasticsearch process. By starting the agent in a separate cgroup, you can be assured that the metrics computation piece will not contend for system resources meant for Elasticsearch.

We will be adding more detailed design docs to both Performance Analyzer and PerfTop