Lots of beginner questions

Hi,

this is my first post, but i keep reading this forum for some time already to know the current state about Opensearch and friends.
I’m currently trying to setup elasticsearch+kibana as well as opensearch + dashboard at home (well i already have, that was the easy part) and i’m now playing with filebeat and metricbeat to get some data in.

I’m still doing a lot of testing and i’m still not 100% set whether elasticsearch/opensearch is what i need, and i hope someone can give me some feedback on whether things makes sense.

What i’m trying to achieve is to do some Log/Metrics monitoring of my home IT. I have a lot of smaller orange-pis, some debian machines + VMs + docker for a lot of services. Keeping all of them uptodate and analyse why certain things go wrong is a big task and right now i’m using icinga for the monitoring part, but i would like to change this, as i want to store the data somewhere and also store the logs in a central location where i can also play with some anomaly detection and centrally managed alarm system.
I know that i can also keep icinga and just forward output, but that feels making the system even more complicated than what it should be. I want to get things simpler (in terms of components involved).

Reading the blogs at logz.io (excellent blog posts btw) i know that using opensearch for logs and prometheus + grafana for metrics is a proposed way of doing things, but to be honest that involves too many components especially as i also need beats or logstash for also storing logs…
So right now i’m planning to also store metrics in opensearch and are currently testing this (still doing this with elasticsearch, but eager to switch).
The big question right now is, does all this make sense to you ? Am i on the right track ?

Specific topics i’m currently thinking about which is currently limiting me are the following:
Storing system metrics just from 2 machines for some days is already eating up quite some disk space. So i would like to do some retention on the metrics. I’ve setup ILM and using Rollup Jobs, but here is the culprit. When rolling up the data to 1d statistics and using a combination of raw_index + rollup_index for the Dashboard i’m limited to only seeing the aggregated data (1d), what i would like to have is to be able to show the current raw data when available (e.g. the 2 days), but only use the aggregated ones when browsing older data. To my understanding the elasticsearch version doesn’t support this.
In addition i also would like to have multiple rollups, similar to what influxdb or graphite are doing, keep raw data for last 2 days, use 1h aggregation for 1 week, use 1day aggregation for the rest of the month etc.
Creating rollup jobs for this is not the problem (just a bit tedious, as just the time histogram changes), but in the end i can just combine a raw_index + one rollup_index in the index pattern.
I couldn’t find any of those limitations in the docs for opendistro and also not for opensearch. So the main question is, does it have the same limitation than the elasticsearch version ?

If yes, how is this problem usually solved at an enterprise level ?

Thx
Gagi

1 Like

Hello, Gagi. I would not suggest storing metrics in OpenSearch or ElasticSearch it’s not well suited for that data. You mentioned you are using Icinga, but you may want to switch to something like Prometheus as a better way to monitor the hosts. Since you are just using this at home, how much log data are we talking about here?

Hi,

i’ve considered Prometheus as well, but that means yet another Service and also means i need to use also another Dashboard solution (e.g. Grafana).
I know that OpenSearch is not the perfect fit for metrics, especially not for large scale, but for somethings small like home lab monitoring shouldn’t it still work ? Could you also explain why you say OpenSearch is not well suited for it ? Most of the articles i’ve read are stating disc space and performance. Rollups should hopefully make the first managable, the second is probably not a problem for my amount of data ?

With just enabling the metricbeat on one of my servers using the default system module settings sending all stats every 10s i got 175k documents for 8h in my index which takes 90MB. I know that this doesn’t sound like much in OpenSearch terms, but for 8h of system metrics just for one system, it’s already quite a lot of data and doesn’t scale very well.

Just reading up on the elasticsearch docs again and it seems having multiple jobs in one rollup_index is supposed to work and with that it should be possible to achieve what i would like to have.

The open question is whether the opensearch rollup provides this as well ?

Grafana is best suited for time series analysis, and Prometheus is a very simple way of monitoring. You could remove Icinga and replace it with this which is much more modern. Trying to do this in ElasticSearch or OpenSearch is really not a good idea. I can attest to this in many years of experience.

If I am allowed to dissent from the general opinion, for home use (not likely to exceed 100M+ events) ES/OS should be entirely suitable, although you might need to reindex a few times to split into more shards.

Although I do agree, unless you need to do some complex analysis on-demand, there are much more viable solutions.

Do take into account, ES/OS is a very high maintenance solution for home logging use.

For the matter at hand, I really don’t think you need any rollups for typical home monitoring (unless you’re indexing an event every few seconds - which is unnecessary).

You can use the built-in aggregations on dashboards/anomaly detection to do the same job essentially at extremely high speeds.

Do consider using other solutions - ES/OS is incredibly high maintenance due to sharding for a typical home setup and there are much better solutions.

2 Likes

Thx for your input.

You both say that ES is not suitable for this. Could you elaborate more on that ? And also why Grafana is better for time-series analytics ? E.g. How can things like anomaly detection be done there ?
Sure these things are not really needed for simple monitoring, but i would like to see whether it can be useful.

@hagayg What kind of maintenance are we talking about ? Right now i have the feeling that once the ILM and Rollups + some additional cleanup of the rollup after some time is configured it shouldn’t need much maintenance.

I’m talking about Rollups as i want to store some metrics for longer time as well, e.g. several years. We are not talking about system metrics here, more some smart home statistics like the power consumption, temperature of rooms etc. And that’s also where i see that prometheus alone is probably not what i need, as it’s not really meant for long time storage and the recommendation is to use yet another system for doing longterm storage.
Could someone highlight whether the rollup API would allow me to do what i described above ?

Just switching to prometheus or others for metrics also doesn’t mean i can remove ES, as i still need an instance for other components (file content search etc.).

Thx again for your input.

As your index grows, I am guessing some resharding would be required (depending in how big it is), as well as thinking about proper data models and relevant fields as you’re adding new data.

In my honest opinion, assuming you won’t go over 50 GB (primary shards only) within the lifespan of this monitoring, there won’t be any need for rollups as aggregations can be quite efficient, especially if indexed properly.

As a side note, since much of it might be very much structured data, I am guessing even a simple SQL solution with an aggregating view/Metabase (or any other sql dashboard) would do the trick, without using up as much resources and requiring almost zero thought process.

To give you a 3rd opinion.

I agree with @jkowall for most environments that a proper time-series platform is better suited for most metrics/telemetry datasets. Elasticsearch/OpenSearch can be used, but it isn’t very efficient in terms of time-series query speed and storage. It is this latter which will probably be the biggest challenge for a small environment. Personally I am not a fan of Prometheus, and would recommend InfluxDB and Telegraf as alternative. However they will all work, even ES/OS for time-series, albeit with trade-offs.

Regarding @hagayg’s comments about sharding as indices get larger. I suspect he made the mistake of trying to index all of his logs in single index, rather than writing daily, weekly, monthly (probably fine for a home environment), or even rollover indices. This automatically breaks the overall dataset into multiple shards, as each new time-bounded index has its own shards. If done right there is no reason that you can’t do log collection at home or in a massive enterprise environment and not touch ES/OS for months.

I have not, breaking them up however is exactly the sort of maintenance and effort I am talking about - planning ahead whether to use weekly indices, etc…

I’m currently reevaluating my options, including TICK, but also checking Loki e.g. to replace ES/OS.

I already splitted up the indices and setup ILM to put them into the delete phase accordingly, so that’s not really a problem and for me this is not maintenance but more “initial setup”.

Remember that Loki is very limited in log search use cases. You need to isolate it down with metrics before Loki works well. Similarly, you can’t use Loki for tracing data as you can with OpenSearch. You would need to run Tempo which has the same limitations. The other thing to consider is that Tempo and Loki are both AGPL licensed and not part of a software foundation so in the future this may become another ElasticSearch situation with forks required. I would be very careful with technologies that may not be open source in the future.