How do you feed the Anomaly Detection Plugin existing data?

When creating a “Feature” you can preview it’s results based on existing data in a past time range, however the live detector+feature only seems to be able to start collecting new logs and isn’t aware of existing data.

I was expecting to be able to “feed” it several years worth of historical data that contains seasonal trends and then be able to visualize it with past date ranges and see where it would have detected anomalies. Am I missing something?

1 Like

Hi @GregT,

This feature is for real-time streaming. We are working on historical data as well. Would you be able to share details on years worth of data? Are you looking at month over month trend?

Thanks,
Pavani

Hi @bpavani,

I have what I assume is a typical use case of many applications storing their logs in elk. Our business has seasonal trends and I’d love to be able to have anomaly detection be aware of that so we could look at year over year trends as well as month over month. I have barely scratched the surface reading about how it works (RCF algorithm), but the documentation alludes to it being aware of seasonal behavior, so I hope at some point we can feed it historical data and it takes it into account.

Thanks!

hi, @GregT

Thanks for providing your use case. Currently anomaly detection mainly take real-time streaming data in and detect anomaly. So it may take some time to make the model aware of the seasonal pattern.

Try to analyze your case to make sure we understand your requirements correctly and completely. Correct me or add more use cases.
1). You already know some pattern from historical data, and want to train model with historical data first. So when we start to detect streaming data, the model already adopts the seasonal pattern and ready to be used.

2). Do you need to detect anomalous data in historical data to verify the model is good enough? Like general ML method, split historical data into training set and test set. So you can verify result and tune it accordingly.

3). Detect anomalies in historical data for analysis. Some historical data may not follow your seasonal behavior. Do you need to know these historical anomalies to analyze and tune your business strategy, process, etc?

Thanks !

Hey everyone,

I am thinking of using ES’s Anomaly Detection for a similar use-case. I think that the features you mentioned in 2 and 3 would be very useful.

As far as I know, the current functionality of anomaly detection lets the user choose whether an event was an anomaly or not. To generalize this search across historical data is something I’d like to see and I’m sure it would greatly benefit the model accuracy.

Thanking you

hi, @amir

Thanks for your feedback.

For feature 3, have a draft idea to run anomaly detection on historical data with some cron job. I think some user may not want so real-time anomaly detection like run every 5 minutes. For example, they can run anomaly detection on last week’s data and put the anomaly detection task running at night or other cluster idle time. We can create a weekly cron job and let user specify the run time. The job will replay last week’s data points and find out anomalies. User can review the anomaly job progress and results on Kibana. One benefit I can see is user can choose to run detector at some system idle time, either run once or periodically, and will clear model once job done rather than hold model in memory all the time.

How do you think about this idea? Welcome any comments/new ideas. If you have other use cases, feel free to post here.

Thanks!

The project I’m currently working on has a use case for both scenarios (real-time anomaly detection and wider anomaly detection analysis done in off-hours). I think it would be beneficial to implement more ML, alongside anomaly detection for the off-hours analysis. Here is how I planned to use this feature:

Real time use-case:
I was hoping to use real time anomaly detection to give a heads-up to the operations team about possible indications that an outage of the system would occur so they could take appropriate actions to mitigate the risk before it actually happens.

Cluster idle time use-case 1:
A lot of business logic data is captured through app logs that my system ingests. I think it would be great if I could use the existing data to build a regression model which could predict certain business parameters - such as the number of expected transactions. If we regard higher-than-expected number of transactions as anomalies, I think it would be beneficial to capture what other parameters contributed/hinted to the appearance of those anomalies. With that knowledge, I’d like to provide my clients with an information (based on param1, param2, param3 values, we expect the number of transactions to be X in the specific time period).

I hope you find this information useful :slight_smile:

Thank you for taking the time to go through all of this. I’m a great fan of the work OpenDistro team is doing and I’m excitedly waiting to hear what new features you will come up with next :smiley:

hi, @amir

Thanks for sharing your use cases and really good suggestions. We will discuss your use cases and update our ODFE roadmap if we plan to put resource on it. Don’t hesitate to tell us if you have new use cases, find any bugs, or have suggestions. And welcome any contribution on Github, you know ODFE is completely open source.

Thanks

1 Like