Questions about ML

VincentNgai · November 12, 2020, 2:23am

Hi there ,

I beginner here , a lot of question , hope someone can give me some insight

The ML is based on historical data to build the model ? or the data after we click Start detector
The ML model is base on Sample anomaly history or Whole index data ?
Will the ML continue update itself or Once after first time the model is settle down
Currently i setup opendistro locally using by docker opendistro 1.11.0 + nginx + filebeat
What i am try to do
count doc number
expected when it suddenly have huge number doc coming
or the doc suddenly dropped
First i provide sample logs (normally all 200 retrun)
But seems it fail to do somehow the last two i expected it is an abnomal

WhatsApp Image 2020-11-10 at 20.09.261600×643 129 KB
I check youtube aws example Real-Time Anomaly Detection on Your Log Data Using Amazon Elasticsearch Service - YouTube
seems he turn the http status it an isolated field , however in normal there should have only 1 field
status to contains all kind of status code
I dont see the Model level can do filter range
is this means we need to do it detector level ?

Thanks all

VincentNgai · November 12, 2020, 4:00am

Another example

wnbts · November 20, 2020, 12:20am

Thanks for using the product and asking the questions, Vincent!

Here’s the answers to some questions.

ML will be based on historical data if possible. If not, it will use the data after it’s created.
ML model uses some recent data. A few hundreds data points.
ML is a streaming algorithm and will update to the new data.
Those should be anomalies. One possible reason the detector doesn’t raise them is that it has just raised a few anomalies earlier. The expected number of anomalies is around 0.5% of all data points by design. So if there are many anomalies in a short period of time, the earlier ones are likely to identified. It is also possible that the model has seen the data from indexed data already.
A feature needs to have a numerical value so all the features have a known fixed total dimension. Unknown number of dimensions is not supported.

VincentNgai · November 23, 2020, 2:29am

Thanks very much. I’ve been search this few question and test few times

About point 3 and 4 i am still have some questions…

Is that means Opendistro is not using a fixed training model ?
for example I’ve 7 days data then perform anormal detection
After another 7 days
Which will happen here?

The model will become this 2 weeks mixed <== from your ans i assume this will happen ?
week2 become new model
The model will remain week 1 pattern

If it is mixed 2 weeks as a new model …

Based on your reply , it means if we have similar pattern before the abnormal detection will not consider it is an abnormal in future times ?
Is that the model store in another index ? So even i missing/deleted all the data i can still backup the model and restore in somewhere ?
Recently find out , after a period of not sending data/log to the index , the abnormal detection will stop forever and unable to resume it even i resume send log or save detector again , unless you create another detector with same features

螢幕截圖 2020-11-23 下午6.01.281756×508 47.4 KB

wnbts · November 23, 2020, 9:10pm

Good questions.

No, open distro model is not fixed. It is by design a streaming algorithm that learns from live data and identify anomalies with respective current data. So in the example you gave, the model will learn data from recent weeks. The more recent a data point is, the more likely the model will remember. It uses weighted reservoir sampling if you want to know more.
It depends, if the pattern has occurred but is still rare, the model might still identify it as anomaly. If the pattern is recurring enough or last long enough, the model might learn it as normal.
Model is stored in a different index named checkpoint.
Since the detector cannot run without live data stream, it might be stopped when there is no data. But it can be restarted as you manually stop and restart a detector. You shouldn’t need to recreate a new one. I am not sure if it is caused by an implementation issue. If you can confirm the new data stream is available but the restarted detector is erroring out, that is a software issue. You can create an issue for tracking at github repo.

VincentNgai · November 24, 2020, 2:34am

Thx for those reply , thats already help me a lot and save me a lot of time

For no.9
Here my steps

I start the test and inject traffic at 11/23 11:00 to 12:00
I create the Abnormal Detector (with index filebeat* )
At 12:00 i stopped inject traffic and leave the setup there
After some time later Abnormal Detector shows it Data is not being ingested correctly
11/24 09:00 i resume the traffic and Abnormal Detector does not resume , keep shows it Data is not being ingested correctly
11/24 10:09 i stop and start again the Abnormal Detector , it keep Shows Initializing and I can confirm the data are inject into over 30mins

bpavani · February 8, 2021, 3:09pm

Could you share your feature configuration?

Thanks,
Pavani

Topic		Replies	Views
How do I add a realtime ML anomaly detector which reacts on when traffic is high/low? Machine Learning	3	193	September 24, 2023
How do you feed the Anomaly Detection Plugin existing data? Machine Learning	9	2379	August 27, 2020
Machine Learning General Feedback	5	1674	December 11, 2019
Community blogs about using Anomaly detection Machine Learning	5	867	August 24, 2020
Count in Zero Why ML not consider abnormal Machine Learning	1	666	November 20, 2020