Questions about ML

Hi there ,

I beginner here , a lot of question , hope someone can give me some insight

  1. The ML is based on historical data to build the model ? or the data after we click Start detector

  2. The ML model is base on Sample anomaly history or Whole index data ?

  3. Will the ML continue update itself or Once after first time the model is settle down

  4. Currently i setup opendistro locally using by docker opendistro 1.11.0 + nginx + filebeat
    What i am try to do
    count doc number
    expected when it suddenly have huge number doc coming
    or the doc suddenly dropped
    First i provide sample logs (normally all 200 retrun)
    But seems it fail to do somehow the last two i expected it is an abnomal

  5. I check youtube aws example https://youtu.be/V1MRY5X-Anw?t=2093
    seems he turn the http status it an isolated field , however in normal there should have only 1 field
    status to contains all kind of status code
    I dont see the Model level can do filter range
    is this means we need to do it detector level ?

Thanks all

Another example

Thanks for using the product and asking the questions, Vincent!

Here’s the answers to some questions.

  1. ML will be based on historical data if possible. If not, it will use the data after it’s created.
  2. ML model uses some recent data. A few hundreds data points.
  3. ML is a streaming algorithm and will update to the new data.
  4. Those should be anomalies. One possible reason the detector doesn’t raise them is that it has just raised a few anomalies earlier. The expected number of anomalies is around 0.5% of all data points by design. So if there are many anomalies in a short period of time, the earlier ones are likely to identified. It is also possible that the model has seen the data from indexed data already.
  5. A feature needs to have a numerical value so all the features have a known fixed total dimension. Unknown number of dimensions is not supported.

Thanks very much. I’ve been search this few question and test few times

About point 3 and 4 i am still have some questions…

  1. Is that means Opendistro is not using a fixed training model ?
    for example I’ve 7 days data then perform anormal detection
    After another 7 days
    Which will happen here?
  • The model will become this 2 weeks mixed <== from your ans i assume this will happen ?

  • week2 become new model

  • The model will remain week 1 pattern

If it is mixed 2 weeks as a new model …

  1. Based on your reply , it means if we have similar pattern before the abnormal detection will not consider it is an abnormal in future times ?

  2. Is that the model store in another index ? So even i missing/deleted all the data i can still backup the model and restore in somewhere ?

  3. Recently find out , after a period of not sending data/log to the index , the abnormal detection will stop forever and unable to resume it even i resume send log or save detector again , unless you create another detector with same features

Good questions.

  1. No, open distro model is not fixed. It is by design a streaming algorithm that learns from live data and identify anomalies with respective current data. So in the example you gave, the model will learn data from recent weeks. The more recent a data point is, the more likely the model will remember. It uses weighted reservoir sampling if you want to know more.

  2. It depends, if the pattern has occurred but is still rare, the model might still identify it as anomaly. If the pattern is recurring enough or last long enough, the model might learn it as normal.

  3. Model is stored in a different index named checkpoint.

  4. Since the detector cannot run without live data stream, it might be stopped when there is no data. But it can be restarted as you manually stop and restart a detector. You shouldn’t need to recreate a new one. I am not sure if it is caused by an implementation issue. If you can confirm the new data stream is available but the restarted detector is erroring out, that is a software issue. You can create an issue for tracking at github repo.

Thx for those reply , thats already help me a lot and save me a lot of time

For no.9
Here my steps

  1. I start the test and inject traffic at 11/23 11:00 to 12:00
  2. I create the Abnormal Detector (with index filebeat* )
  3. At 12:00 i stopped inject traffic and leave the setup there
  4. After some time later Abnormal Detector shows it Data is not being ingested correctly
  5. 11/24 09:00 i resume the traffic and Abnormal Detector does not resume , keep shows it Data is not being ingested correctly
  6. 11/24 10:09 i stop and start again the Abnormal Detector , it keep Shows Initializing and I can confirm the data are inject into over 30mins