TimeoutException loading roles from security index

Hi,

We have an application that is using document-level security and we’re seeing a lot of timeout exceptions when updating user roles that is causing problems. I’ve been trying different approaches to reduce them but I’m not having much success. It seems to happen more frequently as the size of the .opendistro_security index increases. The 5 second timeout is hardcoded in the security plugin. When the role update API does succeed, it averages 30-40 seconds.

The stacktrace we get is this:

[2020-07-17T11:51:31,115][WARN ][r.suppressed             ] [node1] path: /_opendistro/_security/api/roles/testrole, params: {name=testrole}
org.elasticsearch.ElasticsearchException: java.util.concurrent.TimeoutException: Timeout after 5SECONDS while retrieving configuration for [ROLES](index=.opendistro_security)
        at com.amazon.opendistroforelasticsearch.security.configuration.ConfigurationRepository.getConfigurationsFromIndex(ConfigurationRepository.java:349) ~[?:?]
        at com.amazon.opendistroforelasticsearch.security.dlic.rest.api.AbstractApiAction.load(AbstractApiAction.java:250) ~[?:?]
        at com.amazon.opendistroforelasticsearch.security.dlic.rest.api.AbstractApiAction.handlePut(AbstractApiAction.java:184) ~[?:?]
        at com.amazon.opendistroforelasticsearch.security.dlic.rest.api.AbstractApiAction.handleApiRequest(AbstractApiAction.java:123) ~[?:?]
        at com.amazon.opendistroforelasticsearch.security.dlic.rest.api.PatchableResourceApiAction.handleApiRequest(PatchableResourceApiAction.java:252) ~[?:?]
        at com.amazon.opendistroforelasticsearch.security.dlic.rest.api.AbstractApiAction.lambda$prepareRequest$2(AbstractApiAction.java:382) ~[?:?]
        at org.elasticsearch.rest.BaseRestHandler.handleRequest(BaseRestHandler.java:113) ~[elasticsearch-7.3.2.jar:7.3.2]
        ....
Caused by: java.util.concurrent.TimeoutException: Timeout after 5SECONDS while retrieving configuration for [ROLES](index=.opendistro_security)
        at com.amazon.opendistroforelasticsearch.security.configuration.ConfigurationLoaderSecurity7.load(ConfigurationLoaderSecurity7.java:140) ~[?:?]
        at com.amazon.opendistroforelasticsearch.security.configuration.ConfigurationRepository.getConfigurationsFromIndex(ConfigurationRepository.java:339) ~[?:?]
        ... 57 more

As far as I can tell, the REST API handler is trying to load all of the roles in the security index before processing the API call.

For our application, each user has access to a set of UUIDs and every document in Elasticsearch has a uuid field with a single UUID for the value. For each user, we create a unique role that looks like this:

{
  "testuser": {
    "reserved": false,
    "hidden": false,
    "cluster_permissions": [
      "cluster_monitor"
    ],
    "index_permissions": [
      {
        "index_patterns": [
          "metrics-*"
        ],
        "dls": "{\"bool\": {\"must\": [{\"terms\": {\"uuid\": ["uuid1", "uuid2"]}}]}}",
        "fls": [],
        "masked_fields": [],
        "allowed_actions": [
          "read"
        ]
      },
      {
        "index_patterns": [
          "logs-*"
        ],
        "dls": "{\"bool\": {\"must\": [{\"terms\": {\"uuid.keyword\": ["uuid1", "uuid2"]}}]}}",
        "fls": [],
        "masked_fields": [],
        "allowed_actions": [
          "read"
        ]
      }
    ],
    "tenant_permissions": [],
    "static": false
  }
}

Each UUID is 36 characters long and the list of UUIDs for both indices are the same but just a different field. Our environment consists of:

  • Elasticsearch: 7.3.2
  • OpenDistro: 1.3.0
  • Nodes: 9 (all master/data)
  • Indices: Retain 9 months with 1 index per day
  • Number of users: ~400
  • Avg number of UUIDs per role: ~3500

We update the user’s role once every hour when they access the application if there are new UUIDs to add. The user accesses Elasticsearch via Grafana dashboards.

Some alternatives we’ve considered:

  1. Using a single role and putting all of the UUIDs in a single field in the user’s attributes. This takes as long as the current approach.
  2. Apply the UUID filter at the application level instead of in Elasticsearch. This isn’t an option since we’re using Grafana to do the queries and we cannot inject user-specific data.
  3. Changing our indices from metrics-YYYY.MM.DD to metrics-UUID-YYYY.MM.DD and using index permissions to limit what indices the user can see. This would require us to split several TBs of data into new indices and multiplies our index count by ~3k.
  4. Storing the user’s UUIDs in a document in another index and trying to use it in the queries. Grafana doesn’t let us query documents without timestamps so we can’t access it.

Does anyone have any ideas on a better approach for storing the information or a way to eliminate the timeout problem? It looks like the timeout is because all of the roles are loaded in ConfigurationLoaderSecurity7 and the size/number of roles is large?

Thank you,
Greg