Encountered a failure while executing in org.opensearch.replication.action.changes.GetChangesRequest

Hi,

Using opensearch 1.2.0 i see several warns in the logs like the following:

[2021-12-10T08:32:44,016][WARN ][o.o.r.t.s.ShardReplicationTask] [opensearch-replica-master-1] [cadence-visibility][4] Encountered a failure while executing in org.opensearch.replication.action.changes.GetChangesRequest@78ef1be8. Retrying in 10 seconds.
org.opensearch.OpenSearchTimeoutException: global checkpoint not synced. Retry after a few miliseconds...
	at org.opensearch.replication.action.changes.TransportGetChangesAction$asyncShardOperation$1.invokeSuspend(TransportGetChangesAction.kt:93) ~[opensearch-cross-cluster-replication-1.2.0.0.jar:1.2.0.0]
	at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33) [kotlin-stdlib-1.3.72.jar:1.3.72-release-468 (1.3.72)]
	at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:56) [kotlinx-coroutines-core-1.3.5.jar:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:733) [opensearch-1.2.0.jar:1.2.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
	at java.lang.Thread.run(Thread.java:832) [?:?]
	at org.opensearch.replication.action.changes.TransportGetChangesAction$asyncShardOperation$1.invokeSuspend(TransportGetChangesAction.kt:93) ~[opensearch-cross-cluster-replication-1.2.0.0.jar:1.2.0.0]
	at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33) [kotlin-stdlib-1.3.72.jar:1.3.72-release-468 (1.3.72)]
	at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:56) [kotlinx-coroutines-core-1.3.5.jar:?]
	at java.lang.Thread.run(Thread.java:832) [?:?]

Apparently the replication is working fine

{
  "status" : "SYNCING",
  "reason" : "User initiated",
  "leader_alias" : "master",
  "leader_index" : "cadence-visibility",
  "follower_index" : "cadence-visibility",
  "syncing_details" : {
    "leader_checkpoint" : 307987,
    "follower_checkpoint" : 307984,
    "seq_no" : 307986
  }
}

Any suggestions about what is happening here?

Thanks in advance

@stdmje thanks for flagging the issue.
Tasks at the follower cluster waits for certain duration for the operations to sync at the leader cluster. If those operations are not synced within the duration, this exception is logged.

These are not replication failures. Logging should be minimal here to avoid the noise.
opened an issue to track this.

1 Like

Great, thank you for the information.