Replication failed

Hello,

I have found that sometimes, if a replica node restarts when applying changes from the master, the replication fails.

Since i am running dev and staging environments in Preemptible k8s nodes, this is really annoying.

Any suggestions about how to avoid this failures?

Thanks in advance

[opensearch@opensearch-replica-master-2 ~]$ curl -XGET -u admin:admin 'http://localhost:9200/_plugins/_replication/cadence-visibility/_status?pretty'
{
  "status" : "FAILED",
  "reason" : "",
  "leader_alias" : "master",
  "leader_index" : "cadence-visibility",
  "follower_index" : "cadence-visibility"
}

Logs:

[2021-12-15T16:09:17,745][WARN ][o.o.r.a.r.TransportReplayChangesAction] [opensearch-replica-master-2] [[cadence-visibility][4]] failed to perform indices:data/write/plugins/replication/changes on replica [cadence-visibility][4], node[MSw0V5lyQxCFTS_RlfWPWg], [R], s[STARTED], a[id=0QcakcUYQUe27BUo9bZ8_Q]
org.opensearch.transport.NodeDisconnectedException: [opensearch-replica-master-1][10.196.57.16:9300][indices:data/write/plugins/replication/changes[r]] disconnected
[2021-12-15T16:09:17,749][WARN ][o.o.r.a.r.TransportReplayChangesAction] [opensearch-replica-master-2] [[cadence-visibility][4]] failed to perform indices:data/write/plugins/replication/changes on replica [cadence-visibility][4], node[MSw0V5lyQxCFTS_RlfWPWg], [R], s[STARTED], a[id=0QcakcUYQUe27BUo9bZ8_Q]
	at org.opensearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1298) [opensearch-1.2.0.jar:1.2.0]
	at org.opensearch.action.support.replication.ReplicationOperation$3.tryAction(ReplicationOperation.java:306) [opensearch-1.2.0.jar:1.2.0]
	Suppressed: org.opensearch.transport.NodeDisconnectedException: [opensearch-replica-master-1][10.196.57.16:9300][indices:data/write/plugins/replication/changes[r]] disconnected
[2021-12-15T16:09:17,749][WARN ][o.o.r.m.s.ReplicationMetadataStore] [opensearch-replica-master-2] Encountered a failure while executing in org.opensearch.action.admin.cluster.health.ClusterHealthRequest@39ba4ba7. Retrying in 10 seconds.
[2021-12-15T16:09:17,784][WARN ][o.o.r.a.r.TransportReplayChangesAction] [opensearch-replica-master-2] [[cadence-visibility][3]] failed to perform indices:data/write/plugins/replication/changes on replica [cadence-visibility][3], node[MSw0V5lyQxCFTS_RlfWPWg], [R], s[STARTED], a[id=I3EwkNNuRFKgo9Q4uhBbvQ]
	at org.opensearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1298) [opensearch-1.2.0.jar:1.2.0]
	at org.opensearch.action.support.replication.ReplicationOperation$3.tryAction(ReplicationOperation.java:306) [opensearch-1.2.0.jar:1.2.0]
	Suppressed: org.opensearch.transport.NodeDisconnectedException: [opensearch-replica-master-1][10.196.57.16:9300][indices:data/write/plugins/replication/changes[r]] disconnected
[2021-12-15T16:09:17,784][WARN ][o.o.r.a.r.TransportReplayChangesAction] [opensearch-replica-master-2] [[cadence-visibility][4]] failed to perform indices:data/write/plugins/replication/changes on replica [cadence-visibility][4], node[MSw0V5lyQxCFTS_RlfWPWg], [R], s[STARTED], a[id=0QcakcUYQUe27BUo9bZ8_Q]
	at org.opensearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1298) [opensearch-1.2.0.jar:1.2.0]
	at org.opensearch.action.support.replication.ReplicationOperation$3.tryAction(ReplicationOperation.java:306) [opensearch-1.2.0.jar:1.2.0]
	Suppressed: org.opensearch.transport.NodeDisconnectedException: [opensearch-replica-master-1][10.196.57.16:9300][indices:data/write/plugins/replication/changes[r]] disconnected
[2021-12-15T16:09:17,785][WARN ][o.o.r.a.r.TransportReplayChangesAction] [opensearch-replica-master-2] [[cadence-visibility][4]] failed to perform indices:data/write/plugins/replication/changes on replica [cadence-visibility][4], node[MSw0V5lyQxCFTS_RlfWPWg], [R], s[STARTED], a[id=0QcakcUYQUe27BUo9bZ8_Q]
	at org.opensearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1298) [opensearch-1.2.0.jar:1.2.0]
	at org.opensearch.action.support.replication.ReplicationOperation$3.tryAction(ReplicationOperation.java:306) [opensearch-1.2.0.jar:1.2.0]
	Suppressed: org.opensearch.transport.NodeDisconnectedException: [opensearch-replica-master-1][10.196.57.16:9300][indices:data/write/plugins/replication/changes[r]] disconnected
[2021-12-15T16:09:17,786][WARN ][o.o.r.a.r.TransportReplayChangesAction] [opensearch-replica-master-2] [[cadence-visibility][3]] failed to perform indices:data/write/plugins/replication/changes on replica [cadence-visibility][3], node[MSw0V5lyQxCFTS_RlfWPWg], [R], s[STARTED], a[id=I3EwkNNuRFKgo9Q4uhBbvQ]
	at org.opensearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1298) [opensearch-1.2.0.jar:1.2.0]
	at org.opensearch.action.support.replication.ReplicationOperation$3.tryAction(ReplicationOperation.java:306) [opensearch-1.2.0.jar:1.2.0]
	Suppressed: org.opensearch.transport.NodeDisconnectedException: [opensearch-replica-master-1][10.196.57.16:9300][indices:data/write/plugins/replication/changes[r]] disconnected
[2021-12-15T16:09:17,789][WARN ][o.o.r.a.r.TransportReplayChangesAction] [opensearch-replica-master-2] [[cadence-visibility][3]] failed to perform indices:data/write/plugins/replication/changes on replica [cadence-visibility][3], node[MSw0V5lyQxCFTS_RlfWPWg], [R], s[STARTED], a[id=I3EwkNNuRFKgo9Q4uhBbvQ]
	at org.opensearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1298) [opensearch-1.2.0.jar:1.2.0]
	at org.opensearch.action.support.replication.ReplicationOperation$3.tryAction(ReplicationOperation.java:306) [opensearch-1.2.0.jar:1.2.0]
	Suppressed: org.opensearch.transport.NodeDisconnectedException: [opensearch-replica-master-1][10.196.57.16:9300][indices:data/write/plugins/replication/changes[r]] disconnected
[2021-12-15T16:09:17,790][WARN ][o.o.r.a.r.TransportReplayChangesAction] [opensearch-replica-master-2] [[cadence-visibility][4]] failed to perform indices:data/write/plugins/replication/changes on replica [cadence-visibility][4], node[MSw0V5lyQxCFTS_RlfWPWg], [R], s[STARTED], a[id=0QcakcUYQUe27BUo9bZ8_Q]
	at org.opensearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1298) [opensearch-1.2.0.jar:1.2.0]
	at org.opensearch.action.support.replication.ReplicationOperation$3.tryAction(ReplicationOperation.java:306) [opensearch-1.2.0.jar:1.2.0]
	Suppressed: org.opensearch.transport.NodeDisconnectedException: [opensearch-replica-master-1][10.196.57.16:9300][indices:data/write/plugins/replication/changes[r]] disconnected
[2021-12-15T16:09:17,791][WARN ][o.o.r.a.r.TransportReplayChangesAction] [opensearch-replica-master-2] [[cadence-visibility][4]] failed to perform indices:data/write/plugins/replication/changes on replica [cadence-visibility][4], node[MSw0V5lyQxCFTS_RlfWPWg], [R], s[STARTED], a[id=0QcakcUYQUe27BUo9bZ8_Q]
	at org.opensearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1298) [opensearch-1.2.0.jar:1.2.0]
	at org.opensearch.action.support.replication.ReplicationOperation$3.tryAction(ReplicationOperation.java:306) [opensearch-1.2.0.jar:1.2.0]
	Suppressed: org.opensearch.transport.NodeDisconnectedException: [opensearch-replica-master-1][10.196.57.16:9300][indices:data/write/plugins/replication/changes[r]] disconnected
[2021-12-15T16:09:17,791][WARN ][o.o.r.a.r.TransportReplayChangesAction] [opensearch-replica-master-2] [[cadence-visibility][4]] failed to perform indices:data/write/plugins/replication/changes on replica [cadence-visibility][4], node[MSw0V5lyQxCFTS_RlfWPWg], [R], s[STARTED], a[id=0QcakcUYQUe27BUo9bZ8_Q]
	at org.opensearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1298) [opensearch-1.2.0.jar:1.2.0]
	at org.opensearch.action.support.replication.ReplicationOperation$3.tryAction(ReplicationOperation.java:306) [opensearch-1.2.0.jar:1.2.0]
	Suppressed: org.opensearch.transport.NodeDisconnectedException: [opensearch-replica-master-1][10.196.57.16:9300][indices:data/write/plugins/replication/changes[r]] disconnected
[2021-12-15T16:09:17,792][WARN ][o.o.r.a.r.TransportReplayChangesAction] [opensearch-replica-master-2] [[cadence-visibility][3]] failed to perform indices:data/write/plugins/replication/changes on replica [cadence-visibility][3], node[MSw0V5lyQxCFTS_RlfWPWg], [R], s[STARTED], a[id=I3EwkNNuRFKgo9Q4uhBbvQ]
	at org.opensearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1298) [opensearch-1.2.0.jar:1.2.0]
	at org.opensearch.action.support.replication.ReplicationOperation$3.tryAction(ReplicationOperation.java:306) [opensearch-1.2.0.jar:1.2.0]
	Suppressed: org.opensearch.transport.NodeDisconnectedException: [opensearch-replica-master-1][10.196.57.16:9300][indices:data/write/plugins/replication/changes[r]] disconnected
[2021-12-15T16:09:17,924][ERROR][o.o.r.t.s.TranslogSequencer] [opensearch-replica-master-2] [cadence-visibility][3] Failed replaying changes. Failure:0:org.opensearch.action.support.replication.ReplicationResponse$ShardInfo$Failure@64df2a3a}
[2021-12-15T16:09:17,924][ERROR][o.o.r.t.s.TranslogSequencer] [opensearch-replica-master-2] [cadence-visibility][3] Failed replaying changes. Failure:0:org.opensearch.action.support.replication.ReplicationResponse$ShardInfo$Failure@70679a75}
[2021-12-15T16:09:17,924][ERROR][o.o.r.t.s.TranslogSequencer] [opensearch-replica-master-2] [cadence-visibility][3] Failed replaying changes. Failure:0:org.opensearch.action.support.replication.ReplicationResponse$ShardInfo$Failure@509844e2}
[2021-12-15T16:09:17,924][ERROR][o.o.r.t.s.TranslogSequencer] [opensearch-replica-master-2] [cadence-visibility][3] Failed replaying changes. Failure:0:org.opensearch.action.support.replication.ReplicationResponse$ShardInfo$Failure@2444cded}
[2021-12-15T16:09:17,924][ERROR][o.o.r.t.s.TranslogSequencer] [opensearch-replica-master-2] [cadence-visibility][3] Failed replaying changes. Failure:0:org.opensearch.action.support.replication.ReplicationResponse$ShardInfo$Failure@21dda8c5}
[2021-12-15T16:09:17,924][INFO ][o.o.r.t.s.ShardReplicationTask] [opensearch-replica-master-2] [cadence-visibility][4] opensearch[opensearch-replica-master-2][replication_follower][T#10]: Unable to get changes from seqNo: 392578. kotlinx.coroutines.JobCancellationException: Parent job is Cancelling; job=StandaloneCoroutine{Cancelling}@5f4d1f7c
Caused by: ReplicationException[failed to replay changes]
	at org.opensearch.replication.task.shard.TranslogSequencer$sequencer$1$1.invokeSuspend(TranslogSequencer.kt:85)
[2021-12-15T16:09:17,925][INFO ][o.o.r.t.s.ShardReplicationTask] [opensearch-replica-master-2] [cadence-visibility][3] opensearch[opensearch-replica-master-2][replication_follower][T#2]: Unable to get changes from seqNo: 393171. kotlinx.coroutines.JobCancellationException: Parent job is Cancelling; job=StandaloneCoroutine{Cancelling}@532ee7
Caused by: ReplicationException[failed to replay changes]
	at org.opensearch.replication.task.shard.TranslogSequencer$sequencer$1$1.invokeSuspend(TranslogSequencer.kt:85)
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
[2021-12-15T16:09:17,926][ERROR][o.o.r.t.s.ShardReplicationTask] [opensearch-replica-master-2] [cadence-visibility][3] opensearch[opensearch-replica-master-2][replication_follower][T#4]: ShardReplicationTask: Caught downstream exception ReplicationException[failed to replay changes]
	at org.opensearch.replication.task.shard.TranslogSequencer$sequencer$1$1.invokeSuspend(TranslogSequencer.kt:85)
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
[2021-12-15T16:09:17,926][ERROR][o.o.r.t.s.ShardReplicationTask] [opensearch-replica-master-2] [cadence-visibility][4] opensearch[opensearch-replica-master-2][replication_follower][T#3]: ShardReplicationTask: Caught downstream exception ReplicationException[failed to replay changes]
	at org.opensearch.replication.task.shard.TranslogSequencer$sequencer$1$1.invokeSuspend(TranslogSequencer.kt:85)
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
[2021-12-15T16:09:17,927][INFO ][o.o.r.t.s.ShardReplicationTask] [opensearch-replica-master-2] [cadence-visibility][3] opensearch[opensearch-replica-master-2][replication_follower][T#5]: Going to mark ShardReplicationTask as Failed with ReplicationException[failed to replay changes]
	at org.opensearch.replication.task.shard.TranslogSequencer$sequencer$1$1.invokeSuspend(TranslogSequencer.kt:85)
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
[2021-12-15T16:09:17,927][INFO ][o.o.r.t.s.ShardReplicationTask] [opensearch-replica-master-2] [cadence-visibility][4] opensearch[opensearch-replica-master-2][replication_follower][T#9]: Going to mark ShardReplicationTask as Failed with ReplicationException[failed to replay changes]
	at org.opensearch.replication.task.shard.TranslogSequencer$sequencer$1$1.invokeSuspend(TranslogSequencer.kt:85)
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
	Suppressed: ReplicationException[failed to replay changes]
[2021-12-15T16:09:18,233][INFO ][o.o.r.t.s.ShardReplicationTask] [opensearch-replica-master-2] [cadence-visibility][3] opensearch[opensearch-replica-master-2][replication_follower][T#1]: Waiting 600000 millis for IndexReplicationTask to respond to failure of shard task
[2021-12-15T16:09:18,322][INFO ][o.o.r.a.p.TransportPauseIndexReplicationAction] [opensearch-replica-master-2] Pausing index replication on index:cadence-visibility
[2021-12-15T16:09:18,364][INFO ][o.o.r.t.s.ShardReplicationTask] [opensearch-replica-master-2] [cadence-visibility][4] opensearch[opensearch-replica-master-2][replication_follower][T#8]: Waiting 600000 millis for IndexReplicationTask to respond to failure of shard task
[2021-12-15T16:09:18,433][WARN ][o.o.c.r.a.AllocationService] [opensearch-replica-master-2] [.replication-metadata-store][0] marking unavailable shards as stale: [h6iKCCCiShadAlL546t8hQ]
[2021-12-15T16:09:18,737][INFO ][o.o.r.t.s.ShardReplicationTask] [opensearch-replica-master-2] [cadence-visibility][3] opensearch[opensearch-replica-master-2][clusterApplierService#updateTask][T#1]: Pause state received for index cadence-visibility. Cancelling [cadence-visibility][3] task
[2021-12-15T16:09:18,737][INFO ][o.o.r.t.s.ShardReplicationTask] [opensearch-replica-master-2] [cadence-visibility][4] opensearch[opensearch-replica-master-2][clusterApplierService#updateTask][T#1]: Pause state received for index cadence-visibility. Cancelling [cadence-visibility][4] task
[2021-12-15T16:09:18,738][INFO ][o.o.r.t.s.ShardReplicationTask] [opensearch-replica-master-2] [cadence-visibility][3] opensearch[opensearch-replica-master-2][replication_follower][T#6]: Received cancellation of ShardReplicationTask java.util.concurrent.CancellationException: Shard replication task received pause.
	at org.opensearch.replication.task.CrossClusterReplicationTask.cancelTask(CrossClusterReplicationTask.kt:87)
	at org.opensearch.replication.task.shard.ShardReplicationTask.access$cancelTask(ShardReplicationTask.kt:60)
	at org.opensearch.replication.task.shard.ShardReplicationTask$ClusterStateListenerForTaskInterruption.clusterChanged(ShardReplicationTask.kt:187)
[2021-12-15T16:09:18,738][INFO ][o.o.r.t.s.ShardReplicationTask] [opensearch-replica-master-2] [cadence-visibility][3] Going to mark ShardReplicationTask:146988 task as completed
[2021-12-15T16:09:18,738][INFO ][o.o.r.t.s.ShardReplicationTask] [opensearch-replica-master-2] [cadence-visibility][4] opensearch[opensearch-replica-master-2][replication_follower][T#7]: Received cancellation of ShardReplicationTask java.util.concurrent.CancellationException: Shard replication task received pause.
	at org.opensearch.replication.task.CrossClusterReplicationTask.cancelTask(CrossClusterReplicationTask.kt:87)
	at org.opensearch.replication.task.shard.ShardReplicationTask.access$cancelTask(ShardReplicationTask.kt:60)
	at org.opensearch.replication.task.shard.ShardReplicationTask$ClusterStateListenerForTaskInterruption.clusterChanged(ShardReplicationTask.kt:187)
[2021-12-15T16:09:18,738][INFO ][o.o.r.t.s.ShardReplicationTask] [opensearch-replica-master-2] [cadence-visibility][4] Going to mark ShardReplicationTask:147000 task as completed
[2021-12-15T16:09:18,806][INFO ][o.o.r.a.p.TransportPauseIndexReplicationAction] [opensearch-replica-master-2] Pausing index replication on index:cadence-visibility
[2021-12-15T16:09:19,033][INFO ][o.o.r.t.s.ShardReplicationTask] [opensearch-replica-master-2] [cadence-visibility][3] Successfully persisted task status
[2021-12-15T16:09:19,033][INFO ][o.o.r.t.s.ShardReplicationTask] [opensearch-replica-master-2] [cadence-visibility][4] Successfully persisted task status
[2021-12-15T16:09:20,463][WARN ][o.o.p.PersistentTasksClusterService] [opensearch-replica-master-2] persistent task replication:index:cadence-visibility failed
	at org.opensearch.replication.task.index.IndexReplicationTask$failReplication$2.invokeSuspend(IndexReplicationTask.kt:278) ~[?:?]

I have found a pattern.

My current setup:
3 nodes with dmri role.

If the node that restarts is the coordinator node, the replication fails. Otherwise, the replication keep working as expected.

[opensearch@opensearch-replica-master-1 ~]$ curl -XGET -u admin:admin 'http://localhost:9200/_plugins/_replication/cadence-visibility/_status?pretty'
{
  "status" : "SYNCING",
  "reason" : "User initiated",
  "leader_alias" : "master",
  "leader_index" : "cadence-visibility",
  "follower_index" : "cadence-visibility",
  "syncing_details" : {
    "leader_checkpoint" : 2603947,
    "follower_checkpoint" : 2603884,
    "seq_no" : 2603886
  }
}

[opensearch@opensearch-replica-master-1 ~]$ curl -XGET -u admin:admin 'http://localhost:9200/_cat/tasks'
cluster:indices/admin/replication[c]  Mf0-6OROR2yNVsz-lIAvFQ:4622 cluster:125                 persistent 1639642983075 08:23:03 34.5s       10.196.79.40  opensearch-replica-master-1
cluster:indices/shards/replication[c] Mf0-6OROR2yNVsz-lIAvFQ:4807 cluster:126                 persistent 1639642984765 08:23:04 32.8s       10.196.79.40  opensearch-replica-master-1
cluster:indices/shards/replication[c] 3qmptEvvQKSWRgFGnKTZjg:1451 cluster:127                 persistent 1639642984923 08:23:04 32.7s       10.196.47.117 opensearch-replica-master-0
cluster:indices/shards/replication[c] 3qmptEvvQKSWRgFGnKTZjg:1479 cluster:128                 persistent 1639642985407 08:23:05 32.2s       10.196.47.117 opensearch-replica-master-0
cluster:indices/shards/replication[c] Mf0-6OROR2yNVsz-lIAvFQ:4885 cluster:129                 persistent 1639642985717 08:23:05 31.9s       10.196.79.40  opensearch-replica-master-1
cluster:indices/shards/replication[c] 53wnO2a5RqedaEYLfF4-cA:5086 cluster:130                 persistent 1639642986145 08:23:06 31.4s       10.196.97.72  opensearch-replica-master-2
cluster:monitor/tasks/lists           Mf0-6OROR2yNVsz-lIAvFQ:5346 -                           transport  1639643017626 08:23:37 1.4ms       10.196.79.40  opensearch-replica-master-1
cluster:monitor/tasks/lists[n]        Mf0-6OROR2yNVsz-lIAvFQ:5347 Mf0-6OROR2yNVsz-lIAvFQ:5346 direct     1639643017628 08:23:37 361.8micros 10.196.79.40  opensearch-replica-master-1
cluster:monitor/tasks/lists[n]        3qmptEvvQKSWRgFGnKTZjg:1891 Mf0-6OROR2yNVsz-lIAvFQ:5346 transport  1639643017628 08:23:37 511.6micros 10.196.47.117 opensearch-replica-master-0
cluster:monitor/tasks/lists[n]        53wnO2a5RqedaEYLfF4-cA:5468 Mf0-6OROR2yNVsz-lIAvFQ:5346 transport  1639643017629 08:23:37 572.7micros 10.196.97.72  opensearch-replica-master-2

[opensearch@opensearch-replica-master-1 ~]$ curl -XGET -u admin:admin 'http://localhost:9200/_cat/nodes'
10.196.79.40  21 75 32 6.32 6.31 5.91 dimr * opensearch-replica-master-1
10.196.47.117 43 73 36 0.73 1.24 1.72 dimr - opensearch-replica-master-0
10.196.97.72  68 76 24 2.83 1.72 1.85 dimr - opensearch-replica-master-2

kubectl delete po opensearch-replica-master-1

[opensearch@opensearch-replica-master-0 ~]$ curl -XGET -u admin:admin 'http://localhost:9200/_plugins/_replication/cadence-visibility/_status?pretty'
{
  "status" : "FAILED",
  "reason" : "Pause failed with \"Index cadence-visibility is already paused\". Original failure for initiating pause - [[cadence-visibility][0] - org.opensearch.replication.ReplicationException - \"failed to replay changes\"], [[cadence-visibility][2] - org.opensearch.common.io.stream.NotSerializableExceptionWrapper - \"replication_exception: failed to replay changes\"], ",
  "leader_alias" : "master",
  "leader_index" : "cadence-visibility",
  "follower_index" : "cadence-visibility"
}

If i add 3 specific data nodes to the cluster and remove the data role from the master nodes, i can’t reproduce the failure when restarting the coordinator node.

Thanks for your support @ccr-devs

Happened again even with 3 data nodes.

@saikaranam could you please take a look?

Thanks in advance

I added the following lifecycle in the statefulset definition and at the moment, is working without issues.

lifecycle:
  preStop:
    exec:
      command:
        - bash
        - -c
        - |
          curl -u admin:$OPENSEARCH_ADMIN_PWD -XPOST "http://$OPENSEARCH_REPLICA_MASTER_SERVICE_HOST:$OPENSEARCH_REPLICA_MASTER_SERVICE_PORT_HTTP/_plugins/_replication/cadence-visibility/_pause?pretty" -H 'Content-Type: application/json' -d'{}'
          sleep 5
          while [[ "$(curl -s -u admin:$OPENSEARCH_ADMIN_PWD http://$OPENSEARCH_REPLICA_MASTER_SERVICE_HOST:$OPENSEARCH_REPLICA_MASTER_SERVICE_PORT_HTTP/_cat/tasks | grep replication)" != "" ]]
          do 
            curl -u admin:$OPENSEARCH_ADMIN_PWD -XPOST "http://$OPENSEARCH_REPLICA_MASTER_SERVICE_HOST:$OPENSEARCH_REPLICA_MASTER_SERVICE_PORT_HTTP/_plugins/_replication/cadence-visibility/_pause?pretty" -H 'Content-Type: application/json' -d'{}'
            sleep 5
          done
  postStart:
    exec:
      command:
        - bash
        - -c
        - |
          curl -u admin:$OPENSEARCH_ADMIN_PWD -XPOST "http://$OPENSEARCH_REPLICA_MASTER_SERVICE_HOST:$OPENSEARCH_REPLICA_MASTER_SERVICE_PORT_HTTP/_plugins/_replication/cadence-visibility/_resume?pretty" -H 'Content-Type: application/json' -d'{}'

Thanks for the update @stdmje
I will re-produce on test cluster and update this thread.

1 Like

More info.

I see that most of the problems comes when the admin replication task gets stuck and there is no way to remove it other than running a cancel task request and resume the replication again.

[opensearch@opensearch-replica-data-2 ~]$ curl -XGET -u admin:admin 'http://localhost:9200/_plugins/_replication/cadence-visibility/_status?pretty'
{
  "status" : "PAUSED",
  "reason" : "User initiated",
  "leader_alias" : "master",
  "leader_index" : "cadence-visibility",
  "follower_index" : "cadence-visibility"
}

[opensearch@opensearch-replica-data-2 ~]$ curl -XGET -u admin:admin 'http://localhost:9200/_cat/tasks'
cluster:indices/admin/replication[c] 0O0WZYO3QYK5RTrCK27Ekg:1005520 cluster:1274                   persistent 1640420155570 08:15:55 4.4h        10.196.92.18 opensearch-replica-data-2

[opensearch@opensearch-replica-data-2 ~]$ curl -XPOST -u admin:admin -k -H 'Content-Type: application/json' 'http://localhost:9200/_plugins/_replication/cadence-visibility/_pause?pretty' -d '{}'
{
  "error" : {
    "root_cause" : [
      {
        "type" : "resource_already_exists_exception",
        "reason" : "Index cadence-visibility is already paused"
      }
    ],
    "type" : "resource_already_exists_exception",
    "reason" : "Index cadence-visibility is already paused"
  },
  "status" : 400
}