Locate reindex bottleneck

Hello,

I’m need some help to locate the reason for slow reindexing speed. I’m reindexing one index from a 3 node to cluster to another 3 node cluster in the same datacenter. Relevant changes that should affect the speed are disabling of the replicas and the refreshing:

      "index.number_of_replicas" : "0",
      "index.refresh_interval" : "-1",

Neither the CPU, nor the disk io nor the network are even remotely saturated. I get a pretty constant indexing rate of about 1240 documents/s.
The index itself is a bit special since it is for reasons heavily overshared. There are 960 primary shards + 1 replica.

How can I identify the bottleneck?

best regards,
Matthias

Anything unusual about this index itself - e.g. large docs? stored fields?

That’s a good point! The doc size very different, from like 1KB to 10s of MB.

My next approach was to split the documents in groups of a certain size, like 0 to 10000 bytes, 10000 to 15000 bytes… This also allows to run the reindexing in parallel and to set a proper batch size. That was necessary, because reindexing everything at once was often interrupted because the 100mb buffer was exceeded.

With this approach I’ve achieved a reindexing rate of about 6K/s, which sounds reasonable.

Yeah - OpenSearch is more aligned to doing constant ingestion of documents rather than batches like what you originally described. Seems like you are doing a good job now but there are lots of optimizations strategies that are possible, but you often need to tune it according to your specific document quirks.