Why are results duplicated from a failed index rebuild
When an index rebuild fails due to timeout, the results that were indexed during this operation are added to the index instead of being discarded which causes duplicate results. Because an index rebuild takes about 8 hours when successful and consumes most of the resources of our content management server, we don't always have the ability to immediately trigger another rebuild. Why is this partial result set added to the live index? Is there a setting we can enable that only switches to the new result set on rebuild completion (like SwitchOnRebuild for Lucene)?
There are multiple phases in a Coveo for Sitecore rebuild operation:
- Synchronizing configuration
- Sending security identities
- Sending Sitecore items
- Validating updated Sitecore items are all searchable (has a 1h timeout)
- Deleting items that were not updated during this rebuild
- Validating items are deleted (has a 1h timeout)
There are 3 scenarios for an item in a rebuild:
- The item has never been indexed: The item will be indexed.
- The item is already in the index and in Sitecore: The item will be updated. Its indexed date will be updated.
- The item is already in the index but was removed from Sitecore: The indexed item is not updated. It will be deleted after the updated items are validated to be searchable.
A failed rebuild operation caused by a timeout in the post validation (step 4 above) in Coveo for Sitecore should never lead to duplicate results. The worse that it should produce is non updated items still searchable after failure.
Please compare indexed items that you are considering duplicate and report any differences between them. You can use the Coveo Cloud Content Browser and the "Item JSON" tab in the search results properties to get a comparable text version of the items.
I suspect a difference at the URI level. Maybe a Sitecore version difference.