Gravatar for

Question by DEEPTHI KATTA, Jan 27, 2018 12:22 AM

Based on a unique meta data value skip adding to Index

Hi There!

On Coveo cloud, is there a way to not add a document to an index in scenario like below -

"If there is a meta data field having a unique key - Then, check to see if any other document on the source has the same value on such key, If yes, do not add this specific item to index. "

If this is not possible how else can we ensure uniqueness on search results, our only way to track uniqueness at this point is either through a special meta data value or via combination of parameters on URL. Essentially it is called duplicate if the end page result is the same, but, the URL params are subtly different.

This below is not an option for us as it leads to incorrect numbers on result listing and it might not be performance efficient.

Any thing else we can do to solve this duplicate issue?

Gravatar for

Comment by DEEPTHI KATTA, Jan 27, 2018 12:40 AM

I guess that brings me to a core question, when does Coveo on a source consider items as duplicates? I think that becomes important to gain answer to at this time.

1 Reply
Gravatar for

Answer by Jean-François L'Heureux, Jan 27, 2018 1:06 AM

The Coveo index primary key is the indexed item URI field. It is unique per source. If 2 items with the same URI are indexed during the same rebuild, the second one overrides the first one. However, nothing stops 2 sources to contain the same document with the same URI.

It is possible to avoid indexing some items based on their metadata values with indexing pipeline extensions in Coveo Cloud. You can write Python scripts that are executed pre or post conversion. Those scripts can tell Coveo not to index the document. in your case it would be post-conversion as the meta tags of the HTML documents are read at conversion.

However, it is not possible to query the index in a conversion script to check indexed document field values. Even if it would be possible, it is to be avoided. Imagine the load on the index if every indexed item require a query. It would be very very slow.

The Coveo index is able to compare indexed items and know which ones are content duplicates (Same original document content). They can have different field values, even different URIs. This is achieved with the `enableDuplicateFiltering` you already found. It is an effective way to filter out duplicates at query time.

Gravatar for

Comment by DEEPTHI KATTA, Jan 29, 2018 6:07 PM

Thanks Jeff.

I totally understand that it is a performance hit if I need to compare if a specific document exists on a source while adding to index.

I did re-think and realized that the problem with duplication we have is more on business logic of this implementation and the most graceful way is to fix the issue instead of finding a solve through Coveo.

I will try and push in that direction. It does help to know the unique key is Clickable URI. I am assuming that query parameters being different would mean Coveo will think it is two different documents and hence will index both and not override.

Also, I am curious to know inner workings enableDuplicateFiltering, do you happen to know how Coveo knows content duplicates? I mean I know there could be several factors that Coveo would do to analyze if two documents are similar at query run time, but, any insights further could be helpful in making few decisions on our end.

Gravatar for

Comment by Daniel Lavoie, Jan 29, 2018 7:03 PM

Duplicate filtering is the feature that identifies and remove similar documents from the results list, upon a query. Duplicate filtering in Coveo is done through the use of shingles. A shingle is a merge of a specified number of consecutive words, taken randomly. Coveo uses 4 as the number of words, and 300 as the number of shingles. These shingles are then used to compute the similarity of documents, at query time.

So, for each document that is to be returned by Coveo, a check is done to see if it is a duplicate with any of the results that are before it in the results list. This is a computation intensive process, and when returning hundreds of documents, can create performance issues.

By default, documents are considered duplicate at 85% similarity.

Ask a question