Gravatar for

Question by ken thomas, Dec 15, 2016 12:18 PM

question about wildcard searches

when I perform a coveo search for rm*,

the search finds items with rm all by itself first, before any item which has rm plus some more characters. pages and pages of results with rm, before any with rma (for example) are shown.

why are the items which only contain the first part ranked somehow higher than the other items? does not make any sense to me.

2 Replies
Gravatar for

Answer by Martin Laporte, Dec 16, 2016 3:55 AM

The index tends to give more weight to items that are matching exactly what has been entered, vs those matching a derivative. For example, if searching for "universe" a document containing this exact keyword will rank higher than one containing "universities", even though both documents ARE matching the query because of stemming. This is probably something similar happening here, although it concerns wildcards instead of stemming. Still, in my opinion it makes much sense to favor keywords that are closer to the search string.

Gravatar for

Comment by ken thomas, Dec 19, 2016 9:17 AM

That really does not make any sense especially with a wildcard search with the wildcard at the end. the user is specifically saying that they are looking for something to be in that space. in my example, if i wanted matches for "rm" i would have just searched for "rm" , instead, I searched for rm* meaning I want and I am specifically looking for things with at least 3 letters or more. Makes no sense that the items that actually match the wildcard search are so much further down in the search results.

Gravatar for

Comment by Martin Laporte, Dec 19, 2016 12:24 PM

The traditional interpretation for a wildcard is "0 or more characters", so it this case rm does indeed matches the wildcard.

Gravatar for

Answer by Daniel Lavoie, Dec 20, 2016 9:23 AM

Assuming that the index is configured to return all the candidates for a wildcard expression (it's a setting, as some expressions could return the whole lexicon), the results will be ranked using all the words matching the expression. For example, if "rm*" matches [rm, rma, rm12, rms, rmve], documents will be ranked as if the query was actually: rm OR rma OR rm12 OR rms OR rmve.

So ranking does not use the length of the keywords or their closeness to the original wildcard expression, but uses the terms gathered from the expression and rank the documents using the standard algorithm (match in [title, summary, concepts], Okapi BM25, etc). So if "rm" matches many document titles as opposed to "rma", documents matching "rm" will be ranked higher.

Does that make more sense ?

Gravatar for

Comment by ken thomas, Dec 20, 2016 9:42 AM

makes totally zero sense. but thank you for trying.

first, what settings specific are you talking about and where is the documentation for those settings? I don't know details, but I think to avoid performance issues wildcards are not usually set up to be able to return all candidates. but I have not found any solid details explaining all of this, anywhere.

for my example, i scrolled through to page 51 and still had not seen any item with RMA in it, except as part of the word noRMAl. Also, I noticed will RM was highlighted with BOLD as a match when by itself, when I saw RMA by itself was not bolded, indicating that it was not a match. in the title of items, RM when by itself was bolded. Similar items with RMA in the title, was not bolded.

for a knowledge specific search (one of our sources) there are 36 items for RM* - all of them have RM. None of the RMA items show up with that search. Using the same pipeline, same knowledge source a search for rma (no wildcard) returns 361 items with RMA shown bold. None of this makes any sense at all to me

Gravatar for

Comment by Daniel Lavoie, Dec 20, 2016 9:46 AM

Number of candidates:

Make sure it is set to a value high enough to include all the possible candidates for wildcard expressions, otherwise some candidates will be left out. Just make sure not to use a value that is too high, to prevent queries loading all the terms of the index through wildcards.

Gravatar for

Comment by JFCG, Dec 21, 2016 10:23 AM

Maybe the following would help in clarifying the behavior, Ken.

It should also be noted that the 'Number of Candidates' setting is not currently available to Cloud customers.

Wildcard search can be very taxing and consequently needs to be constrained to some extent, which can lead to unexpected outcomes. The following point from the doc linked above is thus relatively important:

1) If you're not getting the expected results when using wildcards, test out more leading characters in your query to get a feeling of what is occurring behind the scenes.

Perhaps you could use a trigger to cause your search to display a message to the above effect if you feel end-users may too often get tripped up by wildcards' inherent limitations.

Gravatar for

Comment by ken thomas, Jan 5, 2017 10:43 AM

I had logged a support case for this issue and got that same article. which does help explain why often times the results coveo gives after a wildcard search can be shown to be completed non-intuitive and from user perspective, wrong. I posted an idea to help improve the product around this behavior.

Ask a question