Gravatar for fcote@coveo.com

Question by fcote, Feb 5, 2015 11:55 AM

Exclusion filter using Regular expression

Using Sitecore Connector 2, I created an exclusion filter on my source using Regular expression

1 - ^.*\/~\/media\/documents\/(?!(uae)\/).+$

2 - ^.*\/~\/media\/downloads\/(?!(uae)\/).+$

After rebuilding the source, I still see documents from the media/downloads/uae

http://www.mysite.com/sitecore%20modules/web/~/media/downloads/uae/product-categories/product-sub-cat/silent%20scan/brochure.pdf#search=%22scan%22

The second exclusion filter should have eliminated the link, but after rebuilding index, it is still appearing.

What could be the cause of this behavior?

Gravatar for jouimet@hansondodge.com

Comment by jouimet, Feb 5, 2015 12:12 PM

Correction to the question, the previous link should appear, this is the link that should not:

http://www.mysite.com/sitecore%20modules/web/~/media/downloads/turkey/product-categories/product-sub-cat/silent%20scan/brochure.pdf#search=%22scan%22

The regular expression should filter our anything that matches up to the /media/downloads, but does not match uae token

Gravatar for jouimet@hansondodge.com

Comment by jouimet, Feb 5, 2015 2:32 PM

Looks like when Fredric asked the question here, he tried to sanitize the values. Unfortunately by doing so he added the wrong filters. The filters that are being used are:

^.*\/~\/media\/documents\/(?!(uae)\/).+$

and

^.*\/~\/media\/downloads\/(?!(uae)\/).+$

and the item that is being included is http://www.mysite.com/sitecore%20modules/web/~/media/downloads/turkey/product-categories/magnetic-resonance-imaging/silent%20scan/gehc-brochure-mr-silent-scan.pdf#search=%22scan%22

When testing on http://regexpal.com/ which is a Javascript Regular Expression tester, the second filter causes a match. Which means that the item should be excluded. But for some reason, it is not.

Gravatar for jouimet@hansondodge.com

Comment by jouimet, Feb 5, 2015 2:34 PM

It looks like it isn't Fredric, but the Rich Text Editor that is reformatting the message. Let me try this

^.*\/~\/media\/documents\/(?!(uae)\/).+$

and 

^.*\/~\/media\/documents\/(?!(uae)\/).+$ 
2 Replies
Gravatar for jflheureux@coveo.com

Answer by Jean-François L'Heureux, Feb 5, 2015 1:05 PM

As explained in the documentation (see Adding or Modifying Source Filters), the source filters use the ECMAScript regular expression syntax. This syntax requires that any forward slash is escaped with a backslash (see http://www.regular-expressions.info/javascript.html ).

The correct regular expressions are:

^.*\/~\/media\/documents\/(?!(uae)\/).+$
^.*\/~\/media\/downloads\/(?!(uae)\/).+$

You may even need a forward slash at the beginning and at the end of your regular expression but I'm not sure.

Gravatar for mlaporte@coveo.com

Answer by Martin Laporte, Feb 6, 2015 11:06 AM

Exclusion filters are supposed to match the content to exclude, not the other way around. So if you want to exclude uae you should not use a negative lookahead but simply match the uae part, something like:

^.*\/\~\/media\/downloads\/uae\/).+$

Gravatar for jouimet@hansondodge.com

Comment by jouimet, Feb 6, 2015 11:22 AM

mlaporte,

There has been some confusion around the ticket as it was generated. I want to exclude everything that is not uae. So the url

http://www.mysite.com/sitecore%20modules/web/~/media/downloads/turkey/product-categories/magnetic-resonance-imaging/silent%20scan/gehc-brochure-mr-silent-scan.pdf#search=%22scan%22

should be excluded, but it is not.

Ask a question