Gravatar for alex@yaplex.com

Question by AlexShapo, Mar 27, 2018 3:22 PM

How to merge urls?

I have the following url which contain our press releases:
http://www.bce.ca/news-and-media/releases

that page have paging, which is implemented using querystring, like this:
http://www.bce.ca/news-and-media/releases?page=3&perpage=10

All our links on that page looks like this, with ?page=1 in querystring, that is done to have "back" button work
http://www.bce.ca/news-and-media/releases/show/Bell-Canada-announces-offering-of-Series-US-1-Notes-1?page=1&month=&year=

and the issue is when indexing that pages, coveo somehow finds two URLs, one with querystring, and another without.

I understand where is the first URL coming from, but have no idea where it found second, without querystring.

http://www.bce.ca/news-and-media/releases/show/Bell-Canada-announces-offering-of-Series-US-1-Notes-1?page=1&month=&year=
http://www.bce.ca/news-and-media/releases/show/Bell-Canada-announces-offering-of-Series-US-1-Notes-1

The issue is, on search results page we have duplicate for every press releases
http://www.bce.ca/news-and-media/advancedsearch

One URL with querystring and second without. I need a way to merge them or show only one, does not matter which one.

I tried excluding ?page from my WEB source, but in that case it shows no results. I also tried the following configuration and exclusion filter, but did not helped.

*?page=*
"parameters": {
      "ExpandBeforeFiltering": {
        "sensitive": false,
        "value": "true"
      },
<br>

Is it possible to have an extension which will fix this? I tried, but looks like extensions does not have access to "uri" and "sysuri" field, so I can't modify them. Any ideas how can I do it?

Configuration:
Source: WEB
Site URL: http://www.bce.ca/news-and-media/releases
Inclusion Filter: [star]/news-and-media/releases/show/[star]

1 Reply
Gravatar for fverpaelst@coveo.com

Answer by Francois Verpaelst, Mar 27, 2018 3:46 PM

Hi! Pages with query parameters in their URL are usually listing pages, which should not be indexed. However, if you exclude them, the crawler will struggle to reach the pages referenced in these listing pages. I would rather let the crawler reach all the pages, but then reject the listing pages with a REGEX in a Python extension.

If your index is in a Cloud V2 org, you should be able to use document.uri to access/manipulate the url.

Ask a question