Gravatar for dnyaneshwar.rokade@raweng.com

Question by dcr2290, Jan 23, 2017 5:46 AM

How to crawl orphan pages using Coveo ?

Hi all, we have some urls which are not linked anywhere in the site. Those urls are not even added in sitemap. is there any REST API or way to feed those orphan urls to Coveo Crawler ?

1 Reply
Gravatar for sbelzile@coveo.com

Answer by Sébastien Belzile, Jan 23, 2017 7:45 AM

Are we talking about a web source? sitemap source? Or something else?

On-Premises? Cloud?

A web source allows you to explicitly configure the URLs that you want to crawl. In a Sitemap source, you also have control over what is included in your Sitemap file.

Gravatar for dnyaneshwar.rokade@raweng.com

Comment by dcr2290, Jan 23, 2017 8:38 AM

We are using web source. Actually i am newbie here. Could to you tell me how to configure the orphan URLs to crawl ? also want to know if there is any REST API to feed those urls ?

Gravatar for sbelzile@coveo.com

Comment by Sébastien Belzile, Jan 23, 2017 9:20 AM

Since you did not tell me which product you are using and seem to be a new user, I will suppose you use Cloud V2. Tell me if you use a different product, I will adjust the answer.

Configure orphan URLS:

  • Connect to the Administration tool (https://platform.cloud.coveo.com)
  • In the source panel, select the source that you want to edit.
  • Click Edit.
  • You should see your web source properties.

To know exactly what every parameter does, refer to the documentation: https://onlinehelp.coveo.com/en/cloud/addeditwebsource.htm#AddorEditaWebSource or use the search on the onlinehelp.

Basically, you can add the URL of your orphan pages in the Site URL box.

REST API:

For the web connector, you should edit the configuration in the Administration console. There is indeed an API to edit the source configuration: (https://platform.cloud.coveo.com/docs), but not to push URLS to crawl.

Coveo also has a Push API that you can use to push actual content, if you would like to index something that Coveo does not crawl OOTB, but Coveo does not offer an API to tell its crawlers to refresh/remove specific content.

Gravatar for dnyaneshwar.rokade@raweng.com

Comment by dcr2290, Jan 24, 2017 7:55 AM

Hi, Thanks for quick response.

Since we have thousands of orphan urls, can't add those urls one by one into Site URL box.

is there any way to automate this process [ e.g. anything that will add all those urls once automatically ]?

We want all urls to be indexed. So looking for kind of API or anything else that will let us add those urls through coding.

Gravatar for sbelzile@coveo.com

Comment by Sébastien Belzile, Jan 24, 2017 8:17 AM

Thousands of orphan URLs… Not SEO friendly at all. I think this is something you should attack.

I don't think adding thousands of pages in your source configuration is a good practice either, though it can easily be done: https://platform.cloud.coveo.com/docs . Editing the source configuration is a simple REST call, the list of URL to crawl is a simple JSON array.

An alternative and cleaner solution would be to have some sort of "orphan pages index" ? You probably have a list of your orphan pages right? You can easily format that list in an HTML page right? Then, you could feed the crawler the URL of that page. Or link that page somewhere in your site (would fix your problem with every web crawler (not only Coveo's)). What do you think?

Ask a question