Gravatar for sathis.k.durairaj@accenture.com

Question by Sathis, Sep 9, 2015 9:17 AM

Webpage Connector language restriction

Is there any webpage connector parameter available to restrict the crawler to crawl only English (Or specific language) content?

I am crawling external site using webpage connector and that website has references to other language webpage content. I don't want to crawl those other language pages.

2 Replies
Gravatar for lbergeron@coveo.com

Answer by Luc Bergeron, Sep 9, 2015 9:38 AM

Hi,

When using the Web Pages connector, you can specify some filters to include or exclude some URLs using patterns. Obviously, the pattern will change depending on the site you are crawling. To define those filters:

  1. Open the CES Admin Tool.
  2. Edit your source.
  3. Click the Filters link in the left menu.

For instance, you might want to include http://yourSite.com/en/* to get only the english pages. Rules can be defined either using wildcards (* or ?) or by using regular expressions. If you want to pass a querystring argument in the pattern, I guess you have to use regular expressions in this particular case.

I hope it helps

Gravatar for sathis.k.durairaj@accenture.com

Comment by Sathis, Sep 9, 2015 10:47 AM

Thanks for the response.

Gravatar for slangevin@coveo.com

Answer by Simon, Sep 9, 2015 9:38 AM

Do you have any fields on the page (meta tags for example) that the connector can use to filter out?

ex: meta name="language" content="en"

From there you could use a conversion script to filter out unwanted languages.

See this post for more info: https://answers.coveo.com/questions/352/can-you-retrieve-the-value-of-a-different-attribute-for-a-field-built-from-meta-header-tags

Cheers,
Simon

Ask a question