Webpage Connector language restriction
Is there any webpage connector parameter available to restrict the crawler to crawl only English (Or specific language) content?
I am crawling external site using webpage connector and that website has references to other language webpage content. I don't want to crawl those other language pages.
When using the Web Pages connector, you can specify some filters to include or exclude some URLs using patterns. Obviously, the pattern will change depending on the site you are crawling. To define those filters:
- Open the CES Admin Tool.
- Edit your source.
- Click the
Filterslink in the left menu.
For instance, you might want to include
http://yourSite.com/en/* to get only the english pages. Rules can be defined either using wildcards (
?) or by using regular expressions. If you want to pass a querystring argument in the pattern, I guess you have to use regular expressions in this particular case.
I hope it helps
Do you have any fields on the page (meta tags for example) that the connector can use to filter out?
ex: meta name="language" content="en"
From there you could use a conversion script to filter out unwanted languages.
See this post for more info: https://answers.coveo.com/questions/352/can-you-retrieve-the-value-of-a-different-attribute-for-a-field-built-from-meta-header-tags