Gravatar for ram282@gmail.com

Question by Ramakrishna Kudikala, Jun 28, 2016 1:06 PM

Crawl Web pages by adding query string but not include the query string in the index

Hi,

I am configuring an index that crawls web pages. I want it to crawl the web pages by including a query string "view=app". But the indexed document url should not contain that query string parameter. Do you have any suggestions how this can be achieved ?

1 Reply
Gravatar for slangevin@coveo.com

Answer by Simon, Jun 28, 2016 1:23 PM

Conversion scripts would be what you are looking for:

https://developers.coveo.com/x/YoAV

You can change the clikable uri field to an url you selected.

For Coveo Cloud V2, you can use web scrapping:

http://www.coveo.com/go?dest=cloudhelp&lcid=9&context=277

Cheers,
Simon

Gravatar for ram282@gmail.com

Comment by Ramakrishna Kudikala, Jun 28, 2016 2:18 PM

Thank you. I have appended the querystring "view=app" to the clickable URI field using the pre conversion script. Now the indexed document link has the querystring in it. This is fine.

I have a question. Does the crawler actually indexes whole web page with the querystring ? I am trying to search with a phrase that is present in the web page html body in the CES and expecting that page to be there in the search results. But the search doesn't return any document. It shows the results only if I search for web page title, or meta tag keywords, description.

Gravatar for slangevin@coveo.com

Comment by Simon, Jun 28, 2016 2:38 PM

Is it a web source crawled using the web connector? Web usually grabs the entire body of the document.

Gravatar for ram282@gmail.com

Comment by Ramakrishna Kudikala, Jun 28, 2016 3:33 PM

Yes, it is a crawling using web connector. The index is in the WebCrawlerCollection

Gravatar for slangevin@coveo.com

Comment by Simon, Jun 28, 2016 4:05 PM

Web should index the entire document. In the Coveo Index Browser (On-Premises) or the Content Browser (Cloud), what is the file type of your indexed item? If it worked well, it should be an html document. Also, you should see a "cached" button, which would show the body, do you see it?

Gravatar for ram282@gmail.com

Comment by Ramakrishna Kudikala, Jun 28, 2016 4:37 PM

The content type is html. I checked the cached version and understood that the coveo is not using the querystring parameter "view=app" while building the index document.

I have appended the Clickable URI of the document with the "view=app" querystring in the pre conversion script. I could see this in the ClickableURI property. Why does the crawler not using the view=app while indexing? Another setting that I also changed is that - I unchecked the option - "Skip addresses with parameters (domain.com?parameters)"

Gravatar for slangevin@coveo.com

Comment by Simon, Jun 28, 2016 4:56 PM

Not 100% sure but I believe the HTML is fetched using the clickuri, and you change it before the conversion using a preconversion script. Have you tried a post-converison script instead?

Gravatar for ram282@gmail.com

Comment by Ramakrishna Kudikala, Jun 28, 2016 4:57 PM

Simon,

Also, from the operational logs, I observed that, the document is getting indexed first and then the PreConversion script is executed. Is this expected behavior ?

Gravatar for slangevin@coveo.com

Comment by Simon, Jun 28, 2016 4:59 PM

See my previous comment. The index is done before the conversion, but the html is fetched during the conversion. Have you tried to use post-conversion.

Gravatar for ram282@gmail.com

Comment by Ramakrishna Kudikala, Jun 29, 2016 9:13 AM

I tried the post-conversion script for setting the clickable URI. The cached html still shows that it is pulled without a query string. I guess the html is fetched first and then the pre and post conversion scripts executes. I came across this document, it shows that the crawling happens before all pre and post conversion scripts. https://onlinehelp.coveo.com/en/ces/7.0/administrator/whataretheconversionphases.htm Does coveo fetches the html in the crawling phase ?

Gravatar for jflheureux@coveo.com

Comment by Jean-François L'Heureux, Jul 8, 2016 3:33 PM

Here's the order of operations:

  1. The HTML document is fetched by the crawler using the starting address of the source.
  2. The Pre-conversion scripts are run.
  3. The HTML document is converted (its text is extracted to be free text searchable, a cached version is saved).
  4. The Post-conversion scripts are run.
  5. The document is added to the search index.

You should put the view=app in your starting address and ensure your website will add it to all the links on the page for other documents to have it too. Then, in a pre-conversion script, you should remove the view=app query string argument for indexed documents to not have it.

Ask a question