Gravatar for answerops@coveo.com

Question by Coveo Answers, Jun 20, 2018 6:20 PM

Sitemap vs. Web Connector

What is the difference between the web crawler and the sitemap connector?

Why would I choose the web crawler over the sitemap connector, or visa versa?

Why are web crawler refreshes slower than refreshes in Salesforce, and how can I speed them up?

1 Reply
Gravatar for chvaneyll@coveo.com

Answer by Charles-Hubert van Eyll, Jun 20, 2018 6:21 PM

Sitemap is kind of like a metadata layer on top of a web site or set of web sites. For the detailed specification, check out http://sitemaps.org.

With a basic web crawler, the idea is that you give the crawler a starting page or set of pages, and the starting page(s) are indexed, links within the pages are identified and crawled, and the process repeats itself ad infinitum until all linked pages are crawled. Any page linked from your start pages, or linked from pages that are linked from the starter pages, and so on--child, grandchild, great-grandchild, etc...--are all indexed and potentially become springboards for other pages to be indexed. Any page in a site which does not somehow have a link path from the starting page(s) won't get indexed.

Finally, a web server is not very smart about its content. If you ask for a particular page, the server will feed it out to you, including metadata like last-modified-date (LMD) and so on, but it's not like a Salesforce table or an Oracle DB, where you can query for specific page types. You can't say to a web server, "Give me all the pages that have been modified in the last 24 hours." Instead, what the web connector's "rescan" function does is that it goes through everything in the associated source, and sends the web server a "HEAD" request, for which the response will tell us if the file still exists, and what the last modified date is, among other things. If the file is gone, we remove it from the index. If it's there, we compare the LMD to the LMD we have on file, and if it's newer in the system of record, we send a "GET" request to pull the whole file and re-index it, including crawling any new links that may have been added.

This is why a web connector's "rescan" is faster than a full "rebuild"--it doesn't necessarily crawl each and every file--but it's not as fast as, say, a "refresh" in Salesforce, which is purely an incremental pull.

So the difference with sitemap is that instead of giving a web page that leads you to pages which lead you to pages which lead you to pages which lead you to pages, a sitemap file is a concise file (or small set of files) which includes references to the URLs of ALL the files you want to index. It can also include metadata, such as LMD, so that the Sitemap connector can do a true refresh instead of full rescan.

So if a file is updated, the site manager (or a system implemented by that site manager locally) should update the sitemap file to include an updated entry for that file, and when the connector refreshes, it pulls down the whole sitemap file, and looks for entries that are new or updated. Then it indexes the contents of the files at the URLs associated with those new or updated entries.

It doesn't ask for any information about pages whose sitemap entries are not updated, nor does it crawl any links on the pages it indexes. A web site with 100,000 documents which has a well maintained sitemap file will only take a couple of minutes to refresh if only a few pages have changed. That same site being refreshed with a regular crawler will take hours or days to do the same thing, depending on the performance of the server, as each page needs to be queried.

Last, but not least, we've made an extension to the Sitemap standard so that clients can easily add metadata to their web content without modifying the actual web pages. It is a powerful and flexible tool.

In short, the web crawler crawls sites driven by web servers, which are not intelligent tools by design; as such, the crawler is a sort of brute force tool. Sitemap adds an layer of intelligence to any web site; it requires more work on the client side to be effective, but it makes for a much faster and more flexible indexing experience. If the added work of implementing and maintaining a sitemap file is an option, we strongly recommend that over a basic web crawl.

Original KB: https://support.coveo.com/s/article/2297

Ask a question