Gravatar for dbahrishy@gmail.com

Question by hrishy, Apr 12, 2016 10:38 AM

Crawl a Subversion site exposed over webdav using http

Hi

I would like to understand if there is a way to index subversion repository which currently houses various types of documents like word,pdf,excel,txt,zip etc.

Since SVN (Subversion) itself can be exposed over webDAV using http protocol can coveo crawl this site including following the links as deep as possible and create an Index ?

I would then go ahead and build a web front end which searches the index and link back to SVN repository

Thanks for your help

1 Reply
Gravatar for mgrondines@coveo.com

Answer by mgrondines, Apr 12, 2016 10:57 AM

If the SVN repository is accessible over HTTP (ie. in the browser), you can use the Web Crawler to index the content. However, you must setup the source so that index pages are filtered out. For instance, you can crawl this repository. (which I found on google and seems public).

The Web Crawler will follow the hyperlinks like nay website and download the binary files depending on the Document type settings on the source.

Then, you can add a source filter to remove pages without extensions, so that only binary files are indexed and "index" pages and removed. Remember to check the "Expand before filtering" checkbox in the UI!

Gravatar for dbahrishy@gmail.com

Comment by hrishy, Apr 12, 2016 11:13 AM

Hi

Thank you so much for your input. I do not have much control over the SVN repository .

You mention and i quote "setup the source so that index pages are filtered out."

Is there something that can be done while configuring the crawler on coveo's side to filter out somethings like html and only index files of certain type like with extensions docx,pdf,zip etc ?

Gravatar for mgrondines@coveo.com

Comment by mgrondines, Apr 12, 2016 11:30 AM

Yes, you can modify the document type settings to reject html documents, or any other types. You can also filter documents by content types such as "text/html" or "application/pdf". You can look at our documentation for more info: https://onlinehelp.coveo.com/en/ces/7.0/administrator/modifyinghowceshandlesadocumenttype.htm

Ask a question