Question by mmintz, Oct 24, 2014 12:48 PM

I'm trying to figure out what the log description "Document Filtered (redirect)" means. I know that "Document Filtered" means that the document isn't being indexed in the source, but I'm fully sure about (redirect). Does that mean that the url of the page that was attempting to be index redirected to another page?

Answer by Jonathan Miller, Oct 24, 2014 3:31 PM

The Document Filtered (redirect) message happens when the crawler gets redirected to another page and that page does not match one of the inclusion filters. This is usually because of one of the following reasons:

  1. The web server is configured to redirect the page
  2. The page requires a logon.

So if the redirected page is the one that you want to crawl you will need to make sure that you have an inclusion filter in your source that matches that page link. You will need to make sure that you don't include any escape characters in your filter. For example (%20 should be a blank space).

So the above is what we normally see, however in writing this answer for you I did a few tests and I was able to reproduce the same message but in an unexpected way. You can also get it when you try to crawl a link that isn't part of my starting address but that an inclusion filter has been created for. Is this your case? Do you have a link that you want to crawl that isn't part of your starting address and you already have an inclusion filter for? If so please open a new case with support so that we can work on this with you.

Comment by Jonathan Miller, Oct 28, 2014 1:58 PM

As a follow up to my answer, we have found that you can get an inclusion filter with a space in the address to work if you use a Regex filter instead of a wildcard one. You will need to replace all spaces in the address with (\s|\%20) and you will need to add .* to the end of the filter instead of this *. I hope that this help you to get this working.

