Gravatar for richard.stokes@medica.com

Question by mnrichardstokes, Dec 9, 2016 4:12 PM

PDF's not indexing words

We are using the Coveo File Connector we are indexing rather large PDF's (10mb+, 1000+ pages) and what we are seeing is that it does crawl the document but when I look at the summary, there is no conent and we cannot search for a word within the pdf. We can see the Quick View with all of the content and can highlight the text we are searching for but again cannot search for those words. We can also search for the words directly in the PDF so I know that it is not just images.

What would cause this type of issue? I also tried the LibreOffice and pdf2html but I am still getting the same results.

Our goal is to search through these large pdf's (and spreadsheets) and perhaps use the Quick View so they can see the results of the search (each PDF is a collection of letters sent to people)

1 Reply
Gravatar for slangevin@coveo.com

Answer by Simon, Dec 14, 2016 3:49 PM

Interesting. It works on my side. First of all, the list of supported files can be found here:

https://onlinehelp.coveo.com/en/ces/7.0/user/supportedfileformats.htm

Acrobat PDF are supported between version 1.0 and 1.7

My assumption is that you are using Coveo Enterprise Search 7.0 On-Premises, so in this case, the indexing behavior for each file type is defined in the Configuration >> Document Types menu.

In there you can decide what to do with a document. PDF extension are indexed fully by default, but will only index metadata in case of failure. Could it be the case on your side? You would see an error like this in the index logs:

"Failed to convert…. will be indexed by reference…."

Also, pdf can be copy protected, find your pdf in the Content >> Index Browser and expand the Details tab under it. There is a field called Copy Protected.

Cheers,
Simon

Gravatar for richard.stokes@medica.com

Comment by mnrichardstokes, Dec 15, 2016 11:51 AM

It is on-premise. The document is not copy protected. These are large pdf's, over 1000 pages long.

Weird thing I can search for a word on the footer of the pdf but on the same line I cannot search for the other word. I can select both words when viewing the PDF. They are different fonts. No security restrictions in the PDF. No errors in the Coveo index log.

Ask a question