PDF's not indexing words
We are using the Coveo File Connector we are indexing rather large PDF's (10mb+, 1000+ pages) and what we are seeing is that it does crawl the document but when I look at the summary, there is no conent and we cannot search for a word within the pdf. We can see the Quick View with all of the content and can highlight the text we are searching for but again cannot search for those words. We can also search for the words directly in the PDF so I know that it is not just images.
What would cause this type of issue? I also tried the LibreOffice and pdf2html but I am still getting the same results.
Our goal is to search through these large pdf's (and spreadsheets) and perhaps use the Quick View so they can see the results of the search (each PDF is a collection of letters sent to people)
Interesting. It works on my side. First of all, the list of supported files can be found here:
Acrobat PDF are supported between version 1.0 and 1.7
My assumption is that you are using Coveo Enterprise Search 7.0 On-Premises, so in this case, the indexing behavior for each file type is defined in the Configuration >> Document Types menu.
In there you can decide what to do with a document. PDF extension are indexed fully by default, but will only index metadata in case of failure. Could it be the case on your side? You would see an error like this in the index logs:
"Failed to convert…. will be indexed by reference…."
Also, pdf can be copy protected, find your pdf in the Content >> Index Browser and expand the Details tab under it. There is a field called Copy Protected.