Gravatar for kvarley@parse3.com

Question by kvarley, Jul 14, 2015 2:06 PM

Indexing Media Library with Sitecore Legacy connector

We have a customer using the Sitecore legacy connector (6.5 x86 Build 4898) and they would like to begin indexing PDFs that they have contained in the media library. I'm trying to determine if there is a way for us to index this content and filter out all other content types such as word docs using document type sets. I have tried to do this by applying a custom document type set and changing the action to reject. However, I still see this content show up in the Index Browser. One thing to note is that the items in Sitecore don't necessarily have the proper file extension. Is there some way to handle this via the Content Type value.

1 Reply
Gravatar for slangevin@coveo.com

Answer by Simon, Jul 15, 2015 7:42 AM

That was actually really fun to test since I have not been playing with CES 6.5 in a while.

Using Document type set would be the best alternative I believe. Now Sitecore might indeed give a special extension to some documents but you can add new extension to an existing document type set using the Add button.

If you still want to filter with a bit more logic, I would recommend using a Pre-Conversion Script Searching for Extension ou our API help brought me to the Extension Property of Document.Info

So assigning the file extension to a variable in JavaScript would look like this:

var fileExtension = DocumentInfo.Extension;

You can then add some logic to compare and exclude, example:

DocumentInfo.IsValid = fileExtension !== "the extension that I don't want"

Now remember that you can log in the CES Console using

PreConversion.Trace

And finally, once your script is ready, you can reference it in the Adminstration Tools:

https://onlinehelp.coveo.com/en/ces/6.5/developer/howtosetupapreconversion_script.htm

Should do the trick.
Cheers
Simon

Gravatar for kvarley@parse3.com

Comment by kvarley, Jul 15, 2015 9:44 AM

Thanks for your response Simon. I have done some pre-conversion scripts in the past and figured that's where we'd end up if document types didn't work out.

Regardless of extension, it doesn't seem like Coveo is honoring the document type configuration. The word docs I am seeing are all named *.doc and the setting for this document type is reject.

Gravatar for slangevin@coveo.com

Comment by Simon, Jul 15, 2015 1:43 PM

Hum, strange, could it be docx?

What I would do is plug a conversion script to output the extension detected by the crawler. So a script with a single line:

PreConversion.Trace(DocumentInfo.Extension);

Maybe the connector is not detecting the same thing as what you are seeing.

Ask a question