Gravatar for

Question by pxbhuang, Jul 21, 2016 10:27 AM

Cleaning HTML content in PreConversion for Summary

I am currently trying to delete content between certain tags of an HTML source in PreConversion. The Summary tab under "Details" in the CES Index Browser currently shows information unrelated to the documents, and as I understand it, cleaning HTML in PreConversion will prevent Coveo from putting that information in the Summary. My code looks as follows:

string body = preConversion.InputDocument.ReadByteString(preConversion.InputDocument.BytesCount, Charsets.CP1252_CHARSET);
// Do deletion modifications to body
preConversion.InputDocument.WriteAsByteString(body, Charsets.CP1252_CHARSET); // throws InvalidCastException

The line where I try to write back to the InputDocument throws an InvalidCastException when a document is run through the preconverter. I have validated that the HTML body is successfully extracted and the deletion operation is performed, but it always fails on the last line. Why is this the case?

1 Reply
Gravatar for

Answer by Jean-François L'Heureux, Jul 21, 2016 11:07 AM

This task should be done in a custom converter instead.

We already have an example code to do this:

It misses instructions to associate it to your source however. You should add the converter first. Then, create a new document type set. In the new document type set, modify the the HTML document type and set your custom converter. Then in your source, set the new document type set. Rebuild.

I hope this helps.


Gravatar for

Comment by pxbhuang, Jul 21, 2016 11:22 AM

Thanks for the answer Jeff,

Is CustomConverter only available in JScript and VBScript? Currently my conversion scripts are in .NET and would like to keep it consistent if possible.

Gravatar for

Comment by pxbhuang, Jul 21, 2016 2:04 PM

Thanks, the custom converter works. For some reason I convinced myself that it should work in Preconversion when that isn't the case.

Ask a question