Cleaning HTML content in PreConversion for Summary
I am currently trying to delete content between certain tags of an HTML source in PreConversion. The Summary tab under "Details" in the CES Index Browser currently shows information unrelated to the documents, and as I understand it, cleaning HTML in PreConversion will prevent Coveo from putting that information in the Summary. My code looks as follows:
string body = preConversion.InputDocument.ReadByteString(preConversion.InputDocument.BytesCount, Charsets.CP1252_CHARSET); // Do deletion modifications to body preConversion.InputDocument.WriteAsByteString(body, Charsets.CP1252_CHARSET); // throws InvalidCastException
The line where I try to write back to the InputDocument throws an InvalidCastException when a document is run through the preconverter. I have validated that the HTML body is successfully extracted and the deletion operation is performed, but it always fails on the last line. Why is this the case?
This task should be done in a custom converter instead.
We already have an example code to do this: https://developers.coveo.com/display/Converter/Modifying+an+HTML+Document
It misses instructions to associate it to your source however. You should add the converter first. Then, create a new document type set. In the new document type set, modify the the HTML document type and set your custom converter. Then in your source, set the new document type set. Rebuild.
I hope this helps.