Gravatar for gviswanath18@gmail.com

Question by gviswanath, Nov 5, 2015 11:36 AM

Strip-out html tags when processing xml thorugh xml document definition

Hi,

We have a xml document that has a set of nodes which needs to be processed into separate indexed documents. Using the xml document definition we have created mappings between the xml nodes and the fields and used Web page crawler to crawl the xml file. We are getting the desired results except for one thing.

The node which we have mapped to Content property contains html tags which are showing up in the search results too. We would like to strip out the html tags while displaying the results. Is there any way to do this?

Sample of the node:

<RootElementNode>
  <ElementNode>
    <Description><![CDATA[<strong>Some heading &ndash;</strong> Some content]]></Description>
  </ElementNode>
</RootElementNode>

Mapping in XML document definition:

Root XML Path : RootElementNode/ElementNode
Content - %[Description]
Gravatar for jflheureux@coveo.com

Comment by Jean-François L'Heureux, Nov 5, 2015 3:36 PM

Is this other question from someone on the same project? https://answers.coveo.com/questions/4891/xml-parser-while-building-index-from-xml-file

Gravatar for gviswanath18@gmail.com

Comment by gviswanath, Nov 5, 2015 4:45 PM

Yes. This is a follow-up to that question. We are using the xml document definition to process the xml nodes. But we need to figure out a way to strip out the html tags in the Content so that we do not see any html tags when the search results are being shown in the page.

1 Reply
Gravatar for ldblanchet@coveo.com

Answer by ldblanchet, Nov 6, 2015 7:54 AM

The best way to do this would probably be to use a conversion script : Conversion Phases If you use pre conversion, you can modify the body of the document before it gets converted, so those html tags will not be searchable.

On hand I do not have an example that does exactly what you want, however the api reference for the conversion objects can be found here : DocumentInfo . The PreConversion and PostConversion objects are also interesting.In post conversion, you could modify the text extracted by the converters which contains the html tags, and remove them before the document reaches the index.

One way to do it in post conversion would be to replace the extracted text for your document :

'Replace the text on the document
Dim Text: Text = PostConversion.Text.ReadString(PostConversion.Text.BytesCount)
Text = Replace(Text, "OriginalWord", "NewWord")
Call PostConversion.TextToOverride.WriteString(Text)
Ask a question