Strip-out html tags when processing xml thorugh xml document definition
We have a xml document that has a set of nodes which needs to be processed into separate indexed documents. Using the xml document definition we have created mappings between the xml nodes and the fields and used Web page crawler to crawl the xml file. We are getting the desired results except for one thing.
The node which we have mapped to Content property contains html tags which are showing up in the search results too. We would like to strip out the html tags while displaying the results. Is there any way to do this?
Sample of the node:
<RootElementNode> <ElementNode> <Description><![CDATA[<strong>Some heading –</strong> Some content]]></Description> </ElementNode> </RootElementNode>
Mapping in XML document definition:
Root XML Path : RootElementNode/ElementNode Content - %[Description]
The best way to do this would probably be to use a conversion script : Conversion Phases If you use pre conversion, you can modify the body of the document before it gets converted, so those html tags will not be searchable.
On hand I do not have an example that does exactly what you want, however the api reference for the conversion objects can be found here : DocumentInfo . The PreConversion and PostConversion objects are also interesting.In post conversion, you could modify the text extracted by the converters which contains the html tags, and remove them before the document reaches the index.
One way to do it in post conversion would be to replace the extracted text for your document :
'Replace the text on the document Dim Text: Text = PostConversion.Text.ReadString(PostConversion.Text.BytesCount) Text = Replace(Text, "OriginalWord", "NewWord") Call PostConversion.TextToOverride.WriteString(Text)