Question by mhogue, May 1, 2017 8:14 AM

V2 Pipeline Extension: How to access the Body text

What is the python code that I need to gain access to the Body text? I can't find this anywhere in the documentation. In our V1 script custom field, we have these lines of code:

var UTF8_CHARSET = 2;
var documentContent = PostConversion.HTMLOutput.ReadByteString(PostConversion.HTMLOutput.BytesCount, UTF8_CHARSET);

Also, it may be that I need access to the Original file. If so, please advise.


Answer by François Lachance-Guillemette, May 1, 2017 12:05 PM

The V2 indexing pipeline extensions are _very_ different from the V1 version. It now uses Python, can import packages, and has a streamlined API for an even more powerful and easy to use tool.

This Coveo Cloud V2 Indexing Pipeline Extensions page is a really useful resource to kickstart your Cloud V2 Extensions journey.

The Body text is a readable stream, as instructed in the Document Object Python API Reference (Get Data Streams).

You should be able to get that stream and do whatever you want with it.

Comment by mhogue, May 1, 2017 12:25 PM

Okay, but if there is more than one stream, how do I know which one(s) is(are) of interest? How can I manipulate the stream after accessing it? How do I ensure the data is in UTF8 format?

Comment by mhogue, May 1, 2017 12:28 PM

I want the stream of interest to be a string like below that I can perform regular expressions against, by the way.

html_code = """

<tag id="whateer" content-lang="uknow"/>

<another tag ... >


