Gravatar for ryan@riversagency.com

Question by riversagency, Jun 26, 2015 10:10 AM

How to prevent Coveo from indexing specific content on HTML pages

I am using the Javascript search interface on a Magento website. I have menus in the header that include a list of all products on the site. When searching for a specific product, all of the pages on the website show up in the search results, because every page has the same header menu.

I want to prevent Coveo from indexing content in specific divs on my pages. Is there a way to do this?

4 Replies
Gravatar for lbergeron@coveo.com

Answer by Luc Bergeron, Jun 26, 2015 10:47 AM

As far as I know, it is not possible to directly exclude specific parts of the HTML. However, Coveo's crawler is using a specific user agent when crawling the web pages. When the server recognizes this particular user agent, you can render the page without the menus, header and footer. So, only the relevant information remain and will get indexed.

If you are using the Web crawler, see https://onlinehelp.coveo.com/en/ces/7.0/administrator/configuringandindexingawebpagessource.htm

If you are using Coveo for Sitecore, you can use the same trick: https://developers.coveo.com/display/SC201506/Indexing+Documents+with+HTML+Content+Processor

I hope this helps

Gravatar for ryan@riversagency.com

Comment by riversagency, Jun 26, 2015 11:04 AM

Thank you for this. I'm going to try it out and I'll let you know if it works.

Gravatar for jflheureux@coveo.com

Answer by Jean-François L'Heureux, Jun 26, 2015 11:08 AM

If you index your website pages with the Web pages connector, another option would be to add special markup in your HTML to identify zones that should not be indexed and removing those zones with a custom converter (see Modifying an HTML Document).

In the documentation, they are using a <noindex> element that is not valid HTML. Instead, you can use another element like a comment like this: <!-- BEGIN DO NOT INDEX -->content<!-- END DO NOT INDEX -->.

Gravatar for abhishek.shrivastava@towerswatson.com

Answer by abhisfortitude, Sep 3, 2015 3:33 PM

I was able to achieve the solution using the User Agent.

Here is what I did:

On the Coveo Admin tool, go to Index you would like to configure. Navigate to General tab. Under User Agent field, entered "coveoCrawler", an arbitrary string to identify the Coveo Crawling Agent.

The on the code behind, get the Server Variable "HTTPUSERAGENT" and search if "coveoCrawler" string exist to identify. If the string if found, don't emit the header or the footer.

Gravatar for sivaji_nalamothu@cable.comcast.com

Answer by sivaji, Feb 22, 2016 3:05 PM

// AddReference:HtmlAgilityPack.dll using System; using System.IO; using System.Linq; using Coveo.CES.DotNetConverterLoader; using Coveo.CES.Interops.COMCoveoConvertersWrappers; using HtmlAgilityPack;

namespace CoveoConversionScripts { public class CoveoCustomConverter : CustomConverter { public override void RunCustomConverter(CustomConversion pCustomConversion, DocumentInfo pDocumentInfo) { if (pCustomConversion.InputDocument == null && pCustomConversion.InputDocument.BytesCount <= 0) return; try { Charsets CHARSETCP1252 = Charsets.CP1252CHARSET; var webPageHtml = pCustomConversion.InputDocument.ReadByteString(pCustomConversion.InputDocument.BytesCount, Charsets.UTF8_CHARSET); var htmlDocument = new HtmlDocument(); htmlDocument.LoadHtml(webPageHtml); var headerNode = htmlDocument.DocumentNode.SelectSingleNode("//div[@class='header']"); var footerNode = htmlDocument.DocumentNode.SelectSingleNode("//div[@class='footer ']"); var scriptNodes = htmlDocument.DocumentNode.SelectNodes("//script").ToList(); if (headerNode != null && headerNode.InnerHtml != null) { headerNode.Remove(); }

            if (footerNode != null && footerNode.InnerHtml != null)
            {
                footerNode.Remove();
            }
            if (scriptNodes.Any())
            {
                scriptNodes.ForEach(node => node.Remove());
            }

            p_CustomConversion.Trace(
                    string.Format("After removing the header and footer {0}", htmlDocument.DocumentNode.InnerHtml),
                    SeverityEnumeration.SeverityNormal);
            p_CustomConversion.OutputDocument.WriteAsByteString(htmlDocument.DocumentNode.InnerHtml, CHARSET_CP1252);
        }
        catch (Exception ex)
        {

            p_CustomConversion.Trace(
                string.Format("Exception in  Coveo Custom Converter - Header and Footer : {0}", ex),
                SeverityEnumeration.SeverityError);
        }
    }
}

}

Ask a question