While I’d like to claim credit for the wisdom in this post, alas I cannot. One of Seven Sigma’s consultants (Daniel Wale) worked this one out and I thought that it was blog-worthy. Before I get into the issue and Daniel’s resolution, let me give you a bit of search engine theory 101 with a concept that I find is useful to help understand search optimisation.
Precision vs. recall
Each time a person searches for information, there is an underlying goal or intended outcome. While there has been considerable study of information seeking behaviours in academia and beyond, they boil down to three archetype scenarios.
- “I know exactly what I am looking for” – The user has a particular place in mind, either because they visited it in the past or because they assume it exists. This known as known item seeking, but is also referred to as navigational seeking or refinding.
- “I’m not sure what I am looking for but I’ll know it when I find it” – This is known as exploratory seeking and the purpose is to find information assumed to be available. This is characterised by
- - Looking for more than one answer
- - No expectation of a “right” answer
- - Open ended
- - Not necessarily knowing much about what is being looking for
- - Not being able to articulate what is being looked for
- “Gimme gimme gimme!” – A detailed research type search known as exhaustive seeking, leaving no stone unturned in topic exploration. This is characterised by;
- - Performing multiple searches
- - Expressing what is being looked for in many ways
Now among other things, each of these scenarios would require different search results to meet the information seeking need. For example: If you know what you are looking for, then you would likely prefer a small, highly accurate set of search results that has the desired result at the top of the list. Conversely if you are performing an exploratory or exhaustive search, you would likely prefer a greater number of results since any of them are potentially relevant to you.
In information retrieval, the terms precision and recall are used to measure search efficiency. Google’s Tim Bray put it well when he said “recall measures how well a search system finds what you want and precision measures how well it weeds out what you do not want”. Sometimes recall is just what the doctor ordered, whereas other times, precision is preferred.
The scenario and the issue…
That said, recently, Seven Sigma worked on a knowledgebase project for a large customer contact centre. The vast majority of the users of the system are customer centre operators who deal directly with all customer enquiries and have worked there for a long time. Thus most of the search behaviours are in the known item seeking category as they know the content pretty well – it is just that there is a lot of it. Additionally, picture yourself as one of those operators and then imagine the frustration a failed or time consuming search with an equally frustrated customer on the end of the phone and a growing queue of frustrated callers waiting their turn. In this scenario, search results need to be as precise as possible.
So what was the search related issue? In a nutshell, we forgot that the search crawler doesn’t differentiate between your pages content and items in your custom navigation. As a result, we had an issue where searches did not have adequate precision.
To explain the problem, and the resolution, I’ll take a step back and let Daniel continue the story… Take it away Dan…
The knowledgebase that Paul described above contained thousands of articles, and when the search crawler accessed each article page, it also saw the titles of many other articles in the dynamic menu code embedded in the page. As a result, this content also got indexed. When you think about it, the search crawler can’t tell whether content is real content versus when it is a dynamic menu that drops down/slides out when you hover over the menu entry point. The result was that when users searched for any term that appeared in the mega menu, they would get back thousands of results (a match for every page) even when the “actual content” of the page doesn’t contain any references to the searched term.
There is a simple solution however, for controlling what the SharePoint search crawler indexes and what it ignores. SharePoint knows to exclude content that exists inside of <div> HTML tags that have the class noindex added to them. Eg
<div class=”menu noindex”> <ul> <li>Article 1</li> <li>Article 2</li> </ul> </div>
There is one really important thing to note however. If your <div class=”noindex”> contains a nested <div> tag that doesn’t contain the noindex class, everything inside of this inner <div> tag will be included by the crawler. For example:
<div class=”menu noindex”> <ul> <li>Article 1</li> <div class=”submenu”> <ul> <li>Article 1.1</li> <li>Article 1.2</li> </ul> </div> <li>Article 2</li> </ul> </div>
In the code above the nested <div> to surround the submenu items does not contain the noindex class. So the text “Article 1.1” and “Article 1.2” will be crawled, while the “Article 1” and “Article 2” text in the parent <div> will still be excluded.
Obviously the example above its greatly simplified and like our solution, your menu is possibly making use of a DataViewWebPart with an XSL transform building it out. It’s inside your XSL where you’ll need to include the <div> with the noindex class because the Web Part will generate its own <div> tags that will encapsulate your menu. (Use the browser Developer Tools and inspect the code that it inserts if you aren’t familiar with the code generated, you’ll find at least one <div> elements that is nested inside any <div class=”noindex”> you put around your web part thinking you were going to stop the custom menu being crawled).
Initially looking around for why our search results were being littered with so many results that seemed irrelevant, I found the way to exclude the custom menu using this method rather easily, I also found a lot of forum posts of people having the same issue but reporting that their use of <div> tags with the noindex class was not working. Some of these posts people had included snippets of their code, each time they had nested <div> tags and were baffled by why their code wasn’t working. I figured most people were having this problem because they simply don’t read the detail in the solutions about the nesting or simply don’t understand that the web part will generate its own HTML into their page and quite likely insert a <div> that surrounds the content they are wanting to hide. As any SharePoint developer quickly finds out a lot of knowledge in SharePoint won’t come from well set out documentation library with lots of code examples that developers get used to with other environments, you need to read blogs (like this one), read forums, talk to colleagues and just build up your own experience until these kinds of gotchas are just known to you. Even the best SharePoint developer can overlook simple things like this and by figuring them out they get that little bit better each time.
Being a SharePoint developer is really about being the master of self-learning, the master of using a search engine to find the knowledge you need and most importantly the master of knowing which information you’re reading is actually going to be helpful and what is going to lead you down the garden path. The MSDN blog post by Mark Arend (http://blogs.msdn.com/b/markarend/archive/2010/06/07/control-search-indexing-crawling-within-a-page-with-noindex.aspx) gives a clear description of the problem and the solution, he also states that it is by design that nested <div> tags are re-evaluated for the noindex class. He also mentions the product team was considering changing this… did this create the confusion for people or was it that they read the first part of the solution and didn’t read the note about nested <div> tags? In any case it’s a vital bit of the solution that it seems a lot of people overlook still.
In case you are wondering, the built in SharePoint navigation menu’s already have the correct <div> tags with the noindex class surrounding them so they aren’t any concern. This problem only exists if you have inserted your own dynamic menu system.
Other Search Provider Considerations
It is more common that you think that some sites do not just use SharePoint Search. The <div class=”noindex”> is a SharePoint specific filter for excluding content within a page, what if you have a Google Search Appliance crawling your site as well? (Yep… we did in this project)
You’re in luck, the Google documents how to exclude content within a page from their search appliance. There are a few different options but the equivalent blanket ignore of the contents between the <div class=”noindex”> tags would be to encapsulate the section between the following two comments
If you want to know more about the GSA googleoff/googleon tags and the various options you have here is the documentation: http://code.google.com/apis/searchappliance/documentation/46/admin_crawl/Preparing.html#pagepart
(… and Paul returns to the conversation).
I think Dan has highlighted an easy to overlook implication of custom designing not only navigational content, but really any type of dynamically generated content on a page. While the addition of additional content can make a page itself more intuitive and relevant, consider the implication on the search experience. Since the contextual content will be crawled along with the actual content, sometimes you might end up inadvertently sacrificing precision of search results without realising.
Hope this helps and thanks for reading (and thanks Dan for writing this up)