SharePoint and the Seven Words

I am often asked by Human Resource folks about “obscene content” in a SharePoint environment. However the content gets in, they don’t want it to be found. There are utilities like Forefront that will perform content filtering for SharePoint, but most fall short of filtering “everything”. For example, Forefront only filters documents, not list items, publishing pages, content editor web parts, or profile properties.

I got to thinking about this issue while listening to a recent interview with Rob Bogue on the SharePoint Pod Show. In his governance interview he was talking about configuring search in such a way that the act of searching for profanity returns a link to the Employee Code of Conduct.

So, you ask “What words?”. This makes me think of George Carlin’s Seven Words you cannot say on TV. I’ll soften the list for this post. We’ll try to keep this blog rated PG and use:

  • Darn
  • Dang
  • Frick
  • Shoot

I created a document with these words in it. I crawl the content and search for “frick” I get the following results.

Search results

My goal is to prevent the display of the document and direct the user to the “Employee Code of Conduct”.

Three Steps

There are three steps to this configuration:

  1. Edit the noise word file and add the words, one on each line.
  2. Edit the Thesaurus file and add a substitution set for the words
  3. Create a keyword with synonyms of our words and a best bet to the code of conduct.

Edit the Noise Word file

  1. Open and edit the noise word for for your language. In a default installation the files are found in the folder: c:\program files\Microsoft Office Servers\12.0\data\applications\<GUID>\Config. The GUID is the unique ID of the search application that you are configuring. If you have only one SSP you will probably only have one folder. For US English you will want to edit the file noiseenu.txt. Add the words, one per line, save and close the file.
    • Darn
    • Dang
    • Frick
    • Shoot
  2. Stop and Start the search service. This forces the indexer to read the changes to the file.
  3. Perform a full crawl. In my test environment I reset the crawled content but this may be more than you want to do.
  4. I tested my changes by searching for “frick”. The result is a “noise word only” query and the results are shown below.

Synonym results

You can get more information on editing the noise word file from this support article: KB 837847

Edit the Thesaurus File

  1. Locate the thesaurus file in the same directory as the noise word file. For US English the file is tsenu.xml.

  2. Create a replacement set for your words and substitute a unique “key” that you could embed in the target file. Your complete file should look like this:

    <XML ID="Microsoft Search Thesaurus">
      <thesaurus xmlns="x-schema:tsSchema.xml">
        <diacritics_sensitive>0</diacritics_sensitive>
        <replacement>
        <pat>frick</pat>
        <pat>darn</pat>
        <pat>dang</pat>
        <pat>shoot</pat>
        <sub>codeofconduct</sub>
        </replacement>
      </thesaurus>
    </XML>

  3. Optionally, edit your Code of Conduct file and add the substitution keyword. In my example I just added “codeofconduct” to the document metadata. In my case I only wanted the query engine to recognize the “bad words”. I did not need the document to be returned in the results because I am configuring a Keyword/Best Bet combination. Alternatively I could have added “codeofconduct” to my “Employee Code of Conduct” document. The trick here is that we are telling the query engine to use the string “codeofconduct” in place of our bad words. By adding the string “codeofconduct” anywhere in our document, the indexer will catalog the word and return the document at query time.

  4. Restart the Search Service.

  5. Recrawl the content

Add the Keyword and Best Bet

  1. Determine the URL of your “Employee Code of Conduct” file.
  2. Create the new Keyword and add the “bad words” as synonyms. Synonyms are not displayed. Create a Best Bet with the link to the Employee Code of Conduct document or page. Code of Conduct Best Bet
  3. Test your configuration by executing a search for your bad words. Best Bet Result

Credits

I did not develop this solution on my own. The idea came from Rob Bogue. The solution was aided by the folks on the product team who are always up to the challenge and come up with unique solutions to the problems I pose. Thanks!

Oh…and thanks to Bob Fox SharePoint MVP, before I met Bob I did not know what a bad word was…or how masterfully and poetically expletives could be strung together.

Notes

It must be noted that this solution works on MOSS 2007 after the IU has been applied. I noticed while testing this post that the Best Bet web part appears to behave differently between versions. Be sure that your web part looks like the one in the post or your mileage may vary.

|| Administration || Search

comments powered by Disqus

Let's Get In Touch!


Ready to start your next project with us? That’s great! Give us a call or send us an email and we will get back to you as soon as possible!

+1.512.539.0322