I am often asked by Human Resource folks about “obscene content” in a SharePoint environment. However the content gets in, they don’t want it to be found. There are utilities like Forefront that will perform content filtering for SharePoint, but most fall short of filtering “everything”. For example, Forefront only filters documents, not list items, publishing pages, content editor web parts, or profile properties.
I got to thinking about this issue while listening to a recent interview with Rob Bogue on the SharePoint Pod Show. In his governance interview he was talking about configuring search in such a way that the act of searching for profanity returns a link to the Employee Code of Conduct.
So, you ask “What words?”. This makes me think of George Carlin’s Seven Words you cannot say on TV. I’ll soften the list for this post. We’ll try to keep this blog rated PG and use:
I created a document with these words in it. I crawl the content and search for “frick” I get the following results.
My goal is to prevent the display of the document and direct the user to the “Employee Code of Conduct”.
There are three steps to this configuration:
c:\program files\Microsoft Office Servers\12.0\data\applications\<GUID>\Config. The GUID is the unique ID of the search application that you are configuring. If you have only one SSP you will probably only have one folder. For US English you will want to edit the file noiseenu.txt. Add the words, one per line, save and close the file.
You can get more information on editing the noise word file from this support article: KB 837847
Locate the thesaurus file in the same directory as the noise word file. For US English the file is tsenu.xml.
Create a replacement set for your words and substitute a unique “key” that you could embed in the target file. Your complete file should look like this:
<XML ID="Microsoft Search Thesaurus"> <thesaurus xmlns="x-schema:tsSchema.xml"> <diacritics_sensitive>0</diacritics_sensitive> <replacement> <pat>frick</pat> <pat>darn</pat> <pat>dang</pat> <pat>shoot</pat> <sub>codeofconduct</sub> </replacement> </thesaurus> </XML>
Optionally, edit your Code of Conduct file and add the substitution keyword. In my example I just added “codeofconduct” to the document metadata. In my case I only wanted the query engine to recognize the “bad words”. I did not need the document to be returned in the results because I am configuring a Keyword/Best Bet combination. Alternatively I could have added “codeofconduct” to my “Employee Code of Conduct” document. The trick here is that we are telling the query engine to use the string “codeofconduct” in place of our bad words. By adding the string “codeofconduct” anywhere in our document, the indexer will catalog the word and return the document at query time.
Restart the Search Service.
Recrawl the content
I did not develop this solution on my own. The idea came from Rob Bogue. The solution was aided by the folks on the product team who are always up to the challenge and come up with unique solutions to the problems I pose. Thanks!
Oh…and thanks to Bob Fox SharePoint MVP, before I met Bob I did not know what a bad word was…or how masterfully and poetically expletives could be strung together.
It must be noted that this solution works on MOSS 2007 after the IU has been applied. I noticed while testing this post that the Best Bet web part appears to behave differently between versions. Be sure that your web part looks like the one in the post or your mileage may vary.
Ready to start your next project with us? That’s great! Give us a call or send us an email and we will get back to you as soon as possible!