Crawling Special Circumstances

I recently helped a client with a crawl configuration that turned out to be a bit trickier than I expected. The client has a page that generates “hidden” links to a library of PDF documents. The URL to this page was simple enough: http://www.docs.company.com. This page had several links on it that categorized the folders and files to facilitate the crawl process. For example:

http://www.docs.company.com/documents/public/documents/resources/getdocs.foo?Division=Monterey%20Coast

This page renders another page with links to the actual documents with URLs like:

http://www.docs.company.com/documents/public/documents/resources/montereybayaquarium.pdf?Division=Monterey%20Coast

So I created a Content Source that pointed to the landing page and kicked off the crawl. The crawl stopped immediately and returned the dreaded message:

http://www.docs.company.com/...
The specified address was excluded from the index. The crawl rules may have to be modified to include this address. (The item was deleted because it was either not found or the crawler was denied access to it.)

Crawl Rules

Hmmm…crawl rules, right… I need to account for the special character in the URL. So I add a crawl rule for the address:

Path: http://www.docs.company.com/*

Crawl Configuration:
* Include all items on this path
* Crawl complex URLs

Specify Authentication: Use default content access account

Crawl again and same result…

File Types

Then (again with the assistance of the great folks at Microsoft) it is pointed out to me that the file extension FOO is not recognized by the crawler. So I add the File Type: FOO.

Crawl again and same result…

IFilter

Wait, how on Earth would the indexer know what to do with a FOO file? I have never heard of that type… So I register FOO as a file extension and associate it with the HTML iFilter. To do this create the following registry key:

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0\Search\Setup\ContentIndexCommon\Filters\Extension\.foo]

With a default value of: {56BD18AD-CF9C-4110-AAAA-B2F96887D123}

Stop and Start the search service: net stop osearch, net start osearch

Crawl again and “HEY…it works!”

Summary

My lesson in all this is that when you are configuring the crawl you have to pay close attention to the “path” that the crawler would take. As pages are requested each file type and each URL should be considered as you troubleshoot the crawl process.

|| Search

comments powered by Disqus

Let's Get In Touch!


Ready to start your next project with us? That’s great! Give us a call or send us an email and we will get back to you as soon as possible!

+1.512.539.0322