After working on several search projects it is clear that the out of the box search experience for images is, well, less than optimal. I mean look at how cool the Microsoft Live Search Images interface is, then, hover over an image and look at how cool the property display is!
How about the results in SharePoint
OK, I know, SharePoint is a document centric application and it is built for the indexing, searching and display of documents. I agree! What if your company’s documents ARE images, no matter where you store them? You may use file shares, SharePoint Document libraries, SharePoint picture libraries, and SharePoint Publishing Images libraries.
If I had to sum up why I love SharePoint, it is this: SharePoint is very flexible. So you can’t do GREAT image search out of the box so what! The designers at Microsoft build the search capability with two important concepts in mind:
Today we don’t know what folks will want to crawl tomorrow, so make it flexible enough to configure the crawls for other file types.
Today we don’t know how our users want their result displayed tomorrow, so make the search interface flexible enough to render results any way the customer wants.
It is with these two concepts in mind that I set out to teach SharePoint to do image search better than it does out of the box. This is the first post in a series where I’ll address the configuration areas required to improve the out of the box handling of image files (JPEG, GIF, PNG, etc.). My goal with this series of posts is to present a structured process for addressing a search project of this nature and present a solution that I think most folks can understand.
The three parts are:
Index Engine Configuration “Can the indexer find your images?
We will start with a test environment with images loaded in 4 locations; File Share, SharePoint Picture Library, SharePoint Document Library, and SharePoint Publishing Images Library. Images are handled differently by SharePoint depending on their location. I use:
The same images in each location
exactly 10 images of each type in each location (this makes testing much easier)
*three different file types: JPEG, GIF and PNG
The environment is MOSS Enterprise SP1 with the Infrastructure Update applied. The screen shots you will see may differ from your environment if you are still working with an SP1 environment but the concepts are the same. I am using a Search Center with Tabs. I configure a tab with XML results based on this post. This configuration makes life easy for testing.
To make things easy I am going to configure two Content Sources, one for our test intranet site and one for the file share. Since SharePoint provides these protocol handlers out of the box, there is nothing special to do here besides set them up. For testing this makes it easy to re-crawl only the target of your tests. Later you can consolidate the content sources if you like.
My first crawl results are unimpressive, but that was to be expected. Let’s look at what we actually found. One of my images is Amelia.jpg. If I search for “Amelia” I get 7 results, 4 are the list, libraries and pages related to the lists, 3 are the actual list items, none are the actual files and the file share images are not present.
I execute the following property search: fileextension:jpg (or fileextension:gif or fileextension:jpg) and see the following results
The result set contains only 10 images, but there should be 40, 10 GIF images times 4 different indexed locations. What’s the deal? The results are the same for all three extensions. The only images we are finding with a recognized file extension are those from the Picture Library.
Out of the box SharePoint does not include JPG, GIF and PNG in the crawl. So the files on the file share are excluded, but since the files are also stored in SharePoint libraries, the list items are included in the index. Looking at the returned XML answers the question. How do I view my results as XML?
The results are of Content Class STS_ListItem_PictureLibrary, the index is returning the list item, not the image. Looking at the crawl logs for each content source confirms this. The file share returned no files because the crawled File Types list does not include JPG, PNG, or GIF.
The document and publishing images library returned no images, just list items.
The list of File Types determines what files to include in the index. If the extension is not listed here the file will not be crawled. So click the link and add JPG, GIF and PNG.
Closer! We got all of our images from all of our content sources.
The interesting thing is that our property search fileextension:jpg only returns 20 items, 10 from the image library and 10 from the file share. What about the document libraries? The out of the box document libraries do not return an attribute for that maps to fileextention. Picture Libraries return “File Type” which is mapped to FileExtension. Likewise, PictureWidth is not returned from the file share or the document library, but is returned from the Picture Library and Publishing Images library.
Ready to start your next project with us? That’s great! Give us a call or send us an email and we will get back to you as soon as possible!