Continued from page 1
After crawl is done collection is automatically indexed and then checked against Serving Prerequisites (any criteria you wish to use to determine whether to move an indexed collection to production) and collection will either be moved to Production (and consequently searchable) or be moved to Staging. The Staging area lets you validate new crawls before letting users search against them.
Crawling configuration After your first crawl you may find need to go back and tweak crawling parameters. Google gives you a good amount of control over how sites are crawled, frequency, how many threads are used, etc. For sites with security, GSA supports Basic Authentication and an additional security module is available which supports Forms Authentication. The most challenging configuration aspects for us were determining right combination of URL patterns to exclude from search. If you are a Domino shop and looking to use GSA you may need to spend some time getting crawler configuration just right to support sometimes convoluted Domino query string parameters.
After we got crawl parameters tuned and first complete crawl done we did some testing to see if crawler grabbed all content. Browsing our site and testing with some strings buried deep inside taxonomy we always found GSA had crawled them accurately. We also did some testing with strings inside PDF documents, PowerPoint presentations and like. When we did come across something that hadn’t been crawled a careful analysis led us to discover that we needed to do some more tweaking of crawl settings.
Other notable features
Google also gives you a KeyMatch tool that allows you to specify which indexed documents should appear at top of results page for a given query. These manifest themselves almost identically to Sponsored Links at top of results page of Google we all use. A Synonym tool allows you to specify alternate words or phrases for search queries. For example, if someone searches for WCM, you can suggest “Web Content Management” at top of results page.
An output format feature lets you control (via an XSLT) presentation of search results. You can use this for changing fonts, colors, logo, header, etc. of results page. We were able to easily remove “Cached” feature on results page with some XSLT modifications. The Reporting tool lets you run reports on search queries over various time ranges. It will show you number of searches per day, per hour, top 100 keywords and top 100 queries for time period specified.
Downsides
The GSA is not for organizations looking to index their shared network drives as appliance has no facility for crawling file systems. This is really too bad as many companies struggle with huge quantities of unstructured content on stored on their networks. Of course, there are a plethora of other products out there for exactly this issue.
Access directly to databases (e.g. SQL, Oracle, etc.) is another area which is off-limits for GSA as well as any kind of integration with content or document management systems.
Conclusion The Google Search Appliance (GSA) is an excellent search product for HTTP-accessible content. It gives great control over administrative features such as crawler configuration and results serving and sufficient reporting capabilities as well. Those looking for a solution to integrate directly with a content/document management system, databases, or indexing network drives should look to another product. However, if you have a intranet or intranet site with plenty of HTML-based content GSA may be just what you need.
Bryan Mjaanes is the creator/editor of Intranet101.com, a community-based forum for Intranet professionals.