Continued from page 1
After
crawl is done
collection is automatically indexed and then checked against
Serving Prerequisites (any criteria you wish to use to determine whether to move an indexed collection to production) and
collection will either be moved to Production (and consequently searchable) or be moved to Staging. The Staging area lets you validate new crawls before letting users search against them.
Crawling configuration After your first crawl you may find
need to go back and tweak
crawling parameters. Google gives you a good amount of control over how sites are crawled,
frequency, how many threads are used, etc. For sites with security,
GSA supports Basic Authentication and an additional security module is available which supports Forms Authentication. The most challenging configuration aspects for us were determining
right combination of URL patterns to exclude from
search. If you are a Domino shop and looking to use
GSA you may need to spend some time getting
crawler configuration just right to support
sometimes convoluted Domino query string parameters.
After we got
crawl parameters tuned and
first complete crawl done we did some testing to see if
crawler grabbed all
content. Browsing our site and testing with some strings buried deep inside
taxonomy we always found
GSA had crawled them accurately. We also did some testing with strings inside PDF documents, PowerPoint presentations and
like. When we did come across something that hadn’t been crawled a careful analysis led us to discover that we needed to do some more tweaking of
crawl settings.
Other notable features
Google also gives you a KeyMatch tool that allows you to specify which indexed documents should appear at
top of
results page for a given query. These manifest themselves almost identically to
Sponsored Links at
top of
results page of
Google we all use. A Synonym tool allows you to specify alternate words or phrases for search queries. For example, if someone searches for WCM, you can suggest “Web Content Management” at
top of
results page.
An output format feature lets you control (via an XSLT)
presentation of
search results. You can use this for changing
fonts, colors, logo, header, etc. of
results page. We were able to easily remove
“Cached” feature on
results page with some XSLT modifications. The Reporting tool lets you run reports on search queries over various time ranges. It will show you
number of searches per day, per hour,
top 100 keywords and top 100 queries for
time period specified.
Downsides
The GSA is not for organizations looking to index their shared network drives as
appliance has no facility for crawling file systems. This is really too bad as many companies struggle with
huge quantities of unstructured content on stored on their networks. Of course, there are a plethora of other products out there for exactly this issue.
Access directly to databases (e.g. SQL, Oracle, etc.) is another area which is off-limits for
GSA as well as any kind of integration with content or document management systems.
Conclusion The Google Search Appliance (GSA) is an excellent search product for HTTP-accessible content. It gives great control over administrative features such as crawler configuration and results serving and sufficient reporting capabilities as well. Those looking for a solution to integrate directly with a content/document management system, databases, or indexing network drives should look to another product. However, if you have a intranet or intranet site with plenty of HTML-based content
GSA may be just what you need.
