The News & Observer

January 10, 2000

Stump the Geeks

Section: Connect
Edition: Final
Page: D6
Estimated Printed Pages: 2

Index Terms:
hi-tech
Letter

Article Type:Letter

Article Text:

Q. I'm putting historical material online for the public and have a problem: I want to put the pages online as page-images for accuracy, but also include hidden OCR'd text from the pages just for searching purposes. (Because HTML can't display OCR'd text EXACTLY like it is on the original pages and proofing kills you if there are thousands of pages.)

I thought of putting the OCR'd text within Comment Tags, "" but AltaVista says they don't index the contents of Comments.

Any suggestions?

Dave Maxey

Apex

A. You may be surprised to know that you have encountered not only a limitation of AltaVista, but a limitation of the Web.

Web search sites like AltaVista are in a way a compromise between solving two distinct problems, often known as "resource discovery" and "deep searching." The problem of finding your historical collection and related collections is an example of resource discovery.

Searching within your collection in a comprehensive way is an example of deep searching.

AltaVista does both at the same time, but neither extremely well; it views all data as a massive set of Web pages, and that is one reason that search results often contain a jumble of what appear sometimes to be links to sites, databases, specific documents, etc. (The development of XML will improve the situation, but it does not solve the underlying problem.)

For many people who use Web searching as a starting point for browsing, this approach is adequate. However, if you are trying to build your own document collection, you may find that you need features of deep searching beyond merely indexing Web pages.

The solution to this is to run your own search engine, if you are somewhat technically inclined. Otherwise you may have to choose an unsatisfying way to work around AltaVista, for example by making your text invisible in the HTML.

Another advantage of setting up your own search engine is that it will allow you to experiment with other formats such as PDF. PDF can represent the original documents exactly as scanned, with the OCR'd text included inside the document. Several search engines are able to index PDF files directly. This means that you would have to maintain only a single version of each document, and your users would get both the correct image and the text in a single document.

None of this answers the question: What is the best way for your collection to be searched with AltaVista? This is where we return to the limits of the Web. Are your data a "collection" or an indistinct area of "the Web"? Are they merely a part of the HTMLting pot? (Thank Paul Jones of Stump the Geeks for the pun.) Are they a resource worth being discovered?

Nassib Nassar

president of Etymon Stystems Inc.

www.etymon.com

Copyright 2000 by The News & Observer Pub. Co.

Record Number: fo4qrk89